World Community Grid - View Thread - exited with code 29 (0x1d, -227)

World Community Grid Forums

Category: Completed Research

Forum: Discovering Dengue Drugs - Together - Phase 2 Forum

Thread: exited with code 29 (0x1d, -227)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 42

[ ]

Author

This topic has been viewed 74672 times and has 41 replies

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: exited with code 29 (0x1d, -227)

The early departures with 29 are by design as in "we know they signal to be of no further use". The way they come and go off stage does get the razzie prize if such a prize were to be given in distributed computing.

Meantime the 3 I have left after the first 2 with 29 are now on 60 and 70%... expecting them to finish proper.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Apr 17, 2010 8:43:54 AM]

JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

1 year badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

10 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: exited with code 29 (0x1d, -227)

The early departures with 29 are by design as in "we know they signal to be of no further use".

Still, maybe Uplinger could tell us more about why in the same quorum WUs are discovering that they are no longer useful at as different percentages as 69.08 and 22.20 %?

Until now the most consistent quorum I have seen for this ts05 distribution is 3 at 13.00 % and 2 at 18.36 %.

Edit: Sorry, actually the most consistent one is the only one which completed fine for both my wingman and me. smile

(Edit2: i.e. a good WU with two valid results and only two copies.)
And there is still some hope for the fifth one which is still In Progress.

----------------------------------------

Team--> Decrypthon -->Statistics/Join -->Thread

----------------------------------------
[Edit 2 times, last edit by JmBoullier at Apr 17, 2010 1:39:31 PM]

[Apr 17, 2010 9:35:37 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with code 29 (0x1d, -227)

The early departures with 29 are by design as in "we know they signal to be of no further use".

Still, maybe Uplinger could tell us more about why in the same quorum WUs are discovering that they are no longer useful at as different percentages as 69.08 and 22.20 %?

ts05_a193_ps0000 is a fine example for three errors 29 at totally different locations. An what is more - at least my WU has the error only after running more than 30% uninterrupted. If it is restarted from the last backup immediately before the last error position, it continues for another 30%, i.e. the error is not reproducible this way at the same location, i.e. the task needs always more than 30% unibnterrupted running to discover that it is of no further use... ;-)
Of course this does not apply to all WUs which error immediately after starting.

[Apr 17, 2010 10:02:06 AM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

45 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: exited with code 29 (0x1d, -227)

@mweisensee: How did you get a WU to restart from a checkpoint after it has experienced a computation error? I thought they became irretrievable after that happens.
@JmBoullier too:
I've repeated your findings & questions in Changes to distribution of error work units, where I asked another question re timing of changes to the max no of error copies in a WU quorum.

[Apr 17, 2010 12:49:19 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with code 29 (0x1d, -227)

@mweisensee: How did you get a WU to restart from a checkpoint after it has experienced a computation error? I thought they became irretrievable after that happens.

Yes, you are right. After a task had an error, all checkpoints are lost.
But I had the same situation before during the beta test (messages can be found within the beta test thread). So I stop boinc from time to time to make a backup of the boinc data directory if long running WUs are active (in fact I'm used to do it since I run climate prediction WUs which take some weeks to completion). Network access is disabled all the time to prevent boinc from reporting failures.
So when the error occurred I stopped boinc again and copied all files for that WU from the backup (slot directory including checkpoints, client state parts, _2 file). Then I restarted boinc and the restored WU was available again. Of course there is some loss because I do not know the time of the next error for sure. But I do not loose the WU.
BTW WU ts05_a193_ps0000_1 had error 29 again at 94% completion after running 32% uninterrupted - exactly the same percentage as with the first error. So I'm pretty sure that it depends on the used resources rather than finding out to be of no further use. For the night I leave it suspended and will complete it tomorrow.

Good night!
Matthias

[Apr 17, 2010 8:01:42 PM]

boulmontjj
Senior Cruncher
France
Joined: Nov 17, 2004
Post Count: 317
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

1 year badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

1 year badge for Outsmart Ebola Together

14 day badge for Microbiome Immunity Project


Re: exited with code 29 (0x1d, -227)

My ts05_b150_ps0000 finished in error with the same error after 29 hours. crying

Nom du résultat: ts05_ b150_ ps0000_ 2--

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
riture impossible sur le piphique spifi (0x1d) - exit code 29 (0x1d)
</message>
<stderr_txt>

I'm the second discovering that error with this specific WU.

I hope my other WU will finish ok (ts05_b039_ps0000) but i can also see that 2 members have already returned it in error sad

(same error that the other one).

----------------------------------------

Rejoignez nous et visitez le site de l'équipe France ici http://www.grid-france.fr

----------------------------------------
[Edit 1 times, last edit by boulmontjj at Apr 17, 2010 8:21:17 PM]

[Apr 17, 2010 8:19:27 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with code 29 (0x1d, -227)

It seems like these monster tasks for DDDT-2 were poorly designed, particularly in the case where ts05_a193_ps0000_1 and ts05_b159_ps0000_1 appear to be running successfully by stopping and starting BOINC. I think of BOINC as a user interface to see what is being executed, and not the actual execution of the tasks which are being continuously executed in the background with BOINC active or not. Wouldn't suspending and resuming a task with BOINC have the same results? The checkpoint of the task is to provide a point at which to restart should your computer go down or needs to be rebooted for some other reason such as a Windows update for security reasons.

----------------------------------------
[Edit 1 times, last edit by Former Member at Apr 17, 2010 10:56:04 PM]

[Apr 17, 2010 10:49:40 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: exited with code 29 (0x1d, -227)

We can only speculate what the restart effects were on the tally. If the recount is complete, he had an out of sync client_state.xml v the slot information. This is not CPDN who've designed in to resume from back-ups.

edit:

PS: v.v Resources, if anyone sees more than 210Mb RAM use and 730Mb VM for the A-Type, please speak up with the result name. These are the max I've observed on own machines and in reports on the forums. WCG already set it protectively to 1Gb, to be multiplied when running several concurrent.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Apr 18, 2010 6:36:06 AM]

[Apr 18, 2010 6:25:09 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with code 29 (0x1d, -227)

Sek, I restarted the tasks at 08:38:25 MESZ (computer was restarted as well) and now after 52 min and +1.8% they both have 213MB RAM and 805MB VM.
Do you know whether the memory allocation is step by step or all at once?
Concerning the restart - I wait until the next checkpoint is reached and suspend the task immediately afterwards. If all tasks are suspended I stop boinc and make the backup.

Matthias

[Apr 18, 2010 7:35:48 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: exited with code 29 (0x1d, -227)

Thnks for the size info. Seems it locks the VM space pretty close to the start when it sets up the model grid. Making sure the VM can richly expand when needed at least will pre-empt any reason because of limits on that part.

For good order, you really need to exit BOINC, stop the service for a reliable backup and as noted, if you restore a task, the slot progress info is not the same as the client_state.xml info since the later clients do it differently, with at times considerable time differential before the control information is written to disk... which is why an acute power out can have more loss than one expects [seen it a few times].

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Apr 18, 2010 8:18:06 AM]

[ ]