Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 42
Posts: 42   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 48083 times and has 41 replies Next Thread
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

The early departures with 29 are by design as in "we know they signal to be of no further use". The way they come and go off stage does get the razzie prize if such a prize were to be given in distributed computing.

Meantime the 3 I have left after the first 2 with 29 are now on 60 and 70%... expecting them to finish proper.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Apr 17, 2010 8:43:54 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

The early departures with 29 are by design as in "we know they signal to be of no further use".
Still, maybe Uplinger could tell us more about why in the same quorum WUs are discovering that they are no longer useful at as different percentages as 69.08 and 22.20 %?

Until now the most consistent quorum I have seen for this ts05 distribution is 3 at 13.00 % and 2 at 18.36 %.

Edit: Sorry, actually the most consistent one is the only one which completed fine for both my wingman and me. smile (Edit2: i.e. a good WU with two valid results and only two copies.)
And there is still some hope for the fifth one which is still In Progress.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
----------------------------------------
[Edit 2 times, last edit by JmBoullier at Apr 17, 2010 1:39:31 PM]
[Apr 17, 2010 9:35:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

The early departures with 29 are by design as in "we know they signal to be of no further use".
Still, maybe Uplinger could tell us more about why in the same quorum WUs are discovering that they are no longer useful at as different percentages as 69.08 and 22.20 %?



ts05_a193_ps0000 is a fine example for three errors 29 at totally different locations. An what is more - at least my WU has the error only after running more than 30% uninterrupted. If it is restarted from the last backup immediately before the last error position, it continues for another 30%, i.e. the error is not reproducible this way at the same location, i.e. the task needs always more than 30% unibnterrupted running to discover that it is of no further use... ;-)
Of course this does not apply to all WUs which error immediately after starting.
[Apr 17, 2010 10:02:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

@mweisensee: How did you get a WU to restart from a checkpoint after it has experienced a computation error? I thought they became irretrievable after that happens.
@JmBoullier too:
I've repeated your findings & questions in Changes to distribution of error work units, where I asked another question re timing of changes to the max no of error copies in a WU quorum.
[Apr 17, 2010 12:49:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

@mweisensee: How did you get a WU to restart from a checkpoint after it has experienced a computation error? I thought they became irretrievable after that happens.


Yes, you are right. After a task had an error, all checkpoints are lost.
But I had the same situation before during the beta test (messages can be found within the beta test thread). So I stop boinc from time to time to make a backup of the boinc data directory if long running WUs are active (in fact I'm used to do it since I run climate prediction WUs which take some weeks to completion). Network access is disabled all the time to prevent boinc from reporting failures.
So when the error occurred I stopped boinc again and copied all files for that WU from the backup (slot directory including checkpoints, client state parts, _2 file). Then I restarted boinc and the restored WU was available again. Of course there is some loss because I do not know the time of the next error for sure. But I do not loose the WU.
BTW WU ts05_a193_ps0000_1 had error 29 again at 94% completion after running 32% uninterrupted - exactly the same percentage as with the first error. So I'm pretty sure that it depends on the used resources rather than finding out to be of no further use. For the night I leave it suspended and will complete it tomorrow.

Good night!
Matthias
[Apr 17, 2010 8:01:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
boulmontjj
Senior Cruncher
France
Joined: Nov 17, 2004
Post Count: 317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

My ts05_b150_ps0000 finished in error with the same error after 29 hours. crying
Nom du résultat: ts05_ b150_ ps0000_ 2--



<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
riture impossible sur le piphique spifi (0x1d) - exit code 29 (0x1d)
</message>
<stderr_txt>

I'm the second discovering that error with this specific WU.

I hope my other WU will finish ok (ts05_b039_ps0000) but i can also see that 2 members have already returned it in error sad (same error that the other one).
----------------------------------------

Rejoignez nous et visitez le site de l'équipe France ici http://www.grid-france.fr
----------------------------------------
[Edit 1 times, last edit by boulmontjj at Apr 17, 2010 8:21:17 PM]
[Apr 17, 2010 8:19:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

It seems like these monster tasks for DDDT-2 were poorly designed, particularly in the case where ts05_a193_ps0000_1 and ts05_b159_ps0000_1 appear to be running successfully by stopping and starting BOINC. I think of BOINC as a user interface to see what is being executed, and not the actual execution of the tasks which are being continuously executed in the background with BOINC active or not. Wouldn't suspending and resuming a task with BOINC have the same results? The checkpoint of the task is to provide a point at which to restart should your computer go down or needs to be rebooted for some other reason such as a Windows update for security reasons.
----------------------------------------
[Edit 1 times, last edit by Former Member at Apr 17, 2010 10:56:04 PM]
[Apr 17, 2010 10:49:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

We can only speculate what the restart effects were on the tally. If the recount is complete, he had an out of sync client_state.xml v the slot information. This is not CPDN who've designed in to resume from back-ups.

edit:

PS: v.v Resources, if anyone sees more than 210Mb RAM use and 730Mb VM for the A-Type, please speak up with the result name. These are the max I've observed on own machines and in reports on the forums. WCG already set it protectively to 1Gb, to be multiplied when running several concurrent.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Apr 18, 2010 6:36:06 AM]
[Apr 18, 2010 6:25:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

We can only speculate what the restart effects were on the tally. If the recount is complete, he had an out of sync client_state.xml v the slot information. This is not CPDN who've designed in to resume from back-ups.

edit:

PS: v.v Resources, if anyone sees more than 210Mb RAM use and 730Mb VM for the A-Type, please speak up with the result name. These are the max I've observed on own machines and in reports on the forums. WCG already set it protectively to 1Gb, to be multiplied when running several concurrent.


Sek, I restarted the tasks at 08:38:25 MESZ (computer was restarted as well) and now after 52 min and +1.8% they both have 213MB RAM and 805MB VM.
Do you know whether the memory allocation is step by step or all at once?
Concerning the restart - I wait until the next checkpoint is reached and suspend the task immediately afterwards. If all tasks are suspended I stop boinc and make the backup.

Matthias
[Apr 18, 2010 7:35:48 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

Thnks for the size info. Seems it locks the VM space pretty close to the start when it sets up the model grid. Making sure the VM can richly expand when needed at least will pre-empt any reason because of limits on that part.

For good order, you really need to exit BOINC, stop the service for a reliable backup and as noted, if you restore a task, the slot progress info is not the same as the client_state.xml info since the later clients do it differently, with at times considerable time differential before the control information is written to disk... which is why an acute power out can have more loss than one expects [seen it a few times].
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Apr 18, 2010 8:18:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 42   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread