Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 40
Posts: 40   Pages: 4   [ Previous Page | 1 2 3 4 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 8609 times and has 39 replies Next Thread
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

We are looking into the issue along with the researchers. One thing to note is that it appears as though the problem may be related to restoring from a checkpoint. If a user is having a high number of errors with CEP and has the memory resources available to leave the application in memory this may work as a temporary workaround while we investigate the errors.

Thanks,
armstrdj
[Aug 4, 2009 6:45:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

One thing to note is that it appears as though the problem may be related to restoring from a checkpoint.


Sounds very likely to me. I did some troubleshooting on my machines that were throwing errors in CEP, and it turns out that they were actually crashing and rebooting many times a day. I have BOINC on auto-start and these machines have no graphics so I didn't catch this problem earlier. When I would check in on them via BOINCview or remote desktop, all appeared to be working fine. It was only by reading the message log in BOINCview that I could see the BOINC startup sequence repeating randomly. The workunits likely errored after the reboot and they had no checkpoint to restore.

Anyways, I bumped the core volteage on each machine by 0.01v(2 notches in BIOS) and they've been running 48 hours now with no errored WU's or crashes. biggrin
[Aug 5, 2009 2:18:36 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

I had my first error 29 yesterday afternoon, and it could confirm your theory about restoring from a checkpoint.
It happened about one minute after my quad had restarted 4 CEP tasks after a fresh boot. Only one of the four tasks failed with no message in the message log and only this line "process exited with code 29 (0x1d, -227)" in the Result Log.
This machine usually shows errors only when WUs are wrong, which has not happened for many weeks.

This was with Boinc 6.2.18 under Ubuntu 9.04 64-bit, and obviously no antivirus program.

Regarding the fresh boot and thus restarting tasks I do it about once a day without any trouble since I run this machine alternatively under Ubuntu 64 and XP 32 every day.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Aug 5, 2009 8:54:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

Jean, do you use the delay on system restart config setting? Had a different, surely unrelated one yesterday around the time of a resume, but no error message of meaning.

E000916_ 001C_ 009y0510j_ 2-- - In Progress 8/4/09 16:42:04 8/8/09 16:42:04 0.00 0.0 / 0.0
E000916_ 001C_ 009y0510j_ 1-- 632 Error 8/2/09 15:01:10 8/4/09 16:26:25 7.27 106.2 / 0.0
E000916_ 001C_ 009y0510j_ 0-- 632 Inconclusive 8/2/09 15:01:07 8/3/09 19:31:57 13.58 90.6 / 0.0

Oddly, the result first sat in PV waiting on the validator to kick in, suggesting there were normal closing signs

Result Name: E000916_ 001C_ 009y0510j_ 1--
<core_client_version>6.6.38</core_client_version>
<![CDATA[
<stderr_txt>
Calling initGraphics()
INFO: No state to restore. Start from the beginning.
called boinc_finish

</stderr_txt>
]]>

(yes yes yes, this is a testing BOINC version on Vista and not going to try 6.6.39... list of issues with .38 now 13, and of course it ain't true, such as coming out of hibernation and the client remaining in full suspend until manually switched to run always... just ventilating ;>)
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Aug 5, 2009 9:13:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

Jean, do you use the delay on system restart config setting?

No, but with this config without bells and whistles, more than one minute after Boinc started the Ubuntu boot has already completed. laughing
(even in XP 32 I have reduced it to 60 seconds).
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Aug 5, 2009 3:53:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

We are removing any jobs that end with {0h, 0i, 0j, 0k, 12, 13, 14, 15} as these appear to be causing a larger number of error 29. These work unit simulations push the temperature higher than the others causing the issue to appear more frequently. We are aborting both active work units with this identifier and future work units with these identifiers.

Thank you for your patience as we continue to work through this.

-Uplinger
[Aug 6, 2009 9:45:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

We are removing any jobs that end with {0h, 0i, 0j, 0k, 12, 13, 14, 15} as these appear to be causing a larger number of error 29. These work unit simulations push the temperature higher than the others causing the issue to appear more frequently. We are aborting both active work units with this identifier and future work units with these identifiers.

Hello!
Will those workunits be reworked and then issued again? Or are they withdrawn completely (leaving us with less work remaining for the project)?
Greetings
Thorsten
[Aug 10, 2009 10:20:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

We are removing any jobs that end with {0h, 0i, 0j, 0k, 12, 13, 14, 15} as these appear to be causing a larger number of error 29. These work unit simulations push the temperature higher than the others causing the issue to appear more frequently. We are aborting both active work units with this identifier and future work units with these identifiers.

Hello!
Will those workunits be reworked and then issued again? Or are they withdrawn completely (leaving us with less work remaining for the project)?
Greetings
Thorsten


The researchers may resubmit them with different parameters but it is not probable that this will happen.

-Uplinger
[Aug 10, 2009 11:20:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

Yesterday WU E000902_117C_009q0670f had an exception error at 92%. I restarted it (after backup) and it completed. But this was done within three minutes, i.e. it needed THREE minutes to run from 92% to 100% (athlon with an average completion time of 14 hours):

10-Aug-2009 15:33:34 [World Community Grid] Restarting task E000902_117C_009q0670f_0 using cep1 version 632
10-Aug-2009 15:36:48 [World Community Grid] Computation for task E000902_117C_009q0670f_0 finished

Then it reported (all result files were present). Was pv. AND BECAME VALID!!
So why can't all WUs terminate with 92% if that is sufficient for becoming valid? Or is something wrong with the validator?

Matthias
[Aug 11, 2009 5:07:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
GIBA
Ace Cruncher
Joined: Apr 25, 2005
Post Count: 5374
Status: Offline
Reply to this Post  Reply with Quote 
Re: Errors exit code 29 (0x1d)

Got other again...

One peer are in PV another are crunching yet (the new replica generated after my error reported):

My result log:


Result Log

Result Name: E000967_ 627C_ 00a80570f_ 0--



<core_client_version>6.2.28</core_client_version>
<![CDATA[
<message>
The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)
</message>
<stderr_txt>
Calling initGraphics()
INFO: No state to restore. Start from the beginning.
Calling initGraphics()
INFO: No state to restore. Start from the beginning.
No heartbeat from core client for 30 sec - exiting
Calling initGraphics()
Encountered error. Exiting.

</stderr_txt>
]]>

coffee
----------------------------------------
Cheers ! GIB@ peace coffee
Join BRASIL - BRAZIL@GRID team and be very happy !
http://www.worldcommunitygrid.org/team/viewTeamInfo.do?teamId=DF99KT5DN1

----------------------------------------
[Edit 1 times, last edit by GIBA at Aug 17, 2009 1:56:54 AM]
[Aug 17, 2009 1:54:48 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 40   Pages: 4   [ Previous Page | 1 2 3 4 ]
[ Jump to Last Post ]
Post new Thread