| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 40
|
|
| Author |
|
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges:
|
We are looking into the issue along with the researchers. One thing to note is that it appears as though the problem may be related to restoring from a checkpoint. If a user is having a high number of errors with CEP and has the memory resources available to leave the application in memory this may work as a temporary workaround while we investigate the errors.
Thanks, armstrdj |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
One thing to note is that it appears as though the problem may be related to restoring from a checkpoint. Sounds very likely to me. I did some troubleshooting on my machines that were throwing errors in CEP, and it turns out that they were actually crashing and rebooting many times a day. I have BOINC on auto-start and these machines have no graphics so I didn't catch this problem earlier. When I would check in on them via BOINCview or remote desktop, all appeared to be working fine. It was only by reading the message log in BOINCview that I could see the BOINC startup sequence repeating randomly. The workunits likely errored after the reboot and they had no checkpoint to restore. Anyways, I bumped the core volteage on each machine by 0.01v(2 notches in BIOS) and they've been running 48 hours now with no errored WU's or crashes. ![]() |
||
|
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3716 Status: Offline Project Badges:
|
I had my first error 29 yesterday afternoon, and it could confirm your theory about restoring from a checkpoint.
----------------------------------------It happened about one minute after my quad had restarted 4 CEP tasks after a fresh boot. Only one of the four tasks failed with no message in the message log and only this line "process exited with code 29 (0x1d, -227)" in the Result Log. This machine usually shows errors only when WUs are wrong, which has not happened for many weeks. This was with Boinc 6.2.18 under Ubuntu 9.04 64-bit, and obviously no antivirus program. Regarding the fresh boot and thus restarting tasks I do it about once a day without any trouble since I run this machine alternatively under Ubuntu 64 and XP 32 every day. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Jean, do you use the delay on system restart config setting? Had a different, surely unrelated one yesterday around the time of a resume, but no error message of meaning.
----------------------------------------E000916_ 001C_ 009y0510j_ 2-- - In Progress 8/4/09 16:42:04 8/8/09 16:42:04 0.00 0.0 / 0.0 E000916_ 001C_ 009y0510j_ 1-- 632 Error 8/2/09 15:01:10 8/4/09 16:26:25 7.27 106.2 / 0.0 E000916_ 001C_ 009y0510j_ 0-- 632 Inconclusive 8/2/09 15:01:07 8/3/09 19:31:57 13.58 90.6 / 0.0 Oddly, the result first sat in PV waiting on the validator to kick in, suggesting there were normal closing signs Result Name: E000916_ 001C_ 009y0510j_ 1-- <core_client_version>6.6.38</core_client_version> <![CDATA[ <stderr_txt> Calling initGraphics() INFO: No state to restore. Start from the beginning. called boinc_finish </stderr_txt> ]]> (yes yes yes, this is a testing BOINC version on Vista and not going to try 6.6.39... list of issues with .38 now 13, and of course it ain't true, such as coming out of hibernation and the client remaining in full suspend until manually switched to run always... just ventilating ;>)
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3716 Status: Offline Project Badges:
|
Jean, do you use the delay on system restart config setting? No, but with this config without bells and whistles, more than one minute after Boinc started the Ubuntu boot has already completed. ![]() (even in XP 32 I have reduced it to 60 seconds). |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
We are removing any jobs that end with {0h, 0i, 0j, 0k, 12, 13, 14, 15} as these appear to be causing a larger number of error 29. These work unit simulations push the temperature higher than the others causing the issue to appear more frequently. We are aborting both active work units with this identifier and future work units with these identifiers.
Thank you for your patience as we continue to work through this. -Uplinger |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
We are removing any jobs that end with {0h, 0i, 0j, 0k, 12, 13, 14, 15} as these appear to be causing a larger number of error 29. These work unit simulations push the temperature higher than the others causing the issue to appear more frequently. We are aborting both active work units with this identifier and future work units with these identifiers. Hello! Will those workunits be reworked and then issued again? Or are they withdrawn completely (leaving us with less work remaining for the project)? Greetings Thorsten |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
We are removing any jobs that end with {0h, 0i, 0j, 0k, 12, 13, 14, 15} as these appear to be causing a larger number of error 29. These work unit simulations push the temperature higher than the others causing the issue to appear more frequently. We are aborting both active work units with this identifier and future work units with these identifiers. Hello! Will those workunits be reworked and then issued again? Or are they withdrawn completely (leaving us with less work remaining for the project)? Greetings Thorsten The researchers may resubmit them with different parameters but it is not probable that this will happen. -Uplinger |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Yesterday WU E000902_117C_009q0670f had an exception error at 92%. I restarted it (after backup) and it completed. But this was done within three minutes, i.e. it needed THREE minutes to run from 92% to 100% (athlon with an average completion time of 14 hours):
10-Aug-2009 15:33:34 [World Community Grid] Restarting task E000902_117C_009q0670f_0 using cep1 version 632 10-Aug-2009 15:36:48 [World Community Grid] Computation for task E000902_117C_009q0670f_0 finished Then it reported (all result files were present). Was pv. AND BECAME VALID!! So why can't all WUs terminate with 92% if that is sufficient for becoming valid? Or is something wrong with the validator? Matthias |
||
|
|
GIBA
Ace Cruncher Joined: Apr 25, 2005 Post Count: 5374 Status: Offline |
Got other again...
----------------------------------------One peer are in PV another are crunching yet (the new replica generated after my error reported): My result log: Result Log Result Name: E000967_ 627C_ 00a80570f_ 0-- <core_client_version>6.2.28</core_client_version> <![CDATA[ <message> The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d) </message> <stderr_txt> Calling initGraphics() INFO: No state to restore. Start from the beginning. Calling initGraphics() INFO: No state to restore. Start from the beginning. No heartbeat from core client for 30 sec - exiting Calling initGraphics() Encountered error. Exiting. </stderr_txt> ]]> ![]()
Cheers ! GIB@
---------------------------------------- ![]() Join BRASIL - BRAZIL@GRID team and be very happy ! http://www.worldcommunitygrid.org/team/viewTeamInfo.do?teamId=DF99KT5DN1 [Edit 1 times, last edit by GIBA at Aug 17, 2009 1:56:54 AM] |
||
|
|
|