Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Discovering Dengue Drugs - Together - Phase 2 Forum Thread: extremely long running DDD2 w/u |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 37
|
Author |
|
Sandoor
Cruncher Canada Joined: May 22, 2008 Post Count: 8 Status: Offline Project Badges: |
Found one at 13h this morning, after reboot cpu time dropped to 49 min. Work unit completed in regular time, now in PV with wingman erroring at 30h
|
||
|
JollyJimmy
Advanced Cruncher USA Joined: Aug 23, 2005 Post Count: 115 Status: Offline Project Badges: |
I'm in the process of testing several of the work units mentioned in this thread (and the other thread) to attempt to recreate the problem of very long work units that some users have reported. Thanks for checking into this, Seippel.Seippel I've got one too: Reports 3.5% done after almost 6h CPU and 16h remaining. While you are crunching on bugs (tastey!), any advice? "Keep 'er running" or "abort and abandon"? Would hate to time out after 12h or dump even 22h into an error. Edit - The same task is now reporting 4.8% after almost 8h CPU and over 22h remaining!! I sure hope the task is not going fungal by simulating the growth of mushrooms. ---------------------------------------- [Edit 1 times, last edit by JollyJimmy at Feb 3, 2011 8:02:25 PM] |
||
|
GB033533
Senior Cruncher UK Joined: Dec 8, 2004 Post Count: 198 Status: Offline Project Badges: |
I too have had a couple of these bad boys;
----------------------------------------ts02_ c283_ sr67b1_ 1-- IBM-60B0387EC84 Error 2/1/11 13:53:51 2/3/11 18:14:30 13.13 207.8 / 0.0 ts02_ c283_ sr78b0_ 0-- IBM-60B0387EC84 Error 2/1/11 13:53:51 2/3/11 13:04:32 13.09 204.3 / 0.0 both with <message>Maximum elapsed time exceeded</message> Am I likely to get any credit for these? The worrying thing is that the replacement wingman for the second one completed in a normal time of 1.35 hours; ts02_ c283_ sr78b0_ 2-- 617 Pending Validation 2/3/11 13:27:40 2/3/11 16:10:39 1.35 23.7 / 0.0 Nothing yet from the original wingmen. |
||
|
verheyde
Cruncher Belgium Joined: Dec 7, 2004 Post Count: 25 Status: Offline Project Badges: |
Got one of those running.. It ran for > 16h CPU already and is only at 6.66% :
ts02_c395_sr45b1_0 I'll leave it running for now. (and sent info to support). |
||
|
ca05065
Senior Cruncher Joined: Dec 4, 2007 Post Count: 325 Status: Offline Project Badges: |
I have also had a long running work unit:
ts02_c395_sr45b1_3 It was 2.33% complete after 2hours 48min when they usually only take 1.5 hours. I tried a re-boot of the PC: the work unit restarted and even checkpointed but so slowly that the time to completion was over 10 hours and increasing. I looked around and found the wcg_checkpoint_00.ckp file. In this I noticed messages of the type: EIPHIFS: WARNING. dihedral 5 is almost linear. derivatives may be affected for atoms: 11 13 12 16 EIPHIFS> Total of 17 WARNINGs issued. and EPHI: WARNING. dihedral 9 is almost linear. derivatives may be affected for atoms: 2 56 3 36 TOTAL OF 135 WARNINGS FROM EPHI I did not see similar warnings in sr units which ran normally. The above may or may not be relevent to the problem. I have aborted the work unit. The wingmen's results were in progress, user aborted, error (max cpu exceeded). |
||
|
JollyJimmy
Advanced Cruncher USA Joined: Aug 23, 2005 Post Count: 115 Status: Offline Project Badges: |
The task I noted yesterday has timed out.
----------------------------------------Here are the details: Result Name App Version Number Status Sent Time Time Due / Return Time CPU Time (hours) ts02_ b483_ sr02b0_ 1-- 617 Error 1/29/11 07:50:34 2/4/11 05:22:20 15.19 Result Log Result Name: ts02_ b483_ sr02b0_ 1-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> <stderr_txt> INFO: No state to restore. Start from the beginning. Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x77540004 Engaging BOINC Windows Runtime Debugger... [snip] |
||
|
ov7
Cruncher Joined: May 14, 2009 Post Count: 15 Status: Offline Project Badges: |
I have another one : ts02_c432_pda0004
It has been calculating for 17 hours, 8 remaining and 0.000% complete ! BOINC Manager 6.2.28 I think I will shoot it... |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Plz do reboot the client, soft-boot of course. For some, including me it kicked the task out of an endless loop and let them finish normal.
--//-- |
||
|
Powhatan
Advanced Cruncher Joined: Oct 20, 2009 Post Count: 58 Status: Offline Project Badges: |
While you are crunching on bugs (tastey!), any advice? "Keep 'er running" or "abort and abandon"? Didn't see a response to this. I've had 3 WUs like these go to 12.5 hours then Error and I have 3 more that look like they are going to do the same. If it's a waste of CPU cycles, I'd like to abort them. I've reboot, but problem persists. The device is Win 7 x64, my other devices x64 and x86 do not have this problem.Would hate to time out after 12h or dump even 22h into an error. ts02_ c283_ sqa008_ 0-- patawomeck Error 2/1/11 13:30:27 2/3/11 23:40:48 12.56 262.6 / 0.0 ts02_ c283_ sqa003_ 1-- patawomeck Error 2/1/11 13:30:07 2/3/11 15:26:33 12.67 264.9 / 0.0 ts02_ c283_ sda002_ 0-- patawomeck Error 2/1/11 13:13:05 2/3/11 15:26:33 12.52 261.8 / 0.0 ts02_ c283_ sr45a1_ 0-- patawomeck In Progress 2/1/11 13:52:50 2/11/11 13:52:50 0.00 0.0 / 0.0 ts02_ c283_ sr45a0_ 1-- patawomeck In Progress 2/1/11 13:52:50 2/11/11 13:52:50 0.00 0.0 / 0.0 ts02_ c284_ sr34b1_ 1-- patawomeck In Progress 2/1/11 13:55:45 2/11/11 13:55:45 0.00 0.0 / 0.0 |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
There was no ''official'' response what to do. The techs have a range of reference WU names to work with to see if they can reproduce this in the labs. Predominantly it seems to hit the 's' types, but had a 'p' myself that did this. Maybe the techs can teach the validator to look for that "Maximum elapsed time exceeded" line in the Result log and then take these automatically out of circulation till a fix is in place.
If your wingmen are OK and your device makes an above average number, only had 2 on about 250 on my quad ofwhich 1 finished properly after restart, then your particular device could be of interest, so you might want to post the startup piece of the message log in order that we can see full setup. Certainly if the wingmen are OK, I'd abort them overlong running unless instructed otherwise for diagnostic purposes. --//-- |
||
|
|