| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 28
|
|
| Author |
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
This is a record breaking WU. LOL 2 days and 9 hours (CPU time)- still counting X-Files 27, I can do better than that - 4d, 11+ hours, and still going... ![]() |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7848 Status: Offline Project Badges:
|
So far, out of about 500 units done, I have had one which appeared "stuck" sitting between 1and 2% after about 8 hours. The "suspend" and then "resume" gambit worked to get it going again. The timer reset and it finished in about 4 hours. Happened on a Linux machine.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges:
|
FWIW It seems that if you have a unit @ 100% and no check points it's a dud. I had several earlier that never finished and wouldn't restart.
----------------------------------------@gb009761: How many other tasks could you have finished in the 4+ days that you've put into that 1 task? Just sayin.
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
![]() ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm with nanoprobe. I can be patient for an hour but after that I will do something. If the silly thing will only repeat, I will abort it.
Lawrence |
||
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
nanoprobe, as it's my electric that I'm using, and for the sake of testing/proving that these don't ever appear to finish (just think of the WU's that are sent to "unmonitored" BOINC sessions - after all, the "set and forget" brigade are far larger than those who are active on the forums), I am willing to forgo a few "wasted cycles" if it helps the techs in the long run find out what's going wrong with these.
----------------------------------------Just to expand upon the issue, the WU that I've got which appears stuck, currently has 4 copies "out in the wild", 3 * "no reply" (which, I'm assuming, the other two besides mine, could still be running - i.e., "wasting" cycles), and another WU "in process" - which, again, will probably also go the same way. Thus, if I can help the techs, then I will. After all, if no one highlights these/is willing to see as to whether they'll ever finish if they're just left to run, then who knows as to how many cycles would be wasted through the life of this particular project. I'm not here to try and crunch through as many of the WU's as possible, I'm here for the long-term good of the project. ![]() [Edit 1 times, last edit by gb009761 at Nov 22, 2013 6:45:35 AM] |
||
|
|
X-Files 27
Senior Cruncher Canada Joined: May 21, 2007 Post Count: 391 Status: Offline Project Badges:
|
Tech, what should we do about these WUs?
----------------------------------------![]() ![]() |
||
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
Well, due to a "No heartbeat from client for 30 sec - exiting" issue, my WU reset itself, again, going straight to 100% complete and again, having the CPU/Elapsed time continuing to increment.
----------------------------------------What I am surprised about though, is this line from the messages tab 22/11/2013 13:25:07 World Community Grid Task MCM1_0000045_2091_2 is 4.75 days overdue; you may not get credit for it. Consider aborting it. I would have thought that BOINC would have killed it - but no, it's left for human intervention to kill it (which I'm just about to do so - I have made a copy of the relevant slot for the techs benefit, if they want anything from out of there).If it is left up to the person running BOINC to spot it and kill it manually (like it certainly looks like), then that further convinces me that those other 3 WU's (2 * "No Reply" and 1 * "In Progress"), will continue indefinitely - thus, making this something the techs may seriously want to investigate... ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
BOINC will not let them run forever. Also, each project gets to set a shorter deadline within the work unit it sends out. Finally, the server has a limit on the number of replications it will allow before it withdraws a work unit and marks it for human attention (about 8 times). Uplinger can tell you what the values are for MCM. I don't know.
I don't worry about a lost hour but I do abort stuck units before they waste a day. If you think you have some useful error information, post it with the work unit identifier. Please don't post the printout from the windows debugger (unless requested). The debugger works best when it is called from a debug statement within the project code. These problems are normal at the start of a project and are usually handled behind the scenes without much information being given to us. I am not withholding any information. Lawrence |
||
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
Hi Lawrence, that's why I've already e-mailed all the details I believe they'll need through to the techs (as can be seen from my previous posts, I did that yesterday).
----------------------------------------As to "These problems are normal at the start of a project", that's why I'm highlighting it for the techs now - whilst we're in the "pre-official announcement, production testing" stage. If (and I'm most certainly not saying that this is - it's not my judgement to call), this is some sort of major issue, then it's best to get it resolved before the project is officially launched. With regards to BOINC eventually killing overdue WU's, I sincerely hope that, eventually, it would have killed them. What I was slightly surprised about, was that it hadn't killed it after being 4.75 days overdue (anyone know what the actual time limit is?). ![]() |
||
|
|
wplachy
Senior Cruncher Joined: Sep 4, 2007 Post Count: 423 Status: Offline |
Looks like I have one with the same problem: MCM1_0000054_2408_3. Time is 13:10:31 (13:09:29) both eleapsed & CPU continue to increment; percent complete 100%; has never checkpointed
----------------------------------------stderr.txt is: Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.24_windows_x86_64 -SettingsFile MCM1_0000054_2408.txt -DatabaseFile dataset-17_72_SDG_v1.txt Initializing wcg_learn_limit = 500000 Running 5 copies sent: 1=User Abort; 2=No Reply; 2=In Progress MCM1_ 0000054_ 2408_ 4-- - In Progress 11/22/13 14:18:54 11/25/13 14:18:54 0.00 0.0 / 0.0 MCM1_ 0000054_ 2408_ 3-- - In Progress 11/21/13 21:30:15 11/24/13 21:30:15 0.00 0.0 / 0.0 <-Mine MCM1_ 0000054_ 2408_ 2-- - No Reply 11/12/13 14:18:49 11/22/13 14:18:49 0.00 0.0 / 0.0 MCM1_ 0000054_ 2408_ 0-- - No Reply 11/11/13 21:30:09 11/21/13 21:30:09 0.00 0.0 / 0.0 MCM1_ 0000054_ 2408_ 1-- 724 User Aborted 11/11/13 21:29:51 11/12/13 14:18:32 0.43 15.4 / 0.0 Looks like it's time to abort this one as well. What a waste, 13 hours of time and cost
Bill P
----------------------------------------![]() [Edit 1 times, last edit by wplachy at Nov 22, 2013 5:07:12 PM] |
||
|
|
|