Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 28
Posts: 28   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 28053 times and has 27 replies Next Thread
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck unit?

This is a record breaking WU. LOL

2 days and 9 hours (CPU time)- still counting


X-Files 27, I can do better than that - 4d, 11+ hours, and still going...
----------------------------------------

[Nov 21, 2013 7:13:25 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7848
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck unit?

So far, out of about 500 units done, I have had one which appeared "stuck" sitting between 1and 2% after about 8 hours. The "suspend" and then "resume" gambit worked to get it going again. The timer reset and it finished in about 4 hours. Happened on a Linux machine.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Nov 21, 2013 10:19:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck unit?

FWIW It seems that if you have a unit @ 100% and no check points it's a dud. I had several earlier that never finished and wouldn't restart.

@gb009761: How many other tasks could you have finished in the 4+ days that you've put into that 1 task? Just sayin.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


[Nov 22, 2013 2:27:44 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Stuck unit?

I'm with nanoprobe. I can be patient for an hour but after that I will do something. If the silly thing will only repeat, I will abort it.

Lawrence
[Nov 22, 2013 5:12:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck unit?

nanoprobe, as it's my electric that I'm using, and for the sake of testing/proving that these don't ever appear to finish (just think of the WU's that are sent to "unmonitored" BOINC sessions - after all, the "set and forget" brigade are far larger than those who are active on the forums), I am willing to forgo a few "wasted cycles" if it helps the techs in the long run find out what's going wrong with these.

Just to expand upon the issue, the WU that I've got which appears stuck, currently has 4 copies "out in the wild", 3 * "no reply" (which, I'm assuming, the other two besides mine, could still be running - i.e., "wasting" cycles), and another WU "in process" - which, again, will probably also go the same way. Thus, if I can help the techs, then I will. After all, if no one highlights these/is willing to see as to whether they'll ever finish if they're just left to run, then who knows as to how many cycles would be wasted through the life of this particular project. I'm not here to try and crunch through as many of the WU's as possible, I'm here for the long-term good of the project.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by gb009761 at Nov 22, 2013 6:45:35 AM]
[Nov 22, 2013 6:09:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
X-Files 27
Senior Cruncher
Canada
Joined: May 21, 2007
Post Count: 391
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck unit?

Tech, what should we do about these WUs?
----------------------------------------

[Nov 22, 2013 9:07:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck unit?

Well, due to a "No heartbeat from client for 30 sec - exiting" issue, my WU reset itself, again, going straight to 100% complete and again, having the CPU/Elapsed time continuing to increment.

What I am surprised about though, is this line from the messages tab
22/11/2013 13:25:07 World Community Grid Task MCM1_0000045_2091_2 is 4.75 days overdue; you may not get credit for it. Consider aborting it.
I would have thought that BOINC would have killed it - but no, it's left for human intervention to kill it (which I'm just about to do so - I have made a copy of the relevant slot for the techs benefit, if they want anything from out of there).

If it is left up to the person running BOINC to spot it and kill it manually (like it certainly looks like), then that further convinces me that those other 3 WU's (2 * "No Reply" and 1 * "In Progress"), will continue indefinitely - thus, making this something the techs may seriously want to investigate...
----------------------------------------

[Nov 22, 2013 1:33:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Stuck unit?

BOINC will not let them run forever. Also, each project gets to set a shorter deadline within the work unit it sends out. Finally, the server has a limit on the number of replications it will allow before it withdraws a work unit and marks it for human attention (about 8 times). Uplinger can tell you what the values are for MCM. I don't know.

I don't worry about a lost hour but I do abort stuck units before they waste a day. If you think you have some useful error information, post it with the work unit identifier. Please don't post the printout from the windows debugger (unless requested). The debugger works best when it is called from a debug statement within the project code.

These problems are normal at the start of a project and are usually handled behind the scenes without much information being given to us. I am not withholding any information.

Lawrence
[Nov 22, 2013 2:27:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck unit?

Hi Lawrence, that's why I've already e-mailed all the details I believe they'll need through to the techs (as can be seen from my previous posts, I did that yesterday).

As to "These problems are normal at the start of a project", that's why I'm highlighting it for the techs now - whilst we're in the "pre-official announcement, production testing" stage. If (and I'm most certainly not saying that this is - it's not my judgement to call), this is some sort of major issue, then it's best to get it resolved before the project is officially launched.

With regards to BOINC eventually killing overdue WU's, I sincerely hope that, eventually, it would have killed them. What I was slightly surprised about, was that it hadn't killed it after being 4.75 days overdue (anyone know what the actual time limit is?).
----------------------------------------

[Nov 22, 2013 2:53:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
wplachy
Senior Cruncher
Joined: Sep 4, 2007
Post Count: 423
Status: Offline
Reply to this Post  Reply with Quote 
Re: Stuck unit?

Looks like I have one with the same problem: MCM1_0000054_2408_3. Time is 13:10:31 (13:09:29) both eleapsed & CPU continue to increment; percent complete 100%; has never checkpointed

stderr.txt is:
Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.24_windows_x86_64 -SettingsFile MCM1_0000054_2408.txt -DatabaseFile dataset-17_72_SDG_v1.txt
Initializing
wcg_learn_limit = 500000
Running

5 copies sent: 1=User Abort; 2=No Reply; 2=In Progress
MCM1_ 0000054_ 2408_ 4-- - In Progress 11/22/13 14:18:54 11/25/13 14:18:54 0.00 0.0 / 0.0
MCM1_ 0000054_ 2408_ 3-- - In Progress 11/21/13 21:30:15 11/24/13 21:30:15 0.00 0.0 / 0.0 <-Mine
MCM1_ 0000054_ 2408_ 2-- - No Reply 11/12/13 14:18:49 11/22/13 14:18:49 0.00 0.0 / 0.0
MCM1_ 0000054_ 2408_ 0-- - No Reply 11/11/13 21:30:09 11/21/13 21:30:09 0.00 0.0 / 0.0
MCM1_ 0000054_ 2408_ 1-- 724 User Aborted 11/11/13 21:29:51 11/12/13 14:18:32 0.43 15.4 / 0.0

Looks like it's time to abort this one as well. What a waste, 13 hours of time and cost
----------------------------------------
Bill P

----------------------------------------
[Edit 1 times, last edit by wplachy at Nov 22, 2013 5:07:12 PM]
[Nov 22, 2013 5:06:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 28   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread