Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Help Cure Muscular Dystrophy - Phase 2 Forum Thread: WU stuck on 88.888%? |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 27
|
Author |
|
tombell12
Advanced Cruncher Australia Joined: Oct 8, 2009 Post Count: 87 Status: Offline Project Badges: |
Hey guys
----------------------------------------I've come across a HCMD2 workunit which seems like it has stuck on exactly 88.888% and time remaining constantly around 1:30. I don't know how long it has been like this as I have been out today. It has gone nearly 13 hours (at point of typing) where the average unit would go about 7-9 hours with full CPU resources available. I find this unusual as another unit which is running continues to tick along. What could cause a HCMD2 unit to hang like that? CPU cycles are still being consumed by both instances of the apps running in memory. **Edit: It was a HCMD2 unit mistaken for a RICE unit [Edit 1 times, last edit by tombell12 at Feb 18, 2010 12:37:24 PM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Why it happens we don't know and not 2 months before this research project is finishing and 25 million results already validated, doubt anyone is going to get on investigating the root cause. Quickest possible fix is to restart the client or the system [which is better to refresh all]. If not progressing properly afterwards, which you should see within 5-10 minutes, abort the unit.
----------------------------------------edit: Ignore the time comment as this thread was started with the wrong application being identified in OP! Follow below for more.
WCG Global & Research > Make Proposal Help: Start Here!
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Feb 19, 2010 9:28:23 AM] |
||
|
tombell12
Advanced Cruncher Australia Joined: Oct 8, 2009 Post Count: 87 Status: Offline Project Badges: |
Ah cool. I see what you mean Sekerob and if it's a once-off well there's no harm in aborting it if a refresh won't do the trick. Cheers!
|
||
|
tombell12
Advanced Cruncher Australia Joined: Oct 8, 2009 Post Count: 87 Status: Offline Project Badges: |
Major announcement. I have posted in the wrong section. Just saw that it was actually a HCMD2 unit. I got confused cos of the similar run times of RICE and HCMD2
----------------------------------------Does that change anything though? (Can this be moved to the HCMD2 forum? ) [Edit 1 times, last edit by tombell12 at Feb 18, 2010 12:38:56 PM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Well, let's have this thread moved to the right section then. In the mean time you tell us what happened after the client/system restart.
----------------------------------------Is the 13 hours CPU time or Elapsed/Wallclock time? Normally HCMD2 jobs will finish on the position that was running at the 12th hour. If it's a real tough position it will take long... there are those monsters around, so the test would be to look on the Result Status page WU detail and see what the wingman did. Let us know.
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
tombell12
Advanced Cruncher Australia Joined: Oct 8, 2009 Post Count: 87 Status: Offline Project Badges: |
Here's the interesting thing Sekerob, after (hard) rebooting it appears the Elapsed time has rolled back to around 11hr:23min. Strange that?
The Super Long Workunit The other cruncher was only 2.80 CPU hours?? Can't think of what I might have done wrong, if I am indeed at fault? |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
The fact that you got past the 6 hours indicates that more than 60% of the total was done at that progress point.
----------------------------------------Watch the checkpointing if you've switched that log flag on [see FAQ index for the How To]. These indicated saved progress. Presume the job to be pretty quick or already on that 88.88%. If you don't see the job finishing soon after where it was, think you're best taking your losses with a wingman doing it in 2.8 hours i.e. all positions in a task, including any very tough ones. In short then abort the job.
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
tombell12
Advanced Cruncher Australia Joined: Oct 8, 2009 Post Count: 87 Status: Offline Project Badges: |
Well using the FAQ I found the BOINC data directory and checked the boinc_task_state XML file which indeed showed the same stalling percentage as shown in the manager. It has not checkpointed since 4:21pm local time. It is now 12:49am Friday here.
Contents of boinc_task_state: <active_task> I will follow your advice and abort the unit |
||
|
tombell12
Advanced Cruncher Australia Joined: Oct 8, 2009 Post Count: 87 Status: Offline Project Badges: |
The stderr.txt left behind in slot 0 appears to contain debug information. Would this be of any importance?
|
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Always of interest on a fresh project. Plz mail with the WU name particulars and issue description to support@worldcommunitygrid.org f.a.o. uplinger
----------------------------------------thanks
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
|