Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 27
Posts: 27   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2646 times and has 26 replies Next Thread
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
confused WU stuck on 88.888%?

Hey guys smile

I've come across a HCMD2 workunit which seems like it has stuck on exactly 88.888% and time remaining constantly around 1:30. I don't know how long it has been like this as I have been out today. It has gone nearly 13 hours (at point of typing) where the average unit would go about 7-9 hours with full CPU resources available.

I find this unusual as another unit which is running continues to tick along.

What could cause a HCMD2 unit to hang like that? CPU cycles are still being consumed by both instances of the apps running in memory.

confused

**Edit: It was a HCMD2 unit mistaken for a RICE unit d oh
----------------------------------------
[Edit 1 times, last edit by tombell12 at Feb 18, 2010 12:37:24 PM]
[Feb 18, 2010 7:26:31 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU stuck on 88.888%?

Why it happens we don't know and not 2 months before this research project is finishing and 25 million results already validated, doubt anyone is going to get on investigating the root cause. Quickest possible fix is to restart the client or the system [which is better to refresh all]. If not progressing properly afterwards, which you should see within 5-10 minutes, abort the unit.

edit: Ignore the time comment as this thread was started with the wrong application being identified in OP! Follow below for more.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Feb 19, 2010 9:28:23 AM]
[Feb 18, 2010 8:02:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU stuck on 88.888%?

Ah cool. I see what you mean Sekerob and if it's a once-off well there's no harm in aborting it if a refresh won't do the trick. Cheers! cool
[Feb 18, 2010 12:28:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU stuck on 88.888%?

Major announcement. I have posted in the wrong section. Just saw that it was actually a HCMD2 unit. I got confused cos of the similar run times of RICE and HCMD2 blushing

Does that change anything though? confused

(Can this be moved to the HCMD2 forum? smile)
----------------------------------------
[Edit 1 times, last edit by tombell12 at Feb 18, 2010 12:38:56 PM]
[Feb 18, 2010 12:34:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU stuck on 88.888%?

Well, let's have this thread moved to the right section then. In the mean time you tell us what happened after the client/system restart.

Is the 13 hours CPU time or Elapsed/Wallclock time? Normally HCMD2 jobs will finish on the position that was running at the 12th hour. If it's a real tough position it will take long... there are those monsters around, so the test would be to look on the Result Status page WU detail and see what the wingman did.

Let us know.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Feb 18, 2010 12:44:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU stuck on 88.888%?

Here's the interesting thing Sekerob, after (hard) rebooting it appears the Elapsed time has rolled back to around 11hr:23min. Strange that? confused

The Super Long Workunit

The other cruncher was only 2.80 CPU hours?? Can't think of what I might have done wrong, if I am indeed at fault? thinking
[Feb 18, 2010 1:16:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU stuck on 88.888%?

The fact that you got past the 6 hours indicates that more than 60% of the total was done at that progress point.

Watch the checkpointing if you've switched that log flag on [see FAQ index for the How To]. These indicated saved progress. Presume the job to be pretty quick or already on that 88.88%.

If you don't see the job finishing soon after where it was, think you're best taking your losses with a wingman doing it in 2.8 hours i.e. all positions in a task, including any very tough ones. In short then abort the job.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Feb 18, 2010 1:25:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU stuck on 88.888%?

Well using the FAQ I found the BOINC data directory and checked the boinc_task_state XML file which indeed showed the same stalling percentage as shown in the manager. It has not checkpointed since 4:21pm local time. It is now 12:49am Friday here.

Contents of boinc_task_state:

<active_task>
<project_master_url>http://www.worldcommunitygrid.org/</project_master_url>
<result_name>CMD2_0342-HXK2A.clustersOccur-2IAE_A.clustersOccur_11_955_972_0</result_name>
<checkpoint_cpu_time>33626.300000</checkpoint_cpu_time>
<checkpoint_elapsed_time>40957.720194</checkpoint_elapsed_time>
<fraction_done>0.888889</fraction_done>
</active_task>


I will follow your advice and abort the unit smile
[Feb 18, 2010 2:21:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
tombell12
Advanced Cruncher
Australia
Joined: Oct 8, 2009
Post Count: 87
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU stuck on 88.888%?

The stderr.txt left behind in slot 0 appears to contain debug information. Would this be of any importance? thinking
[Feb 18, 2010 2:41:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU stuck on 88.888%?

Always of interest on a fresh project. Plz mail with the WU name particulars and issue description to support@worldcommunitygrid.org f.a.o. uplinger

thanks
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Feb 18, 2010 2:43:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 27   Pages: 3   [ 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread