Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 5
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 772 times and has 4 replies Next Thread
Davethebrewer
Advanced Cruncher
United States
Joined: Feb 17, 2006
Post Count: 76
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Help Conquer Cancer units never finishing

About every two weeks I get a Help Conquer Cancer work unit that never seems to finish. I caught the most recent one this evening and actually had time to post about it. This one had already run for 38 hours and said it was only 11% done with 56 more hours to go and increasing!

When I notice these units (typically not until they have run over 24 hours) they show an ever increasing "Time to Completion" value even though the elapsed time is also increasing. Once or twice I have let them run for about another half day, but the time to completion is still increasing. I then abort the unit and it gets reported as an "Error" with no run time associated with it. The other system that gets the same work unit typically does not have an obvious problem, but I have not checked every one and have not gone back and checked out what happened with the system that gets the replacement unit.

I am on the verge of excluding HCC from this PC, but thought I would ask if I can be of any help in tracking this down? None of my other systems exhibit this problem.

Most recent failure: X0000045591340200502091019
Windows XP Pro SP2, Boinc 5.10.30, Pentium 4 3 GHz, 3 GB RAM
Computer ID: 53197

Thanks
Dave
----------------------------------------
[Apr 27, 2008 3:02:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer units never finishing

Hi Davethebrewer,

There is a few tips in the start here forum actually dealing with a HPF2 looping/stuck problem, but practice has shown that it works for other projects as well (not always). Thus, would you please read: http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=16378

Killing a job looses all useful information, thus it needs to be cropped prior to taking the above action by visiting the slots\0\ or slots\1\ task progress file dir and copy/pasting the content from the stderr.txt file. Very often between the beginning and end part the transactions are an endless repetition, so that can be left out.

ttyl
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Apr 27, 2008 7:41:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Davethebrewer
Advanced Cruncher
United States
Joined: Feb 17, 2006
Post Count: 76
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer units never finishing

Hi Davethebrewer,

There is a few tips in the start here forum actually dealing with a HPF2 looping/stuck problem, but practice has shown that it works for other projects as well (not always). Thus, would you please read: http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=16378



Thanks Sekerob

I will watch for the next time this happens and get the stderr output and then try the Suspend trick. In the past I believe that I also had the same problem with HPF2 on the same system as I see I have my profile for that system setup to exclude HPF2. Maybe I have a good test bed for this problem!

Dave
----------------------------------------
[Apr 28, 2008 10:00:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Davethebrewer
Advanced Cruncher
United States
Joined: Feb 17, 2006
Post Count: 76
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
sad Re: Help Conquer Cancer units never finishing

Well, I got another one of these units this week. 55 hours and counting.

Here is the stderr.txt file for this unit:
---
World Community Grid HCC (projects/www.worldcommunitygrid.org/wcg_hcc1_img_5.20_windows_intelx86) version Failed to get VersionInfo size: 1812

INFO: No state to restore. Start from the beginning.
ERROR: Restoring checkpoint failed. Unable to restore state!
In ExtractGlcmFeatures: End of 0 iteration of outer loop.
In ExtractGlcmFeatures: End of 1 iteration of outer loop.
In ExtractGlcmFeatures: End of 2 iteration of outer loop.
In ExtractGlcmFeatures: End of 3 iteration of outer loop.
In ExtractGlcmFeatures: End of 4 iteration of outer loop.
In ExtractGlcmFeatures: End of 5 iteration of outer loop.
In ExtractGlcmFeatures: End of 6 iteration of outer loop.
In ExtractGlcmFeatures: End of 7 iteration of outer loop.
In ExtractGlcmFeatures: End of 8 iteration of outer loop.
In ExtractGlcmFeatures: End of 9 iteration of outer loop.
In ExtractGlcmFeatures: End of 10 iteration of outer loop.
In ExtractGlcmFeatures: End of 11 iteration of outer loop.
In ExtractGlcmFeatures: End of 12 iteration of outer loop.
In ExtractGlcmFeatures: End of 13 iteration of outer loop.
In ExtractGlcmFeatures: End of 14 iteration of outer loop.
----
I will now suspend and restart it and check back this afternoon to see if that gets it to ever finish.

This is the task name:

6/6/2008 8:10:13 AM|World Community Grid|Restarting task X0000046841006200501311011_1 using hcc1 version 520


Thanks,
Dave
----------------------------------------
[Jun 6, 2008 1:11:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Davethebrewer
Advanced Cruncher
United States
Joined: Feb 17, 2006
Post Count: 76
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Help Conquer Cancer units never finishing

Well, I got another one of these units this week. 55 hours and counting.

Here is the stderr.txt file for this unit:
---
World Community Grid HCC (projects/www.worldcommunitygrid.org/wcg_hcc1_img_5.20_windows_intelx86) version Failed to get VersionInfo size: 1812

INFO: No state to restore. Start from the beginning.
ERROR: Restoring checkpoint failed. Unable to restore state!
In ExtractGlcmFeatures: End of 0 iteration of outer loop.
... snip .....
This is the task name:

6/6/2008 8:10:13 AM|World Community Grid|Restarting task X0000046841006200501311011_1 using hcc1 version 520

...


I looked again the next morning. The unit did eventually finish after about another 13 hours when I suspended and restarted. A total of 67 hours crying . I think I may just give up on these if it happens again.

Thanks,
Dave
----------------------------------------
[Jun 7, 2008 2:22:04 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread