| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 5
|
|
| Author |
|
|
Davethebrewer
Advanced Cruncher United States Joined: Feb 17, 2006 Post Count: 76 Status: Offline Project Badges:
|
About every two weeks I get a Help Conquer Cancer work unit that never seems to finish. I caught the most recent one this evening and actually had time to post about it. This one had already run for 38 hours and said it was only 11% done with 56 more hours to go and increasing!
----------------------------------------When I notice these units (typically not until they have run over 24 hours) they show an ever increasing "Time to Completion" value even though the elapsed time is also increasing. Once or twice I have let them run for about another half day, but the time to completion is still increasing. I then abort the unit and it gets reported as an "Error" with no run time associated with it. The other system that gets the same work unit typically does not have an obvious problem, but I have not checked every one and have not gone back and checked out what happened with the system that gets the replacement unit. I am on the verge of excluding HCC from this PC, but thought I would ask if I can be of any help in tracking this down? None of my other systems exhibit this problem. Most recent failure: X0000045591340200502091019 Windows XP Pro SP2, Boinc 5.10.30, Pentium 4 3 GHz, 3 GB RAM Computer ID: 53197 Thanks Dave |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Hi Davethebrewer,
----------------------------------------There is a few tips in the start here forum actually dealing with a HPF2 looping/stuck problem, but practice has shown that it works for other projects as well (not always). Thus, would you please read: http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=16378 Killing a job looses all useful information, thus it needs to be cropped prior to taking the above action by visiting the slots\0\ or slots\1\ task progress file dir and copy/pasting the content from the stderr.txt file. Very often between the beginning and end part the transactions are an endless repetition, so that can be left out. ttyl
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Davethebrewer
Advanced Cruncher United States Joined: Feb 17, 2006 Post Count: 76 Status: Offline Project Badges:
|
Hi Davethebrewer, There is a few tips in the start here forum actually dealing with a HPF2 looping/stuck problem, but practice has shown that it works for other projects as well (not always). Thus, would you please read: http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=16378 Thanks Sekerob I will watch for the next time this happens and get the stderr output and then try the Suspend trick. In the past I believe that I also had the same problem with HPF2 on the same system as I see I have my profile for that system setup to exclude HPF2. Maybe I have a good test bed for this problem! Dave |
||
|
|
Davethebrewer
Advanced Cruncher United States Joined: Feb 17, 2006 Post Count: 76 Status: Offline Project Badges:
|
Well, I got another one of these units this week. 55 hours and counting.
----------------------------------------Here is the stderr.txt file for this unit: --- World Community Grid HCC (projects/www.worldcommunitygrid.org/wcg_hcc1_img_5.20_windows_intelx86) version Failed to get VersionInfo size: 1812 INFO: No state to restore. Start from the beginning. ERROR: Restoring checkpoint failed. Unable to restore state! In ExtractGlcmFeatures: End of 0 iteration of outer loop. In ExtractGlcmFeatures: End of 1 iteration of outer loop. In ExtractGlcmFeatures: End of 2 iteration of outer loop. In ExtractGlcmFeatures: End of 3 iteration of outer loop. In ExtractGlcmFeatures: End of 4 iteration of outer loop. In ExtractGlcmFeatures: End of 5 iteration of outer loop. In ExtractGlcmFeatures: End of 6 iteration of outer loop. In ExtractGlcmFeatures: End of 7 iteration of outer loop. In ExtractGlcmFeatures: End of 8 iteration of outer loop. In ExtractGlcmFeatures: End of 9 iteration of outer loop. In ExtractGlcmFeatures: End of 10 iteration of outer loop. In ExtractGlcmFeatures: End of 11 iteration of outer loop. In ExtractGlcmFeatures: End of 12 iteration of outer loop. In ExtractGlcmFeatures: End of 13 iteration of outer loop. In ExtractGlcmFeatures: End of 14 iteration of outer loop. ---- I will now suspend and restart it and check back this afternoon to see if that gets it to ever finish. This is the task name: 6/6/2008 8:10:13 AM|World Community Grid|Restarting task X0000046841006200501311011_1 using hcc1 version 520 Thanks, Dave |
||
|
|
Davethebrewer
Advanced Cruncher United States Joined: Feb 17, 2006 Post Count: 76 Status: Offline Project Badges:
|
Well, I got another one of these units this week. 55 hours and counting. Here is the stderr.txt file for this unit: --- World Community Grid HCC (projects/www.worldcommunitygrid.org/wcg_hcc1_img_5.20_windows_intelx86) version Failed to get VersionInfo size: 1812 INFO: No state to restore. Start from the beginning. ERROR: Restoring checkpoint failed. Unable to restore state! In ExtractGlcmFeatures: End of 0 iteration of outer loop. ... snip ..... This is the task name: 6/6/2008 8:10:13 AM|World Community Grid|Restarting task X0000046841006200501311011_1 using hcc1 version 520 ... I looked again the next morning. The unit did eventually finish after about another 13 hours when I suspended and restarted. A total of 67 hours . I think I may just give up on these if it happens again.Thanks, Dave |
||
|
|
|