| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 5
|
|
| Author |
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
The WU named faah3349_ZINC01671182_xMut_md01470_01_1 seems to be stuck at an indicated 21.4% complete. I have suspended it after 38h 59m CPU time.
----------------------------------------I looked at its entry from my Results Status page to see other crunchers' results: 1 Pending Validation after 5h 46m, 1 No Reply, 1 Error, 1 In Progress (me?). Before I abort the WU, I expect I can copy its status file(s) and email it/them to someone for post-mortem. Instructions, anyone? Meanwhile, with this WU suspended, BOINC 5.10.42 (yes I have yet to install 5.10.45) does not seem to be downloading new work. This is probably because BOINC estimates its Time to Completion as 34h 40m. I have about 9hrs' work left. [Edit 1 times, last edit by Rickjb at Mar 30, 2008 3:11:09 AM] |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Hi Rick,
----------------------------------------What OS is this on? If you Suspend WCG in the Project Tab and not just the Task and it not resuming even after a BOINC exit and restart it's best to abort. Before, visit the slots\0 or slots\1 subdirectory of the BOINC program location and open the stderr.txt file to see if there are any suspicious messages. Each slot stores the progress files of 1 job so you need to see which one relates to your stuck job. Looping can happen, but that should be visible in the Graphics screen if you have a regular user install, by the dotted line / best energy graph creeping forward and retreating, each time the percent changing back a few tenth.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
What OS is this on? Win2k SP4
I looked at the stderr.txt files in slots\0 and slots\1. The latter was relevant and concludes with the messages: - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x77FCDF2A read attempt to address 0x13623A50 Engaging BOINC Windows Runtime Debugger... I suspended the wcg project in BOINC, exited and restarted BOINC, all with the task still suspended. Now resuming it ... It went into "waiting to run" state, so I suspended another runner. Away she goes, but CPU time has been set at 1h 28, To Completion = 5h 12m, Progress was back to 20.x%, now 22.1 after about 5 min. Data have been appended to stderr.txt - looks OK. Why was it looping after the crash? (I didn't check CPU usage on task mangler to see whether it was hogging the CPU tho). BTW, the stderr.txt files in both slots\0 and \1 start with: (projects/www.worldcommunitygrid.org/wcg_faah_autodock_5.42_windows_intelx86) version Failed to get VersionInfo size: 2 Is this apparent error important? BOINC still won't fetch work, and there's only about 6hrs' left in my queue. Re. the graphics display - sorry, but there are no labels on the axes of the graphs. As far as I am concerned, graphs without properly labeled axes are meaningless. (That's an ex- maths teacher speaking). I never look at them. We seem to have got this WU unstuck. Thanks. If it sticks again, I'll update this post. - Richard. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
You might want to revisit your permission inheritance as in: If you are allowed to write to BOINC program dir and it's files, that that right is also passed on to any new files in sub-directories of that. The "Access Violation" suggests some security grant lacking and thus writes failing, potentially causing the loop.
----------------------------------------Dont worry about your queue. once that reset job finishes and is reported, new work will come, or even before as BOINC is now relearning that things are now moving faster. The reset to the 1:28 hours is where a last good checkpoint was stored. Keep a watchful eye and let us know if the job validates. PS, visit the Start Here forum on a post regarding the VersionInfo size 2 and 1812. They are standard and benign. The samples show more lines that can be happily ignored. PPS, Reminds me: wonder if these Linux/Mac stuck jobs have anything to do with permissions getting lost?
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
Thanks Sekerob. I checked that I can write to BOINC directory & its subdirectories.
----------------------------------------The error message looks to me like the program tried to access memory it did not own. Lots of possible causes, eg exceeding the bounds of an array. Perhaps an unusual data combination led the program to somewhere it never went before ...? I've written enough floating-point number-crunching code to know that they're very ornery critters. Or a hardware error? If so, a first for my machine. Most likely, it's something strange about this WU. My run of it completed in 6h 51m, but was deemed Inconclusive. Add that to a No Reply and an Error. A 5th instance of the WU has been sent out to yet another poor sucker (one's born every minute). I'll follow its progress. Queue now OK. --------- Later edit: WU instance #5 validated against WU instance #4, ruling me Invalid. [Edit 2 times, last edit by Rickjb at Apr 1, 2008 6:39:08 AM] |
||
|
|
|