Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 5
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1215 times and has 4 replies Next Thread
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
faah WU stuck at 21.4% complete, 38h 59m CPU time

The WU named faah3349_ZINC01671182_xMut_md01470_01_1 seems to be stuck at an indicated 21.4% complete. I have suspended it after 38h 59m CPU time.

I looked at its entry from my Results Status page to see other crunchers' results: 1 Pending Validation after 5h 46m, 1 No Reply, 1 Error, 1 In Progress (me?).

Before I abort the WU, I expect I can copy its status file(s) and email it/them to someone for post-mortem. Instructions, anyone?

Meanwhile, with this WU suspended, BOINC 5.10.42 (yes I have yet to install 5.10.45) does not seem to be downloading new work. This is probably because BOINC estimates its Time to Completion as 34h 40m. I have about 9hrs' work left.
----------------------------------------
[Edit 1 times, last edit by Rickjb at Mar 30, 2008 3:11:09 AM]
[Mar 30, 2008 3:07:01 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: faah WU stuck at 21.4% complete, 38h 59m CPU time

Hi Rick,

What OS is this on?

If you Suspend WCG in the Project Tab and not just the Task and it not resuming even after a BOINC exit and restart it's best to abort. Before, visit the slots\0 or slots\1 subdirectory of the BOINC program location and open the stderr.txt file to see if there are any suspicious messages. Each slot stores the progress files of 1 job so you need to see which one relates to your stuck job.

Looping can happen, but that should be visible in the Graphics screen if you have a regular user install, by the dotted line / best energy graph creeping forward and retreating, each time the percent changing back a few tenth.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Mar 30, 2008 8:52:32 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: faah WU stuck at 21.4% complete, 38h 59m CPU time

What OS is this on? Win2k SP4

I looked at the stderr.txt files in slots\0 and slots\1. The latter was relevant and concludes with the messages:
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x77FCDF2A read attempt to address 0x13623A50
Engaging BOINC Windows Runtime Debugger...

I suspended the wcg project in BOINC, exited and restarted BOINC, all with the task still suspended. Now resuming it ... It went into "waiting to run" state, so I suspended another runner. Away she goes, but CPU time has been set at 1h 28, To Completion = 5h 12m, Progress was back to 20.x%, now 22.1 after about 5 min. Data have been appended to stderr.txt - looks OK.

Why was it looping after the crash? (I didn't check CPU usage on task mangler to see whether it was hogging the CPU tho).

BTW, the stderr.txt files in both slots\0 and \1 start with:
(projects/www.worldcommunitygrid.org/wcg_faah_autodock_5.42_windows_intelx86) version Failed to get VersionInfo size: 2

Is this apparent error important?

BOINC still won't fetch work, and there's only about 6hrs' left in my queue.

Re. the graphics display - sorry, but there are no labels on the axes of the graphs. As far as I am concerned, graphs without properly labeled axes are meaningless. (That's an ex- maths teacher speaking). I never look at them.

We seem to have got this WU unstuck. Thanks. If it sticks again, I'll update this post. - Richard.
[Mar 30, 2008 12:34:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: faah WU stuck at 21.4% complete, 38h 59m CPU time

You might want to revisit your permission inheritance as in: If you are allowed to write to BOINC program dir and it's files, that that right is also passed on to any new files in sub-directories of that. The "Access Violation" suggests some security grant lacking and thus writes failing, potentially causing the loop.

Dont worry about your queue. once that reset job finishes and is reported, new work will come, or even before as BOINC is now relearning that things are now moving faster. The reset to the 1:28 hours is where a last good checkpoint was stored.

Keep a watchful eye and let us know if the job validates.

PS,

visit the Start Here forum on a post regarding the VersionInfo size 2 and 1812. They are standard and benign. The samples show more lines that can be happily ignored.

PPS,

Reminds me: wonder if these Linux/Mac stuck jobs have anything to do with permissions getting lost?
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Mar 30, 2008 12:45:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: faah WU stuck at 21.4% complete, 38h 59m CPU time

Thanks Sekerob. I checked that I can write to BOINC directory & its subdirectories.

The error message looks to me like the program tried to access memory it did not own. Lots of possible causes, eg exceeding the bounds of an array. Perhaps an unusual data combination led the program to somewhere it never went before ...? I've written enough floating-point number-crunching code to know that they're very ornery critters. Or a hardware error? If so, a first for my machine.

Most likely, it's something strange about this WU. My run of it completed in 6h 51m, but was deemed Inconclusive. Add that to a No Reply and an Error. A 5th instance of the WU has been sent out to yet another poor sucker (one's born every minute). I'll follow its progress.

Queue now OK.
---------
Later edit: WU instance #5 validated against WU instance #4, ruling me Invalid.
----------------------------------------
[Edit 2 times, last edit by Rickjb at Apr 1, 2008 6:39:08 AM]
[Mar 31, 2008 2:33:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread