Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 10
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3419 times and has 9 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
A work unit stopped at 23,20 % since 12 hours

The WU ne627_00032-2 is blocked at 23,2 % since last night.

Do I wait ? Do I abort it ?

Has this problem ever occured ?
[Mar 6, 2010 8:44:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: A work unit stopped at 23,20 % since 12 hours

Legrandpiou,

if you exit the client and restart it the result will restart resume from the last good checkpoint. There will be computing hours lost due to an infinite loop the task was stuck in, but the result will then finish in normal time.

edit: grammar
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Mar 6, 2010 4:11:08 PM]
[Mar 6, 2010 8:48:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A work unit stopped at 23,20 % since 12 hours

Thanks, it restarted all right.
[Mar 6, 2010 2:53:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A work unit stopped at 23,20 % since 12 hours

Hello,

I have/had the same problem with the task nh69_00034_14.

I saw this task 16 hours running, and took a look to the other crunchers in the result page, but they had "normal" runtimes.
At the meanwhile my BOINC client runs a CPU benchmark. This benchmark restarts the task from the, I guess, last good checkpoint.
First I thought about a reading mistake but I took a deeper look into my logfiles, and it showed that 16 hours were correct.

So I lost 15 hours CPU runtime due to this bug.

Are the techs already working on this issue ? I mean if I did not take a look right now on my task list, I would not even notice it.
So if the CPU benchmark recovers from this error, I may also happen to other users without knowing it.

What I want to say, its not a big deal for me or the wcg loosing my 15 hours, but if this happens to more crunchers we may loose xxxx times more....
[Apr 19, 2010 11:56:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: A work unit stopped at 23,20 % since 12 hours

We will have a new version out soon [version 6.17 probably] that handles the 401 error, reducing that one by about 90%. If that fixes the loop? Don't know. Nobody has as yet been able to diagnose why it does that... somehow I suspect now it's actually a 401 in disguise for I had a looper yesterday that after restart crashed out at the next checkpoint with... 401, with 1:26 hours CPU time acknowledged, loosing only 2:40... because the BOINCTasks tool highlights low CPU efficiency jobs according to configurable preferences.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Apr 20, 2010 7:05:04 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A work unit stopped at 23,20 % since 12 hours

Thank you for the information,

can I help providing some more detailed information ?
Are some logs I shall look if I see useful informations for you ?
[Apr 20, 2010 8:50:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: A work unit stopped at 23,20 % since 12 hours

Think for now we just wait on the new version launch and take it from there. Freezing such a job and making multiple snap shots of memory dumps maybe could reveal something... but it's proven 3+ years running illusive.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Apr 20, 2010 9:02:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Goku
Advanced Cruncher
France - Caen (Calvados / Normandie)
Joined: Nov 30, 2004
Post Count: 84
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: A work unit stopped at 23,20 % since 12 hours

Same problem for me.
task ni414_ 00058_ 15-- has been stopped after more than 36 hours. crying
730,9 pts claimed but 0 received d oh

ni414_ 00058_ 15-- Erreur 03/05/10 01:09:17 04/05/10 16:58:49 36,30 730,9 / 0,0


Nom du résultat: ni414_ 00058_ 15--
<core_client_version>6.10.37</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x75A022A1

----------------------------------------
[May 6, 2010 6:48:38 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: A work unit stopped at 23,20 % since 12 hours

That is a version 6.03 work unit looking at the distribution date. As of sometime yesterday work units are distributed with version 6.17 of HPF2 and show dramatically less error occurrance.

As for the points, you don't get credit for time until quorum is complete... is it? (It has to be a known standard error)
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[May 6, 2010 7:05:45 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Goku
Advanced Cruncher
France - Caen (Calvados / Normandie)
Joined: Nov 30, 2004
Post Count: 84
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: A work unit stopped at 23,20 % since 12 hours


As for the points, you don't get credit for time until quorum is complete... is it? (It has to be a known standard error)

Yes, quorum is complete. 88,2 pts for the others except me sad
----------------------------------------
[May 6, 2010 7:54:29 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread