Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 9
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2173 times and has 8 replies Next Thread
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Failed WUs

Hello !
I am a little bit disappointed this morning finding the following failures:

WU: lh054_00060
CPU time: 44.23
Claimed/granted Boinc credit: 566.1 / 0.0

<core_client_version>5.10.13</core_client_version>
<![CDATA[
<message>
Maximum CPU time exceeded
</message>
...
------------
WU: lg597_00103
CPU time: 9.37
Claimed/granted Boinc credit: 63.1 / 0.0

<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
Failed to get VersionInfo size: 2
</stderr_txt>
...

What did go wrong ?
Regards,
----------------------------------------
[Sep 11, 2007 8:56:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Failed WUs

Okay, step by step, please visit the many times mentioned Result Status page and copy the lines into a post for the work units you posted above. A line would typically look like:

dddt0101a0038_ ZINC04146649-0001_ 06_ 0-- Lapsed-01 Pending Validation 09/05/2007 07:49:47 09/11/2007 07:45:56 4.83 51.0 / 0.0

The 566.10 hours looks like a job that ran over it's time out. Did you never see that no progress was made in the Tasks Tab of BOINCmgr? On the second WU I'll reserve the response until the requested lines have been copy/pasted.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Sep 11, 2007 9:42:21 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Failed WUs

Hi Sekerob

here is the requested info:
lh054_ 00060_ 13-- kermc03 Error 09/08/2007 21:22:24 09/11/2007 04:51:32 44.23 566.1 / 0.0

Because the computer is running alone only for crunching purpose, I am not looking very often on it. The error message, I put in my initial e-mail, mentioned already that the CPU experienced a time out (over the time).

The second info is:
lg597_ 00103_ 10-- kerdiwi01 Error 09/07/2007 08:53:48 09/08/2007 01:12:05 9.37 63.1 / 0.0
I reported this both failures because I was surprised of them. Normally, I did not have too many failures.

Regards
----------------------------------------
[Sep 11, 2007 1:47:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Failed WUs

Hi Sekerob

here is the requested info:
lh054_ 00060_ 13-- kermc03 Error 09/08/2007 21:22:24 09/11/2007 04:51:32 44.23 566.1 / 0.0

Because the computer is running alone only for crunching purpose, I am not looking very often on it. The error message, I put in my initial e-mail, mentioned already that the CPU experienced a time out (over the time).

The second info is:
lg597_ 00103_ 10-- kerdiwi01 Error 09/07/2007 08:53:48 09/08/2007 01:12:05 9.37 63.1 / 0.0
I reported this both failures because I was surprised of them. Normally, I did not have too many failures.

Regards

I fear you have a machine with an issue of hanging on the occasional HPF2 job. If you can isolate the machine by linking it to a specific profile (very easily created and standard called school, work, home) and deselect HPF2 in the device profile you'd not have to worry about that client. DDDT and FA@H are stable, but HPF2 has the strange looping. Closing the project and restarting usually makes it run proper. 566 hours is a pity... that's nearly 4 weeks. You might want to consider setting up BOINCview for remote monitoring.

I'll ask the techs as it was understood previously that CPU time-out was 2 weeks i.e. 296 hours and wallclock 3 weeks.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Sep 11, 2007 2:14:22 PM]
[Sep 11, 2007 2:02:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Failed WUs

566 hours is a pity...

Sekerob, it is "only" 44.23 hours. 566.1 are the claimed credits.
That does not change the problem, but the damage is less dramatic. smile

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Sep 12, 2007 12:29:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Failed WUs

blushing
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Sep 12, 2007 8:09:54 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Failed WUs

Hi everybody !

Me again !
I observed again during the last two weeks some failing WUs for HPF2.
lh694_ 00118_ 2-- (2.74 hours 47.8 claimed points)
lh618_ 00092_ 7-- (37.04 hours 244.4 claimed points)
lh582_ 00045_ 4-- (23.21 hours 411.5 claimed points)
I am wondering why the same WU can run successfully on some devices and failed by others !
In my particular case, the devices currently crunching are "state of the art" in terms of CPU and RAM. Is it possible that some WUs behave unforeseeable depending of the CPU which computes them ?
Cheers,
----------------------------------------
[Sep 29, 2007 8:51:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Failed WUs

Hello KerSamson,
There is a known bug in HPF2 which causes some work units to get caught in an infinite loop. We have never been able to locate the bug because the same work unit will run fine if it is run again on the same computer. This is probably a problem with an uninitialized memory location or an out of bounds memory access. This problem can be solved by restarting the HPF2 work unit from the last check point. I think it has happened twice on my computer. I know it has happened once.

Sekerob has posted in Start Here: http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=16378

Lawrence
[Sep 29, 2007 11:29:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Failed WUs

Hello lawrencehardin,
Thank you very much for your feedback. Indeed such failures are the worst for analysing and solving ! Everybody having developed and debugged software (especially real-time one or directly using low level languages) does know too good this nightmare.
Considering how many WUs, my systems complete weekly, the numbers of errors is finally limited. By the way, two systems are working again for DDDT since two weeks without any problem (unlike one month ago).
Have a nice week-end,
----------------------------------------
[Sep 29, 2007 12:56:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread