Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 119
Posts: 119   Pages: 12   [ Previous Page | 3 4 5 6 7 8 9 10 11 12 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 13101 times and has 118 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures

(Google translation :)
Hello,
This message is for a technician from the WCG. The following is called wu "Invalid" :
ts01_ c457_ pdb001_ 2-- 617 Valide
07/01/11 06:57:11 09/01/11 14:57:52 0,32 6,9 / 6,7
ts01_ c457_ pdb001_ 1-- 617 Valide
05/01/11 06:17:05 05/01/11 14:35:41 0,43 6,5 / 6,7
ts01_ c457_ pdb001_ 0-- 617 Non valide
05/01/11 06:17:02 07/01/11 06:25:02 2,78 54,9 / 3,4
This type of project (pdb) has never lasted a short time on my machine (3GHz QX9650, Win7 32bits), but still close to a regular time of 2.78. I fear there will be a double error. Perhaps to avoid that researchers have a bad surprise, would it be wise to repeat the calculation of the wu on one of your machines.
Good day everyone,
Christian.


(Texte original :)
Bonjour,
Ce message s’adresse à un technicien du WCG. Le wu suivant est qualifié « Non valide » :
ts01_ c457_ pdb001_ 2-- 617 Valide
07/01/11 06:57:11 09/01/11 14:57:52 0,32 6,9 / 6,7
ts01_ c457_ pdb001_ 1-- 617 Valide
05/01/11 06:17:05 05/01/11 14:35:41 0,43 6,5 / 6,7
ts01_ c457_ pdb001_ 0-- 617 Non valide
05/01/11 06:17:02 07/01/11 06:25:02 2,78 54,9 / 3,4
Ce type de projet (pdb) n’a jamais duré un temps aussi court sur ma machine (QX9650 3GHz, Win7 32bits), mais toujours un temps régulier proche de 2.78. Je crains qu’il y ait une double erreur. Peut-être, pour éviter que les chercheurs n’aient une mauvaise surprise, serait-il judicieux de refaire le calcul de ce wu sur une de vos machines.
Bonne journée à tous,
Christian.
[Jan 10, 2011 12:22:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures of

Chris,

Rather obscure the "pd" (c-type) fails for you, but not to worry as you posted that the wingman plus repair copy were already successful, i.e. no need for the scientists to rerun.

ts01_ c457_ pdb001_ 2-- 617 Valide
07/01/11 06:57:11 09/01/11 14:57:52 0,32 6,9 / 6,7
ts01_ c457_ pdb001_ 1-- 617 Valide
05/01/11 06:17:05 05/01/11 14:35:41 0,43 6,5 / 6,7
ts01_ c457_ pdb001_ 0-- 617 Non valide
05/01/11 06:17:02 07/01/11 06:25:02 2,78 54,9 / 3,4

If you click on the "Non Valide" link on your result status page, you get the task-log. If you copy-paste it in a next post we can see if there is a specific error code and start research that from that angle.

cheers

edit: complete an unfinished line :O
----------------------------------------
[Edit 1 times, last edit by Former Member at Jan 10, 2011 6:06:40 PM]
[Jan 10, 2011 6:05:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures of

Hello Sek,
Here are the details of the 3 results of this Wu :
First, for the wingmen :
Nom du résultat: ts01_ c457_ pdb001_ 2--

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
called boinc_finish

</stderr_txt>
]]>

Nom du résultat: ts01_ c457_ pdb001_ 1--

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
called boinc_finish

</stderr_txt>
]]>

And now for me :
Nom du résultat: ts01_ c457_ pdb001_ 0--

<core_client_version>6.2.28</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
called boinc_finish

</stderr_txt>
]]>

I see no difference, apart from the version number of the program BOINC.
Cheers,
Chris
[Jan 11, 2011 7:19:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures of

Chris,

I agree there is nothing unusual in your output. If you start to see more invalid workunits it may be worth running some hardware tests to check your memory and hard disk. If this is your only invlalid I wouldn't worry about it.

Thanks,
armstrdj
[Jan 13, 2011 2:05:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Hypernova
Master Cruncher
Audaces Fortuna Juvat ! Vaud - Switzerland
Joined: Dec 16, 2008
Post Count: 1908
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures of

Sek no ideas for my post just above. It was before the holidays but it is still valid. biggrin
----------------------------------------

[Jan 13, 2011 3:34:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures of

While I would never counsel to ignore errors, I haven't seen an error ratio that cries an alarm. Over the last month I have had 31 dddt2 WU error, out of 2600 valid WU, or just over 1%. Therefore I lost 1.25 hours time against the 9200 hours crunched, which is .00014%, ie: negligible.
[Jan 13, 2011 4:58:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
keithhenry
Ace Cruncher
Senile old farts of the world ....uh.....uh..... nevermind
Joined: Nov 18, 2004
Post Count: 18665
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures of

Sek no ideas for my post just above. It was before the holidays but it is still valid. biggrin


Just some thoughts (hopefully useful) - could the "INFO: No state to restore. Start from the beginning." message simply mean that the task had been suspended and when restarted, had no checkpoint yet to start from? Also, noticed in Sek's response to a post that the invalid task had a total runtime quite longer (relatively) that the two other users. That can result in a WU being marked invalid? Or does that situation still give valid but cuts the points granted in half?
----------------------------------------
Join/Website/IMODB



[Jan 13, 2011 7:54:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures of

Aborted a task when noticing it overran 100%, and was already half an hour past the regular run time for an ''sq'' on the Linux box, AFTER checking that the wingman had self-exited when running past the "Maximum elapsed time exceeded".

ts02_ b483_ sqb010_ 2-- - In Progress 31-1-11 12:22:04 4-2-11 12:22:04 0.00 0.0 / 0.0
ts02_ b483_ sqb010_ 0-- 617 User Aborted 29-1-11 07:31:33 31-1-11 23:32:51 2.06 37.3 / 0.0
ts02_ b483_ sqb010_ 1-- 617 Error 29-1-11 07:31:29 31-1-11 12:12:46 14.46 202.8 / 0.0 < Exceeded Max time!
ts02_ b483_ sqb010_ 3-- - Waiting to be sent — — 0.00 0.0 / 0.0

Just for awareness for anyone seeing tasks going over 100%. Check the wingman, then decide what to do.

--//--
[Jan 31, 2011 11:41:57 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb077492
Advanced Cruncher
Joined: Dec 24, 2004
Post Count: 96
Status: Offline
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures of

Hi Sek,

Just to let you and the community know that I'm seeing a similar thing with a closely related WU. My machine is an old slow P4 HT and the task is showing 21:32 hours CPU time at only 4.33% (though I notice the last checkpoint was only at 20:47 hours). My wingmen show:

ts02_ b483_ sqb000_ 3-- 617 Pending Validation 31/01/11 08:27:16 01/02/11 04:01:20 1.34 23.9 / 0.0
ts02_ b483_ sqb000_ 2-- 617 Error 30/01/11 03:32:59 31/01/11 08:24:05 10.14 213.4 / 0.0
ts02_ b483_ sqb000_ 1-- - In Progress 29/01/11 07:30:34 08/02/11 07:30:34 0.00 0.0 / 0.0 <== Me
ts02_ b483_ sqb000_ 0-- 617 Error 29/01/11 07:30:33 30/01/11 03:22:38 11.98 216.0 / 0.0

The two wingmen that errored both show "Maximum CPU time exceeded" in the result log.

The last similar task that this machine had (ts02_ b467_ sqb010_ 0) ran for just over 3 hours.

I don't understand why my machine hasn't hit the same CPU limit the two wingmen did. I'm going to abort it. Shame about the loss of credit, though.

Mike
[Feb 1, 2011 11:56:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: DDDT2 Wu Failures of

Look in the log of the PV job. One showed up on mine of that aborted job with multiple restarts, but normal run time.

Result Name: ts02_ b483_ sqb010_ 2--
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
Calling gridPlatform.init()
INFO: No state to restore. Start from the beginning.
Calling gridPlatform.init()
Copying wcgrestart.rst
Calling gridPlatform.init()
Copying wcgrestart.rst
Calling gridPlatform.init()
Copying wcgrestart.rst
Calling gridPlatform.init()
Copying wcgrestart.rst
called boinc_finish

</stderr_txt>
]]>

Updated distribution:

ts02_ b483_ sqb010_ 3-- - In Progress 1/31/11 23:46:42 2/4/11 23:46:42 0.00 0.0 / 0.0
ts02_ b483_ sqb010_ 2-- 617 Pending Validation 1/31/11 12:22:04 1/31/11 23:56:35 3.39 24.3 / 0.0
ts02_ b483_ sqb010_ 0-- 617 User Aborted 1/29/11 07:31:33 1/31/11 23:32:51 2.06 37.3 / 0.0
ts02_ b483_ sqb010_ 1-- 617 Error 1/29/11 07:31:29 1/31/11 12:12:46 14.46 202.8 / 0.0

Maybe this one managed to break out of an endless loop? If you see % barely moving and checkpoints still appearing, it maybe the same as the > 100% symptom.... wild guess. Cant tell if it is the same checkpoint over and over again. We had that behaviour on HPF2. A simple restart of the client, or set LAIM off, pause client, resume client, set LAIM on, would almost guaranteed have them finish without a hitch and validate. Not seen that for a long long time, so next time I'd see that, I'll restart the client and see what happens.
[Feb 1, 2011 12:14:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 119   Pages: 12   [ Previous Page | 3 4 5 6 7 8 9 10 11 12 | Next Page ]
[ Jump to Last Post ]
Post new Thread