Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Discovering Dengue Drugs - Together - Phase 2 Forum Thread: DDDT2 Wu Failures |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 119
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
(Google translation :)
Hello, This message is for a technician from the WCG. The following is called wu "Invalid" : ts01_ c457_ pdb001_ 2-- 617 Valide 07/01/11 06:57:11 09/01/11 14:57:52 0,32 6,9 / 6,7 ts01_ c457_ pdb001_ 1-- 617 Valide 05/01/11 06:17:05 05/01/11 14:35:41 0,43 6,5 / 6,7 ts01_ c457_ pdb001_ 0-- 617 Non valide 05/01/11 06:17:02 07/01/11 06:25:02 2,78 54,9 / 3,4 This type of project (pdb) has never lasted a short time on my machine (3GHz QX9650, Win7 32bits), but still close to a regular time of 2.78. I fear there will be a double error. Perhaps to avoid that researchers have a bad surprise, would it be wise to repeat the calculation of the wu on one of your machines. Good day everyone, Christian. (Texte original :) Bonjour, Ce message s’adresse à un technicien du WCG. Le wu suivant est qualifié « Non valide » : ts01_ c457_ pdb001_ 2-- 617 Valide 07/01/11 06:57:11 09/01/11 14:57:52 0,32 6,9 / 6,7 ts01_ c457_ pdb001_ 1-- 617 Valide 05/01/11 06:17:05 05/01/11 14:35:41 0,43 6,5 / 6,7 ts01_ c457_ pdb001_ 0-- 617 Non valide 05/01/11 06:17:02 07/01/11 06:25:02 2,78 54,9 / 3,4 Ce type de projet (pdb) n’a jamais duré un temps aussi court sur ma machine (QX9650 3GHz, Win7 32bits), mais toujours un temps régulier proche de 2.78. Je crains qu’il y ait une double erreur. Peut-être, pour éviter que les chercheurs n’aient une mauvaise surprise, serait-il judicieux de refaire le calcul de ce wu sur une de vos machines. Bonne journée à tous, Christian. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Chris,
----------------------------------------Rather obscure the "pd" (c-type) fails for you, but not to worry as you posted that the wingman plus repair copy were already successful, i.e. no need for the scientists to rerun. ts01_ c457_ pdb001_ 2-- 617 Valide 07/01/11 06:57:11 09/01/11 14:57:52 0,32 6,9 / 6,7 ts01_ c457_ pdb001_ 1-- 617 Valide 05/01/11 06:17:05 05/01/11 14:35:41 0,43 6,5 / 6,7 ts01_ c457_ pdb001_ 0-- 617 Non valide 05/01/11 06:17:02 07/01/11 06:25:02 2,78 54,9 / 3,4 If you click on the "Non Valide" link on your result status page, you get the task-log. If you copy-paste it in a next post we can see if there is a specific error code and start research that from that angle. cheers edit: complete an unfinished line :O [Edit 1 times, last edit by Former Member at Jan 10, 2011 6:06:40 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello Sek,
Here are the details of the 3 results of this Wu : First, for the wingmen : Nom du résultat: ts01_ c457_ pdb001_ 2-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. called boinc_finish </stderr_txt> ]]> Nom du résultat: ts01_ c457_ pdb001_ 1-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. called boinc_finish </stderr_txt> ]]> And now for me : Nom du résultat: ts01_ c457_ pdb001_ 0-- <core_client_version>6.2.28</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. called boinc_finish </stderr_txt> ]]> I see no difference, apart from the version number of the program BOINC. Cheers, Chris |
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: |
Chris,
I agree there is nothing unusual in your output. If you start to see more invalid workunits it may be worth running some hardware tests to check your memory and hard disk. If this is your only invlalid I wouldn't worry about it. Thanks, armstrdj |
||
|
Hypernova
Master Cruncher Audaces Fortuna Juvat ! Vaud - Switzerland Joined: Dec 16, 2008 Post Count: 1908 Status: Offline Project Badges: |
Sek no ideas for my post just above. It was before the holidays but it is still valid.
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
While I would never counsel to ignore errors, I haven't seen an error ratio that cries an alarm. Over the last month I have had 31 dddt2 WU error, out of 2600 valid WU, or just over 1%. Therefore I lost 1.25 hours time against the 9200 hours crunched, which is .00014%, ie: negligible.
|
||
|
keithhenry
Ace Cruncher Senile old farts of the world ....uh.....uh..... nevermind Joined: Nov 18, 2004 Post Count: 18665 Status: Offline Project Badges: |
Sek no ideas for my post just above. It was before the holidays but it is still valid. Just some thoughts (hopefully useful) - could the "INFO: No state to restore. Start from the beginning." message simply mean that the task had been suspended and when restarted, had no checkpoint yet to start from? Also, noticed in Sek's response to a post that the invalid task had a total runtime quite longer (relatively) that the two other users. That can result in a WU being marked invalid? Or does that situation still give valid but cuts the points granted in half? |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Aborted a task when noticing it overran 100%, and was already half an hour past the regular run time for an ''sq'' on the Linux box, AFTER checking that the wingman had self-exited when running past the "Maximum elapsed time exceeded".
ts02_ b483_ sqb010_ 2-- - In Progress 31-1-11 12:22:04 4-2-11 12:22:04 0.00 0.0 / 0.0 ts02_ b483_ sqb010_ 0-- 617 User Aborted 29-1-11 07:31:33 31-1-11 23:32:51 2.06 37.3 / 0.0 ts02_ b483_ sqb010_ 1-- 617 Error 29-1-11 07:31:29 31-1-11 12:12:46 14.46 202.8 / 0.0 < Exceeded Max time! ts02_ b483_ sqb010_ 3-- - Waiting to be sent — — 0.00 0.0 / 0.0 Just for awareness for anyone seeing tasks going over 100%. Check the wingman, then decide what to do. --//-- |
||
|
gb077492
Advanced Cruncher Joined: Dec 24, 2004 Post Count: 96 Status: Offline |
Hi Sek,
Just to let you and the community know that I'm seeing a similar thing with a closely related WU. My machine is an old slow P4 HT and the task is showing 21:32 hours CPU time at only 4.33% (though I notice the last checkpoint was only at 20:47 hours). My wingmen show: ts02_ b483_ sqb000_ 3-- 617 Pending Validation 31/01/11 08:27:16 01/02/11 04:01:20 1.34 23.9 / 0.0 ts02_ b483_ sqb000_ 2-- 617 Error 30/01/11 03:32:59 31/01/11 08:24:05 10.14 213.4 / 0.0 ts02_ b483_ sqb000_ 1-- - In Progress 29/01/11 07:30:34 08/02/11 07:30:34 0.00 0.0 / 0.0 <== Me ts02_ b483_ sqb000_ 0-- 617 Error 29/01/11 07:30:33 30/01/11 03:22:38 11.98 216.0 / 0.0 The two wingmen that errored both show "Maximum CPU time exceeded" in the result log. The last similar task that this machine had (ts02_ b467_ sqb010_ 0) ran for just over 3 hours. I don't understand why my machine hasn't hit the same CPU limit the two wingmen did. I'm going to abort it. Shame about the loss of credit, though. Mike |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Look in the log of the PV job. One showed up on mine of that aborted job with multiple restarts, but normal run time.
Result Name: ts02_ b483_ sqb010_ 2-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> Calling gridPlatform.init() INFO: No state to restore. Start from the beginning. Calling gridPlatform.init() Copying wcgrestart.rst Calling gridPlatform.init() Copying wcgrestart.rst Calling gridPlatform.init() Copying wcgrestart.rst Calling gridPlatform.init() Copying wcgrestart.rst called boinc_finish </stderr_txt> ]]> Updated distribution: ts02_ b483_ sqb010_ 3-- - In Progress 1/31/11 23:46:42 2/4/11 23:46:42 0.00 0.0 / 0.0 ts02_ b483_ sqb010_ 2-- 617 Pending Validation 1/31/11 12:22:04 1/31/11 23:56:35 3.39 24.3 / 0.0 ts02_ b483_ sqb010_ 0-- 617 User Aborted 1/29/11 07:31:33 1/31/11 23:32:51 2.06 37.3 / 0.0 ts02_ b483_ sqb010_ 1-- 617 Error 1/29/11 07:31:29 1/31/11 12:12:46 14.46 202.8 / 0.0 Maybe this one managed to break out of an endless loop? If you see % barely moving and checkpoints still appearing, it maybe the same as the > 100% symptom.... wild guess. Cant tell if it is the same checkpoint over and over again. We had that behaviour on HPF2. A simple restart of the client, or set LAIM off, pause client, resume client, set LAIM on, would almost guaranteed have them finish without a hitch and validate. Not seen that for a long long time, so next time I'd see that, I'll restart the client and see what happens. |
||
|
|