| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 70
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Why are some of these WU's classified as "error" when no apparent error has occurred but simply did not finish job 0 within the 18 hour time limit? Wouldn't it be more prudent to classify them as "time exceeded" or some such thing and save the WCG servers sending these WU's to more clients?
|
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
The why is in there surely being other faster devices that will get to checkpoint 1. Each task gets 5 chances to do that.
|
||
|
|
pvh513
Senior Cruncher Joined: Feb 26, 2011 Post Count: 260 Status: Offline Project Badges:
|
I got 7 of these WUs, of which 1 has finished so far. It exited with RC = 0x100 in job #0 and then skipped jobs #1 through #4. For my wingman it exited with RC = 0x100 in job #3 and then skipped job #4. This is now pending verification and a third unit has been sent out. Name is BETA_ E236437_ 323_ S.372.C52H28S2.WEULLIFNMVXNAH-UHFFFAOYSA-N.4_ s1_ 14a.
|
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
" These work units are smaller in size than the previous test. Should allow for the first job to run faster. "
----------------------------------------Not quite as others already have observed. Had 3 of 5 on laptop not checkpointing in the first 17.5 hours and then faith struck... system momentarily busy, heartbeat troubles... 5519 World Community Grid 5/24/2016 7:08:02 PM Task BETA_E236438_79_S.384.C34F2H10N8O4S4.ATKBMOAMWBBVNU-UHFFFAOYSA-N.4_s1_14a_1 exited with zero status but no 'finished' file 5520 World Community Grid 5/24/2016 7:08:02 PM If this happens repeatedly you may need to reset the project. 5521 World Community Grid 5/24/2016 7:08:02 PM [checkpoint] result ZIKA_000001131_x1nb7_HCVJ4_RNAPol_wRNAand2Mn_chnA_0023_0 checkpointed 5522 World Community Grid 5/24/2016 7:08:03 PM [cpu_sched] Restarting task BETA_E236438_79_S.384.C34F2H10N8O4S4.ATKBMOAMWBBVNU-UHFFFAOYSA-N.4_s1_14a_1 using beta11 version 700 in slot 8 5523 5/24/2016 7:08:13 PM Suspending computation - CPU is busy 5524 World Community Grid 5/24/2016 7:08:13 PM [cpu_sched] Preempting BETA_E236438_79_S.384.C34F2H10N8O4S4.ATKBMOAMWBBVNU-UHFFFAOYSA-N.4_s1_14a_1 (left in memory) 5525 World Community Grid 5/24/2016 7:08:13 PM [cpu_sched] Preempting BETA_E236438_990_S.388.C40F5H13N6S3.FYRXAQSGOQBMGM-UHFFFAOYSA-N.9_s1_14a_0 (left in memory) 5526 World Community Grid 5/24/2016 7:08:13 PM [cpu_sched] Preempting BETA_E236438_293_S.392.C44F2H18N4S4.CRKKDMWLGUUGIU-UHFFFAOYSA-N.13_s1_14a_1 (left in memory) 5527 World Community Grid 5/24/2016 7:08:13 PM [cpu_sched] Preempting BETA_E236438_576_S.394.C44H20N2O4S4.JBZGKUDBQBIDKE-UHFFFAOYSA-N.15_s1_14a_1 (left in memory) 5528 World Community Grid 5/24/2016 7:08:13 PM [cpu_sched] Preempting ZIKA_000001118_x1nb7_HCVJ4_RNAPol_wRNAand2Mn_chnA_0058_2 (left in memory) 5529 World Community Grid 5/24/2016 7:08:13 PM [cpu_sched] Preempting ZIKA_000001123_x1nb7_HCVJ4_RNAPol_wRNAand2Mn_chnA_0006_0 (left in memory) 5530 World Community Grid 5/24/2016 7:08:13 PM [cpu_sched] Preempting ZIKA_000001123_x1nb7_HCVJ4_RNAPol_wRNAand2Mn_chnA_0322_0 (left in memory) 5531 World Community Grid 5/24/2016 7:08:13 PM [cpu_sched] Preempting ZIKA_000001131_x1nb7_HCVJ4_RNAPol_wRNAand2Mn_chnA_0023_0 (left in memory) 5532 World Community Grid 5/24/2016 7:08:13 PM Task BETA_E236438_990_S.388.C40F5H13N6S3.FYRXAQSGOQBMGM-UHFFFAOYSA-N.9_s1_14a_0 exited with zero status but no 'finished' file 5533 World Community Grid 5/24/2016 7:08:13 PM If this happens repeatedly you may need to reset the project. 5534 World Community Grid 5/24/2016 7:08:13 PM Task BETA_E236438_293_S.392.C44F2H18N4S4.CRKKDMWLGUUGIU-UHFFFAOYSA-N.13_s1_14a_1 exited with zero status but no 'finished' file 5535 World Community Grid 5/24/2016 7:08:13 PM If this happens repeatedly you may need to reset the project. 5536 World Community Grid 5/24/2016 7:08:13 PM Task BETA_E236438_576_S.394.C44H20N2O4S4.JBZGKUDBQBIDKE-UHFFFAOYSA-N.15_s1_14a_1 exited with zero status but no 'finished' file 5537 World Community Grid 5/24/2016 7:08:13 PM If this happens repeatedly you may need to reset the project. 5538 5/24/2016 7:08:23 PM Resuming computation Of course the reset advice is no-go... this is CEP2 after all. Anyway, caught them at 20 minutes into retry from start and returned them back to sender... doubtful these 3 would have made it on this device to first checkpoint. The 4th strangely did not budge and has 3 checkpoints, so this one is good to finish in time. The 5th finished in 14:12 with 4 checkpoints. [Edit 1 times, last edit by SekeRob* at May 24, 2016 5:47:28 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Yikes, gotta say this forum doesn't tolerate diversity of opinion very well
But to be on topic, I've received a few betas, all valid taking between 6-10 hours each. |
||
|
|
RTS48
Veteran Cruncher Bolivia Joined: Aug 2, 2009 Post Count: 1353 Status: Offline Project Badges:
|
Betas crunching fine and due to finish in about 18 hours EXCEPT I have just had a power cut (only a few minutes but enough to shut down my UPS). When back up I find that all of my Betas have reset to zero loosing me 64 hours (8 hours by 8 cores) of crunch time. Why oh why does this Beta not do a CPU checpoint. Please please ensure that future Betas include a checkpoint so that folks like me (subject to random power cuts) can preserve most of the work already completed.
----------------------------------------
Rod Peel
Santa Cruz Bolivia South America , ![]() |
||
|
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges:
|
Betas crunching fine and due to finish in about 18 hours EXCEPT I have just had a power cut (only a few minutes but enough to shut down my UPS). When back up I find that all of my Betas have reset to zero loosing me 64 hours (8 hours by 8 cores) of crunch time. Why oh why does this Beta not do a CPU checpoint. Please please ensure that future Betas include a checkpoint so that folks like me (subject to random power cuts) can preserve most of the work already completed. Checkpointing still seems to be an issue. I thought one of the tests for this new beta was to remedy that problem but I'm still seeing tasks run past 11 hours before the first checkpoint.
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
![]() ![]() |
||
|
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 823 Status: Offline Project Badges:
|
+1
----------------------------------------I am over 14 hours in on a 3.4 GHz machine and the first checkpoint is yet to be reached. ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
We still have an RC = 0x1 exit in Job #0 not validating with an RC = 0x1 exit in Job #3; instead both go to PVer and a repair unit gets issued. The question is whether one of the original pair still ends up Invalid ...
BETA_ E236439_ 314_ S.422.C44H18N4O2S6.PLTGJJHXMUKIKO-UHFFFAOYSA-N.12_ s1_ 14a_ 2-- Microsoft Windows 8.1 Professional x64 Edition, (06.03.9600.00) - In Progress 25/05/16 03:55:21 29/05/16 03:55:21 0.00 0.0 / 0.0 BETA_ E236439_ 314_ S.422.C44H18N4O2S6.PLTGJJHXMUKIKO-UHFFFAOYSA-N.12_ s1_ 14a_ 1-- Microsoft Windows 10 Core x64 Edition, (10.00.10586.00) 700 Pending Verification 24/05/16 10:40:21 25/05/16 03:55:13 8.06 293.7 / 0.0 BETA_ E236439_ 314_ S.422.C44H18N4O2S6.PLTGJJHXMUKIKO-UHFFFAOYSA-N.12_ s1_ 14a_ 0-- Microsoft x64 Edition, (10.00.10586.00) 700 Pending Verification 24/05/16 10:40:01 24/05/16 12:44:09 2.02 63.3 / 0.0 |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Maybe the validator needs an additional rule to always set the wingman canonical candidate to the one with the highest number of jobs completed. Suppose the 3rd copy only gets to job #2, which one is than of binding interest? In the example, for validation purposes, only look at the first 2 jobs that can be matched and assume the 3rd is fine. [Think this is how HCMD2 worked when 2 results had a different number of jobs completed in the allowed time.]
|
||
|
|
|