| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 10
|
|
| Author |
|
|
kateiacy
Veteran Cruncher USA Joined: Jan 23, 2010 Post Count: 1027 Status: Offline Project Badges:
|
Here's a case where my wingman and I produced different results for the same WU, but both were deemed valid. Can someone explain why this happens?
----------------------------------------Name: The Clean Energy Project - Phase 2 Created: 6/30/10 Name: E200058_034_A.18.C14H10N2OSi.9.2.set1d06 Minimum Quorum: 2 Replication: 2 Result Name App Version Number Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit E200058_ 034_ A.18.C14H10N2OSi.9.2.set1d06_ 1-- 619 Valid 7/4/10 12:06:21 7/5/10 04:36:11 0.45 5.5 / 7.6 E200058_ 034_ A.18.C14H10N2OSi.9.2.set1d06_ 0-- 619 Valid 7/4/10 12:04:09 7/4/10 23:30:00 3.16 52.3 / 40.7 <- me My result log shows all 16 jobs in the WU being completed. Finished Job #0 Finished Job #1 . . . Finished Job #15 Here's the wingman's result log. His machine skipped the last 13 jobs. Why would that have happened when he was not near the time limit? And why is that result valid? Result Log Result Name: E200058_ 034_ A.18.C14H10N2OSi.9.2.set1d06_ 1-- <core_client_version>6.10.17</core_client_version> <stderr_txt> INFO: No state to restore. Start from the beginning. [05:35:51] Number of jobs = 16 [05:35:51] Starting job 0,CPU time has been restored to 0.000000. [05:35:51] Starting new Job [05:35:51] Qink name = fldman [05:35:51] Qink name = gesman [05:35:51] Qink name = scfman [05:38:58] Qink name = anlman [05:39:00] End of Job [05:39:03] Finished Job #0 [05:39:03] Starting job 1,CPU time has been restored to 82.170508. [05:39:03] Starting new Job [05:39:03] Qink name = fldman [05:39:04] Qink name = gesman [05:39:04] Qink name = scfman [05:46:29] Qink name = anlman [05:46:46] End of Job [05:46:48] Finished Job #1 [05:46:48] Starting job 2,CPU time has been restored to 312.978419. [05:46:48] Starting new Job [05:46:49] Qink name = fldman [05:46:49] Qink name = gesman [05:46:49] Qink name = scfman [05:53:39] Qink name = anlman [05:53:39] Qink name = drvman [05:55:10] Qink name = optman [05:55:10] Qink name = fldman [05:55:10] Qink name = gesman [05:55:10] Qink name = scfman [06:07:30] Qink name = anlman [06:07:30] Qink name = drvman [06:08:54] Qink name = optman [06:08:55] Qink name = fldman [06:08:55] Qink name = gesman [06:08:55] Qink name = scfman [06:20:43] Qink name = anlman [06:20:44] Qink name = drvman [06:22:15] Qink name = optman [06:22:15] Qink name = fldman [06:22:15] Qink name = gesman [06:22:16] Qink name = scfman [06:32:41] Qink name = anlman [06:32:41] Qink name = drvman [06:34:05] Qink name = optman [06:34:06] Qink name = fldman [06:34:06] Qink name = gesman [06:34:07] Qink name = scfman Application exited with RC = 0x100 [06:35:05] Finished Job #2 [06:35:05] Starting job 3,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #3 [06:35:05] Starting job 4,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #4 [06:35:05] Starting job 5,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #5 [06:35:05] Starting job 6,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #6 [06:35:05] Starting job 7,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #7 [06:35:05] Starting job 8,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #8 [06:35:05] Starting job 9,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #9 [06:35:05] Starting job 10,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #10 [06:35:05] Starting job 11,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #11 [06:35:05] Starting job 12,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #12 [06:35:05] Starting job 13,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #13 [06:35:05] Starting job 14,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #14 [06:35:05] Starting job 15,CPU time has been restored to 1535.621548. [06:35:05] Skipping Job #15 called boinc_finish Exiting 0 </stderr_txt> ]]> ![]() |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Not an original explanation: The Scientists get the one that has done most (well really both go directly to Harvard). The validation in this case is considered sufficient when the 2 results have match on the part that the slowest wingman did.
----------------------------------------
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
kateiacy
Veteran Cruncher USA Joined: Jan 23, 2010 Post Count: 1027 Status: Offline Project Badges:
|
Not an original explanation: The Scientists get the one that has done most (well really both go directly to Harvard). The validation in this case is considered sufficient when the 2 results have match on the part that the slowest wingman did. But this situation is different from that. There was no "slow wingman" here. The machine that completed only 3 of the 16 jobs exited in less than 1 hour. That's what has me puzzled. ![]() |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Exactly as what I explained and understood, yours is then the one that has the scientists interest having done the most. Why some tasks do early exits I've not heard, neither what RC = 0x4 or RC 0x100 means. Let's see what techs or scientists have to pitch in on this. Then we can start collecting some of the bits into the "don't worry, these are benign messages" FAQ. Depends on how long this phase will run.
----------------------------------------
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
anhhai
Veteran Cruncher Joined: Mar 22, 2005 Post Count: 839 Status: Offline Project Badges:
|
I haven't seen an update to this question anywhere else. Sekerob, can you tell us what happens when both crunchers don't finish the WU (due to whatever reason)? You have already stated that the scientist care about the one that did the most work, but what about the parts that aren't done yet? Will they figure out what WUs haven't been done and recompile new WUs with the unfinished parts?
----------------------------------------![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
That will probably be taken care of in "child" WUs as in other projects.
I do not know that for a fact in cep2 but assume it is true as I have gotten a few that were surprisingly short running, both for me and the wing man (around 4 hours). The rest are running 6 hours and up. I have only had one time out so far. Happened a couple weeks ago. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
There are no child results created FAIK. The tasks have series of dependent jobs (simulations) one taking the input of the previous or root. Why in a quorum sometimes one computes more than the other I don't know, but the validation process is set to accept that and to move on as defined by the scientists.
----------------------------------------
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Just curious. And still no answer by some scientist found. :-(
I have some WUs where I finished all 16 jobs successfully while my wingman had 'Application exited with RC = 0x100' e.g. within job 12 and was skipping all succeeding jobs, but both tasks validated without problem. On the other hand if the wingman has 'Application exited with RC = 0x4' his task becomes inconclusive and then invalid. Not a real problem, but I'd appreciate a scientific explanation for both return codes and their difference. Or did I miss something? |
||
|
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 397 Status: Offline Project Badges:
|
Just curious. And still no answer by some scientist found. :-( I have some WUs where I finished all 16 jobs successfully while my wingman had 'Application exited with RC = 0x100' e.g. within job 12 and was skipping all succeeding jobs, but both tasks validated without problem. On the other hand if the wingman has 'Application exited with RC = 0x4' his task becomes inconclusive and then invalid. Not a real problem, but I'd appreciate a scientific explanation for both return codes and their difference. Or did I miss something? I can't get any CEP2 WUs to finish on a Athlon XP rig running Ubuntu 10.04 with all updates. All units fail with RC = 0x4. Could this project require SSE2 cpu support? I had another rig with a P4 HT cpu that did not have this problem. Just looking for some information before sending the box to the scrap heap. Thanks.
|
||
|
|
kateiacy
Veteran Cruncher USA Joined: Jan 23, 2010 Post Count: 1027 Status: Offline Project Badges:
|
My wingman has an error I haven't seen before; thought I'd post it in case it's not already a known one. (His result was declared valid, by the way, as was mine which ran correctly all the way through to the end.)
----------------------------------------Result Log Result Name: E200498_ 003_ A.25.C18H9N3S3Se.25.4.set1d06_ 1-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [19:54:24] Number of jobs = 16 [19:54:24] Starting job 0,CPU time has been restored to 0.000000. [19:54:24] Starting new Job [19:54:24] Qink name = fldman [19:54:25] Qink name = gesman [19:54:25] Qink name = scfman [19:58:05] Qink name = anlman [19:58:09] End of Job [19:58:11] Finished Job #0 [19:58:11] Starting job 1,CPU time has been restored to 121.648506. [19:58:11] Starting new Job [19:58:12] Qink name = fldman [19:58:13] Qink name = gesman [19:58:13] Qink name = scfman [20:07:14] Qink name = anlman [20:08:03] End of Job [20:08:06] Finished Job #1 [20:08:06] Starting job 2,CPU time has been restored to 519.510021. [20:08:06] Starting new Job [20:08:06] Qink name = fldman [20:08:07] Qink name = gesman [20:08:08] Qink name = scfman [20:15:33] Qink name = anlman [20:15:33] Qink name = drvman [20:17:12] Qink name = optman [20:17:12] Qink name = fldman [20:17:12] Qink name = gesman [20:17:12] Qink name = scfman [20:28:44] Qink name = anlman [20:28:44] Qink name = drvman [20:30:28] Qink name = optman [20:30:28] Qink name = fldman [20:30:28] Qink name = gesman [20:30:29] Qink name = scfman [20:42:51] Qink name = anlman [20:42:51] Qink name = drvman [20:44:44] Qink name = optman [20:44:44] Qink name = fldman [20:44:44] Qink name = gesman [20:44:45] Qink name = scfman [20:57:08] Qink name = anlman [20:57:08] Qink name = drvman [20:59:13] Qink name = optman [20:59:13] Qink name = fldman [20:59:13] Qink name = gesman [20:59:13] Qink name = scfman [21:09:46] Qink name = anlman [21:09:47] Qink name = drvman [21:11:34] Qink name = optman [21:11:34] Qink name = fldman [21:11:34] Qink name = gesman [21:11:35] Qink name = scfman [21:23:26] Qink name = anlman [21:23:27] Qink name = drvman [21:24:56] Qink name = optman [21:24:56] Qink name = fldman [21:24:56] Qink name = gesman [21:24:57] Qink name = scfman [21:35:23] Qink name = anlman [21:35:23] Qink name = drvman [21:37:38] Qink name = optman [21:37:38] Qink name = fldman [21:37:38] Qink name = gesman [21:37:38] Qink name = scfman [21:49:38] Qink name = anlman [21:49:39] Qink name = drvman [21:51:03] Qink name = optman [21:51:03] Qink name = fldman [21:51:03] Qink name = gesman [21:51:04] Qink name = scfman [22:02:46] Qink name = anlman [22:02:46] Qink name = drvman [22:04:23] Qink name = optman [22:04:23] Qink name = fldman [22:04:23] Qink name = gesman [22:04:23] Qink name = scfman [22:15:19] Qink name = anlman [22:15:19] Qink name = drvman [22:17:40] Qink name = optman [22:17:40] Qink name = fldman [22:17:40] Qink name = gesman [22:17:41] Qink name = scfman [22:28:49] Qink name = anlman [22:28:49] Qink name = drvman [22:30:20] Qink name = optman [22:30:20] Qink name = fldman [22:30:20] Qink name = gesman [22:30:20] Qink name = scfman [22:39:46] Qink name = anlman [22:39:46] Qink name = drvman [22:41:30] Qink name = optman [22:41:30] Qink name = fldman [22:41:30] Qink name = gesman [22:41:30] Qink name = scfman [22:52:29] Qink name = anlman [22:52:29] Qink name = drvman [22:54:03] Qink name = optman [22:54:03] Qink name = fldman [22:54:03] Qink name = gesman [22:54:03] Qink name = scfman [23:03:32] Qink name = anlman [23:03:33] Qink name = drvman [23:06:02] Qink name = optman [23:06:02] Qink name = fldman [23:06:02] Qink name = gesman [23:06:04] Qink name = scfman [23:12:40] Qink name = anlman [23:12:40] Qink name = drvman [23:15:11] Qink name = optman [23:15:11] Qink name = fldman [23:15:11] Qink name = gesman [23:15:13] Qink name = scfman [23:21:58] Qink name = anlman [23:21:58] Qink name = drvman [23:23:37] Qink name = optman [23:23:37] Qink name = fldman [23:23:37] Qink name = gesman [23:23:38] Qink name = scfman [23:29:33] Qink name = anlman [23:29:33] Qink name = drvman [23:31:23] Qink name = optman [23:31:23] Qink name = anlman [23:31:54] End of Job [23:31:58] Finished Job #2 [23:31:58] Starting job 3,CPU time has been restored to 7492.622946. [23:31:58] Starting new Job [23:31:58] Qink name = fldman [23:31:58] Qink name = gesman [23:31:59] Qink name = scfman [23:42:33] Qink name = anlman [23:43:37] End of Job [23:43:39] Finished Job #3 [23:43:39] Starting job 4,CPU time has been restored to 7905.664154. [23:43:39] Starting new Job [23:43:39] Qink name = fldman [23:43:40] Qink name = gesman [23:43:40] Qink name = scfman [23:52:04] Qink name = anlman [23:53:08] End of Job [23:53:11] Finished Job #4 [23:53:11] Starting job 5,CPU time has been restored to 8206.906358. [23:53:11] Starting new Job [23:53:11] Qink name = fldman [23:53:11] Qink name = gesman [23:53:11] Qink name = scfman [00:00:43] Qink name = anlman [00:01:13] End of Job [00:01:15] Finished Job #5 [00:01:15] Starting job 6,CPU time has been restored to 8497.055248. [00:01:15] Starting new Job [00:01:15] Qink name = fldman [00:01:17] Qink name = gesman [00:01:17] Qink name = scfman [00:09:26] Qink name = anlman [00:10:24] End of Job [00:10:26] Finished Job #6 [00:10:26] Starting job 7,CPU time has been restored to 8799.896209. [00:10:27] Starting new Job [00:10:27] Qink name = fldman [00:10:28] Qink name = gesman [00:10:28] Qink name = scfman [00:21:47] Qink name = anlman [00:22:45] End of Job [00:22:48] Finished Job #7 [00:22:48] Starting job 8,CPU time has been restored to 9202.710971. [00:22:49] Starting new Job [00:22:49] Qink name = fldman [00:22:50] Qink name = gesman [00:22:50] Qink name = scfman [00:31:08] Qink name = anlman [00:31:23] End of Job [00:31:26] Finished Job #8 [00:31:26] Starting job 9,CPU time has been restored to 9488.874467. [00:31:26] Starting new Job [00:31:26] Qink name = fldman [00:31:27] Qink name = gesman [00:31:27] Qink name = scfman [00:41:49] Qink name = anlman [00:43:10] End of Job [00:43:13] Finished Job #9 [00:43:13] Starting job 10,CPU time has been restored to 9869.784559. [00:43:13] Starting new Job [00:43:13] Qink name = fldman [00:43:14] Qink name = gesman [00:43:14] Qink name = scfman [01:04:51] Qink name = anlman [01:06:13] End of Job [01:06:16] Finished Job #10 [01:06:16] Starting job 11,CPU time has been restored to 10609.091167. [01:06:16] Starting new Job [01:06:16] Qink name = fldman [01:06:16] Qink name = gesman [01:06:16] Qink name = scfman [01:15:36] Qink name = anlman [01:17:03] End of Job [01:17:06] Finished Job #11 [01:17:06] Starting job 12,CPU time has been restored to 10975.516461. [01:17:06] Starting new Job [01:17:07] Qink name = fldman [01:17:12] Qink name = gesman [01:17:14] Qink name = scfman [02:15:12] Qink name = anlman [02:26:13] End of Job [02:26:17] Finished Job #12 [02:26:17] Starting job 13,CPU time has been restored to 13356.576484. [02:26:18] Starting new Job [02:26:18] Qink name = fldman [02:26:23] Qink name = gesman [02:26:24] Qink name = scfman [04:50:27] Qink name = anlman [04:59:47] End of Job [04:59:51] Finished Job #13 [04:59:51] Starting job 14,CPU time has been restored to 19290.353412. [ERROR] Failed to open either source or destination files while copying A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp.in to ./A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp/A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp.in. Error: 2 Job file not authorized to run Application exited with RC = 0xe600 [ERROR] Failed to open either source or destination files while copying A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp/stdout.txt to A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp.out. Error: 2 [04:59:53] Finished Job #14 [04:59:53] Starting job 15,CPU time has been restored to 19290.354411. [ERROR] Failed to open either source or destination files while copying A.25.C18H9N3S3Se.25.4.bp86.svp.n.upbe0.tzvp.n.sp.in to ./A.25.C18H9N3S3Se.25.4.bp86.svp.n.upbe0.tzvp.n.sp/A.25.C18H9N3S3Se.25.4.bp86.svp.n.upbe0.tzvp.n.sp.in. Error: 2 [04:59:53] Skipping Job #15 called boinc_finish Exiting 0 </stderr_txt> ]]> ![]() |
||
|
|
|