Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 10
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2983 times and has 9 replies Next Thread
kateiacy
Veteran Cruncher
USA
Joined: Jan 23, 2010
Post Count: 1027
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
question on mismatched WU results, both called valid

Here's a case where my wingman and I produced different results for the same WU, but both were deemed valid. Can someone explain why this happens?

Name: The Clean Energy Project - Phase 2
Created: 6/30/10
Name: E200058_034_A.18.C14H10N2OSi.9.2.set1d06
Minimum Quorum: 2
Replication: 2

Result Name App Version Number Status Sent Time Time Due /
Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
E200058_ 034_ A.18.C14H10N2OSi.9.2.set1d06_ 1-- 619 Valid 7/4/10 12:06:21 7/5/10 04:36:11 0.45 5.5 / 7.6
E200058_ 034_ A.18.C14H10N2OSi.9.2.set1d06_ 0-- 619 Valid 7/4/10 12:04:09 7/4/10 23:30:00 3.16 52.3 / 40.7 <- me

My result log shows all 16 jobs in the WU being completed.
Finished Job #0
Finished Job #1
.
.
.
Finished Job #15

Here's the wingman's result log. His machine skipped the last 13 jobs. Why would that have happened when he was not near the time limit? And why is that result valid?


Result Log

Result Name: E200058_ 034_ A.18.C14H10N2OSi.9.2.set1d06_ 1--
<core_client_version>6.10.17</core_client_version>

<stderr_txt>
INFO: No state to restore. Start from the beginning.
[05:35:51] Number of jobs = 16
[05:35:51] Starting job 0,CPU time has been restored to 0.000000.
[05:35:51] Starting new Job
[05:35:51] Qink name = fldman
[05:35:51] Qink name = gesman
[05:35:51] Qink name = scfman
[05:38:58] Qink name = anlman
[05:39:00] End of Job
[05:39:03] Finished Job #0
[05:39:03] Starting job 1,CPU time has been restored to 82.170508.
[05:39:03] Starting new Job
[05:39:03] Qink name = fldman
[05:39:04] Qink name = gesman
[05:39:04] Qink name = scfman
[05:46:29] Qink name = anlman
[05:46:46] End of Job
[05:46:48] Finished Job #1
[05:46:48] Starting job 2,CPU time has been restored to 312.978419.
[05:46:48] Starting new Job
[05:46:49] Qink name = fldman
[05:46:49] Qink name = gesman
[05:46:49] Qink name = scfman
[05:53:39] Qink name = anlman
[05:53:39] Qink name = drvman
[05:55:10] Qink name = optman
[05:55:10] Qink name = fldman
[05:55:10] Qink name = gesman
[05:55:10] Qink name = scfman
[06:07:30] Qink name = anlman
[06:07:30] Qink name = drvman
[06:08:54] Qink name = optman
[06:08:55] Qink name = fldman
[06:08:55] Qink name = gesman
[06:08:55] Qink name = scfman
[06:20:43] Qink name = anlman
[06:20:44] Qink name = drvman
[06:22:15] Qink name = optman
[06:22:15] Qink name = fldman
[06:22:15] Qink name = gesman
[06:22:16] Qink name = scfman
[06:32:41] Qink name = anlman
[06:32:41] Qink name = drvman
[06:34:05] Qink name = optman
[06:34:06] Qink name = fldman
[06:34:06] Qink name = gesman
[06:34:07] Qink name = scfman
Application exited with RC = 0x100
[06:35:05] Finished Job #2
[06:35:05] Starting job 3,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #3
[06:35:05] Starting job 4,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #4
[06:35:05] Starting job 5,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #5
[06:35:05] Starting job 6,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #6
[06:35:05] Starting job 7,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #7
[06:35:05] Starting job 8,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #8
[06:35:05] Starting job 9,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #9
[06:35:05] Starting job 10,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #10
[06:35:05] Starting job 11,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #11
[06:35:05] Starting job 12,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #12
[06:35:05] Starting job 13,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #13
[06:35:05] Starting job 14,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #14
[06:35:05] Starting job 15,CPU time has been restored to 1535.621548.
[06:35:05] Skipping Job #15
called boinc_finish
Exiting 0

</stderr_txt>
]]>
----------------------------------------

[Jul 7, 2010 5:16:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: question on mismatched WU results, both called valid

Not an original explanation: The Scientists get the one that has done most (well really both go directly to Harvard). The validation in this case is considered sufficient when the 2 results have match on the part that the slowest wingman did.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 7, 2010 6:08:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
kateiacy
Veteran Cruncher
USA
Joined: Jan 23, 2010
Post Count: 1027
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: question on mismatched WU results, both called valid

Not an original explanation: The Scientists get the one that has done most (well really both go directly to Harvard). The validation in this case is considered sufficient when the 2 results have match on the part that the slowest wingman did.


But this situation is different from that. There was no "slow wingman" here. The machine that completed only 3 of the 16 jobs exited in less than 1 hour. That's what has me puzzled.
----------------------------------------

[Jul 7, 2010 7:15:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: question on mismatched WU results, both called valid

Exactly as what I explained and understood, yours is then the one that has the scientists interest having done the most. Why some tasks do early exits I've not heard, neither what RC = 0x4 or RC 0x100 means. Let's see what techs or scientists have to pitch in on this. Then we can start collecting some of the bits into the "don't worry, these are benign messages" FAQ. Depends on how long this phase will run.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Jul 7, 2010 7:43:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
anhhai
Veteran Cruncher
Joined: Mar 22, 2005
Post Count: 839
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: question on mismatched WU results, both called valid

I haven't seen an update to this question anywhere else. Sekerob, can you tell us what happens when both crunchers don't finish the WU (due to whatever reason)? You have already stated that the scientist care about the one that did the most work, but what about the parts that aren't done yet? Will they figure out what WUs haven't been done and recompile new WUs with the unfinished parts?
----------------------------------------

[Sep 16, 2010 1:19:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: question on mismatched WU results, both called valid

That will probably be taken care of in "child" WUs as in other projects.

I do not know that for a fact in cep2 but assume it is true as I have gotten a few that were surprisingly short running, both for me and the wing man (around 4 hours).

The rest are running 6 hours and up. I have only had one time out so far. Happened a couple weeks ago.
[Sep 18, 2010 6:29:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: question on mismatched WU results, both called valid

There are no child results created FAIK. The tasks have series of dependent jobs (simulations) one taking the input of the previous or root. Why in a quorum sometimes one computes more than the other I don't know, but the validation process is set to accept that and to move on as defined by the scientists.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Sep 18, 2010 7:57:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: question on mismatched WU results, both called valid

Just curious. And still no answer by some scientist found. :-(

I have some WUs where I finished all 16 jobs successfully while my wingman had 'Application exited with RC = 0x100' e.g. within job 12 and was skipping all succeeding jobs, but both tasks validated without problem.
On the other hand if the wingman has 'Application exited with RC = 0x4' his task becomes inconclusive and then invalid.
Not a real problem, but I'd appreciate a scientific explanation for both return codes and their difference. Or did I miss something?
[Oct 31, 2010 4:03:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
AgrFan
Senior Cruncher
USA
Joined: Apr 17, 2008
Post Count: 397
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: question on mismatched WU results, both called valid

Just curious. And still no answer by some scientist found. :-(

I have some WUs where I finished all 16 jobs successfully while my wingman had 'Application exited with RC = 0x100' e.g. within job 12 and was skipping all succeeding jobs, but both tasks validated without problem.
On the other hand if the wingman has 'Application exited with RC = 0x4' his task becomes inconclusive and then invalid.
Not a real problem, but I'd appreciate a scientific explanation for both return codes and their difference. Or did I miss something?

I can't get any CEP2 WUs to finish on a Athlon XP rig running Ubuntu 10.04 with all updates. All units fail with RC = 0x4. Could this project require SSE2 cpu support? I had another rig with a P4 HT cpu that did not have this problem. Just looking for some information before sending the box to the scrap heap. Thanks.
----------------------------------------

  • i5-10400 (Comet Lake, 6C/12T) @ 2.9 GHz
  • i5-7400 (Kaby Lake, 4C/4T) @ 3.0 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3330 (Ivy Bridge, 4C/4T) @ 3.0 GHz

[Nov 1, 2010 11:40:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
kateiacy
Veteran Cruncher
USA
Joined: Jan 23, 2010
Post Count: 1027
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: question on mismatched WU results, both called valid

My wingman has an error I haven't seen before; thought I'd post it in case it's not already a known one. (His result was declared valid, by the way, as was mine which ran correctly all the way through to the end.)

Result Log

Result Name: E200498_ 003_ A.25.C18H9N3S3Se.25.4.set1d06_ 1--
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[19:54:24] Number of jobs = 16
[19:54:24] Starting job 0,CPU time has been restored to 0.000000.
[19:54:24] Starting new Job
[19:54:24] Qink name = fldman
[19:54:25] Qink name = gesman
[19:54:25] Qink name = scfman
[19:58:05] Qink name = anlman
[19:58:09] End of Job
[19:58:11] Finished Job #0
[19:58:11] Starting job 1,CPU time has been restored to 121.648506.
[19:58:11] Starting new Job
[19:58:12] Qink name = fldman
[19:58:13] Qink name = gesman
[19:58:13] Qink name = scfman
[20:07:14] Qink name = anlman
[20:08:03] End of Job
[20:08:06] Finished Job #1
[20:08:06] Starting job 2,CPU time has been restored to 519.510021.
[20:08:06] Starting new Job
[20:08:06] Qink name = fldman
[20:08:07] Qink name = gesman
[20:08:08] Qink name = scfman
[20:15:33] Qink name = anlman
[20:15:33] Qink name = drvman
[20:17:12] Qink name = optman
[20:17:12] Qink name = fldman
[20:17:12] Qink name = gesman
[20:17:12] Qink name = scfman
[20:28:44] Qink name = anlman
[20:28:44] Qink name = drvman
[20:30:28] Qink name = optman
[20:30:28] Qink name = fldman
[20:30:28] Qink name = gesman
[20:30:29] Qink name = scfman
[20:42:51] Qink name = anlman
[20:42:51] Qink name = drvman
[20:44:44] Qink name = optman
[20:44:44] Qink name = fldman
[20:44:44] Qink name = gesman
[20:44:45] Qink name = scfman
[20:57:08] Qink name = anlman
[20:57:08] Qink name = drvman
[20:59:13] Qink name = optman
[20:59:13] Qink name = fldman
[20:59:13] Qink name = gesman
[20:59:13] Qink name = scfman
[21:09:46] Qink name = anlman
[21:09:47] Qink name = drvman
[21:11:34] Qink name = optman
[21:11:34] Qink name = fldman
[21:11:34] Qink name = gesman
[21:11:35] Qink name = scfman
[21:23:26] Qink name = anlman
[21:23:27] Qink name = drvman
[21:24:56] Qink name = optman
[21:24:56] Qink name = fldman
[21:24:56] Qink name = gesman
[21:24:57] Qink name = scfman
[21:35:23] Qink name = anlman
[21:35:23] Qink name = drvman
[21:37:38] Qink name = optman
[21:37:38] Qink name = fldman
[21:37:38] Qink name = gesman
[21:37:38] Qink name = scfman
[21:49:38] Qink name = anlman
[21:49:39] Qink name = drvman
[21:51:03] Qink name = optman
[21:51:03] Qink name = fldman
[21:51:03] Qink name = gesman
[21:51:04] Qink name = scfman
[22:02:46] Qink name = anlman
[22:02:46] Qink name = drvman
[22:04:23] Qink name = optman
[22:04:23] Qink name = fldman
[22:04:23] Qink name = gesman
[22:04:23] Qink name = scfman
[22:15:19] Qink name = anlman
[22:15:19] Qink name = drvman
[22:17:40] Qink name = optman
[22:17:40] Qink name = fldman
[22:17:40] Qink name = gesman
[22:17:41] Qink name = scfman
[22:28:49] Qink name = anlman
[22:28:49] Qink name = drvman
[22:30:20] Qink name = optman
[22:30:20] Qink name = fldman
[22:30:20] Qink name = gesman
[22:30:20] Qink name = scfman
[22:39:46] Qink name = anlman
[22:39:46] Qink name = drvman
[22:41:30] Qink name = optman
[22:41:30] Qink name = fldman
[22:41:30] Qink name = gesman
[22:41:30] Qink name = scfman
[22:52:29] Qink name = anlman
[22:52:29] Qink name = drvman
[22:54:03] Qink name = optman
[22:54:03] Qink name = fldman
[22:54:03] Qink name = gesman
[22:54:03] Qink name = scfman
[23:03:32] Qink name = anlman
[23:03:33] Qink name = drvman
[23:06:02] Qink name = optman
[23:06:02] Qink name = fldman
[23:06:02] Qink name = gesman
[23:06:04] Qink name = scfman
[23:12:40] Qink name = anlman
[23:12:40] Qink name = drvman
[23:15:11] Qink name = optman
[23:15:11] Qink name = fldman
[23:15:11] Qink name = gesman
[23:15:13] Qink name = scfman
[23:21:58] Qink name = anlman
[23:21:58] Qink name = drvman
[23:23:37] Qink name = optman
[23:23:37] Qink name = fldman
[23:23:37] Qink name = gesman
[23:23:38] Qink name = scfman
[23:29:33] Qink name = anlman
[23:29:33] Qink name = drvman
[23:31:23] Qink name = optman
[23:31:23] Qink name = anlman
[23:31:54] End of Job
[23:31:58] Finished Job #2
[23:31:58] Starting job 3,CPU time has been restored to 7492.622946.
[23:31:58] Starting new Job
[23:31:58] Qink name = fldman
[23:31:58] Qink name = gesman
[23:31:59] Qink name = scfman
[23:42:33] Qink name = anlman
[23:43:37] End of Job
[23:43:39] Finished Job #3
[23:43:39] Starting job 4,CPU time has been restored to 7905.664154.
[23:43:39] Starting new Job
[23:43:39] Qink name = fldman
[23:43:40] Qink name = gesman
[23:43:40] Qink name = scfman
[23:52:04] Qink name = anlman
[23:53:08] End of Job
[23:53:11] Finished Job #4
[23:53:11] Starting job 5,CPU time has been restored to 8206.906358.
[23:53:11] Starting new Job
[23:53:11] Qink name = fldman
[23:53:11] Qink name = gesman
[23:53:11] Qink name = scfman
[00:00:43] Qink name = anlman
[00:01:13] End of Job
[00:01:15] Finished Job #5
[00:01:15] Starting job 6,CPU time has been restored to 8497.055248.
[00:01:15] Starting new Job
[00:01:15] Qink name = fldman
[00:01:17] Qink name = gesman
[00:01:17] Qink name = scfman
[00:09:26] Qink name = anlman
[00:10:24] End of Job
[00:10:26] Finished Job #6
[00:10:26] Starting job 7,CPU time has been restored to 8799.896209.
[00:10:27] Starting new Job
[00:10:27] Qink name = fldman
[00:10:28] Qink name = gesman
[00:10:28] Qink name = scfman
[00:21:47] Qink name = anlman
[00:22:45] End of Job
[00:22:48] Finished Job #7
[00:22:48] Starting job 8,CPU time has been restored to 9202.710971.
[00:22:49] Starting new Job
[00:22:49] Qink name = fldman
[00:22:50] Qink name = gesman
[00:22:50] Qink name = scfman
[00:31:08] Qink name = anlman
[00:31:23] End of Job
[00:31:26] Finished Job #8
[00:31:26] Starting job 9,CPU time has been restored to 9488.874467.
[00:31:26] Starting new Job
[00:31:26] Qink name = fldman
[00:31:27] Qink name = gesman
[00:31:27] Qink name = scfman
[00:41:49] Qink name = anlman
[00:43:10] End of Job
[00:43:13] Finished Job #9
[00:43:13] Starting job 10,CPU time has been restored to 9869.784559.
[00:43:13] Starting new Job
[00:43:13] Qink name = fldman
[00:43:14] Qink name = gesman
[00:43:14] Qink name = scfman
[01:04:51] Qink name = anlman
[01:06:13] End of Job
[01:06:16] Finished Job #10
[01:06:16] Starting job 11,CPU time has been restored to 10609.091167.
[01:06:16] Starting new Job
[01:06:16] Qink name = fldman
[01:06:16] Qink name = gesman
[01:06:16] Qink name = scfman
[01:15:36] Qink name = anlman
[01:17:03] End of Job
[01:17:06] Finished Job #11
[01:17:06] Starting job 12,CPU time has been restored to 10975.516461.
[01:17:06] Starting new Job
[01:17:07] Qink name = fldman
[01:17:12] Qink name = gesman
[01:17:14] Qink name = scfman
[02:15:12] Qink name = anlman
[02:26:13] End of Job
[02:26:17] Finished Job #12
[02:26:17] Starting job 13,CPU time has been restored to 13356.576484.
[02:26:18] Starting new Job
[02:26:18] Qink name = fldman
[02:26:23] Qink name = gesman
[02:26:24] Qink name = scfman
[04:50:27] Qink name = anlman
[04:59:47] End of Job
[04:59:51] Finished Job #13
[04:59:51] Starting job 14,CPU time has been restored to 19290.353412.
[ERROR] Failed to open either source or destination files while copying A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp.in to ./A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp/A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp.in. Error: 2
Job file not authorized to run
Application exited with RC = 0xe600
[ERROR] Failed to open either source or destination files while copying A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp/stdout.txt to A.25.C18H9N3S3Se.25.4.bp86.svp.n.pbe0.tzvp.n.sp.out. Error: 2
[04:59:53] Finished Job #14
[04:59:53] Starting job 15,CPU time has been restored to 19290.354411.
[ERROR] Failed to open either source or destination files while copying A.25.C18H9N3S3Se.25.4.bp86.svp.n.upbe0.tzvp.n.sp.in to ./A.25.C18H9N3S3Se.25.4.bp86.svp.n.upbe0.tzvp.n.sp/A.25.C18H9N3S3Se.25.4.bp86.svp.n.upbe0.tzvp.n.sp.in. Error: 2
[04:59:53] Skipping Job #15
called boinc_finish
Exiting 0

</stderr_txt>
]]>
----------------------------------------

[Nov 8, 2010 12:12:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread