Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 8
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3161 times and has 7 replies Next Thread
mreuter80
Advanced Cruncher
Joined: Oct 2, 2006
Post Count: 83
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
something's wrong with validation on E200925_880_A.26.C20H11NS3Se2.26.1.set1d06 [resolved]

Hey team,
I saw on one of the WUs I had the following (bold is from my computer):
E200925_ 880_ A.26.C20H11NS3Se2.26.1.set1d06_ 3-- 	635	Valid 	1/17/11 00:23:19	1/17/11 12:04:02 	8.03	229.7 / 120.5 
E200925_ 880_ A.26.C20H11NS3Se2.26.1.set1d06_ 2-- - No Reply 1/12/11 23:59:23 1/16/11 23:59:23 0.00 0.0 / 0.0
E200925_ 880_ A.26.C20H11NS3Se2.26.1.set1d06_ 1-- 635 Valid 1/3/11 00:03:21 1/3/11 01:52:39 0.07 1.4 / 15.1
E200925_ 880_ A.26.C20H11NS3Se2.26.1.set1d06_ 0-- - No Reply 1/2/11 23:46:27 1/12/11 23:46:27 0.00 0.0 / 0.0

Looking at my wingman's runtime something seems to be wrong. After checking his status information I found out that only job 0 finished and job 1 exited with an error. All other jobs were skipped.
I know that the WU don't need to finish the entire set of 16 jobs to be valid, however, if there is such a discrepancy (especially if there is an error in the WU) I think something doesn't work properly. Also, if only job 0 will be used for analysis I think I have wasted 8 hours of precious energy. crying
Why do we actually have a quorum of 2 if it allows such differences?
Maybe this was just a freak accident, but maybe someone can have a look at it.
Cheers
----------------------------------------
[Edit 1 times, last edit by mreuter80 at Jan 17, 2011 6:22:57 PM]
[Jan 17, 2011 1:39:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: something's wrong with validation on E200925_880_A.26.C20H11NS3Se2.26.1.set1d06

Hello mreuter80,
Why do we actually have a quorum of 2 if it allows such differences?

Only the project scientists know just what (and why) the validation criteria are currently. But apparently, if a computer has a record of returning valid units, and if another computer validates at least one job from that computer, then they are willing to accept the results of the cpu that returns a number of jobs. So they appear to demand at least a little validation (in order to avoid excessive gullibility) but they are trying not to waste cpu time.

biggrin Doesn't bother me at all!

Lawrence
[Jan 17, 2011 2:20:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: something's wrong with validation on E200925_880_A.26.C20H11NS3Se2.26.1.set1d06

Dear mreuter80,
the validation is performed on a job that was finished by both wingmen - in this case job0. If validation at that point is positive, it is unlikely that the rest of the results differ, hence we are satisfied with that.

You are right that it would be more efficient to not run the validation copy all the way to conclusion, since the results are not used anyways. We had a detailed discussion on this issue with the IBM team but it seems that it's technically not feasible at this point. So we have to live with it.

Best wishes from
Your Harvard CEP team
[Jan 17, 2011 3:53:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mreuter80
Advanced Cruncher
Joined: Oct 2, 2006
Post Count: 83
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: something's wrong with validation on E200925_880_A.26.C20H11NS3Se2.26.1.set1d06

Thanks for the quick answers. I'm still not clear on one thing. Is the research team using then only the result from job 0 or the results from all 16 jobs? Or, will there be another WU send out that contains only the jobs 1 - 15?
[Jan 17, 2011 4:11:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: something's wrong with validation on E200925_880_A.26.C20H11NS3Se2.26.1.set1d06

snip

You are right that it would be more efficient to not run the validation copy all the way to conclusion, since the results are not used anyways. We had a detailed discussion on this issue with the IBM team but it seems that it's technically not feasible at this point. So we have to live with it.

Presume here the --validation-- copy is the one meant as running 0.07 hours. I'm unsure what exact loss is determined here in significance if the tasks just cycles to the end in a matter of a minute once an end (for instance the RC=x100) or kill condition is encountered.

thx.
[Jan 17, 2011 4:20:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: something's wrong with validation on E200925_880_A.26.C20H11NS3Se2.26.1.set1d06

Hi mreuter80,
smile Sorry. I should have stated it explicitly. All the jobs returned up to job 15 (the 16th), are accepted if a wingman validates at least the first job (job 0).

Lawrence
[Jan 17, 2011 4:33:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: something's wrong with validation on E200925_880_A.26.C20H11NS3Se2.26.1.set1d06

Hi SekeRob et al.,
these special cases are just not easy to implement in BOINC/WCG.

But there is some good news: This thread has triggered a new idea on how to do the validation which would basically eliminate the waste of the current setup cool. It may take a month or two to implement, but it is high up on the agenda. We'll keep you posted on how this is going.
Best wishes from

Your Harvard CEP team
[Jan 17, 2011 5:57:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mreuter80
Advanced Cruncher
Joined: Oct 2, 2006
Post Count: 83
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: something's wrong with validation on E200925_880_A.26.C20H11NS3Se2.26.1.set1d06

Thanks all again for the quick answers.
I'm glad my resources were not wasted and my stupid question eventually triggered some new idea biggrin
Please keep us posted about the "new idea".
I will close the thread as solved.
Cheers!
[Jan 17, 2011 6:22:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread