Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 12
Posts: 12   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2594 times and has 11 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Need clarification on server abort

Got an interesting server abort:

This is "me":

E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 2-- 640 Server Aborted 17/09/11 15:06:37 18/09/11 03:11:58 0.00 0.0 / 0.0

and log from clicking "Server Abort" status:

Result Name: E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 2--

<core_client_version>6.10.58</core_client_version>

This is wingman #1:

E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 1-- 640 Valid 07/09/11 15:14:46 10/09/11 12:09:21 9.15 208.4 / 215.8

and quite normal log from clicking "Valid" status:

Result Name: E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 1--

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[06:27:49] Number of jobs = 16
[06:27:49] Starting job 0,CPU time has been restored to 0.000000.
[06:31:43] Finished Job #0
[06:31:43] Starting job 1,CPU time has been restored to 133.817658.
[06:43:02] Finished Job #1
[06:43:02] Starting job 2,CPU time has been restored to 518.562924.
Quit requested: Exiting
[16:47:05] Number of jobs = 16
[16:47:05] Starting job 2,CPU time has been restored to 518.562924.
Quit requested: Exiting
[17:12:44] Number of jobs = 16
[17:12:44] Starting job 2,CPU time has been restored to 518.562924.
Quit requested: Exiting
[18:18:04] Number of jobs = 16
[18:18:04] Starting job 2,CPU time has been restored to 518.562924.
Quit requested: Exiting
[01:22:24] Number of jobs = 16
[01:22:24] Starting job 2,CPU time has been restored to 518.562924.
Quit requested: Exiting
[02:21:40] Number of jobs = 16
[02:21:40] Starting job 2,CPU time has been restored to 518.562924.
Quit requested: Exiting
[02:36:36] Number of jobs = 16
[02:36:36] Starting job 2,CPU time has been restored to 518.562924.
[06:50:12] Finished Job #2
[06:50:12] Starting job 3,CPU time has been restored to 9136.884969.
[07:02:34] Finished Job #3
[07:02:34] Starting job 4,CPU time has been restored to 9571.878158.
[07:10:56] Finished Job #4
[07:10:56] Starting job 5,CPU time has been restored to 9868.248857.
[07:19:43] Finished Job #5
[07:19:43] Starting job 6,CPU time has been restored to 10176.600434.
[07:28:10] Finished Job #6
[07:28:10] Starting job 7,CPU time has been restored to 10473.283136.
[07:40:20] Finished Job #7
[07:40:20] Starting job 8,CPU time has been restored to 10890.882213.
[07:48:29] Finished Job #8
[07:48:29] Starting job 9,CPU time has been restored to 11178.251655.
[16:16:36] Number of jobs = 16
[16:16:36] Starting job 9,CPU time has been restored to 11178.251655.
Quit requested: Exiting
[16:34:27] Number of jobs = 16
[16:34:27] Starting job 9,CPU time has been restored to 11178.251655.
Quit requested: Exiting
[18:15:48] Number of jobs = 16
[18:15:48] Starting job 9,CPU time has been restored to 11178.251655.
[18:24:42] Finished Job #9
[18:24:42] Starting job 10,CPU time has been restored to 11493.607677.
[18:45:32] Finished Job #10
[18:45:32] Starting job 11,CPU time has been restored to 12230.962003.
[18:56:42] Finished Job #11
[18:56:42] Starting job 12,CPU time has been restored to 12627.422945.
[20:35:47] Finished Job #12
[20:35:47] Starting job 13,CPU time has been restored to 16122.188547.
Quit requested: Exiting
[01:36:26] Number of jobs = 16
[01:36:26] Starting job 13,CPU time has been restored to 16122.188547.
Quit requested: Exiting
[02:24:27] Number of jobs = 16
[02:24:27] Starting job 13,CPU time has been restored to 16122.188547.
[04:45:30] Finished Job #13
[04:45:30] Starting job 14,CPU time has been restored to 21116.576162.
[07:13:11] Finished Job #14
[07:13:11] Starting job 15,CPU time has been restored to 26235.000172.
[10:39:55] Finished Job #15
10:40:04 (6356): called boinc_finish

</stderr_txt>
]]>

and this is wingman #2:

E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 0-- 640 Valid 07/09/11 14:53:04 18/09/11 03:04:45 12.00 209.2 / 202.3

and his log from clicking "Valid":

Result Name: E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 0--

<core_client_version>6.12.33</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[16:27:00] Number of jobs = 16
[16:27:00] Starting job 0,CPU time has been restored to 0.000000.
[16:33:16] Finished Job #0
[16:33:16] Starting job 1,CPU time has been restored to 190.265625.
[16:50:22] Finished Job #1
[16:50:22] Starting job 2,CPU time has been restored to 788.921875.
Quit requested: Exiting
[10:09:31] Number of jobs = 16
[10:09:31] Starting job 2,CPU time has been restored to 788.921875.
Quit requested: Exiting
[19:16:08] Number of jobs = 16
[19:16:08] Starting job 2,CPU time has been restored to 788.921875.
Quit requested: Exiting
[16:27:57] Number of jobs = 16
[16:27:57] Starting job 2,CPU time has been restored to 788.921875.
[15:32:17] Finished Job #2
[15:32:17] Starting job 3,CPU time has been restored to 15258.296875.
[16:38:33] Finished Job #3
[16:38:33] Starting job 4,CPU time has been restored to 15948.062500.
[16:54:58] Finished Job #4
[16:54:58] Starting job 5,CPU time has been restored to 16396.562500.
[00:05:09] Finished Job #5
[00:05:09] Starting job 6,CPU time has been restored to 16867.109375.
[00:20:34] Finished Job #6
[00:20:34] Starting job 7,CPU time has been restored to 17316.250000.
[00:43:23] Finished Job #7
[00:43:23] Starting job 8,CPU time has been restored to 18077.937500.
[01:02:31] Finished Job #8
[01:02:31] Starting job 9,CPU time has been restored to 18496.703125.
Quit requested: Exiting
[09:48:54] Number of jobs = 16
[09:48:54] Starting job 9,CPU time has been restored to 18496.703125.
[10:02:23] Finished Job #9
[10:02:23] Starting job 10,CPU time has been restored to 18959.015625.
[10:41:59] Finished Job #10
[10:41:59] Starting job 11,CPU time has been restored to 20354.250000.
[10:59:14] Finished Job #11
[10:59:14] Starting job 12,CPU time has been restored to 20955.515625.
Quit requested: Exiting
[21:36:20] Number of jobs = 16
[21:36:20] Starting job 12,CPU time has been restored to 20955.515625.
Quit requested: Exiting
[09:45:19] Number of jobs = 16
[09:45:19] Starting job 12,CPU time has been restored to 20955.515625.
Quit requested: Exiting
[10:26:07] Number of jobs = 16
[10:26:07] Starting job 12,CPU time has been restored to 20955.515625.
[16:33:58] Finished Job #12
[16:33:58] Starting job 13,CPU time has been restored to 26117.703125.
[21:59:21] Finished Job #13
[21:59:21] Starting job 14,CPU time has been restored to 34799.968750.
Killing job because cpu time has been exceeded. Subjob start time = 0, Subjob current time = 1088486911
[04:54:38] Finished Job #14
04:54:49 (4792): called boinc_finish

</stderr_txt>
]]>


Eh? That is where I need clarification...wingman #2 shows job kill due to CPU time exceeded, but I get the server abort?

I wouldn't have bothered ya'll again, but when the validation process involves at least two people doing the same thing, and one of the credited wingmen did all 16 jobs but the other one did at most 15 jobs, I personally wonder: Why isn't the validation one-for-one?

(The points...nice to have, but not something my ego depends upon having before I can stand to look at myself in the mirror while shaving.)
[Sep 18, 2011 5:47:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Need clarification on server abort

The server aborted happened after late return of the original wingman. Your computer spend no time computing on the assigned task.

Repair/Makeup job for No Reply **:
E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 2-- 640 Server Aborted 17/09/11 15:06:37 18/09/11 03:11:58 0.00 0.0 / 0.0
Timely Returned job
E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 1-- 640 Valid 07/09/11 15:14:46 10/09/11 12:09:21 9.15 208.4 / 215.8
Late Returned Result:
E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 0-- 640 Valid 07/09/11 14:53:04 18/09/11 03:04:45 12.00 209.2 / 202.3

edit: The technical reasons for quorum 2 is simply because the tools were not in place to do single validation on this science. That will be implemented in the near future (exhaustively discussed), and even then random wingman checks will be run to ensure that the work is ''reliable''.

--//--
----------------------------------------
[Edit 2 times, last edit by Former Member at Sep 18, 2011 6:07:25 PM]
[Sep 18, 2011 6:02:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Need clarification on server abort

The server aborted happened after late return of the original wingman. Your computer spend no time computing on the assigned task.

edit: The technical reasons for quorum 2 is simply because the tools were not in place to do single validation on this science. That will be implemented in the near future (exhaustively discussed), and even then random wingman checks will be run to ensure that the work is ''reliable''.

--//--


You quoted what I don't care about - my computer time/my points. What I do ask for clarification on is "the technical reasons for quorum 2", which in my understanding was having two people do the exact same work as a means of validating each other's work. I.e., for each task, if same # of jobs performed yields identical results, then work is "Valid".

But what I see in the results is (absolutely ignoring me as I am moot):
Wingman #1:
[10:39:55] Finished Job #15
10:40:04 (6356): called boinc_finish
Wingman #2:
[04:54:38] Finished Job #14
04:54:49 (4792): called boinc_finish

I.e., there isn't a one-to-one relationship between jobs completed before an apparent judgement is made that "enough" of a task has been accomplished to satisfy quorum requirements. That suggests to me that either the later job stages in a task don't "always" matter and that is why you can have a quorum without performing identical work or that you are running some other validation method against tasks and the need for a quorum has become arbitrary, a possibility raised by the fluidity suggested by the definition of "valid" in the WCG Wiki:
The result was returned to the server and was equal to the majority of results returned for the WU and or successfully passed several other verification tests in case of zero redundancy.

If ya'll are satisfied with the validity of your science, then far be it from me to question it. I'm just...curious, as the parameters that I am aware of seem to be going unsatisfied.

(Edit: I am "assuming" - always a bad thing - that I was sent the server abort because the server decided a quorum had been reached and so running that task was no longer necessary.)
----------------------------------------
[Edit 1 times, last edit by Former Member at Sep 19, 2011 2:52:26 PM]
[Sep 19, 2011 2:37:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Need clarification on server abort

Your E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 2 task was a resend.

When E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 1 and E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 0 returned and validated your task was not needed; quorum was met (2 tasks returned and validated).
As your task had not started it was server aborted, to save unnecessary work.

As for why it validated when only 15 of the 16 jobs completed, I don't know what the limit is? Perhaps the jobs completed requirement changes with different batches, or there isn't one setup?
It's clear however that it reached the time limit and was stopped early.
Once it validated, it was destined to be server aborted, so long as it hadn't started.
----------------------------------------
[Edit 1 times, last edit by skgiven at Sep 19, 2011 3:06:35 PM]
[Sep 19, 2011 2:46:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Need clarification on server abort

Your E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 2 task was a resend.

When E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 1 and E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 0 returned and validated your task was not needed; quorum was met (2 tasks returned and validated).
As your task had not started it was server aborted, to save unnecessary work.

I see that. Let me rephrase this, then: Why did

E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 0

have 15 jobs while

E203133_ 942_ C.27.C20H11N3S2SeSi.00059063.2.set1d06_ 1

had 16 jobs?

(Edit: Forgot '0' base.)
----------------------------------------
[Edit 1 times, last edit by Former Member at Sep 19, 2011 2:59:26 PM]
[Sep 19, 2011 2:58:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Need clarification on server abort

I initially missed your point, and subsequently edited my above post. I vaguely recall a Scientists post saying something along the line of 2 to 4 valids makes it worth while (but might be mistaking this projects system with HCMD2).

Don't have that link, but this link eludes to that being the situation; the early jobs within a task are the most important, and the latter just help refine the results.
----------------------------------------
[Edit 1 times, last edit by skgiven at Sep 19, 2011 3:25:48 PM]
[Sep 19, 2011 3:12:25 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Need clarification on server abort

Ah...a situation of "We'll send 16 jobs per task, and we will decide if you really needed to do all 16 jobs at some later date.".

Interesting.

If that is the case, there must not be a way to to broadcast to the BOINC clients of grid participants "Hey, if you're running task 'X', you don't need to do jobs [...], 16 - just give us 0 through 'n'." after that first wingman completes the entire task.

Or whenever and however that minimum jobs required decision is made.
[Sep 19, 2011 3:29:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Need clarification on server abort

Don't have that link, but this link eludes to that being the situation; the early jobs within a task are the most important, and the latter just help refine the results.

Geeze...O.T., but I wish I hadn't read your link even though it does offer further explanation. It contains more examples of badge chasing depriving this particular project of participants. I hate to be gimmicky, but...CEP2 should offer a 5 year badge. biggrin
[Sep 19, 2011 3:37:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Need clarification on server abort

As eluded, both jobs had sufficient matching data to be able to declare both valid, where the one doing most is the one retained... anything beyond X jobs is deep refinement the scientists 'like' to have but it's not a must. And then back to the edit of my original post:
edit: The technical reasons for quorum 2 is simply because the tools were not in place to do single validation on this science. That will be implemented in the near future (exhaustively discussed), and even then random wingman checks will be run to ensure that the work is ''reliable''.


Rephrase: In a while the validation alterations will make even a second copy redundant, with the but, so we're way past the discussion of why there is quorum 2 with not equal long computed results (quorums where one or both hit the 12 hour cut-off, extensively discussed for instance here ]). The cut-off will likely go away too, to a certain extend, because all jobs no matter what are cut-off at 10x estimated flops to stop any run-away processes.

--//--
[Sep 19, 2011 3:45:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Need clarification on server abort

In theory, future Boinc versions might allow for more contact/control from the servers to Boinc Clients. Not sure if it will ever be implemented at WCG, due to the size (number of crunchers and projects), especially if CEP2 moves to single quorum.

CEP2 should offer a 5 year badge
Undoubtedly that would work for about 500 crunchers straight away, and perhaps a couple of thousand before the close of the project. Unfortunately some people can find negativity in any positive move, enough to scupper a good plan.
Anyway, I have a DIY 2.8M points badge tongue
[Sep 19, 2011 5:00:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 12   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread