Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 7
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 821 times and has 6 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
About the second copy after a "no reply"

Hello,

I was thinking to ask this since a while.

When a WU with a validation copy does not receive in time the additional result, this one becomes "no reply" and another copy is sent to reliable machines, with a turnaround of <2 days.

Happens sometimes for several reasons the fast machine is not fast anymore! And a third copy is sent.

Now, I notice quite often that (and not only for resent WU) happens something like this:

0000108470112200904091510_ 2-- - In Progress 5/31/11 14:00:36 6/3/11 09:12:36 0.00 0.0 / 0.0
X0000108470112200904091510_ 0-- 642 Error 5/24/11 13:59:42 5/31/11 13:59:35 0.00 0.0 / 0.0
X0000108470112200904091510_ 1-- 642 Pending Validation 5/24/11 13:59:25 5/25/11 16:17:07 1.96 34.5 / 0.0


As you can see, the WU errored out only 7 seconds before the deadline! It cannot be casual!

My question is: this is happening because BOINC is freaking out having all these WU about to expire, and then does something crazy with the result of delivering errors?

I think has something to do with WUs about to expire, never started crunching, and the way BOINC manages them.

Any thoughts?

Thanks!
[May 31, 2011 7:22:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: About the second copy after a "no reply"

My thoughts would be that BOINC is "freaking out" having all these WU's to crunch, and instead of attempting to run them (when, it's obvious to BOINC that they won't complete in time), it causes them to error out.

What we've got to remember, is that HCC is the ONLY project (other than DDDT2), with a 7 day deadline as opposed to the 10 day one all the other WCG projects get.
----------------------------------------

[May 31, 2011 7:28:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: About the second copy after a "no reply"

latakia,

What client version is listed in the Result log of this task:

X0000108470112200904091510_ 0-- 642 Error 5/24/11 13:59:42 5/31/11 13:59:35 0.00 0.0 / 0.0

Maybe if you click the error link and post a copy of the Result log we can second guess.

Some versions of client is trained to abort a task on reaching deadline asd in ''why waste time on this'', except WCG maybe does not know this state, yet, so it is marked ''error''.

We have:

- User Aborted
- Server Aborted
- Aborted (which filters out both on the Result Status page)

but not yet

Client/Agent Aborted.

But, as said, if you first could copy / paste the result log of the errd task into a reply, we might be able to learn more.

--//--

PS, I've noticed similar btw... an error but the log only listing the task name and client version... empty... a sign of an aborted task.
[May 31, 2011 7:42:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: About the second copy after a "no reply"

Sekerob, it is exactly like you described in your last line.

So these are all signs of workunits aborted because no time to finish them.

Thanks for the explanation.
[Jun 1, 2011 2:01:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: About the second copy after a "no reply"

Oh, I just noticed something!

One of my machines was in the situation of reliable ---> not reliable anymore, and 4 results with short deadline didn't make in time...

So, in the status you see "error" BUT if you filter for "aborted" they come out!

So basically they are errors at the first sight but in reality they are not - Boinc aborts them and there is no - as Sekerob was stating - such a status defined "boinc aborted"...

Good to know!

edit: a g was missing...
----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 1, 2011 3:52:32 AM]
[Jun 1, 2011 3:51:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
deltavee
Ace Cruncher
Texas Hill Country
Joined: Nov 17, 2004
Post Count: 4894
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: About the second copy after a "no reply"

WU completing after almost 15 days! Barely under the wire of wingman no. 3.

E202394_ 100_ C.25.C20H13N3SSi.00508265.2.set1d06_ 3-- 640 Valid 6/26/11 12:16:05 6/27/11 17:02:36 6.17 131.2 / 189.4
E202394_ 100_ C.25.C20H13N3SSi.00508265.2.set1d06_ 2-- - No Reply 6/22/11 12:06:07 6/26/11 12:06:07 0.00 0.0 / 0.0
E202394_ 100_ C.25.C20H13N3SSi.00508265.2.set1d06_ 0-- 640 Valid 6/12/11 12:16:16 6/12/11 22:57:33 10.28 224.7 / 189.4 <--Me
E202394_ 100_ C.25.C20H13N3SSi.00508265.2.set1d06_ 1-- 640 Valid 6/12/11 12:02:07 6/27/11 03:38:47 7.64 154.0 / 189.4 <--15 days!
[Jun 28, 2011 2:59:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: About the second copy after a "no reply"

Without knowing the client version it's a guess, but the 15 day "original" client probably did not talk to the servers for longer and when it did, it had already started the task [user could have suspended it 2 weeks ago after running for a little]. Then they're not aborted. Long as the task is on the RS pages the ''grace'' period continues... technically ''too late'' tough. Regrettably client with _3 also did not talk to servers prior to starting, else it woulds have likely been ''server aborted''.

Clients are designed to report late tasks immediately. A change coming is that the project can set a flag so that clients will report a task immediately upon completion. Could be one to employ for CEP2 only and or for ''No Reply" repair tasks. Something knreed might want to ponder on that (probably has already :O).

--//--
[Jun 28, 2011 3:17:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread