World Community Grid - View Thread - About the second copy after a "no reply"

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: About the second copy after a "no reply"

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 7

[ ]

Author

This topic has been viewed 821 times and has 6 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


About the second copy after a "no reply"

Hello,

I was thinking to ask this since a while.

When a WU with a validation copy does not receive in time the additional result, this one becomes "no reply" and another copy is sent to reliable machines, with a turnaround of <2 days.

Happens sometimes for several reasons the fast machine is not fast anymore! And a third copy is sent.

Now, I notice quite often that (and not only for resent WU) happens something like this:

0000108470112200904091510_ 2-- - In Progress 5/31/11 14:00:36 6/3/11 09:12:36 0.00 0.0 / 0.0
X0000108470112200904091510_ 0-- 642 Error 5/24/11 13:59:42 5/31/11 13:59:35 0.00 0.0 / 0.0
X0000108470112200904091510_ 1-- 642 Pending Validation 5/24/11 13:59:25 5/25/11 16:17:07 1.96 34.5 / 0.0

As you can see, the WU errored out only 7 seconds before the deadline! It cannot be casual!

My question is: this is happening because BOINC is freaking out having all these WU about to expire, and then does something crazy with the result of delivering errors?

I think has something to do with WUs about to expire, never started crunching, and the way BOINC manages them.

Any thoughts?

Thanks!

[May 31, 2011 7:22:11 PM]

gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

90 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: About the second copy after a "no reply"

My thoughts would be that BOINC is "freaking out" having all these WU's to crunch, and instead of attempting to run them (when, it's obvious to BOINC that they won't complete in time), it causes them to error out.

What we've got to remember, is that HCC is the ONLY project (other than DDDT2), with a 7 day deadline as opposed to the 10 day one all the other WCG projects get.

----------------------------------------

[May 31, 2011 7:28:23 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: About the second copy after a "no reply"

latakia,

What client version is listed in the Result log of this task:

X0000108470112200904091510_ 0-- 642 Error 5/24/11 13:59:42 5/31/11 13:59:35 0.00 0.0 / 0.0

Maybe if you click the error link and post a copy of the Result log we can second guess.

Some versions of client is trained to abort a task on reaching deadline asd in ''why waste time on this'', except WCG maybe does not know this state, yet, so it is marked ''error''.

We have:

- User Aborted
- Server Aborted
- Aborted (which filters out both on the Result Status page)

but not yet

Client/Agent Aborted.

But, as said, if you first could copy / paste the result log of the errd task into a reply, we might be able to learn more.

--//--

PS, I've noticed similar btw... an error but the log only listing the task name and client version... empty... a sign of an aborted task.

[May 31, 2011 7:42:36 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: About the second copy after a "no reply"

Sekerob, it is exactly like you described in your last line.

So these are all signs of workunits aborted because no time to finish them.

Thanks for the explanation.

[Jun 1, 2011 2:01:13 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: About the second copy after a "no reply"

Oh, I just noticed something!

One of my machines was in the situation of reliable ---> not reliable anymore, and 4 results with short deadline didn't make in time...

So, in the status you see "error" BUT if you filter for "aborted" they come out!

So basically they are errors at the first sight but in reality they are not - Boinc aborts them and there is no - as Sekerob was stating - such a status defined "boinc aborted"...

Good to know!

edit: a g was missing...

----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 1, 2011 3:52:32 AM]

[Jun 1, 2011 3:51:25 AM]

deltavee
Ace Cruncher
Texas Hill Country
Joined: Nov 17, 2004
Post Count: 4894
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

100 year badge for The Clean Energy Project - Phase 2

10 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

200 year badge for Outsmart Ebola Together

200 year badge for FightAIDS@Home - Phase 2

200 year badge for Smash Childhood Cancer

200 year badge for Microbiome Immunity Project

200 year badge for Africa Rainfall Project

200 year badge for OpenPandemics - COVID-19


Re: About the second copy after a "no reply"

WU completing after almost 15 days! Barely under the wire of wingman no. 3.

E202394_ 100_ C.25.C20H13N3SSi.00508265.2.set1d06_ 3-- 640 Valid 6/26/11 12:16:05 6/27/11 17:02:36 6.17 131.2 / 189.4
E202394_ 100_ C.25.C20H13N3SSi.00508265.2.set1d06_ 2-- - No Reply 6/22/11 12:06:07 6/26/11 12:06:07 0.00 0.0 / 0.0
E202394_ 100_ C.25.C20H13N3SSi.00508265.2.set1d06_ 0-- 640 Valid 6/12/11 12:16:16 6/12/11 22:57:33 10.28 224.7 / 189.4 <--Me
E202394_ 100_ C.25.C20H13N3SSi.00508265.2.set1d06_ 1-- 640 Valid 6/12/11 12:02:07 6/27/11 03:38:47 7.64 154.0 / 189.4 <--15 days!

[Jun 28, 2011 2:59:35 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: About the second copy after a "no reply"

Without knowing the client version it's a guess, but the 15 day "original" client probably did not talk to the servers for longer and when it did, it had already started the task [user could have suspended it 2 weeks ago after running for a little]. Then they're not aborted. Long as the task is on the RS pages the ''grace'' period continues... technically ''too late'' tough. Regrettably client with _3 also did not talk to servers prior to starting, else it woulds have likely been ''server aborted''.

Clients are designed to report late tasks immediately. A change coming is that the project can set a flag so that clients will report a task immediately upon completion. Could be one to employ for CEP2 only and or for ''No Reply" repair tasks. Something knreed might want to ponder on that (probably has already :O).

--//--

[Jun 28, 2011 3:17:29 PM]

[ ]