World Community Grid - View Thread

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Late task handling

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 20

[ ]

Author

This topic has been viewed 2473 times and has 19 replies

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

5 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Late task handling

I had a look through the FAQ's and found what happens to an already late task.

Now my question is how are tasks that are on the cusp handled?

By this I mean if the runtime for a work unit is 1 hour and the deadline for a work unit is in 45 minutes, does boinc calculate this as a wu that is (will be) late and abort said unit or will it allow the machine to crunch it? If so does it receive credit/runtime.

I am wondering if I will need to resolve this manually.

Finally, which clock is used by the project to calculate this? If it were the rig's time I can fix this by moving the time by the duration of a work unit so that units might be aborted before getting any cpu time if need be.

Thanks
OC

----------------------------------------

[Sep 24, 2012 2:00:59 AM]

a_mobile_humanist
Cruncher
Joined: May 20, 2011
Post Count: 34
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for The Clean Energy Project - Phase 2

45 day badge for Computing for Clean Water

90 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

45 day badge for Uncovering Genome Mysteries

90 day badge for Outsmart Ebola Together

45 day badge for FightAIDS@Home - Phase 2

90 day badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: Late task handling

My understanding is that if BOINC determines that a workunit might miss its deadline, it will automatically set that workunit to "high priority" and run it more or less exclusively, including pausing other workunits, in order to make the best effort to return on time. If the deadline is still missed, a new copy is sent out to someone else, but your workunit is not automatically abandoned. If you return the workunit before the extra copy returns, you will receive credit. If the extra copy returns first, your copy is marked "Too Late" and receives no credit.

All in all, it's better not to intervene as BOINC will try to make the best effort automatically, including making adjustments to workunit/process priorities, both in the BOINC scheduler and on your actual machine. I've had this happen before, and haven't missed a deadline yet.

See also: https://secure.worldcommunitygrid.org/forums/wcg/viewthread?thread=17160

Some extra info on the work scheduler (with a mess 'o links) from the Unofficial BOINC Wiki: http://www.boinc-wiki.info/Work_Scheduler

Apparently, the exact manner in which late workunits are handled can also vary from project to project (~~so also, in the case of an umbrella project like WCG, subproject to subproject?~~ See World Community Grid forum link above).

Disclaimer: All of the above is subject to change depending on someone else knowing better than me. biggrin

----------------------------------------
[Edit 5 times, last edit by a_mobile_humanist at Sep 24, 2012 2:44:47 AM]

[Sep 24, 2012 2:11:47 AM]

wplachy
Senior Cruncher
Joined: Sep 4, 2007
Post Count: 423
Status: Offline


Re: Late task handling

The server is the “master” time keeper. Once it determines a WU is late it sends out a repair WU and the next time the late host connects it is sent an “order” to abort the WU if it is not in progress.
Be very careful adjusting the client system time as there is a very real risk you will lose all the cached WUs.

----------------------------------------

Bill P

[Sep 24, 2012 3:05:49 AM]

mikey
Veteran Cruncher
Joined: May 10, 2009
Post Count: 826
Status: Offline
Project Badges:

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

1 year badge for Influenza Antiviral Drug Search

180 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

10 year badge for OpenPandemics - COVID-19


Re: Late task handling

All this being said it leads to the question is if it would be wise to just adjust your cache levels downward by a day and see if that puts an end to your being late, or almost late.

----------------------------------------

[Sep 24, 2012 2:15:02 PM]

mikey
Veteran Cruncher
Joined: May 10, 2009
Post Count: 826
Status: Offline
Project Badges:


Re: Late task handling

On most Boinc Projects this "order" to abort the unit is only sent AFTER the unit sent to the other computer is received prior to your original unit being received. So there is a time frame for you to finish the unit, but MOST Projects are now doing the 'resends' to pc's that normally return units in less than 24 hours, so it is VERY time sensitive. It is best to decrease the size of the workunit cache and abort any units that are so close to the deadline as to make you think you won't finish them in time. As stated if you go ahead and attempt to finish the unit you could be crunching along and find it aborted mid task!

----------------------------------------

[Sep 24, 2012 2:20:19 PM]

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:


Re: Late task handling

Thanks for your input gents and some good links.

The reason I find myself needing this information is that some of the guys on the team need every last minute of runtime to get their water badges.

Some have big caches downloaded all at once, so, many pages of work all due to finish in a 3 hour window some days hence. quick calculations suggest that a couple of pages at least will be late.

The cache size at the time of download was 7 days but due to the difference between cpu time and real time running these ( and other work carried out on rig) The amount downloaded was a little high.

I guess that it is best to let it run because those wu's started will most definitely complete before any re-issued task provided that switch between applications every x minutes is greater than 2 * task duration.

----------------------------------------

[Sep 24, 2012 5:39:52 PM]

Movieman
Veteran Cruncher
Joined: Sep 9, 2006
Post Count: 1042
Status: Offline


Re: Late task handling

Hi all,
Just this past weekend I saw Boinc manager do some really screwey things.
On a dual E5-2687W with 32 threads I had a 3 day work load also set to contact at point1 day intervals.
When I was approaching 3 days left anhd seeing what I thought to be more work than I could do in that timeframe I set BOINC to "no new work" and thought I was covered BUT BOINC makes the decision to do the work out of order in terms of report deadlines and then yesterday I see maybe 80 WU have disappeared.. There is a flaw here and it's twofold, the doing of the work out of order for deadline and then the wildcard of Boinc saeeing your machine as a "trusted machine" for lack of a better word and flooding you with high priority WU which then supends to memory those already being worked on which has the effect of not only messing with what you'd already set up but also putting a HUGE load on the memory in the system. This machine has 16 gig of DDR3-1333 ECC REG and I have seen the memory load at over 15 gig because of this flaw. My suggestion is no more than one high priority WU to any single machine and set BOINC to only do WU based on the time to deadline.
Thanks for reading.

----------------------------------------

[Sep 24, 2012 11:54:29 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Late task handling

Unfortunately, the congestion this 1 repair assignment per host would cause on the feeder side [the repair jobs having first priority on distribution [to reliable devices], and will hold up any backfilling of regular due date jobs, meaning that not 100% reliable devices would not get work until all the repair work is out the queue]. That the ~80 short deadline tasks were suddenly gone could be for several reasons, such as:

A) because the late reporters of completed work still reported and yours got a server abort instruction, being redundant
B) the client decided that it was over-commit [less likely, and actually cant remember seeing this ever happen to my hosts].
C) bug
D) don't know (Ingleside could).

The message log would print the reason.

As yet, I've found that being cached under 2 days has not left the device dry once, even through the various troubles we had in past months. With the latest test client and 1.5 days cache [meaning if long jobs come along, the cache inflates to over 2 days or more] this gives me frequent prioritization of repairs, but not too much. Any short deadline job that approaches 2.5 days due gets put ahead, but that's as they complete i.e. not all at the same time, well fitting in the 7GB RAM allowed to BOINC.

Connect 1 day interval actually amplifies the panic state [though uploading of result files is continuous if an internet connection is set active]. Suppose a task is 1.5 days due and there's 2 days buffer ahead in FIFO order. The client also considers that it wont reconnect until another 24 hours have passed to clear the Ready to Report and fetch new work [6.10 client], so it reckons that the 1.5 days due will only make it if put ahead. Earliest Deadline First kicks in.

The scheduler is not perfect, it is still being tweaked, and reality is, it tries to cater for an infinite number of project combinations and scenarios. Even with the said latest test client [7.0.36], I see head-scratchers, and that with a 1.5 day buffer setting. I just let it ride. 3-4 days buffer setting and up... that's though a guaranteed panic state condition set up for a reliable device [really, if running continuous 3 days or up, there should not be any repairs coming as the criteria is, to regular return under 2 days]. Anyway, there's a web-option in the unscheduled ToDo list to disable repairs in the web device profile, which means that apart from Beta, all tasks have at least a 7 day deadline. Can't see how any task on a 3 day cache could miss deadline on that.

Loose bits of info.

[Sep 25, 2012 8:39:38 AM]

Movieman
Veteran Cruncher
Joined: Sep 9, 2006
Post Count: 1042
Status: Offline


Re: Late task handling

Thanks Sek and as always you have a MUCH better understanding of the mechanics of BOINC than I do.
I have sort of what you'd call a "black and white" perspective on the works we do.
I try to get the best machinery that is the most efficient electrically and do the most I can and with that there is this "OMG, they aren't going to finish in time" mentality as if not doing all the listed work will cause entire cities to be lined up and shot.. I know, a bit paranoid but it is what it is.
I keep the 3 day cache because of past history with the (few) downtimes we've seen on WCG so as not to run out of work.
I do think though that "some" limit should be put on these priority WU as God's truth I've seen a full load of 32 of them while the 32 WU I was working are stuck on hold and then held in memory and lets face it, even with 16 gig I'm now running 64 WU, 32 running, 32 in memory and 16 gig is minimal at that amount of WU.. See my point?
Thanks for the imput, as always, you answer questions with good info..That is appreciated more than I can say.

----------------------------------------

[Sep 25, 2012 9:59:16 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Late task handling

If a job is put in per-empted and there is enough Swap file space, I'd thought that the paused tasks used RAM memory segment is moved to Swap file. For Windows the default of Swap is about 1.5 times RAM, space galore. At 64 times 35MB (that what HCC is taking in peak on my W7 ATM), that would be 2.2 Gig. With HFCC, taking peak at 165MB, that's 11GB, still within the 16GB you''ve got on that crunching monster.

Setting to check: Is RAM permission during work/idle set to 100% or close to that?

BTW, Murphy was listening in... now got 6 HCC running in HP on the octo, due in 2:17 days, with 1.5 cache MinBuffer (which is aka "connect every..."). No disk trashing/swapping... all operation within the 7GB permitted including the 6 that were preempted

[Sep 25, 2012 10:43:27 AM]

[ ]