World Community Grid - View Thread

World Community Grid Forums

Category: Support

Forum: GPU Support Forum

Thread: Priority work

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 9

[ ]

Author

This topic has been viewed 2534 times and has 8 replies

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

5 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Priority work

This may fit better in Boinc support but as it is really only a PITA when running GPU work I thought to add it here.

I had my rig running happily along and had spent the time to generally separate in time the work units running concurrently.

No internet for a while and when it re-connects WCG, or is it BOINC, decides it knows better, stops everything mid flow and starts a bunch of new work units.

I can see the point of doing this when a wu runs for 8 or more hours but is it really necessary to do things this way with GPU? Would it not make more sense to let a job finish then start the priority work?

Is it possible to switch off this feature for gpu work and just run a FiFo system? or perhaps as mentioned let jobs finish before running priority work.

What a waste of time. sad

----------------------------------------

[Mar 11, 2013 7:54:41 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Priority work

WCG has no control of the order of processing, only maintains a "repair" logic of shorter deadlines, and prioritizes those in the feeder. Critically, you don't mention the client version, and no, I don't disagree with your logic, on the other hand, the GPU jobs last when on their own 2-10 minutes or less, so what loss / delay is there to consider? We know they have a midway checkpoint, so worst case 1 minute loss per task when unloaded from memory when per-empted.

If to get control, a case needs to be made with the BOINC devs. At WCG the GPU jobs are rather short, elsewhere they can run lots longer, so if FIFO were switched on and not only WCG is active on a host, deadlines could get missed if the buffer is larger than the shortest deadline.

[Mar 11, 2013 8:13:19 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Priority work

P.S. We're still looking for that "Don't send rush jobs" option. There are reasons for having certain devices [mobile/parttime very irregular schedule e.g.], to not be send repairs. Then FIFO would always be run if just crunching WCG [lest you mix HCC 7 day deadline with other WCG sciences and run a very large cache]

[Mar 11, 2013 8:25:11 PM]

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:


Re: Priority work

For the record 7.0.42 on this rig.

The pic above is the result of an experiment to see what happens with the offset when heavily loading the number of concurrent wu's. this is after 24 hours and for me is the best result yet as I can maybe use this if I will be unable to tend the rig for any reason. the downside is that averages are maybe 2-3 secs per wu longer.

I understand what you are saying about other projects Rob ....Maybe it is something that we will have to put up with unless the devs could execute some sort of switch under user control.

Feeling less grumpy about it now I had another coffee wink

Apologies for the negative vibe.

----------------------------------------

[Mar 11, 2013 8:35:53 PM]

BladeD
Ace Cruncher
USA
Joined: Nov 17, 2004
Post Count: 28976
Status: Offline
Project Badges:

180 day badge for Help Cure Muscular Dystrophy

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

50 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Africa Rainfall Project


Re: Priority work

When I'm feeling grumpy and have the time, I will manually stop those WUs and others that might start and let those waiting run to completion. smile

And guess what? The world didn't end! biggrin

----------------------------------------

MyCity

[Mar 11, 2013 11:15:15 PM]

JacobKlein
Cruncher
Joined: Aug 21, 2007
Post Count: 21
Status: Offline


Re: Priority work

Regarding the work units that pre-empted your GPU units... were they run as high-priority?

If I'm not mistaken... when BOINC believes tasks won't make a deadline using a FIFO round-robin simulation, it then re-schedules those "deadline miss" tasks to run first, even if they pre-empt other jobs that are GPU.

[Mar 17, 2013 1:52:19 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Priority work

I have seen this happen once or twice and it is usually when I have a fairly large cache. Occasionally BOINC decides instead of calculating estimated runtime based on my multiple concurrent WU configuration (about 13:00 minutes each) that it should calculated for the same number of concurrent WUs but at the time estimate if I was doing 1 (1:30 minutes each). This is observable in BOINC manager. Unfortunately BOINC is still holding onto the multiple WU configuration core count and calculates that my cache is insufficient so it requests (and receives) more WUs. By more, I mean it floods me out. Now that would only be a 1 time problem if it stuck that way but then a short time later (sometimes 30 seconds sometimes a few minutes) it magically reverts to the "good" estimate of 13:00 minutes and promptly freaks out realizing I can't possibly finish all re-send work in time.

The only solution I have found is to keep my cache to .5 days.

[Mar 17, 2013 12:05:56 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Priority work

There's a wonk in the system [I'm not 100% sure, needing more observations]. I've seen the TTC's jump and drop, for GFAM. When a repair arrives, *all* the GFAM tasks assume a 7+ hour runtime, to include all already in the queue. Then when the repairs are done and new normal GFAM arrive, all, including what is already buffered for GFAM drops to it's approximate correct runtime, round 4 hours. It's beyond me but with WCG being on <dont_use_dcf> tend to think it's coming off the server feeder.

Well, say no more, just uppded the cache to force a top up, the DSFL were at 5:20 hours. First DSFL set arrived with normal deadline, and TTC plumeted to 4:18, then 3 repairs arrived and DSFL time jumped all to 6:48. All GFAM sat there at 4:58 and new work adjusdted that to 4:56.

Maybe knreed can explain, this to add to the offline note I issued. This client actually punches through GFAM/DSFL in 3:45 to 4:45 hours since weeks... not seen longer Here's last 15

GFAM_ x2XFA_ PbADF2_ box2_ 0083813_ 0030_ 0-- 2320720 Pending Validation 3/16/13 21:23:33 3/17/13 13:00:59 3.66 / 3.73 100.7 / 0.0
GFAM_ x2XFA_ PbADF2_ box2_ 0083813_ 0008_ 0-- 2320720 Valid 3/16/13 21:23:17 3/17/13 10:33:04 3.74 / 3.84 104.9 / 106.5
DSFL_ 00060-46_ 0000045_ 0308_ 0-- 2320720 Valid 3/16/13 21:23:17 3/17/13 10:03:57 4.50 / 4.61 114.6 / 124.7
GFAM_ x2XFA_ PbADF2_ box2_ 0083813_ 0203_ 0-- 2320720 Pending Validation 3/16/13 21:23:17 3/17/13 09:55:15 3.81 / 3.91 106.8 / 0.0
DSFL_ 00060-46_ 0000045_ 0360_ 0-- 2320720 Valid 3/16/13 21:23:17 3/17/13 09:44:41 3.74 / 3.84 92.1 / 102.3
GFAM_ x2XFA_ PbADF2_ box2_ 0083813_ 0240_ 0-- 2320720 Valid 3/16/13 21:23:17 3/17/13 09:34:08 4.08 / 4.19 114.5 / 114.5
GFAM_ x2XFA_ PbADF2_ box2_ 0083813_ 0162_ 0-- 2320720 Valid 3/16/13 21:23:17 3/17/13 09:28:06 3.70 / 3.79 103.2 / 102.2
DSFL_ 00060-46_ 0000045_ 0816_ 0-- 2320720 Valid 3/16/13 21:23:17 3/17/13 09:16:56 3.92 / 4.02 96.3 / 103.3
GFAM_ x2XFA_ PbADF2_ box2_ 0083812_ 0099_ 0-- 2320720 Valid 3/16/13 21:23:17 3/17/13 09:04:35 3.75 / 3.85 104.4 / 108.4
DSFL_ 00060-46_ 0000045_ 0028_ 0-- 2320720 Valid 3/16/13 21:20:27 3/17/13 07:11:02 4.40 / 4.45 87.1 / 105.8
DSFL_ 00060-46_ 0000045_ 0347_ 0-- 2320720 Valid 3/16/13 21:20:27 3/17/13 07:11:02 4.92 / 4.99 97.7 / 114.6
GFAM_ x2XFA_ PbADF2_ box2_ 0083813_ 0195_ 0-- 2320720 Pending Validation 3/16/13 21:20:27 3/17/13 07:10:44 3.92 / 3.97 85.6 / 0.0
GFAM_ x2XFA_ PbADF2_ box2_ 0083813_ 0215_ 0-- 2320720 Valid 3/16/13 21:20:27 3/17/13 07:10:44 3.90 / 3.95 85.1 / 101.6
GFAM_ x2XFA_ PbADF2_ box2_ 0083812_ 0248_ 0-- 2320720 Valid 3/16/13 21:20:27 3/17/13 07:10:44 4.14 / 4.21 90.6 / 107.5
GFAM_ x2XFA_ PbADF2_ box2_ 0083812_ 0242_ 1-- 2320720 Pending Validation 3/16/13 21:20:27 3/17/13 07:10:44 3.80 / 3.85 82.9 / 0.0

As for prioritization, GPU goes over CPU, but I'd expect that within the GPU resource the same EDF/FIFO rules apply as for the CPU. Would be a big surprise if not, *but* since LAIM does not work for GPU tasks for discussed reasons, EDF switching is not lossless, albeit, the HCC jobs are short, and have a midway checkpoint.

Anyway, is the TTC for GPU using CPU tasks projecting with a divisor 10? Some comments suggest so.

[Mar 17, 2013 1:11:21 PM]

gomeyer
Senior Cruncher
USA
Joined: Jul 11, 2008
Post Count: 161
Status: Offline
Project Badges:

14 day badge for Nutritious Rice for the World

1 year badge for Uncovering Genome Mysteries

10 year badge for Smash Childhood Cancer

20 year badge for OpenPandemics - COVID-19


Re: Priority work

I've had exactly the same experience as Snow Crash above, and have employed exactly the same remedy; lower the cache to about .5 days or less and avoid the "Panic Mode" to begin with.

WCG has done such an excellent job of keeping their system up and running there is usually no need for a large cache. (Except perhaps before a planned outage such as the one coming up tonight. I may increase my cache all the way to .75 days. tongue

)

----------------------------------------

[Mar 17, 2013 1:26:42 PM]

[ ]