World Community Grid - View Thread - Should I be running a smaller additional work buffer?

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: Should I be running a smaller additional work buffer?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 7

[ ]

Author

This topic has been viewed 1563 times and has 6 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Should I be running a smaller additional work buffer?

I've been running a 0.6 day buffer, and just cranked it down to a 0.5 day buffer as I'm being scheduled high-priority tasks that are pushing queued tasks out to deadline warnings and beyond. For instance, this is a typical eEFmer BoincTasks display (although it's been far redder , of late) showing a number of "high-priority" tasks:

that results in queued tasks sitting for a week (for example) on Intel 2600Ks/980Xs/etc. that only have 0.6 day buffers and typical don't do anything but CEP2 on on their CPUs (although a fractional or full core and a GPU may be doing something else) and so - without the onslaught of high-priority tasks - would have no problem chomping through a measly half-day of tasks. But as you can see in this next image, there's the details of a task that had already been sitting a week...

So I'm wondering if going almost "JIT" might better serve the purpose that BOINC's scheduler sees in my systems....say, by taking my work buffer down to 0.25 of a day...or even lower. Any associated hazards/risks?

(Edit: lolll...yeah, I suspended so I could have some CPU and GPU to take all of those nifty pictures; it isn't late because I dorked up and left it suspended.)

----------------------------------------
[Edit 1 times, last edit by Former Member at Dec 15, 2012 11:18:51 PM]

[Dec 15, 2012 11:12:46 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Should I be running a smaller additional work buffer?

Yeah, I would drop it down to .1 for a while to see if that lessens the load and then gradually up it to what ever your comfort level might be. I would also lower the "connect about every" time to really small number also.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Dec 16, 2012 12:59:58 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Should I be running a smaller additional work buffer?

Aside from the over-buffer issue that persists across 6.xx and 7.xx [a timing issue between the servers and client, deemed "too fast" at the alien search site], you don't mention which client you're running. FTM I'm highly frustrated with how it works in 7 through 7.0.42, and how quickly it runs into panic state. Tried to convey the message to developers, but it's one of those "inconvenient" issues, hitting on the "far eastern deafness" as it's called in one language I know. The Maximum additional work buffer, MaxAB, is pretty useless at WCG, particularly if you only run WCG and not at all sure where WCG stands on the matter. The last word was that 15 tasks per work-fetch call, but I've seen lastly 23, and one crazy, 86 contradicting that 15. Is it that timing issue, or is there something else not shared at play?

To understand your case [for a best guess fix], we need to know "Connect about every..." / "Minimum work buffer" (MinB) and Maximum Additional Work Buffer (MaxAB). For sure, since I'm planning an off-line trip, want to take 7 days with me on the octo lappy, so the past few days has been spend on getting *out* of the reliability class, by holding back reporting till > 48 hours. Last supply series only had 1 out of about 30 being of the repair type, but few days ago my 1 day buffer consisted 50% out of repair jobs [21 out of 43], which thoroughly upsets the applecart. Most of these tasks were aborts at that.

As for your BT screenshot, cant's see what jobs the condensed/filtered tasks are. It does not show the shortname/user friendly name. Looks like at least 2 sciences.

Computer name skghome9... reminds me of a member who's not posted for a longer while... coincidence I suppose.

BTW, latest 7 clients don't do the "let's try many" to see how long they run during a panic state, and some sitting there till near deadline before they get a turn, if they get a turn. The highest number I've seen just WCG crunching was equal the the maximum number or concurrent jobs going into "waiting to run" preempted state.

Edit: P.S. Earliest Deadline First / Panic State is really pushed by the Connect / MinB setting [the logic continues to be baffling to me]. If you want least panic issue and always online, keep it lowest possible to your comfort. Hopefully MaxAB [which is in 6 always filled to the brim, but in 7 only acts as an maximum overflow. The client asks for X seconds of work, but at WCG you'll be lucky to get a handful, to lift you over the MinB level, at which point work fetch stops [for the project with highest work fetch priority].

Sunday afternoon loose rant [and those with "how to root resolve the problem BOINC code" ideas, it's a waste of time selling them here... really they need to tell the developers at their alpha mail list.]

----------------------------------------
[Edit 1 times, last edit by Former Member at Dec 16, 2012 4:29:57 PM]

[Dec 16, 2012 4:15:30 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Should I be running a smaller additional work buffer?

I typically set a system up so that the CPU cores are dedicated to CEP2 except for a a fractional or a whole core dedicated to support an ATI GPU running either HCC (usually), or Collatz or the Seti beta (hey, if ET is out there and accepting hitchhikers - with proper acknowledgements to Douglas Adams - then I'm leaving...I've had enough of Earth's bloody righties).

Given that the central purpose is CEP2, the BOINC client used is the CEP2-"approved" client (6.10.58). I always choose GPU projects as secondaries so that CEP2 gets first shot at the CPU cores. Configuration at the local level always looks like this:

Although 6/12 physical/virtual core boxes may not have "100% of cores" running science...I run virtual machines so that I can, for instance, remote desktop in and use them to write long-winded speeches on WCG without disturbing the busy, busy, busy GPUs.

At the local level under the BOINC Manager "Activity" menu, I set "Network Activity" to "Always Available" and GPU to "Use GPU Always"...and "Run Based on Preferences".

I was wondering if the number of times the HCC application tickles the scheduler has something to do with it? The ATI GPU churns through HCC work units like a kid through M&Ms. Collatz and Seti go through work units far faster than CEP2 does, too, but I'm not altogether sure from "just watching" that they affect the scheduler the same way.

The SKGHOME7 box...while it previously had issues with a similar stack of units running high priority and work units in deadline warning condition, after the recent Msoft "Patch Tuesday" it displayed string of "exited with zero status but no 'finished' file" so I reset the WCG project. Lo and behold, it started behaving "better". On the off chance that might have a similar impact on the other boxes (that is, clear the queues of "high priority" and near/past deadline work units), I reset their WCG projects...to no effect.

Given how rude a project reset is, I ain't doing that again. Other than reduce their work buffers, I doubt that I'll do anything other than watch the somewhat intimidating colors associated with late/too late and high priority work units spread like the hapless epidemiologist watching the spread of a virus on a map in a "B" movie on the Syfy channel.

[Dec 16, 2012 7:51:59 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

20 year badge for The Clean Energy Project - Phase 2

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Should I be running a smaller additional work buffer?

I was wondering if the number of times the HCC application tickles the scheduler has something to do with it? The ATI GPU churns through HCC work units like a kid through M&Ms. Collatz and Seti go through work units far faster than CEP2 does, too, but I'm not altogether sure from "just watching" that they affect the scheduler the same way.

On a Core i7-3770 I run only unlimited CEP2, GFAM, SN2S and DSFL, as well as HCC on the GPU, an HD 7770. And I never get panic jobs (except once when I had something obviously misconfigured in BOINC, but I don't remember what it was). So maybe it is a non-WCG project that is causing the problem? (I use 0.25 days min work buffer, and 0.50 days additional work buffer on BOINC 7.0.42 under Win7 64-bit.)

EDIT:I have had as many as 7 CEP2 jobs running at once (one core is reserved for HCC/GPU) with no problem, though normally it runs 2 to 4 at a time.

(Also, it is a dedicated machine that runs 24/7, if that is a factor. I don't know that I understand the problem.)

----------------------------------------
[Edit 2 times, last edit by Jim1348 at Dec 17, 2012 3:51:31 PM]

[Dec 17, 2012 3:23:43 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Should I be running a smaller additional work buffer?

V6/V7 has the nasty of wanting to process anything in high priority when the MinB multiplied by 2 is near or greater the shortest deadline task in buffer. In effect, when running a 2 day buffer, tasks that get their turn after 2 days are already 2 days old, so here BOINC goes off redlining anyhow short deadline repair tasks in BT, where at least in V7 it will not put more than 8 into a preempting state for one project [WCG e.g.]. That is, when running CPU tasks only. At some point the v7 dev clients were requesting work no matter if it needed it or not... sort of like "why let a scheduler connect to report results go to waist". Hit update and it would fetch at least one task, on and on, but that was fixed FAIK at least by version 7.0.31.

Often, if there is a panic mode, the simple setting of MinB to zero reverses the EDF status, FIFO resuming, but eventually if total buffer is greater than the shorter deadlines, the panic continues.

----------------------------------------
[Edit 1 times, last edit by Former Member at Dec 17, 2012 4:14:40 PM]

[Dec 17, 2012 4:04:37 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Should I be running a smaller additional work buffer?

I did figure out how to make this even more annoying: Don't pay sufficient attention to Microsoft emails so you miss their re-release of a Microsoft "Patch Tuesday" patch (Security Bulletin MS12-078, KB2783534) and then its subsequent silent automatic install under Windows 7 will automatically reboot your computers which you won't see because they're running on KVMs and don't have dedicated displays...just the ambient temperature on that floor will start dropping because BOINC isn't running anymore...and when you finally figure it out you'll get a bunch of

12/23/2012 11:11:48 PM World Community Grid Task E210860_614_A.31.C26H19NOSSi2.3.1.set1d06_0 is 1.30 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:11:48 PM World Community Grid Task E210860_559_A.32.C28H17NOS2.347.2.set1d06_0 is 1.30 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:11:48 PM World Community Grid Task E210860_377_A.32.C27H17N3S2.259.3.set1d06_1 is 1.30 days overdue; you may not get credit for it. Consider aborting it.

12/23/2012 11:20:05 PM World Community Grid Task E210852_036_C.32.C26H16N2OSSi2.02245857.4.set1d06_0 is 1.87 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_087_C.33.C27H14N4S2.02047269.1.set1d06_1 is 1.85 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_429_C.33.C28H14N2OS2.02227445.0.set1d06_0 is 1.84 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_696_C.33.C27H14N4S2.02047010.0.set1d06_0 is 1.84 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_535_C.33.C28H14N2OS2.02180931.3.set1d06_0 is 1.83 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210853_163_C.31.C28H16S2Se.02156016.3.set1d06_0 is 1.83 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210861_834_A.31.C27H17NOSSe.47.4.set1d06_1 is 1.26 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_497_C.32.C27H14N2OSSe.02129996.2.set1d06_0 is 1.87 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_776_C.32.C28H16OS2Si.02234802.2.set1d06_0 is 1.87 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_604_C.32.C27H14N2OSSe.02054247.2.set1d06_0 is 1.87 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_536_C.32.C27H16N2S2Si.02201876.4.set1d06_0 is 1.87 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_481_C.32.C28H16OS2Si.02119190.3.set1d06_0 is 1.87 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_843_C.32.C26H16N2OSSi2.02133160.0.set1d06_0 is 1.87 days overdue; you may not get credit for it. Consider aborting it.
12/23/2012 11:20:05 PM World Community Grid Task E210852_718_C.32.C25H16N4SSi2.02184089.3.set1d06_0 is 1.87 days overdue; you may not get credit for it. Consider aborting it.

...so you do...

Sigh. Such episodes make me consider running BOINC as a service...and then I remember the times when the power has gone out, come back, and everything comes back up except the air conditioning. I haven't quite got around to wiring in an ambient temp sensor yet, and I figure that is a prerequisite for running BOINC as a service unless I'm going to be here to babysit this stuff 24/7. Which I'm not. wink

[Dec 24, 2012 6:19:03 AM]

[ ]