World Community Grid - View Thread - Let's discuss this one more time

World Community Grid Forums

Category: Support

Forum: Suggestions / Feedback

Thread: Let's discuss this one more time

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 42

[ ]

Author

This topic has been viewed 5700 times and has 41 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Let's discuss this one more time

Hi Barney!
Your changing the write-to-disk interval just "when you need" does not work, unfortunately, and you could check it if you had set your cc_config.xml file to have chechpoints logged in your messages.
Applications query the Boinc client about this parameter only once when they start or restart. Then they use this value till the end of the run (completion or shutdown).

Hi Jean,

How ya' been bud?

Now you have me really confused... I was under the impression (perhaps incorrectly) setting the values, then applying them; then going to the advanced->read settings did the trick! Your saying this just isn't the case? YUCK!

A few other comments on your topic.
I usually work with a short queue (0.15 day currently, and crunching HCMD2 only) and thus I have many WUs in PV status too. But on average I seldom have more than one day of work in PV, i.e. 6 days of runtime since I have six WUs active at a time. Last time I made a snapshot of my Result Status pages the PV tasks were totalling 6.5 days of runtime. And when there are more it is when the estimated complexity of jobs is far from reality like when HCMD2 started recently, and in this case having a dedicated feeder for fast returners would be defeated too.

Hmmm... while I'm not attempting to argue, I'm also somewhat mystified with your conclusion. I would have certainly anticipated an entire batch of WU Pairs, triplets, all necessary for the purpose quick turnaround would be contained in the same queue presumably in a FIFO queue. Of course, the dispatcher would need to be smart enough not to send the entire composition of WU's having the necessity for "Validation" to the same machine. Assuming this were solved, the rest should knock the PV Purgatory scenario down to almost no PV for any extended period of time at least for the fast responders. Those who choose to have a 10 day buffer of WU's, I don't perceive any real way to help them get out of PV Purgatory. Actually, I doubt they really experience much of the PV Purgatory. I suspect it's us with the systems that have literally what we are working on in the queue that are indeed impacted, waiting for the one that is stuck deep in the queue on a 10 day buffer.

Over the duration of a project the time WUs stay in PV status has no noticeable influence on the duration of the project: we are talking of some more days of delay for each WU versus several months or years for the project. Practically if the average PV time is increased by one or two weeks the total project will finish one or two weeks later, that's all.

Your likely correct, probably would have a minimal impact. On the other hand, an impact could be observed when one considers if a WU is held for say no more than 5 hrs total; from time dispatched to time returned. When the requisite Validation WU is out for 14 days (because someone has 10 days of WU in the queue, and can barely keep up; or holds that validation WU for 13 / 14 days and the WU either aborts or otherwise not returned in a timely manor, then an additional Validation WU must be dispatched for the one that didn't complete. So now another delay is introduced and you are still stuck in PV Purgatory.

One can only presume should this become more the rule than the exception, more and more rework / re-issue of validation WU's will need to be dispatched. And has been being discussed here; one way to almost make the problem go away for the fast responders is to have all the WU's in a fast responder queue.

It's likely not perfect, but I can envision it has an opportunity to be better than what is being observed; at least on my system.

I think your suggestion is interesting, particularly for reducing the number of entries in the database at a given time, but I am not sure that the added complexity makes it really practical. Also the techs have other simpler means to reduce the number of open WUs in their database if they need, for example by increasing the average duration of WUs.

But that does not forbid discussing about it, obviously. Jean. smile

From what I've observed in some of the things I've done with queuing in past projects; the adjustment of "tuning knobs" frequently causes unexpected and unanticipated consequences. Thus some of the reasoning and rational for bringing forth the discussion.

I'm just a mere mortal here on WCG and I know there are some real Titans available. My only hope was someone might give this some consideration.

I do think the suggestion has an opportunity to benefit the charts Sekerob produces for all of us, but that is pure speculation on my part.

On a side note; having to wait on some other machine in the grid to do the same work I've already completed; annoys the fool out of me! angry

I want what I want when I want it and if I can't have it when I want it or believe I should have it, then I'm annoyed because I can't have it when I wanted it in the first place! wink

Hmmmm... sounds like a someone stuck in PV Purgatory! biggrin

I absolutely hate having to be somewhere by a specific time and all is going according to plan; when you get onto a one way, one lane road that is capable of being driven at a moderately high speed safely only to be stuck behind someone traveling at such a meager pace the cats and dogs on the side are passing you. sick

----------------------------------------
[Edit 2 times, last edit by Former Member at Jul 4, 2009 12:24:16 AM]

[Jul 4, 2009 12:04:28 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Let's discuss this one more time

So I just get this WU.

Have no fear, it will be completed and returned today. The noteworthy part of the image is it's been dispatched out more or less on an emergency validation scenario. If history proves true, my 2-- copy will be back before the 1--. All three of the results were dispatched out to the worker bees within 2 hours of each other. The first result went back within about 1.5 hrs of the client having it. Then the one I get, is due not on 7/14 but 7/8 as if someone is in PV purgatory.

Oh goodness, what will they think of next?

First, I'm not complaining; but this scheduling with an earlier due date than the 7/14 date makes logically little sense in my feeble mind.

I guess the point is, if this can be dispatched within minutes of the error being detected, it should be plausible to have the fast work queue. Of course, here's the case where playing with "tuning knobs" just mucks up the works.

I wish I understood what in the world is going on.

----------------------------------------
[Edit 1 times, last edit by Former Member at Jul 4, 2009 11:23:53 AM]

[Jul 4, 2009 11:20:55 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Let's discuss this one more time

The reason we[me] writes FAQ's is so we[me] don't/doesn't have to rehash information and in the cause make mistakes in remembering how it exactly was.

See NB note in:

http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=17160

Fast Returner devices [without confirmation] is from observation now at least 48 hours.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Jul 4, 2009 11:38:26 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Let's discuss this one more time

Sekerob,

My friend, Yes, I do recall seeing this in the FAQ previously.

NB: ++ Because of database efficiency reasons, deadlines for "Rush" jobs (Repair & Make up for bad jobs) use the 20% 40% rule. A request is outstanding to give these particular "Rush" jobs the same deadline as the original, with a minimum of the 40% rule. These Rush Jobs are only send to "reliable" clients that are known to have a very high "valid" rate and return results usually within 21 hours from submission to a device. 40% was chosen because occasionally a very long running job goes out. Then a short deadline could not be met.

In this case, the error occurred and reported back within 2 hours of all WU's being dispatched. Making this particular job a "rush" job tends towards the ludicrous.

As we had been discussing the aspect of having a strict queue for fast returners, I used the above WU as a vehicle to articulate a scenario where just using tuning knobs to adjust the behavior of the system would be mucked up if all the correct thinking for a fast machine queue was not in place.

Nothing more nothing less.

From the mind of a mere mortal, I just can't get my head around why this kind of deep queue for fast returns isn't already in place. It simply boggles me completely.

[Jul 4, 2009 12:07:43 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Let's discuss this one more time

How about this idea (and someone might have already vocalized it).

WCG scheduler acts as a matchmaker. Whenever a 'reliable' device requests work:

- First, it looks in the 'rushjobs' pot.
- Second, if no rush-jobs are there, send a new job with a short deadline, artificially, so it's partner automatically goes to the rushjobs queue.

No special batches required, but chance of having wingman return result sooner is much greater. Of course, don't know what the load to the scheduler would be!

Think this can be applied to all quorum 2, of which there is CEP, HCC & HCMD2.

BUT, it might have implications for those doing also crunching outside WCG and might have different effects on various versions of BOINC client, and from reading, several projects have not exactly in-sync server software versions either. It's a mess.

All of this comes with a cost: whilst the fast machines are seeing a reduction in PV, the slower crunchers will see an increase in PV, simply because they are never matched with fast machines. Net gain? You quantify it. Maybe it induces folk to buffer less? But how many actually can be swung by this?

(and someone might have already vocalized it this thread or elsewhere).

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Jul 4, 2009 2:11:43 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Let's discuss this one more time

There is no net gain, merely a redistribution of workload. The total crunching power of of WCG remains unchanged. Total throughput is constrained by the total crunching power of the system regardless of the individual throughput of any one machine.

Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Jul 4, 2009 3:03:04 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Let's discuss this one more time

No special batches required, but chance of having wingman return result sooner is much greater. Of course, don't know what the load to the scheduler would be!

Think this can be applied to all quorum 2, of which there is CEP, HCC & HCMD2.

Sekerob,

The big difference between what you are discussing and what I believe I put forth is my implementation was for any quorum value; not just the quorum 2 projects. And your idea is an interesting one to consider. Similar goals different implementation methodology.

And I agree with you its a mess.

We continue to utilize technology that was reasonable several years ago and because of the technological advances, the service just hasn't kept up the pace so its reasonable.

For the life of me, I can't imagine why anyone would set up their system in any other way than an "WU request on demand" scenario.

Now, this isn't to say, should WU know they are going to take say an 8 or 12 hr outage, they couldn't download sufficient work to your work buffer to perform work for that duration of time plus say an hour or two to keep your machine from being idle. But beyond that..... a fast responder queue seems to be the real solution short of enticing everyone to change their behavior to only request WU's when they are out (or soon to be) of WU's to process.

Sgt.Joe,

I'm sorry I disagree on one aspect and agree with you on another.

True, the same quantity of WU's are computed. From that perspective your correct and we agree.

I'm sure those with the big brains, the ones that are seeking the completed data from corroborative WU's providing the necessary quorum of WU's back for validation would see it as a throughput gain because they get the confirmed WU's back sooner. Now if they need say some group of 500 different WU batches and their validations and they must wait for say the last few WU batches that's a different discussion. That all has to do with sequencing which hasn't really been addressed other than saying an entire batch of WU's and their entire PV's must reside within the same fast return queue if you want it back say in 24 hrs or so.

So it hinges on your definition of "throughput'. As I see it, if you can't get all the requite validation WU's in a batch completed in a timely way, the throughput is argumentatively reduced to the slowest responder.

BTW, what's for dinner up there in Minnesota for this 4th of July? wink

----------------------------------------
[Edit 3 times, last edit by Former Member at Jul 4, 2009 5:56:41 PM]

[Jul 4, 2009 5:29:17 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:


Re: Let's discuss this one more time

BTW, what's for dinner up there in Minnesota for this 4th of July?

Off topic.

Hamburgers, watermelon, musk melon, rhubarb pie, potato salad, iced tea and beer. biggrin

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Jul 4, 2009 6:24:44 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Let's discuss this one more time

Interesting discussion and I appreciate the input. First I suggest using the following two definitions for two terms:

Throughput: The number of units of work completed successfully during a given period of time (i.e. the rate at which they complete)

Turnaround: The time between when a unit of work is sent out/started and the time that it is returned/finished.

To understand the perspective of the researchers, the following is the sequence of events regarding how workunits come from the researchers and then results get returned to them:

1) The researchers prepare workunits in groups called batches. Usually they prepare many batches at once. The number of workunits in a batch varies from 400-500 for DDDT up to 5000-6000 for FAAH. We ask the researchers to always keep some 'unclaimed' batches ready for us. Most of the projects keep more than a week.
2) We download the batches from the researchers (and 'claim' them). We try to keep 3-4 days of batches ready to prepare for BOINC.
3) We prepare the batches for loading into BOINC. We try to keep at least 3 days of work ready to load into BOINC.
4) Load the batches into BOINC. We try to keep 48 hours of workunits ready to send loaded into BOINC at all times.
5) Distribute workunits to the members computers.
6) As workunits are returned, validated and assimilated, they are put into a holding area until the entire batch completes.
7) Once the batch is completed, the results are packaged up and pushed to the researchers. This takes about 1 day.

Once a batch is loaded into BOINC it takes the following on average to complete the batch:

app_name  processing_days
cep1      16.3
dddt      13.6
faah      17.6
flu1      14.1
hcc1      16.7
hcmd2     18.5
hfcc      16.6
hpf2      12.2
rice      9.2

(Note that Rice and Hcmd2 have artificially low values. This is because the timestamp value that is set when a batch is completely 'loaded' is reset each time that a child workunit created and loaded into BOINC.)

This means that it takes 4-5 weeks from the time the batch is created by the researchers until the results are in their hands. The reason for all of the buffers is that failures of different components or issues creating work can cause delays in producing workunits to distribute. As you have noticed we still have periodic issues with workunit availability even with this system, but if we shortened the buffered work it would occur much more frequently.

About half of the turnaround time is derived from the time to complete the batch within BOINC.

The length of time it takes to complete a batch within BOINC is determined by the time it takes to the slowest workunit to complete. The slowest workunit is always a 'hard luck' case workunit. These are almost always workunits where one of the initial results is never returned and the 'rush' job created to replace the initial result is not returned by its shortened deadline. The second 'rush' job usually completes.

This means that the turnaround for a batch on BOINC is determined largely by the initial deadline that is set for each workunit within the batch - not as a function the computing power of the computers assigned to the workunits or the queue depth of those computers.

(One execption: HPF2 = deadline + 2 day buffer. This is because timeouts are generally ignored if the min_quorum of 15 is met and we just assimilate 15 or more results rather than 19 results for the workunit)

So how does this relate to the discussion above? Assigning some workunits to be 'fast' and some workunits to be 'slow' within a batch would have no impact on the turnaround time for a batch. We could divide the batches into 'fast' batches and 'slow' batches. This would require the researchers to prioritize their batches. However, in reality they have already prioritized their batches and they are giving them to us in that order. There are occasionally batches that they want as extreme priorities and we generally accommodate those by moving them to the front of all queues and sending them only out as rush jobs. Doing this causes an hour or two where only 'reliable' computers can get work for that project. This is a rare exception. In general they basically want all of their batches done as the same priority - otherwise they simply hold off making them available for us to send until they are ready.

The second discussion point discussed above is how to reduce the pending validation queue for members. There are benefits to doing this:

1) Members get their points quicker
2) Members see if there are issues with their computer quicker (i.e. if they are producing invalid results)
3) Reduce the number of records on the result and workunit table (thus improve db performance)
4) Reduce filesystem requirements for the BOINC filesystem (although this will probably be matched by an increase in storage required for results waiting to be packaged to return to the researchers)

Essentially what would need to happen is that a preference would exist that would allow a member to say that they would prefer workunits with 'short', 'long' or 'no preference' deadlines (this would be set as a device profile).

A 'short' preference would mean artificially reduce the deadline for the workunit.
A 'long' preference would mean do not send 'rush' jobs and only process full deadline jobs.
A 'no preference' project would mean send work that computer can complete on time.

The server would need to ensure that the computer can meet the reduced deadline (if set to 'short') even given projects that have wildy inaccurate estimates. It would also need to make sure that reduced deadline workunits do not increase so fast that they cause 'long' members to be unable to obtain work. Additionally, it would need make sure that all results for the workunit get the same deadline.

We will have to think about this some. I'm interested in members thoughts on this.

----------------------------------------
[Edit 1 times, last edit by knreed at Jul 6, 2009 4:16:37 PM]

[Jul 6, 2009 4:15:15 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Let's discuss this one more time

knreed,

Thanks for your kind and lengthy explanation. I for one appreciate the additional information. Certainly not knowing the internals of how WCG in all it's glory works places limitations on what I can articulate with any certainty. So the best I can do is only provide observations and expected benefits.

Now the real question is; When can we have it? Are you done with all the coding yet? biggrin

Just kidding.

The whole goal is to make it better for everyone involved. FWIW, I'm still showing 45-50 WU's all in purgatory ATM awaiting some kind soul to cough up the required validation WU child. Bad Child... Bad Child.. it should be sent to it's room!

Anyway, thanks for at least considering the discussion and seeing if there's any benefit to the whole conversation.

---Barney

[Jul 6, 2009 9:06:02 PM]

[ ]