Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 149
Posts: 149   Pages: 15   [ Previous Page | 6 7 8 9 10 11 12 13 14 15 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 19636 times and has 148 replies Next Thread
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

Barney,

The reliable mechanism looks at two things:

1) The recent average of the length of time between when a result is assigned and the time that the result is reported as done by the client. This time must be less than 21 hours.

2) The most recent 15 or so results checked must have been determined to be valid.

Although item #1 does not explicitly check the depth of the queue on a given machine, the reality is that people do not change their queue size very often. Thus this measurement will include the impact of queue size becuase a larger queue size will lengthen the 'turnaround' time for a workunit and push the computer out of the 'reliable' metric.

Also - given that this is in the long workunit thread, the scheduler does the following computation to assign a deadline for these 'rush' jobs.

1) Take the original deadline time and divide by 5. So if the original deadline was 10 days, then the rush job deadline would be 2 days.

2) Compute the estimated turnaround time for the job on the client. If this time is more than the value in #1 then use this value up to a max of twice #1.

The greater of #1 and #2 is used as the new allowed time.

For many of these extended workunits, I strongly suspect that rule #2 is being applied.
[Aug 21, 2008 1:55:57 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

Didactylos & Knreed,

Thanks for the explanations




The reliable mechanism looks at two things:

1) The recent average of the length of time between when a result is assigned and the time that the result is reported as done by the client. This time must be less than 21 hours.

2) The most recent 15 or so results checked must have been determined to be valid.

Although item #1 does not explicitly check the depth of the queue on a given machine, the reality is that people do not change their queue size very often. Thus this measurement will include the impact of queue size because a larger queue size will lengthen the 'turnaround' time for a workunit and push the computer out of the 'reliable' metric.


Ok, so the queue depth is determined via difference between time dispatched to client and return to client , no problem.

This would tend to suggest, for most clients that are connected to the internet all the time and for machines that run 24/7 the Client preferences for optimal settings should look like:



the first value says, to connect about every 8 seconds.

The calculations that support this are:

60 sec/min * 60 min/hr * 24 hr/day = 86,400 sec / day * .0001 = 8.64 secs.

or it's alternative of:

60 sec/min * 60 min/hr * 24 hr/day = 86,400 sec / day * .001 = 86.4 secs.

The second value of Additional Work Buffer says to request a work unit about 15 minutes prior the end of the WU that is about to complete.

The calculations that support this are:

60 sec/min * 60 min/hr * 24 hr/day = 86,400 sec / day * .01 = 864 secs. / 60 seconds = 14.4 mins.

So those two settings, working harmoniously with the client and the scheduler, has the WU returned in the shortest period of time presuming the machine is running 24/7.

The other thing this does as a setting is to almost guarantee that all WU's will be validated or otherwise corroborated in the shortest period of time, so erroneous patterns can emerge quickly.

It's a real tooth ache when WU's are dispatched, computed and returned for validation, only to be required to wait the 8-10 days for another client to finally return the results. When a client misses it's return deadline, then another WU is dispatched for computation, and again another long period of time can go on before the WU's previously completed can be validated.

It appears the WU's that take many days to complete by other clients with deep queues or otherwise slow processors do indeed keep clients that return WU's for validation quickly from being marked as reliable because the necessary 15 count of recient WU's can not be counted as being VALID.

The net effect is a significant when the scheduler is looking for clients that can be viewed as being "reliable".

From where I sit, it seems reasonable to cluster work units onto groups of machines with similar characteristics so you have a core set of clients which permits the majority of similar clients to be grouped as being "reliable."

To do anything else seems to unnecessarily penalize fast responders because of clients that can not reply in a timely fashion.


Also - given that this is in the long workunit thread, the scheduler does the following computation to assign a deadline for these 'rush' jobs.

1) Take the original deadline time and divide by 5. So if the original deadline was 10 days, then the rush job deadline would be 2 days.

2) Compute the estimated turnaround time for the job on the client. If this time is more than the value in #1 then use this value up to a max of twice #1.

The greater of #1 and #2 is used as the new allowed time.

For many of these extended workunits, I strongly suspect that rule #2 is being applied.


Is it worth while to consider limiting the amount of data that can be successfully processed to no greater than say 23 or 24 hrs at a time for any client based upon that machined raw cpu capacity?

It seems to help everyone out in the long run.
[Aug 21, 2008 5:56:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

Hello BarneyBadass,
First, some people like to have large work queues, even when they do not need them. Second, why bother connecting so often? Nyquist's theorem says that you are all right as long as your connect time is less than half the time it takes to process a work unit. If you have an additional work cache that holds at least your connect time, then you are golden. Only lengthy problems (connection, server, etc.) can cause you to run out.

Lawrence
[Aug 21, 2008 10:26:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
petehardy
Senior Cruncher
USA
Joined: May 4, 2007
Post Count: 318
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

Hi Barney,

According to BOINCStats you just got 1 quad (Q6600), I've got 7 computers, a total of 16 cores.
I'm getting less than half your credit per day. I know that you've got it clocked up pretty good.
My question is could you start a thread in the Chat Room and tell us the details (motherboard, memory, cooling etc.) and give us details on what settings you're using.
I'm sure that many people(including me) would be interested. Also, sorry to bring this up, but are you getting any error WUs?

Pete

Edit - Grammar
----------------------------------------

"Patience is a virtue", I can't wait to learn it!
----------------------------------------
[Edit 1 times, last edit by petehardy at Aug 21, 2008 11:20:58 PM]
[Aug 21, 2008 10:51:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

Hi Lawrence,


First, some people like to have large work queues, even when they do not need them.


I agree with your statement.

From what I can see this does not help anyone in many regards.

I enjoy having an automobile which can approach 200 MPH... but just because I enjoy having such a vehicle doesn't mean I can approach those speeds on the freeways or streets,


Second, why bother connecting so often? Nyquist's theorem says that you are all right as long as your connect time is less than half the time it takes to process a work unit.


Lawrence,

Your assessment is correct.

As we both recognize, there are really only two ways for things like this to occur. One is event driven, the other is statistical sampling.

The Bionic Client has a modified version of a statistically sampled polled event. Hmmmmm... what's that?

Well, I want the client to have the ability to be able to wait until it can't wait any longer to request a new work unit. So stipulating I don't want any new work unit until the one I'm just about done working on completes gives the scheduler the opportunity to get something into me that's very high priority needing something to be returned in as little time as possible.

The point I was trying to make was that as long as client machines are allowed to request WU's long before they can process them and return the results in as short a period of time as possible, other client machines are always going to be marked as not-reliable. Why? Because while I may get get a WU that takes say 4 hrs to process (for my system) the overall time that WU is in my possession is about 4.25 hrs. Now, if the WU just processed has a quorum of 2 or more and my client needs verification, and I have to wait say several days for those verifications / confirmations to complete. As a natural byproduct my client is subsequently marked as non-reliable which is really unfortunate.

If the WCG scheduler could dispatch the same WU's to systems that have the similar response characteristics, (not necessarily speed, but generally return the results within 12 hrs of the WU being dispatched to that client) then more machines would be marked as reliable. So "emergency" re-work WU's would have a wider selection of "reliable" clients to dispatch those "emergency" WU's. As it stands, it appears very fast client responders can easily be eliminated from the reliable client list because their WU's must wait for validation which may occur outside the time specifications for validation. As it is right now, my client has about 90% of my last 45 WU's waiting for validation. Care to guess how that would likely affect my client being a member of the reliable client list?


If you have an additional work cache that holds at least your connect time, then you are golden. Only lengthy problems (connection, server, etc.) can cause you to run out.

Lawrence


Lawrence,

Again, I agree with your statement. As I articulated earlier, having my parms set the way I have them, tolerates the possibility of being dispatched WU's requiring results being returned as soon as possible at the last minute.

Of course, this is not to say that my client will get "emergency" WU processing any more frequently than anyone else, but it does ensure my client will get the WU, process it, and return it in nearly the shortest period of time my client can muster.

---Barney
[Aug 22, 2008 12:08:31 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

Barney, I think that conditions shown by knreed are computed separately:
1. the recent average turnaround time must be below 21 hours
and
2. the last 15 WUs which have gone through the validator must be valid.

Since those two conditions are rather restrictive and must be both satisfied there is no need for the "reliable" process to be any more complicated: as soon as a WU is declared "invalid" or "error" the client is not considered as "reliable" until 15 new valid ones have gone through the validator.

Regarding the average turnaround time there is no need to check if WUs have been computed too fast on an unstable overclocked machine (for example) because in that case condition #2 will quickly knock. No need to care for the queue depth either because as soon as it will be too large (say, about 0.5 day) the turnaround time will mathematically be too high. And that covers also the case of machines with a short queue but not switched on 24/7.

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Aug 22, 2008 12:33:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

Barney, I think that conditions shown by knreed are computed separately:
1. the recent average turnaround time must be below 21 hours
and
2. the last 15 WUs which have gone through the validator must be valid.

with a short queue but not switched on 24/7.

Cheers. Jean.


Jean,

Many thanks,

So the next question is how does "Pending Validation" and Valid Relate?

Does "Pending Validation" imply "VALID"?

Thanks
---Barney
[Aug 22, 2008 12:38:48 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

Barney, the turn-around time is based on return time, not validation time.*

The work buffer serves two purposes:
1) it allows members who are not constantly connected to bridge unconnected intervals.

2) it provides a safety factor for unscheduled outages, both WCG-wide and local ISP related.

* Knreed stated this in his post. Please, if you want to try to pick holes in the algorithms used by World Community Grid, read very, very, carefully everything you are told. Otherwise, you spend a lot of time on baseless ideas (and waste our time, too, explaining the facts again).

edit: Knreed also answered your follow-up questions already.
----------------------------------------
[Edit 1 times, last edit by Former Member at Aug 22, 2008 12:41:02 AM]
[Aug 22, 2008 12:39:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

Hi Pete!

I'll be happy to divulge my system configuration and the such.

I'll see if I can't open some kind of descriptive thread in the chat room.

As for errors... Nope.. I don't get any on my WU's, unless I do something stupid.

---Barney
[Aug 22, 2008 12:42:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: this is a really long work unit

As it stands, it appears very fast client responders can easily be eliminated from the reliable client list because their WU's must wait for validation which may occur outside the time specifications for validation.

No, see my previous post.

Even if that were the case that would not be a problem as long as WCG has many more reliable clients than it needs for dispatching emergency WUs, which is the case. Although it is rewarding to realize that you are considered as a reliable fast returner when you see a short deadline WU in your client's queue, you must not forget that the purpose of this process is to quickly complete a WU which has had some problem, it is not to deliver certificates of excellence to the best clients/members. For the latter it could be considered as unfair, but for the former it is simply efficient and achieving its objectives.

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Aug 22, 2008 12:48:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 149   Pages: 15   [ Previous Page | 6 7 8 9 10 11 12 13 14 15 | Next Page ]
[ Jump to Last Post ]
Post new Thread