Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Support Forum: Community-maintained FAQs [authorized posting allowed] Thread: BOINC: Estimated Tasks "Time To Complete" Are Totally Wrong |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 1
|
Author |
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
BOINC uses part of the task headers that store the estimated fpops (floating point operations recorded in the <rsc_fpops_est> value) to compute an estimated run time with reference to a host (client) benchmark. If the benchmark is a true and correct reflection of the computational power that is actually applied to the science on hand, than fpops divided by benchmark gives a correct Time To Complete (TTC).
----------------------------------------What happens if the task's header fpops estimates are wrong or the host benchmark was miscalculated? Cause and effect, BOINC learns from these variations and stores the deviation in a value called the (Result) Duration Correction Factor (rDCF). Next time a job arrives it uses the fpops estimate in the task header and applies the rDCF to provide an adjusted TTC. That rDCF value starts out at a value of 1.000000. The change when the actual Task completion time has been reached is: The reason this is so is because the rDCF is also used to estimate the amount of work called from the servers. The logic that follows is obvious: As we have experienced a few times, the work is sometimes split, without fault, in wrong sizes (very hard on non-deterministic calculations). Either they run much much longer and require multiple number of estimated computations or they are way overestimated and take just half or less run time. Now what faults is, that whilst WCG maintains a running average of fpops for each project (used in the new work headers), BOINC was never geared to maintain a rDCF for sub-projects. It just keeps 1 for WCG. So, if 1 sub-project goes haywire on the estimates, the rDCF starts to effect the estimated TTC for all the other projects as well. Thus, if FAAH runs 5x longer than the fpops in the Task header suggests it should run, the following HPF2 job with similar fpops estimate is deemed to do the same and gets the inflated TTC associated. Sort off, On August 4, 2008 knreed announced that in order to mitigate this effect, future batches of work from projects of which WCG knows they are producing substantial variable run times, will get a limited sample distribution. They will be send to known and reliable clients. Based on the actual result data, either the fpops estimate in the headers is adjusted or the batch is further cut to size, in order to keep the total average run time within a target area, e.g. 7 or 8 hours. We are going to modify our processes going forward (starting today) so that we send out a limited number of workunits for each batch as soon as the batch is ready to be loaded and sent to members. This work will be sent to the reliable hosts so that we can get information about the behavior of that work as soon as possible. This process will limit the impact to the member community as we should be able to identify surprises like this before we send out tens of thousands of 'surprises'. Future developments will allow sizing of work according computational power so that a weaker machine will do approximately the same run time as a power-cruncher. The product would thus be that you get something which in the extreme is similar to the RICE project where tasks run 8 hours, no matter what computer. knreed explained on July 31, 2008 Yes - there are actually a lot of advantages to doing this. We have been working with David Anderson and BOINC to get this capability added. David has done a lot of work on this already and the folks at Superlink@Technion! are the first BOINC project to put the new code into production. We will be updating our servers to utilize the new code later this year. Once we have the code, the server will assess the 'effective power' of the computer requesting work and try to send it work that won't take it more than a day or so. Effective power is the raw power of the computer * the amount of time that BOINC is allowed to run work on the computer. Once we have tested this and feel good about it, we will modify how we create workunits so that there is a lot of variation in the size and computers will be able to get the appropriate size of work. This will reduce load on our servers as we will be able to send bigger workunits to those powerful always on computers and it will improve our ability to effectively use those computers that are less powerful and are only on infrequently (and thus have a hard time completing work currently). So it is a definite advantage to do this and we are anxious to get this in place.
phew.... and still tweaking this topic!!!
WCG Global & Research > Make Proposal Help: Start Here!
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 2 times, last edit by Sekerob at Aug 6, 2008 7:57:04 PM] |
||
|
|