Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 1
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3045 times and has 0 replies Next Thread
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
BOINC: Estimated Tasks "Time To Complete" Are Totally Wrong

BOINC uses part of the task headers that store the estimated fpops (floating point operations recorded in the <rsc_fpops_est> value) to compute an estimated run time with reference to a host (client) benchmark. If the benchmark is a true and correct reflection of the computational power that is actually applied to the science on hand, than fpops divided by benchmark gives a correct Time To Complete (TTC).

What happens if the task's header fpops estimates are wrong or the host benchmark was miscalculated?

  • If too high fpops or too low benchmark, the client scheduler over-estimates the TTC
  • If too low fpops or too high benchmark, the client scheduler under-estimates the TTC

    Cause and effect, BOINC learns from these variations and stores the deviation in a value called the (Result) Duration Correction Factor (rDCF).

    Next time a job arrives it uses the fpops estimate in the task header and applies the rDCF to provide an adjusted TTC.

    That rDCF value starts out at a value of 1.000000. The change when the actual Task completion time has been reached is:

  • If jobs take longer, the rDCF is aggressively increased.
  • If jobs take less time, the rDCF is slowly reduced.

    The reason this is so is because the rDCF is also used to estimate the amount of work called from the servers. The logic that follows is obvious:

  • If the work done recently has taken much longer, too much work would get buffered when using the Cache/Additional buffer function, always assuming new work is going to have the same deviation as completed work. The risk is here that deadline exceeding is threatening by the time the last received work is getting it's turn.
  • If the work is going to take much shorter, too little work is buffered, but no harm is done.

    As we have experienced a few times, the work is sometimes split, without fault, in wrong sizes (very hard on non-deterministic calculations). Either they run much much longer and require multiple number of estimated computations or they are way overestimated and take just half or less run time.

    Now what faults is, that whilst WCG maintains a running average of fpops for each project (used in the new work headers), BOINC was never geared to maintain a rDCF for sub-projects. It just keeps 1 for WCG. So, if 1 sub-project goes haywire on the estimates, the rDCF starts to effect the estimated TTC for all the other projects as well. Thus, if FAAH runs 5x longer than the fpops in the Task header suggests it should run, the following HPF2 job with similar fpops estimate is deemed to do the same and gets the inflated TTC associated.

    Sort off,

    On August 4, 2008 knreed announced that in order to mitigate this effect, future batches of work from projects of which WCG knows they are producing substantial variable run times, will get a limited sample distribution. They will be send to known and reliable clients. Based on the actual result data, either the fpops estimate in the headers is adjusted or the batch is further cut to size, in order to keep the total average run time within a target area, e.g. 7 or 8 hours.
    We are going to modify our processes going forward (starting today) so that we send out a limited number of workunits for each batch as soon as the batch is ready to be loaded and sent to members. This work will be sent to the reliable hosts so that we can get information about the behavior of that work as soon as possible. This process will limit the impact to the member community as we should be able to identify surprises like this before we send out tens of thousands of 'surprises'.

    Future developments will allow sizing of work according computational power so that a weaker machine will do approximately the same run time as a power-cruncher. The product would thus be that you get something which in the extreme is similar to the RICE project where tasks run 8 hours, no matter what computer. knreed explained on July 31, 2008
    Yes - there are actually a lot of advantages to doing this. We have been working with David Anderson and BOINC to get this capability added. David has done a lot of work on this already and the folks at Superlink@Technion! are the first BOINC project to put the new code into production. We will be updating our servers to utilize the new code later this year.

    Once we have the code, the server will assess the 'effective power' of the computer requesting work and try to send it work that won't take it more than a day or so. Effective power is the raw power of the computer * the amount of time that BOINC is allowed to run work on the computer.

    Once we have tested this and feel good about it, we will modify how we create workunits so that there is a lot of variation in the size and computers will be able to get the appropriate size of work. This will reduce load on our servers as we will be able to send bigger workunits to those powerful always on computers and it will improve our ability to effectively use those computers that are less powerful and are only on infrequently (and thus have a hard time completing work currently).

    So it is a definite advantage to do this and we are anxious to get this in place.

      NB:
    • The client benchmark is re-evaluated once every 5 days (140 hours wallclock)

    • Though excessive long tasks have a "Safety" to allow them to run extra long, all have a cut off factor of between 6 to 10 times the original estimated computations (fpops) needed to complete. This prevents them from running ad infinitum, particularly on clients that are not being attended. The factor is determined by taking the task's <rsc_fpops_bound> value divided by the <rsc_fpops_est> value. For instance the below results in a factor 10 cut off leading to a aborted:"Exceeded CPU time limit" if computations are performed beyond this point.

      <rsc_fpops_bound>395603310101260.000000</rsc_fpops_bound>
      <rsc_fpops_est>39560331010126.000000</rsc_fpops_est>

    phew.... and still tweaking this topic!!!
  • ----------------------------------------
    WCG Global & Research > Make Proposal Help: Start Here!
    Please help to make the Forums an enjoyable experience for All!
    ----------------------------------------
    [Edit 2 times, last edit by Sekerob at Aug 6, 2008 7:57:04 PM]
    [Aug 5, 2008 5:00:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
    [ Jump to Last Post ]
    Post new Thread