| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 13
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I have a very reliable machine that crunches 24/7 and hasn't had an error in over 1 month. Following the last benchmark, my 7-8 hour HFCC WU are now being projected to 18 hours to complete, even though the last WU to have such an estimate is 42% complete after 3:38. Maybe it was the benchmark and maybe it wasn't.
15/03/2010 11:16:01 AM||Suspending computation - running CPU benchmarks 15/03/2010 11:16:32 AM||Benchmark results: 15/03/2010 11:16:32 AM|| Number of CPUs: 3 15/03/2010 11:16:32 AM|| 2130 floating point MIPS (Whetstone) per CPU 15/03/2010 11:16:32 AM|| 4622 integer MIPS (Dhrystone) per CPU |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
My even more reliable duo is projecting them HFCC jobbies at 15.5 hours presently with a DCF of 1.68... remember the DCF? That's the value computed from the last jobs, between what they were originally estimated to have in fpops and what they actually had. Run a few RICE or longer HCMD2 (after a few shorties) and that value gets all upset. See FAQs for that special Why topic :O
----------------------------------------edit: The link for the FAQ http://bit.ly/aSoRlH
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Mar 15, 2010 4:50:05 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Maybe I am just missing the point, but I do not understand why a reliable, constant and consistent computer which has stabilized its DCF on one project, and then switches projects, goes to a DCF that is out by 250%?
The computer hasn't changed only the project. No errors and no aborts. The new WU had to have an wildly incorrect estimate. Maybe the first WU on HFCC got sent out with a 7 hour estimate and then the computer took 16 hours to crunch it. Ok, but no one other than Mr Kermit could have crunched that job in 7 hours. So why did HFCC think that this job was 7 hours. It should not have. It should have been around 16 hours for this box. AND in 7-10 days we will all be able to prove it because every HFCC WU will have an accurate forecast. What I fully expect to be true, is that each project cannot and does not accurately estimate the duration of any particular WU (which is why you are pressing for project-based DCF), in comparison to any other project, thereby introducing sizable errors into the DCF when switching projects. Whatever tool is being used on the server to estimate WU crunching time is wrong. This box has WU that have taken 6-16 hours to finish and I can bet the 16 hour WU arrived with an 7 hour estimate so that now the 7 hour WU are arriving with a 16 hour estimate. Finally what is happening with a box running both HFCC and HCMD2 where the two projects share the same DCF. Will HFCC be consistently over-estimated and HCMD2 under-estimated? Will every WU be incorrectly forecast? The solution is do not switch projects and if I was finishing either 2 or 200 HFCC/FA@H/HCC/HPF WU per day, that is what I would be doing. So my final question is this: IF I have a single core system running all projects (even on a high speed processor), does the DCF get pooched each time you process a WU for a different project? If it does not, then why does the DCF get pooched when switching projects? |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
OK, so the FAQ did not make it clear?
----------------------------------------Let's get hypothetical, yet it's a real example: HCMD2 has a moving project mean of X fpops per task that translates to Y amount of time, per your devices benchmark... global average is about 4.6 hours. Then you fetch a HCMD2 tasks that actually has a few very tough positions, right after the first 6 hours had it do more than 60% of what was in the package and suddenly the job runs 11.5 hours. There you have 11.5 / 4.6 = 2.5 which knocks up the DCF. Then a HFCC job comes in with it's whatever running project average fpops (dynamically updated from prior days results). Since WCG=WCG that DCF driven by the previous HCMD2 jobs is applied to the HFCC jobs. Now, I noted that HFCC has currently a trend of longer jobs after a while of lighter, so it works a bit as a double whammy in this HCMD2/HFCC project mix. The practical advise is: Ignore it... just let the client run. If worried about the reliability rating, keep a cache of around 1 day and let it be. It's not going to change... it's impossible to accurately estimate the fpops with all the non-deterministic computations the various sciences perform.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
PS: No times are estimated. It's real fpops from work returned that is stored in the headers of tasks that go out afterwards. The fpops are converted back to a time estimated derived from the oft bogus benchmark values (particular those who have 64 bit clients) and the running DCF the client has stored based on it's specific actual performance and throughput.
----------------------------------------
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
So I was partly right. HCMD2 is not able to accurately estimate the fpops for any given WU. So when that (possibly) wildly inaccurate estimate is sent on the header and it is wrong, the computer that gets that WU has its DCF pooched for the next 7-10 days until it can be worked out of the system. Thanks for the feedback, Sek.
|
||
|
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3716 Status: Offline Project Badges:
|
So when that (possibly) wildly inaccurate estimate is sent on the header and it is wrong, the computer that gets that WU has its DCF pooched for the next 7-10 days until it can be worked out of the system. It's not measured in days, it's measured in number of WUs.If some WU pushes the DCF from 1 to 2 it will take about 20 accurately sized WUs to bring the DCF back to close to 1. If the machine is a single-core one which processes normal WUs in 12 hours that will mean 10 days. If it is an 8-thread i7 which processes normal WUs in 4 hours that will mean less than half a day. The "problem" with BOINC is that a longer WU raises the DCF full size while a shorter one decreases it by only 10 % of the difference. It was a conscious choice made to protect the clients from overcommitting. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
So this same machine has processed 40+ WU since last analysis. All the HFCC WU were completing in 5-7 hours. Last night I started to get WU estimated at 15 hours and I have seen 7 WU estimated at 13-15 hours, even though every one of these "long" WU are finishing in about 6.5 hours. Now I wait 70+ completed WU to get my DCF back to normal, except that is a wasted effort because the server can at any time send me a Wu that you, I and Santa Claus know is only going to take me 5-7 hours but HFCC has estimated the WU to take 25, 35, 45...... hours to complete. There has got to be a better solution than "about 5-45 hours (estimate)". Makes it hard to accurately maintain a reasonable sized cache when 3 days = 2 WU on a single core machine when you should get 6-7 WU.
|
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Maybe WCG has figured out that your machine likes 6.5 hour HFCC jobs, though I'm totally unaware how they suddenly would be able to figure that out for this project. Others observe long and short, the long ones knocking the DCF off it's hooves.
----------------------------------------Here a filter of HFCC jobs of my duo. Not many as they are moved off fairly quickly after validation: HFCC_ s2_ 00831608_ s2_ 0000_ 0-- Valid 21-3-10 04:27:06 22-3-10 09:25:02 13.27 HFCC_ s2_ 00834009_ s2_ 0000_ 0-- Valid 21-3-10 00:44:56 22-3-10 04:55:45 10.87 HFCC_ s2_ 00827787_ s2_ 0000_ 0-- Valid 20-3-10 21:24:32 21-3-10 19:10:01 8.52 HFCC_ s2_ 00821093_ s2_ 0000_ 0-- Valid 20-3-10 16:01:32 21-3-10 16:54:47 9.83 HFCC_ s2_ 00827301_ s2_ 0000_ 0-- Valid 20-3-10 15:14:14 21-3-10 09:30:27 7.19 HFCC_ s2_ 00813403_ s2_ 0000_ 0-- Valid 20-3-10 08:10:07 21-3-10 05:54:13 9.29 HFCC_ s2_ 00809744_ s2_ 0001_ 1-- Pending Validation 20-3-10 01:29:25 21-3-10 01:27:42 6.54 HFCC_ s2_ 00796511_ s2_ 0001_ 0-- Valid 19-3-10 20:08:02 20-3-10 19:26:07 9.68 HFCC_ s2_ 00794529_ s2_ 0001_ 0-- Valid 19-3-10 19:46:45 20-3-10 18:00:55 9.37 and the quad: HFCC_ s2_ 00846862_ s2_ 0001_ 0-- Valid 21-3-10 09:31:33 22-3-10 20:05:02 15.84 HFCC_ s2_ 00847091_ s2_ 0001_ 0-- Valid 21-3-10 09:31:33 22-3-10 15:23:22 13.31 HFCC_ s2_ 00811852_ s2_ 0001_ 0-- Valid 20-3-10 07:52:23 22-3-10 07:55:39 8.16 HFCC_ s2_ 00811505_ s2_ 0000_ 0-- Valid 20-3-10 07:51:50 22-3-10 06:58:32 4.64 HFCC_ s2_ 00811519_ s2_ 0001_ 0-- Valid 20-3-10 07:52:05 22-3-10 06:58:32 5.21 HFCC_ s2_ 00811556_ s2_ 0000_ 0-- Valid 20-3-10 07:51:50 21-3-10 21:01:18 6.88 HFCC_ s2_ 00811556_ s2_ 0001_ 0-- Valid 20-3-10 07:51:50 21-3-10 20:13:44 6.81 HFCC_ s2_ 00811672_ s2_ 0001_ 0-- Valid 20-3-10 07:51:50 21-3-10 13:16:49 3.31 HFCC_ s2_ 00811558_ s2_ 0000_ 0-- Valid 20-3-10 07:51:50 21-3-10 13:16:49 3.04 HFCC_ s2_ 00794630_ s2_ 0000_ 0-- Valid 19-3-10 17:43:04 21-3-10 09:31:47 10.21 HFCC_ s2_ 00794602_ s2_ 0000_ 1-- Pending Validation 19-3-10 17:47:26 21-3-10 09:31:47 8.97 HFCC_ s2_ 00791709_ s2_ 0001_ 0-- Valid 19-3-10 17:37:42 20-3-10 23:52:43 6.61 HFCC_ s2_ 00783755_ s2_ 0001_ 0-- Valid 19-3-10 09:22:12 20-3-10 18:53:50 3.52 HFCC_ s2_ 00786346_ s2_ 0001_ 1-- Pending Validation 19-3-10 07:59:36 20-3-10 18:22:21 9.23 HFCC_ s2_ 00780713_ s2_ 0000_ 0-- Valid 19-3-10 05:14:00 20-3-10 14:58:16 6.69 Run times all over the place. I wonder, since my quad churns out HCC jobs at fairly regular run times, if it's these you're actually referencing. The quad, last 15 X0000090670238200708162125_ 1-- Pending Validation 21-3-10 13:16:49 22-3-10 22:00:44 4.55 X0000090700390200708031418_ 1-- Pending Validation 21-3-10 23:55:37 22-3-10 18:12:06 4.68 X0000090710350200707200956_ 0-- Valid 22-3-10 06:58:16 22-3-10 18:11:45 4.89 X0000090680635200707261151_ 1-- Valid 21-3-10 15:57:48 22-3-10 13:09:15 4.77 X0000090670633200708021237_ 0-- Pending Validation 21-3-10 11:46:21 22-3-10 11:47:03 4.77 X0000090670620200708021237_ 0-- Pending Validation 21-3-10 11:45:56 22-3-10 07:55:39 4.45 X0000090670505200708021239_ 0-- Pending Validation 21-3-10 11:45:14 22-3-10 06:58:32 4.45 X0000090670497200708021239_ 1-- Valid 21-3-10 11:45:35 22-3-10 06:58:32 4.46 X0000090670511200708021239_ 1-- Pending Validation 21-3-10 11:45:14 21-3-10 22:09:03 4.49 X0000090601019200707171103_ 1-- Valid 20-3-10 07:51:50 21-3-10 20:13:44 4.73 X0000090601023200707171103_ 1-- Valid 20-3-10 07:51:50 21-3-10 17:27:59 4.73 X0000090601036200707171103_ 0-- Valid 20-3-10 07:51:50 21-3-10 15:57:47 4.67 X0000090601037200707171103_ 1-- Valid 20-3-10 07:51:50 21-3-10 12:16:20 4.67 X0000090601043200707171103_ 0-- Valid 20-3-10 07:51:50 21-3-10 11:45:13 4.81 X0000090601042200707171103_ 1-- Valid 20-3-10 07:51:50 21-3-10 11:45:13 4.88 If so, and there's these monster type HFCC in between, than that DCF is not coming down any time soon.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
and so now with the release of dddt2 type C Wu, we see the true impact of the wacko DCF. Now I have completed a Type C WU in a completed run time of 38:17 (mm:ss). Not bad. But just for kicks, lets see what the original estimate for that WU was when it was received: 48:33:21 (hh:mm:ss). Ridiculous. Oh yes, when the WU completed, it took a whole 21 minutes off the estimated run time for the other WU not yet started
----------------------------------------Setting my buffer to 10 days, I get 13 WU that my system thinks is going to fill my buffer for 10 DAYS (it has actually forced my system into high priority) and it will in fact be empty in less than 12 HOURS. Not happy [Edit 1 times, last edit by Former Member at Mar 23, 2010 2:27:59 PM] |
||
|
|
|