Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 76
Posts: 76   Pages: 8   [ Previous Page | 1 2 3 4 5 6 7 8 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 532356 times and has 75 replies Next Thread
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

...as long as the result is returned within the 10 days', it seems strange to have the work unit cut off after 6/12 hours of crunching?

That was precisely one of the reasons (in addition to people being not very fond of endless WUs, me first): during the beta tests it became quickly obvious that some WUs would exceed 10 days even for fast machines running 24/7. So imagine for slower machines or for any machine not running 24/7.

If you want to figure it out a little better you can read through this thread of the Beta Test Support forum:
BETA_CMD2_0001-PP1BA.clustersOccur-TPM1A.clustersOccur_xx monster WUs

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Oct 28, 2009 3:07:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

The system really does suffer from the Tall Poppy Syndrome. It is the faster-than-average devices that have a bigger proportion of their work discarded. Hardly an incentive to people with faster machinery (... edits device profiles ...). The big losses are when a WU is crunched by a fast machine and reaches 100% in just under 10 (or even 12) hours, but the wingman has a very slow machine that hits its 6hr limit with only a small fraction of the WU done.
I understand the descriptions above of how the system works, and why it would be difficult to change it.
However, I have been thinking ... (dangerous)

WUs where all quorum members run to completion are not the problem. It is where both, or worse, one member quits after 6 hours, that is wasteful.
The greatest wastage is "caused" by the slowest crunchers, eg netbooks, older laptops and hyperthreading P4s. Their work isn't wasted, but that of their faster wingmen is, like the stereotypical whoever-you-want-to-vilify (mythical?) slow driver who causes others to have traffic accidents.

Thought 1: Have a verification system outside of the existing BOINC system. Initially run each parent WU as 2 independent single-redundancy WUs and automatically grant inconclusive status, or valid if necessary to get the WUs out of the BOINC system. Perform DIY validation external to the BOINC system, and re-issue new WUs to cover any parts of the original WU that did not validate. Too much work for the techs.

Thought 2: Wastage could be reduced using the current system if the amounts of work achieved on each WU by each cruncher were made more even, ie faster-than-average devices should truncate WUs earlier, while slow ones should crunch longer. The "60% rule" should use a variable percentage, to bring the amount of work done by every device to the same suitable amount.
Without changes to the BOINC system, this cutoff behaviour would have to be implemented within the science program (wcg_hcmd2_maxdo...), which in turn would have to know the relative speed of the device on which it was running.
I don't know how to convey that info to the running "maxdo", but as a last resort it could perform its own (mini)"BOINC benchmarks" and compare the results to fleet-average scaling parameters read from the WU input file.
If the science program can get the benchmark results from the BOINC client, that would be a less wasteful method.
Such synthetic benchmarks are not ideal, and the most accurate result result would be based on the device's average performance for the project (hcmd2) compared to the fleet. I don't expect that a science program is permitted to query the WCG database, but perhaps it is possible for the BOINC system to download a file containing (encrypted) performance scaling factors for a device periodically.
I also assume that the WU input files are finalised before the target device is known, but perhaps the BOINC server could be tweaked to pop the target device's performance factor into the WU input files at the last moment.
These alterations would not eliminate all wasted CPU time, but they might stop a big proportion of the wastage. The very slow devices would still cause wastage if they can't reach the required amount of work in the max time allowed (now 12hrs). But presumably these machines are currently able to crunch WUs from projects other than HCMD2 and Rice, so set the new HCMD2 goal to the level of those projects.

Meanwhile, perhaps the owners of slow machines should be discouraged from selecting HCMD2, even though it has the smallest memory footprint and is the most productive current project to run on machines with limited CPU cache. P4 owners should disable HT if they run HCMD2. Owners of faster machines should be aware that a significant amount of their HCMD2 effort will be discarded.
Your comments will be interesting.
---------
PS: @Sekerob: I got "GG-children" WUs on an Intel C2Q (Windows XP-64) and less often on my AMD Atlon64 X2 s939 (Windows 2000). Work-caches are 0.9 days or less.
----------------------------------------
[Edit 4 times, last edit by Rickjb at Oct 28, 2009 4:57:53 PM]
[Oct 28, 2009 9:33:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Rick you are right, but with the threshold at 6 hours now there are much more WUs which can complete, and with the complexity expected to decrease as the project progresses there should be more and more. Even my P4 HT running dual tasks is able to complete sometimes. Out of 32 results in my Status page for this device there are 4 which reached completion, one after 7.45 hours, another one at 6.61... and the two other ones at 5.35 and 2.69!

I have also checked 4 pages of results for my quad (60 WUs) and only one was paired with an interrupted one.

Last, if the waste ratio would happen to be really outrageous the techs can always decide to move the limits up to 7 and 14 hours.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Oct 28, 2009 10:44:04 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

"with the complexity expected to decrease as the project progresses ..." Maybe, but according to Sekerob's charts, HCMD2 is not scheduled to end until June 2011, so there are still many WUs to come, and I expect that between now and then much crunching will be wasted unless improvements are made.

[Edit]: "Even my P4 HT running dual tasks is able to complete sometimes." It's not the ones that a slow machine completes that are important, because the usually-faster wingman will have done the same amount of work, ie 100%, so nothing is wasted. It's the ones where you bump into the 6hr limit that cause wastage. Any work that the wingman does in excess of your 6hrs worth gets discarded. Add up the amounts of credit granted to wingmen above what was granted to you for the WUs where you hit the limit to get an estimate of the wastage that you have "caused". Check with the techs that I am correct, but if I am and you want to keep running HCMD2 on the P4, please disable HT.

I just went through 4 pages of my valid HCMD2 results (4 x 15 per page). 9 of the 60 results (15%) either cut off at 6.0 hours, or that happened to the wingman. I have noted the amounts of difference between my Credits Awarded and those of the wingmen. Negative amounts are where the wingman was awarded more credit than me, ie some of his work was wasted rather than mine:
Athlon64 x2: 20, 77, 36, 33, -20 = 5 WUs with loss, 186 credits involved. Total 30 WUs returned by rjb-a64x2.
At an average of about 16 credits/hr, loss = 11.6 hrs = 23 min/WU over 30 WUs.
Intel Yorkfield quad: 98, 89, 40, 57 = 4 WUs with loss, 284 credits involved. Total 30 WUs returned by this C2Q.
At an average of about 27 credits/hr, loss = 10.5 hrs = 21 min/WU over 30 WUs.
In theory, the further your machine is above the fleet average, the greater the amount of your work that is wasted.

I have taken the C2Q off HCMD2 and will reduce the participation of rjb-a64x2 soon, and decommission the machine soon after that.

[Off-topic]: @JMBoullier and other P4 owners: If you want to reduce the running temperature and power consumption of that P4, you might try "undervolting"* it. My first encounter with a P4 was a 530 model (3.0GHz, HT) on which I was able to reduce CPU potential from 1.3875V to 1.1875V. It now runs much cooler.
*Apologies to the family of the late Alessandro Volta for taking liberties with the family's name.
----------------------------------------
[Edit 4 times, last edit by Rickjb at Oct 28, 2009 4:07:40 PM]
[Oct 28, 2009 12:59:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

My comments about the decreasing complexity and my P4 HT meant to say that there will be less and less WUs hit by the cut off feature as the work to do will be less demanding. And it has already started for my P4 which was still stopped by the limit on all WUs not long ago.

knreed ended one of his posts with this paragraph
It is important to note though, that most workunits have no descendants. Those that do generally have a small number of structures that are computed by one host and not the other. There is a very small percentage of work that is 'lost' due to this technique.
and I tend to trust him. If it were more severe he would not hesitate to raise the cut off time as an emergency measure. From a server workload viewpoint his natural preference is for longer WUs.

If you cannot tolerate this "very small percentage" until he can implement some of the improvements he has exposed I think you are taking the right decision, i.e. remove HCMD2 from your list of projects. There is enough flexibility at WCG for balancing all projects fairly.

Happy crunching! Jean.

PS: My P4 HT is a Sony desktop therefore, obviously, I am allowed to look at voltages in the BIOS but not to touch them. smile
Anyway it is rather stable at 55 °C while crunching its two tasks 24/7, so I think it is quite reasonable.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Oct 28, 2009 6:23:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

I had a bunch of administrative work that piled up after being at the conference last week. After doing that for awhile I needed a break, so I've put some code in that is pairing up workunits to hosts that are similar in terms of cpu power and turnaround time. It will take a couple of days for the impact to be seen. However, this should significantly reduce the 'waste' for workunits on HCMD2. Additionally, people should start seeing their workunits validate more quickly after they are returned (this is being done for all projects except HPF2 and Rice)
[Oct 28, 2009 9:27:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

@knreed: Thank you for your post, and for the work done to pair up hosts of similar capability.
Did you consider my idea of having long HCMD2 WUs cut themselves off when they estimate that they have performed a targeted amount of work (say 90 credits' worth), rather than at a fixed amount of CPU time (6h)?
Your new system should eliminate much of the wastage, but the situation will still arise where host A has done just under 60% at 6h and stops, while host B has done just over 60% and wastes the time it spends going from host A's percentage to the end of the WU.
The relative effectiveness of these 2 systems would depend on the accuracy with which WUs could calculate the true amount of work they have done done, vs your accuracy in pairing up similar hosts, plus the effects of WUs that still get caught by the 6h split. Of course, I don't know the amount of work needed to implement each system.

@JmBoullier: After knreed's changes, it should be OK to leave the HT of your P4 on
It might be interesting to go through its valid results now, and list the amounts of difference in Credits Awarded to you vs to your wingman, for all the WUs where you or the wingman hit the 6h limit, ie wastage under the old system. Then, when enough results under the new system have been validated, repeat the exercise to check the effectiveness of the changes.
----------------------------------------
[Edit 2 times, last edit by Rickjb at Oct 29, 2009 10:05:02 AM]
[Oct 29, 2009 6:59:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
martin64
Senior Cruncher
Germany
Joined: May 11, 2009
Post Count: 445
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Did you consider my idea of having long HCMD2 WUs cut themselves off when they estimate that they have performed a targeted amount of work (say 90 credits' worth), rather than at a fixed amount of CPU time (6h)?

This would require changes in the client - definitely more effort and more complicated than just changing the distribution process on the server side...

If the WUs are sent out to computers of equal performance, there should be no significant difference between point- ond time-cutoff, as the performance would likely be compared on the point-to-time ratio anyway.

Regards,
Martin
----------------------------------------

[Oct 29, 2009 9:47:38 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Movieman
Veteran Cruncher
Joined: Sep 9, 2006
Post Count: 1042
Status: Offline
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Interesting thread.
Hey guys, send those WU's over to us at XS..
We've got good machines and as to turning them off all I can say is we did have ONE member turn off a machine a couple years back and not that people on the team got too upset but last I heard his Doctor said that the guy should recover and be able to walk soon.. biggrin

All tongue in cheek but I do think it's an excellent idea to match machines.
----------------------------------------

[Oct 29, 2009 10:07:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Parents, children, grandchildren WUs - how does it work?

Did you consider my idea of having long HCMD2 WUs cut themselves off when they estimate that they have performed a targeted amount of work (say 90 credits' worth), rather than at a fixed amount of CPU time (6h)?


This would require an application change and a change to the BOINC API. We originally did discuss implementing it this way but the challenges were too high.

We will get the vast majority of the benefit though from the matching that is now taking place.
[Oct 29, 2009 10:39:07 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 76   Pages: 8   [ Previous Page | 1 2 3 4 5 6 7 8 | Next Page ]
[ Jump to Last Post ]
Post new Thread