Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 117
Posts: 117   Pages: 12   [ Previous Page | 3 4 5 6 7 8 9 10 11 12 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 823179 times and has 116 replies Next Thread
Mysteron347
Senior Cruncher
Australia
Joined: Apr 28, 2007
Post Count: 179
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Looong running WUs

It seems to me, Mysteron, that the techs can't win on this one.
On this sub forum, they're taking heat for not introducing a hard time stop on the units. If you visit HCMD-2 or CEP-2, you will find threads that frequently pop up to announce that the hard stops waste cycles and they want an option to run until its done, no matter how long it takes.
This is the method the techs and scientists came up with for this project, and it was probably their best option.
In short... suck it up


The issue with HCMD2 is related to the wastage of returned results. Fundamentally, it's a good system - it just requires adjustment.

What I've proposed is this (some people need a more graphical approach) - colours refer to the cruncher

1/ Two units are despatched, each of 10,000 positions

2/ Suppose one cruncher returns 40% complete (hit 6-hr wall) and the other returns 85% (hit 12-hr wall)
U_0 (0 - 9999, 4000 complete, 3600 sec)
U_1 (0 - 9999, 8500 complete, 7200 sec)

3/ When these two results are presented to the validator, what should happen is (run-time pro-ratad between splits)
U_0 (0 - 3999, 4000 complete, 3600 sec) - to validator
U_1 (0 - 3999, 4000 complete, 3388 sec) - to validator

U_4000_8499_1 (4500 complete, 3812 sec) - retained result
U_4000_8499_0 - child despatched
U_8500_9999_0 - child despatched
U_8500_9999_1 - child despatched

4/ Let's assume that all of these are sent to crunchers identical to the red cruncher
The results available would be
U_4000_8499_1 (4500 complete, 3812 sec) - retained result
U_4000_8499_0 (4000 complete, 3600 sec)
U_8500_9999_0(1500 complete, 1350 sec)
U_8500_9999_1(1500 complete, 1350 sec)

5/ This would be processed as:
U_4000_7999_1 (4000 complete, 3388 sec) - to validator
U_8000_8499_1 (500 complete, 424 sec) - retained result
U_4000_7999_0 (4000 complete, 3600 sec) - to validator
U_8000_8499_0 - grandchild unit despatched
U_8500_9999_0(1500 complete, 1350 sec) - to validator
U_8500_9999_1(1500 complete, 1350 sec) - to validator

6/ The grandchild unit returns completed
U_8000_8499_1 (500 complete, 424 sec) - to validator
U_8000_8499_0 (500 complete, 300 sec) - to validator

Note that since the return on the blue cruncher is split three times, NO work is discarded. Even if the validator objects to any particular pair or one destination cruncher drops out of sight, the MINIMUM amount of work is duplicated (say if the orange cruncher fails to respond, then only THAT cruncher's work needs to be sent out again...

I'm not attached to the numbering scheme - I hope it simply demonstrates what I'm getting at. Given that a returned unit COULD be split in this way (and I've not received a clear 'Yes' or 'No' in this regard) then I believe that it would gives us the best of all worlds - the 6-hr target for the majority like-to-see-some-progress crew and NO wasted work. EVER.

The same principle could be extended to projects with a quorum greater than 2 - with minor adjustments.

Another advantage would be that if the ACTUAL run-time and processed-positions-count from the initial return for each processor was recorded into a moving-average system (say last 20 returns per project) then a real measure of relative speed would be achieved.

4000 complete, 3600 sec = 4000 in 6hr
8500 complete, 7200 sec = 4250 in 6 hr
4000 complete, 3600 sec = 4000 in 6hr
1500 complete, 1350 sec = 4000 in 6hr
1500 complete, 1350 sec = 4000 in 6hr
500 complete, 300 sec = 6000 in 6hr

So sending units of max. size 4000 for each cruncher; 4250 for blue and 6000 for indigo would mean ~6hr run-times. The crunchers themselves could thus control the size of the generated new units - self-adjusting depending on whether the difficulty of the work being processed.

A superficial analysis may conclude that this would lead to a large number of small units. This is not correct. A small unit would only ever be created when two nearly-identical crunchers hit the same wall (or wall-combinations) and hitting walls is apparently a relatively rare situation (<15% on figures I saw once.)

It should be a relatively simple matter for the slowest 10%(say) to be allotted the smallest units and the overclocked supercharged water-cooled speed-demons the largest-available units. This would see better use of the tortoises which currently may even OBSTRUCT processing by returning late results. This is one of the reasons that I advocate a relaxing of the deadline at the beginning of a project - to allow the slower, older processors the opportunity to return useful work. As the project nears completion, bring in the deadline and move the slower processors onto newer projects with a still-extended deadline.
[Sep 14, 2011 5:40:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Looong running WUs

Sensitive point of course, volunteers paying their own electricity and using free time [the bulk is not applying in set/forget], but it's IBM that's footing the dollars for the salaried staff. By my count, WCG still has the same number of assigned project staff as when there were 3-4 projects running, the grid anyway costing multi-million dollars annual. Soon there will be 11 concurrent sciences on an endless variety of devices/OS mixes and a few thousand hundred of the volunteers always seeking the last gizmo tweak. Soon too the last 0.01% of p2/p3 will drop off the consideration board. The science complexity moves forward... cant wait on 20th century technology. Until WCG has put the dynamic task sizing technology and know-how in place, I for one crunch what's on offer in the expectation that the tasks will be sized down after the early cuts of what we now crunch have been worked through. I'd be disappointed, but still continue if it cant be done now. DSFL runs as top shareholder of our crunch time whilst the weight is clearly still set to pretty much only go to exclusive crunchers. Till then...

Crunching On

--//--
[Sep 14, 2011 5:57:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KWSN - A Shrubbery
Master Cruncher
Joined: Jan 8, 2006
Post Count: 1585
Status: Offline
Reply to this Post  Reply with Quote 
Re: Looong running WUs

The reason for this thread is not necessarily people complaining about long crunch times. I think most people would be fine with the 11-12 hour results if that is what was initially announced. The reason they're pointing out the discrepancies is because the techs stated they were after a 6 hour target.

We'll get there, it's still early in the process and refinements are in the works. The round of betas proves they're working on solutions. Again, this wouldn't be an issue if someone in authority stated that the tasks are expected to run from 6-20 hours. People with slow machines expect them to exceed the predicted run times.

On that note, (and Sekerob's mentioning of it), I just retired my last PIII. I'm expecting an i7-2600k today and the case is ready and waiting. It hadn't died yet but I was amazed at how noisy the fans were once I turned it off. I'd say it didn't have much longer to live if I had let it.
----------------------------------------

Distributed computing volunteer since September 27, 2000
[Sep 14, 2011 6:46:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Looong running WUs

I am working on a new estimator/work unit splitter that should give us a better average between targets. I am using the data from the previous beta test last week to give a generalized curve based on cpu averages per job. This means that in the beta if target 7 was 50% larger than target 1 then it would be cut up with smaller amounts jobs in hopes to achieve 6 hour runtimes.

Our initial analysis of the system on the randomly selected targets was they ran about the same average for similar ligands (jobs).

Thanks,
-Uplinger
[Sep 14, 2011 9:04:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Looong running WUs

Greetings all,

We have implemented the latest work unit splitter starting with target 16.

Here are some of the average runtimes per job for each target. Our estimates before were right around 300 per job and so based on each of these estimates we are increasing the weights of a job estimation based on these values.

| target_00000016.pdbqt | 324.287492727273 |
| target_00000017.pdbqt | 464.999171388103 |
| target_00000018.pdbqt | 310.607246987952 |
| target_00000019.pdbqt | 354.410957919255 |
| target_00000020.pdbqt | 343.083308498583 |
| target_00000021.pdbqt | 322.898007062147 |
| target_00000022.pdbqt | 373.51864527027 |
| target_00000023.pdbqt | 353.415841040463 |
| target_00000024.pdbqt | 424.301083333333 |
| target_00000025.pdbqt | 316.467465625 |
| target_00000026.pdbqt | 420.106759046053 |
| target_00000027.pdbqt | 307.070412096774 |
| target_00000028.pdbqt | 370.619392114094 |
| target_00000029.pdbqt | 308.85312554945 |
| target_00000030.pdbqt | 255.667741489362 |
| target_00000031.pdbqt | 471.703388043478 |
| target_00000032.pdbqt | 376.290288043478 |
| target_00000033.pdbqt | 388.217299368687 |
| target_00000034.pdbqt | 287.745596355685 |
| target_00000035.pdbqt | 377.516415 |
| target_00000036.pdbqt | 382.509401470588 |
| target_00000037.pdbqt | 344.251103879311 |
| target_00000038.pdbqt | 457.458003970588 |
| target_00000039.pdbqt | 377.971094134078 |
| target_00000041.pdbqt | 292.983208894879 |
| target_00000042.pdbqt | 358.218019398907 |
| target_00000043.pdbqt | 337.448498453608 |
| target_00000044.pdbqt | 272.720113333334 |
| target_00000045.pdbqt | 446.22657482806 |
| target_00000046.pdbqt | 293.214546957672 |
| target_00000049.pdbqt | 324.538654661017 |
| target_00000050.pdbqt | 416.241356325301 |
| target_00000051.pdbqt | 288.478359269662 |

To give an example, target 14 which did not use this new method created about 40881 work units. Here are the values for the next 3 targets.

target 16: 44018
target 17: 61590
target 18: 42047

The more work units means fewer jobs per work unit to help stabilize the work unit average runtime length.

Thanks,
-Uplinger
[Sep 15, 2011 6:09:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
genes
Advanced Cruncher
USA
Joined: Jan 28, 2006
Post Count: 132
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Looong running WUs

As someone who ran CPDN for a long time, I don't see what the fuss is about. When one WU finishes, another one starts, so you're doing the same work whether you split it up into a lot of little WU's or a few bigger ones. I can see a concern if there were problems with WU's erroring out after running a long time, but the bad batch of WU's that we saw failed immediately so no time was wasted. I haven't seen a problem with these failing after running a long time.
[Sep 15, 2011 9:41:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Looong running WUs

The last 2.5 days of mean run times in hours, decimal frraction:

10.32169 14th
10.42489 15th
10.40692 16th

When we get to target 16 we hopefully gonna see a more permanent drop to these average.
[Sep 16, 2011 2:40:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
RMau
Cruncher
Joined: Feb 6, 2008
Post Count: 44
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Looong running WUs

As someone who ran CPDN for a long time, I don't see what the fuss is about. When one WU finishes, another one starts, so you're doing the same work whether you split it up into a lot of little WU's or a few bigger ones. I can see a concern if there were problems with WU's erroring out after running a long time, but the bad batch of WU's that we saw failed immediately so no time was wasted. I haven't seen a problem with these failing after running a long time.



genes,

I agree with you, crunching is crunching. It shouldn't matter if one WU takes 24 hours, or four WUs take 24 hours. In my case, it wasn't the elapsed time that a WU took to complete that bothered me, it was the difference between elapsed time and CPU time.

I was seeing a delta between Elapsed Time and CPU Time measured in hours for DSFL, not in minutes as I see on other projects. To me, those hours are time that would be more productive working on other projects.

Rick
----------------------------------------

[Sep 16, 2011 4:12:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Looong running WUs

Rick, have asked the techs to look into this and other items collected in the past 2 weeks of things that kept returning in forum discussions. But for 1 item all are investigated, one is prepped for alpha presently. Totally agree if your computer is committed to run at 100%, taking 9 hours to do a job and only report 1:15, it is a loss of credited time, BUT, the task is valid and really did use the 9 hours... it's just not accounting for them correctly.

--//--
[Sep 16, 2011 4:21:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
genes
Advanced Cruncher
USA
Joined: Jan 28, 2006
Post Count: 132
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Looong running WUs


I was seeing a delta between Elapsed Time and CPU Time measured in hours for DSFL, not in minutes as I see on other projects.

That, to me, looks like a bug somewhere. I am seeing about a 5 minute differential after 10 hours crunching a typical WU on this particular system (Intel Core Duo T2500 2GHz, not HT, Windows, 32-bit) with about 3 hours estimated to go. Are you seeing that behavior on all of your machines, Windows and Linux? I know you mentioned AMD, but I have none of those to compare with.
[Sep 17, 2011 3:07:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 117   Pages: 12   [ Previous Page | 3 4 5 6 7 8 9 10 11 12 | Next Page ]
[ Jump to Last Post ]
Post new Thread