Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 3520
|
![]() |
Author |
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12564 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
entity
That is not what I was trying to suggest. I was trying to suggest that the definition of 'reliable' be adjusted so that the fastest machines run 100% on the stragglers and the slower machines 100% on the later generations, 20% of outstanding work happens to be in generation 077 and earlier generations. If the definition could be changed to keep those machines busy on those stragglers, they would not be let loose on the later generation which might slow down the advancement of those generations. Including generation 078 with the stragglers would boost them to 40% of outstanding work and mid-range machines could be kept away from the later generations. Otherwise I can see there being more and more work needed for the stragglers which would extend the end date. Personally, I wouldn't mind which classification my machine has. All contributions will be equally valuable. Obviously the techs will decide and I too will keep crunching regardless. Mike |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7777 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I ran into a little problem with upping my allocation of work units for ARP. On the main house machine (i7-3770 8gb mem) which my wife uses, it turns out I don't have enough memory when she is using the machine. Unbeknownst to me, when she gets on the machine, she has her browser pre-configured to open with about 8 tabs open. Not a problem when I am only running 1 ARP unit. However, I upped it 4 MCM and 4 ARP. Then cam a series a of ARP units which ran about 8-12 hours and ended with:
----------------------------------------rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 1 : Not enough space Turns out her browser usage was eating up enough memory to cause the ARP units to fail. I cut them back to 1 and now they are finishing fine again. I wish I had discovered it earlier, but there are close to a dozen which all finished in error. I am back to 1 which works just fine, so I think I will leave it there. I tried some on my Linux machines, but with the USB's running a live version, I bricked a couple of USB drives. Also the wi-fi to them tends to choke on the large file downloads and uploads for this project. At least I am getting some done. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
With the growing spread of these workunits, I'm reminded of the last batches of FAH2. Each work unit was supposed to run 150 times, but some work ran another 30+ generations while the stragglers caught up. Imagine running 83 more cycles just for those six lagging work units! I'm hoping that was a one-time issue, and after generating a year's worth of weather data, the generations cease or are manually aborted. Once a unit reaches it's final generation it will stop be run. |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I am curious if you why those 6 are still on generation 1. The sequential aspect to this project on any one piece of land mean there are limits to how fast we can get the last work unit for the project completed. Dr Camille Le Coz is the person who generates the input data for the project. She just finished her PhD and is taking a well earned short holiday. Once she returns she will be getting us the inputs we need to resume running those jobs (we had generation 0 cached locally, but we need the later generations restored before we can run them). |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Since we are now running with 0 backlog of jobs to send, I've changed our backend processing, so that we are taking a generation that just validated and getting the next generation for that unit out and running within about 10 minutes. As a result there is a steady stream of jobs flowing out to clients.
Stragglers (anything older than 5 generations back) are are set to a priority that causes them to only be sent to "reliable" hosts (which are hosts with a history of returning results quickly and that returned a number of consecutive jobs without errors). For units that completed in the past 48 hours we have the following:
hours is the number of hours between when the unit was loaded into BOINC and the time that it was validated and "unloaded" from BOINC. The only control that BOINC provides us for targeting jobs to different clients is the 'reliable' mechanism described above. We also only have one field (average turnaround time) for controlling which hosts qualify as a reliable host. We have to make sure that enough hosts qualify as reliable relative to the number of jobs that need to be assigned to reliable hosts that we don't clog up the system by having too many jobs that need reliable. In order to keep things running smoothly we need to have about 4 times the computing power. |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A different way to think about the spread of jobs is the following:
----------------------------------------Given the current distribution:
The current average generation is 79.7. It is moving forward at a rate of about 0.23 per day. If you ignore the older stragglers due to the errors that were resumed a few months ago, the units would spread out in a normal distribution around this mean and that distribution would move forward at a steady pace. We are attempting to truncate the distribution by assigning the units lower on the distribution to faster hosts via the reliable mechanism. We will do that as much as we can. In order to help us with that we introduced a mechanism that will truncate the units higher on the distribution by holding the units in the lead generation until there are 350 of them (for example, the 084 tasks were just released a couple of hours ago). This will make it easier to identify how many generations behind the lead generation we need to flag as a straggler and give it a preferential assignment. [Edit 3 times, last edit by knreed at Aug 2, 2021 3:01:05 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Kevin, is the 79.7 an average or a median?
----------------------------------------I just got about 20 of the 84s. about 12 to 24 hours from now, those 84s will become 20 85s. Is that going to achieve what you want or is the front still going to move forward but in bursts versus smoothly? Additionally, looking back over a day of downloads, I'm getting about 6% high priority work. That's across a little more than 100 downloads [Edit 1 times, last edit by Former Member at Aug 2, 2021 3:08:11 PM] |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
79.7 is an average (mean).
----------------------------------------The front will still move forward with a new generation about every 4.4 days or so. However, when the new generation is first released, all of those units that were held will be released so there will be a small surge. That 350 units represents 1%. I will be able to control the % marked as needs reliable by using this technique and controlling the number of generations sent out at normal priority. I expect that once this stabilizes over the next week, I can drop the hold on the leading generation down to 0.5% (175 units), keep 5 generations at normal priority and have older generations set to be accelerated via the reliable mechanism while keeping the % needs reliable to around 15%. [Edit 2 times, last edit by knreed at Aug 2, 2021 4:54:01 PM] |
||
|
pwhidden
Cruncher USA Joined: Nov 17, 2004 Post Count: 32 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
entity, Your 6% must be the luck of the draw. Of the 15 work units on my system 4 are high priority. I suspect your 6% will pick up a bit in the next few days now that you are past your 999 download.
----------------------------------------I haven't seen any 084s yet... but I only have 10 threads running ARP in my client farm. <grin> ![]() |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12564 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I see that 24182 units have cleared in the 3 days which means 48364 units validated, but about 51000 had been returned. Would the difference be errors, etc?
Of those 24182, 16421 were generations 077 and earlier, so 2/3 were stragglers. We do seem to be heading in the right direction. My last target date should be increased by a month to allow for only 16000 validated per day. Mike |
||
|
|
![]() |