Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 3520
Posts: 3520   Pages: 352   [ Previous Page | 119 120 121 122 123 124 125 126 127 128 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4352491 times and has 3519 replies Next Thread
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12564
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

entity

That is not what I was trying to suggest.

I was trying to suggest that the definition of 'reliable' be adjusted so that the fastest machines run 100% on the stragglers and the slower machines 100% on the later generations,

20% of outstanding work happens to be in generation 077 and earlier generations. If the definition could be changed to keep those machines busy on those stragglers, they would not be let loose on the later generation which might slow down the advancement of those generations.

Including generation 078 with the stragglers would boost them to 40% of outstanding work and mid-range machines could be kept away from the later generations.

Otherwise I can see there being more and more work needed for the stragglers which would extend the end date.

Personally, I wouldn't mind which classification my machine has. All contributions will be equally valuable.

Obviously the techs will decide and I too will keep crunching regardless.

Mike
[Aug 1, 2021 7:12:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7777
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

I ran into a little problem with upping my allocation of work units for ARP. On the main house machine (i7-3770 8gb mem) which my wife uses, it turns out I don't have enough memory when she is using the machine. Unbeknownst to me, when she gets on the machine, she has her browser pre-configured to open with about 8 tabs open. Not a problem when I am only running 1 ARP unit. However, I upped it 4 MCM and 4 ARP. Then cam a series a of ARP units which ran about 8-12 hours and ended with:
rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 1
: Not enough space

Turns out her browser usage was eating up enough memory to cause the ARP units to fail. I cut them back to 1 and now they are finishing fine again. I wish I had discovered it earlier, but there are close to a dozen which all finished in error. I am back to 1 which works just fine, so I think I will leave it there.
I tried some on my Linux machines, but with the USB's running a live version, I bricked a couple of USB drives. Also the wi-fi to them tends to choke on the large file downloads and uploads for this project. At least I am getting some done.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Aug 1, 2021 8:21:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available


With the growing spread of these workunits, I'm reminded of the last batches of FAH2. Each work unit was supposed to run 150 times, but some work ran another 30+ generations while the stragglers caught up.

Imagine running 83 more cycles just for those six lagging work units!

I'm hoping that was a one-time issue, and after generating a year's worth of weather data, the generations cease or are manually aborted.


Once a unit reaches it's final generation it will stop be run.
[Aug 2, 2021 2:17:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

I am curious if you why those 6 are still on generation 1. The sequential aspect to this project on any one piece of land mean there are limits to how fast we can get the last work unit for the project completed.


Dr Camille Le Coz is the person who generates the input data for the project. She just finished her PhD and is taking a well earned short holiday. Once she returns she will be getting us the inputs we need to resume running those jobs (we had generation 0 cached locally, but we need the later generations restored before we can run them).
[Aug 2, 2021 2:22:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

Since we are now running with 0 backlog of jobs to send, I've changed our backend processing, so that we are taking a generation that just validated and getting the next generation for that unit out and running within about 10 minutes. As a result there is a steady stream of jobs flowing out to clients.

Stragglers (anything older than 5 generations back) are are set to a priority that causes them to only be sent to "reliable" hosts (which are hosts with a history of returning results quickly and that returned a number of consecutive jobs without errors).

For units that completed in the past 48 hours we have the following:

num_units avg_hours stddev_hrs min_hrs max_hrs straggler
--------- --------- ---------- ------- ------- ---------
15001 100.0 74.7 9 619 No
1816 66.0 62.3 8 487 Yes


hours is the number of hours between when the unit was loaded into BOINC and the time that it was validated and "unloaded" from BOINC.

The only control that BOINC provides us for targeting jobs to different clients is the 'reliable' mechanism described above. We also only have one field (average turnaround time) for controlling which hosts qualify as a reliable host.

We have to make sure that enough hosts qualify as reliable relative to the number of jobs that need to be assigned to reliable hosts that we don't clog up the system by having too many jobs that need reliable. In order to keep things running smoothly we need to have about 4 times the computing power.
[Aug 2, 2021 2:42:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

A different way to think about the spread of jobs is the following:

Given the current distribution:

count(*) generation
-------- ----------
6 001
...
1 036
3 037
1 038
0 039
2 040
3 041
9 042
1 043
10 044
4 045
8 046
4 047
4 048
4 049
8 050
8 051
6 052
8 053
4 054
14 055
12 056
6 057
12 058
8 059
10 060
10 061
11 062
11 063
12 064
5 065
15 066
19 067
19 068
23 069
15 070
24 071
12 072
27 073
47 074
89 075
346 076
1469 077
4598 078
8097 079
8677 080
6767 081
3504 082
1282 083
374 084


The current average generation is 79.7. It is moving forward at a rate of about 0.23 per day. If you ignore the older stragglers due to the errors that were resumed a few months ago, the units would spread out in a normal distribution around this mean and that distribution would move forward at a steady pace.

We are attempting to truncate the distribution by assigning the units lower on the distribution to faster hosts via the reliable mechanism. We will do that as much as we can. In order to help us with that we introduced a mechanism that will truncate the units higher on the distribution by holding the units in the lead generation until there are 350 of them (for example, the 084 tasks were just released a couple of hours ago). This will make it easier to identify how many generations behind the lead generation we need to flag as a straggler and give it a preferential assignment.
----------------------------------------
[Edit 3 times, last edit by knreed at Aug 2, 2021 3:01:05 PM]
[Aug 2, 2021 2:54:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Work Available

Kevin, is the 79.7 an average or a median?

I just got about 20 of the 84s. about 12 to 24 hours from now, those 84s will become 20 85s. Is that going to achieve what you want or is the front still going to move forward but in bursts versus smoothly?

Additionally, looking back over a day of downloads, I'm getting about 6% high priority work. That's across a little more than 100 downloads
----------------------------------------
[Edit 1 times, last edit by Former Member at Aug 2, 2021 3:08:11 PM]
[Aug 2, 2021 3:06:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

79.7 is an average (mean).

The front will still move forward with a new generation about every 4.4 days or so. However, when the new generation is first released, all of those units that were held will be released so there will be a small surge.

That 350 units represents 1%. I will be able to control the % marked as needs reliable by using this technique and controlling the number of generations sent out at normal priority.

I expect that once this stabilizes over the next week, I can drop the hold on the leading generation down to 0.5% (175 units), keep 5 generations at normal priority and have older generations set to be accelerated via the reliable mechanism while keeping the % needs reliable to around 15%.
----------------------------------------
[Edit 2 times, last edit by knreed at Aug 2, 2021 4:54:01 PM]
[Aug 2, 2021 3:34:25 PM]   Link   Report threatening or abusive post: please login first  Go to top 
pwhidden
Cruncher
USA
Joined: Nov 17, 2004
Post Count: 32
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

entity, Your 6% must be the luck of the draw. Of the 15 work units on my system 4 are high priority. I suspect your 6% will pick up a bit in the next few days now that you are past your 999 download.

I haven't seen any 084s yet... but I only have 10 threads running ARP in my client farm. <grin>
----------------------------------------

[Aug 2, 2021 5:14:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12564
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Work Available

I see that 24182 units have cleared in the 3 days which means 48364 units validated, but about 51000 had been returned. Would the difference be errors, etc?

Of those 24182, 16421 were generations 077 and earlier, so 2/3 were stragglers. We do seem to be heading in the right direction.

My last target date should be increased by a month to allow for only 16000 validated per day.

Mike
[Aug 2, 2021 5:55:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 3520   Pages: 352   [ Previous Page | 119 120 121 122 123 124 125 126 127 128 | Next Page ]
[ Jump to Last Post ]
Post new Thread