| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 19
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Running 128 simultaneously and averaging 33 hours. I was thinking it would have been a lot worse than that. I'll take it and run with it.
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Once upon a time during Clean Energy there was a proposal of staggered starting but with now reading 128 concurrent and no result crashing, no need, but I do wonder what happens of this beast is shutdown and started all 128 simultaneous.
|
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
I haven't seen anything about starting up or restarting after a checkpoint. As I see it, a lot of those 128 would be trying to checkpoint at the same time when the next 12.5% has been completed and that might well be a problem. And if they were all to be at the same stage then there would be considerable bandwidth required when they all try to report at about the same time. Or a lot of queuing would take place.
Of course, if the machine were to be hibernated instead of being shut down you would not have the same problem with bunching because they would all restart from where they left off instead of back to the last checkpoint. Mike |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Once upon a time during Clean Energy there was a proposal of staggered starting but with now reading 128 concurrent and no result crashing, no need, but I do wonder what happens of this beast is shutdown and started all 128 simultaneous. Not much, just put maintenance on it this morning and it came right back up and all 128 were in a running state after about 2 minutes. Bandwidth isn't a problem with 1G fibre to the premises. Machine has 256GB memory and all 8 memory channels are populated. HD averages about 4Mb writes per second. All very manageable.. Only real anomaly I have noticed is the hardware interrupts are very high and take about 4% of the processing time. These WUs do not bunch up. Even if you started all 128 at the same time, the inherent variability in run times guarantees they end and report singly. Same thing with checkpoints. [Edit 1 times, last edit by Former Member at Jun 11, 2020 2:41:11 PM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
With 128 units all starting at the same time you would inevitably get some bunching. Say the difference in run times between first and last was 2 hours, in a perfect world, they would all be finishing at about 1 minute intervals. But we don't live in a perfect world. There would be bunching especially near the middle of the spread. Maybe seconds apart but still bunching so uploading/reporting would overlap.
Mike |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Define bunching... I never have more than 3 end within 5 minutes of each other. Is that a "bunch"? I say nay nay. It only takes 2 to 5 seconds per WU to transmit the entire set(60M) of files to WCG. Even if I had 20 (which I never do) uploading at the same time, they would be gone in less than a minute. My experience has been that the spread in runtimes is considerable. min of 28 hours and max of 54 hours but the graph would look like a bell curve. 80% run in the 32 to 39 hour range. Bear in mind, that the 128 thread machine is just one machine, there are 11 others running ARP1 varying between 8 and 16 simultaneous WUs so they are uploading and downloading at the same time. Network link is mostly idle. Maxed out, the link can do about 130MB per second. So, unless there is a simultaneous upload of about 50 work units (which will never happen except at the end of a maintenance window) it isn't any kind of a problem.
----------------------------------------[Edit 1 times, last edit by Former Member at Jun 11, 2020 7:32:11 PM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
So the 128 thread machine is a slow one. Working on the 80%, we have, say, 102 ending in a 7 hour window, so averaging 4 minutes apart. I had presumed it to be much faster than that because of your bandwidth.
Even spacing never happens in the real world but your upload speed seems to be sufficient to compensate for that and the spread of computing times is higher than I imagined. So my apologies. Mike |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
So the 128 thread machine is a slow one. You can't be serious, Mike, how long does it take for your machine to run 128 ARP1 tasks? ![]() Let's be real, entity's device is blowing yours out of the water. ![]() Executing many ARP1s at the same time is having a serious, detrimental impact on their runtimes on a machine. When running only one ARP1 my machine will mostly finish it in 16 hours, however when I run 5 ARP1s ![]() [Edit 1 times, last edit by adriverhoef at Jun 12, 2020 2:06:46 PM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
By slow I was simply referring to the time per unit and not the huge output that 128 threads brings. I would not normally recommend more than 50% of threads for arp. Some of the problem is alleviated by the huge bandwidth that entity has.
If he only wants to run arp then that is fair enough, but if he wants to run other projects as well, it is better to spread them across all machines so each has a mixture rather than one project per machine. Personally, I have an i7-3770 with 8 threads which crunches 4 arp almost as fast as 1 but performance drops off above that, so I run a mix. My priority, currently, is opn followed by mcm but am keeping arp ticking over. Mikei |
||
|
|
|