| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 22
|
|
| Author |
|
|
Bearcat
Master Cruncher USA Joined: Jan 6, 2007 Post Count: 2803 Status: Offline Project Badges:
|
Just want to know if I run 10 out of 12 threads if the computer can handle it or will it choke?
----------------------------------------
Crunching for humanity since 2007!
![]() |
||
|
|
Dataman
Ace Cruncher Joined: Nov 16, 2004 Post Count: 4865 Status: Offline Project Badges:
|
I doubt if it would "choke" your computer as the memory requirement is ~1 GB/wu but it may choke your network with the very large upload files especially when several ARP wu try to upload concurently. This is not a problem for those running only a few machines but as I recall you have a large farm.
----------------------------------------I my case I am running 29 machines so my network is in a constant state of download/upload. When a long running upload occurs, other project's uploads go into a pending state and may also go into a backoff state for minutes (or hours). It is a bit of a moot point as you probably will not get 10 anyway with the current distribution. Cheers ![]() |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
TL;DR -- stick to a mixed WCG work-load, 10 of these might be very inefficient...
----------------------------------------Dataman covers a lot of interesting stuff there, and things like network bandwidth and trying to avoid using swap-space are indeed significant anti-choking features. So if you choose a mixed WCG work-load you'll be fine. However, there are other ways to choke a machine, including too much disk I/O and intensive memory access thrashing L3 cache. The effect of lots of L3 cache misses is especially obvious if one runs multiple MIP1 tasks at once. Not only do they wander around memory a lot (hence the misses) but they also have a higher proportion of instructions accessing main memory for data than [most of] the other current projects so the misses relate to a larger number of overall instructions! There are posts in the Microbiome Immunity Project forum on that very topic... I suspect that if you try to run 10 of these at once (or a mix of these and MIP1 only - see below) you'll find they run a lot slower than if you only run a couple! This is because the number of L3 cache misses gets very high and the CPUs have more wait states so you get through less instructions per second. ARP1 doesn't seem to have such a high proportion of data memory-accessing instructions so the effect of cache misses isn't as immediately noticeable - however, I suspect that there will be a number of simultaneous ARP1 tasks beyond which the run-time increases would become unacceptable. (And if/when I can accumulate enough of them at once to test things, I'll try to find out if no-one else does it beforehand!) As Dataman points out, you're unlikely to be able to collect 10 of these at a time at the moment unless you go out of your way to do so, so it probably won't be a problem! Most other WCG projects don't tend to thrash L3 cache as much, even FAH2 and HST1. I have used Linux performance monitoring tools on an Intel i7-7700K and an AMD Ryzen 3700X to dig into CPU utilization stats, so this is based on more than just reported run times... The Intel box (4-core, 8-thread with 8MB L3 cache) is allowed at most 1 CPDN task, and at most 6 WCG tasks with a limit of 1 MIP1 and 1 ARP1 imposed via an app_config.xml file. (There's also GPU work going on, hence the "at most 6") That mix seems to run without MIP1 or ARP1 tasks suffering serious performance hits. My Ryzen 3700X (8 cores, 16 threads, 32MB L3 cache divvied up as 4x8MB) is allowed double the above, and my observations on throughput are similar. By the way, you may see a noticeable performance degradation on any applications that do large amounts of floating-point instructions if you enable hyperthreading - however, unless you also have a lot of cache-thrashing you'll probably manage to run more work with hyperthreading on, as performance is unlikely to drop by 50%! Happy crunching - Al. [Edited to re-order and rephrase some content.] [Edited again in response to post from mdxi - "you'll also see" changed to "you may see" in acknowledgment of the fact my experiences were based on older hardware...] [Edit 2 times, last edit by alanb1951 at Nov 4, 2019 3:22:08 AM] |
||
|
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges:
|
By the way, you'll also see a noticeable performance degradation on any applications that do large amounts of floating-point instructions if you enable hyperthreading - however, unless you also have a lot of cache-thrashing you'll probably manage to run more work with hyperthreading on, as performance is unlikely to drop by 50%! I'd like to provide some actual data around this statement. It's something that gets repeated a good deal, but usually as an anecdote. Earlier this year I benchmarked ZIKA, FAH2, and MCM1. I did 24 hour runs with SMT/HT off, and another 24 hours with it on. The shortest possible takeaway is that there was never any degradation of performance on WCG tasks from SMT being enabled. I tested on the Ryzen 1600 and two configurations of Ryzen 2700 (stock and underclocked/undervolted). The smallest SMT performance uplift was 1.16X (2700, low-power, ZIKA). The largest uplift was 1.43X (2700, low-power, FAH2). The average uplift across all 9 runs was 1.28X. I did not benchmark MIP1 this way, because it is well understood that the Rosetta suite used by MIP will cause cache thrashing once you exceed approximately (MB_OF_L3CACHE / 4) concurrent WUs. I did test some non-WCG software. I tested the Stockfish chess engine, which was incredibly parallelizable. Its lowest SMT uplift was 1.4X, and the highest was 1.46X. Finally, I tested the OpenFOAM computational fluid dynamics package. CFD is about as non-linear and FP-heavy as it gets, so if SMT was going to have a negative effect anywhere, you would expect to see it here. And I actually did -- sometimes. Going from 8 threads to 16 threads on the 2700 resulted in a 4.2% performance degradation. Going from 12 to 24 threads on the 3900X slowed things down by 1% (and I'm not rounding down there; the actual timings on the benchmark were 23.86s for 12 threads vs 24.00s for 24 threads). However, bizarrely, the 1600 bucked the trend with a 3.5% speedup when going from 6 to 12 threads. I don't know how, but I ran it twice and got the same numbers both times. So yes. If you are doing physics-based simulations with a complexity on the order of describing the flow of compressible fluids around non-compressible objects, then you MAY see some VERY small slowdowns on modern hardware. For almost everything else, expect to simply get free performance by using all your threads. ![]() [Edit 1 times, last edit by mdxi at Nov 3, 2019 6:35:14 AM] |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
@mdxi, fascinating results, and thanks for doing all those tests. I'd be interested in also measuring power consumption and CPU temperature with HT/SMT off and then on.
----------------------------------------
|
||
|
|
fuzzydice555
Advanced Cruncher Joined: Mar 25, 2015 Post Count: 89 Status: Offline Project Badges:
|
I tested HT on/off power consumption on an older machine (Xeon X5650).
----------------------------------------The result was that the power consumption increase was exactly the same as the points increase, something like +40% points = +40% power. Hopefully hyperthreading got better over time, so I haven't tested any newer chips ![]() ![]() |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
@mdxi
Thanks for lots of interesting data. You've been testing on different (and newer) machines than the ones on which I based my hyperthread observations, so I'm quite prepared to believe things are better now! I'd like to provide some actual data around this statement. It's something that gets repeated a good deal, but usually as an anecdote. My comment was based on experimentation, by the way, but several years ago (pre-MIP1, and before I was familiar with any ways to do per-process performance monitoring!); the comment itself could've been made less "doom and gloom", and I've taken the liberty of changing it (and acknowledging that I've changed it...) One question on your information - you refer to uplift, and I'd like to be sure I understand your numbers; are those numbers the amount of extra work you got by running twice as many threads or a measure of how much faster individual jobs ran when not hyperthreading, or am I completely misunderstanding? (That wouldn't be a surprise...) Once again, thank you! Cheer - Al. |
||
|
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges:
|
@mdxi One question on your information - you refer to uplift, and I'd like to be sure I understand your numbers; are those numbers the amount of extra work you got by running twice as many threads or a measure of how much faster individual jobs ran when not hyperthreading It's the ratio of WUs completed in 24 hours with SMT vs the WUs completed in 24 hours without SMT. So 15 WUs with SMT versus 10 WUs without, would be a 1.5X uplift. I have all this in a document, but I need to re-measure the 3900X power consumption numbers in it. The power numbers are too low because they predate people figuring out the 3900's voltage droop/CPU frequency micro-stutter behavior. All the benchmarking and performance data should be correct though. If you're interested, the doc is here. Just skip the undervolting/underclocking sections. As I said before, they are overly optimistic and not in line with real-world usage. ![]() [Edit 1 times, last edit by mdxi at Nov 4, 2019 6:19:45 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for all the data mdxi!
I would be really interested in the remeasured power consuption data of the 3900X. How about a comments section on your website, to be able to discuss things there? |
||
|
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges:
|
Thanks for all the data mdxi! I would be really interested in the remeasured power consuption data of the 3900X. How about a comments section on your website, to be able to discuss things there? I haven't plugged one back up to the killawatt, but I can tell you that where I finally got everything stable was 3.4GHz with a Vcore of 1.01875. And what I mean by "stable" here is "the clocks hold at 3.39GHz under full load". At that clock and voltage, with the stock cooler, temperatures are between 58C and 62C. Next time I clean dust out of the HSF, I'll plug in the killawatt and get a power usage number. To your other point: my website is built with a static generator, so comments aren't a thing. Sorry! ![]() |
||
|
|
|