| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 20
|
|
| Author |
|
|
thunder7
Senior Cruncher Netherlands Joined: Mar 6, 2013 Post Count: 238 Status: Offline Project Badges:
|
I have linux machines like this, all running a full load on all cpu's of MCM.
2x 2696V4, 88 threads, 256 Gb memory, 0.3% system time, linux 5.4.0 2x 2696V2, 48 threads, 384 Gb memory, 3% system time, linux 6.0.0 2x 2680V4, 40 threads, 64 Gb memory, 1% system time, linux 4.19 If it were true that MCM tasks on linux used a lot of system time, I would expect to have the 88 thread system to have a lot of system time - which I don“t see. vmstat 2 on the first machine certainly doesn't show anything obviously wrong:
What exactly is the command to get that 'perf' thing? I can start perf, I can zoom in on a process, but it doesn't look anything like what you posted. |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
@rjs5
Ryzen 3700X and Intel i7-7700K here, booth running XUbuntu 22.04. MCM1 tasks on the Ryzen usually taking between 1.25 and 2 hours depending on other work running (and which of the two MCM1 task types it is!), whereas MCM! on the i7 can take from 1.6 to 5 hours! Whilst I often use perf stat to investigate various aspects of a task's performance, I'm not that familiar with perf top, so I'm not sure what parameters you used for that report, and I'd be interested to know... I tried it on the Ryzen with an MCM1 PID and my output didn't look anything like yours, so perhaps I wasn't using it right :-) -- however, the display did suggest I could get a "higher level view" with some options it suggested (--sort comm,dso), so I tried that too; that showed over 99% process, less than 1% kernel. It is worth noting that MCM1 tasks have two threads, one of which is presumably where the process management functions (watchdog timer?; checkpoint management?) live and seems to be mostly idle[1], whilst the other is the compute thread. So (still on the Ryzen) I tried perf top on the thread IDs for the same task, and got something more akin to your display from the secondary thread, whilst the primary thread display was nothing like it (no kernel references in the top 50 lines...). I had to tune the sample event period to stop it complaining about being "too slow to read the ring buffer", but I don't suppose that would make such a huge difference to the output... Still on the Ryzen, I used the sort option on the thread IDs to see what would happen, and observed the following...
I looked at the Ryzen first because it's so much quicker on the longer-running (NFCV) tasks, but I eventually got round to the i7 and (surprise, surprise) it seems to behave as indicated in your post! And I checked vmstat (as per thunder7's post, which turned up whilst I was writing this...) and my Ryzen has almost no "system" time (but lots of idle...) whilst the Intel has a consistent 15% system and 50% user... Of course, that's not all about MCM1 tasks on either system, but it's interesting to note the generally higher apparent overheads on the Intel with my mix of BOINC and non-BOINC work... Now, both systems run the same executable (as far as I am aware - it's certainly the same version number...) and are on the same Linux kernel (5.15 family), the same system libraries and the same BOINC client. So why the huge difference in behaviour??? Are the basic stats used by perf collected differently on the two different system types? All that said, recalling how much faster the last Beta-test MCM1 program was on Linux over two years ago, I suspect the real performance fix might be simply to recompile with a newer Linux compiler and libraries! Sadly, that Beta didn't result in a new application (and meant that the ARM version they trialled hasn't seen the light of day - pity that, so much more efficient than OPN1...) Cheers - Al. [1] Verified with perf stat on both the Intel and the Ryzen. |
||
|
|
thunder7
Senior Cruncher Netherlands Joined: Mar 6, 2013 Post Count: 238 Status: Offline Project Badges:
|
On the other hand, I just noticed my 88 thread machine seemed to be gathering points more slowly than expected.
top output:
35% system time? Oops! sudo perf top:
Hmmmmm. |
||
|
|
rjs5
Cruncher Joined: Jan 22, 2011 Post Count: 6 Status: Offline Project Badges:
|
Well ... when I use perf-top, i just use "perf top". It is VERY hdlpful when the binary has symbols.
It looks like the running slow symptom is my problem and not MCM. The CPU i9-9980xe is sending a signal to Linux indicating a thermal problem and Linux is "idling" for a period of time I looked around to see if I could get more information on the LOCK problem and found: perf lock record control-c after a few seconds perf lock report RECORD command stores data in a file "perf.data". REPORT reads and formats the data The hottest lock turns out to be "pkg_thermal_notify" which indicates my CPU is overheating. 8-( perf lock report Name acquired contended avg wait total wait max wait min wait pkg_thermal_noti... 56421 56421 31.11 us 1.76 s 120.87 us 2 ns futex_wake+0xa0 4460 4460 28.56 us 127.37 ms 106.07 us 1.54 us __folio_end_writ... 320 320 11.84 us 3.79 ms 22.30 us 3.02 us rcu_core+0xd4 219 219 3.52 us 770.77 us 9.73 us 1.82 us futex_q_lock+0x26 199 199 7.78 us 1.55 ms 63.20 us 1.42 us futex_wake+0xa0 56 56 22.70 us 1.27 ms 45.87 us 2.38 us raw_spin_rq_lock... 56 56 5.34 us 298.90 us 16.62 us 2.26 us raw_spin_rq_lock... 51 51 6.55 us 334.08 us 21.43 us 2.55 us raw_spin_rq_lock... 49 49 13.04 us 638.97 us 36.17 us 2.82 us raw_spin_rq_lock... 46 46 8.14 us 374.45 us 21.57 us 2.44 us raw_spin_rq_lock... 41 41 6.49 us 266.29 us 12.54 us 2.45 us btrfs_read_lock_... 40 40 20.20 us 808.07 us 71.87 us 1.83 us raw_spin_rq_lock... 37 37 6.07 us 224.74 us 15.19 us 2.59 us tick_do_update_j... 37 37 6.94 us 256.72 us 25.97 us 2.30 us raw_spin_rq_lock... 35 35 6.64 us 232.44 us 12.37 us 2.53 us __queue_work+0x174 34 34 4.51 us 153.33 us 8.50 us 1.42 us raw_spin_rq_lock... 34 34 6.76 us 229.97 us 19.46 us 1.90 us |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Just to add another data point, dual 2695v3, 100% MCM, linux 5.4.0:
Tasks: 593 total, 29 running, 564 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.1 us, 1.1 sy, 49.0 ni, 49.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 32055.3 total, 28563.2 free, 2443.8 used, 1048.3 buff/cache MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 29174.4 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10866 boinc 39 19 73696 72924 2392 R 105.9 0.2 77:06.78 wcgrid_+ 10809 boinc 39 19 73696 72928 2392 R 100.0 0.2 146:00.80 wcgrid_+ 10812 boinc 39 19 73696 72924 2392 R 100.0 0.2 144:53.36 wcgrid_+ 10814 boinc 39 19 73696 72928 2392 R 100.0 0.2 141:50.95 wcgrid_+ 10819 boinc 39 19 73696 72924 2392 R 100.0 0.2 137:53.19 wcgrid_+ 10824 boinc 39 19 73696 72928 2392 R 100.0 0.2 131:02.93 wcgrid_+ 10835 boinc 39 19 73696 72928 2392 R 100.0 0.2 113:49.22 wcgrid_+ 10859 boinc 39 19 73696 72928 2392 R 100.0 0.2 89:27.37 wcgrid_+ .... |
||
|
|
thunder7
Senior Cruncher Netherlands Joined: Mar 6, 2013 Post Count: 238 Status: Offline Project Badges:
|
Just like that, this morning it's gone again
This machine runs no other software, so it's probable that MCM itself causes it. I do notice that yesterday was a bad day in terms of results (60% of the average of the past 4 days), points (75% of the average of the past 4 days) but not runtime (the same as the past 4 days). I know I get 2 types of MCM tasks - the short and the long ones. The short ones take a bit less than 4 hours, the long ones take a bit less than 8 hours. short: https://www.worldcommunitygrid.org/contribution/workunit/245413338 long: https://www.worldcommunitygrid.org/contribution/workunit/245384006 But I'm not sure they take longer because 35% of the cpu power is spent on system calls, or that they just are different and the extra system time is not the cause of the longer computation. Yesterday a cursory glance told me I had a lot of the 'long' tasks. I am somewhat suspect that the long unit above was solved by another linux user in 1.82 hours. A Xeon V4 @ 2.8 GHz isn't the fastest machine, but is there something out there that runs 4 times faster? And why does that system claim just 10% more points for four times the speed? |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
I am somewhat suspect that the long unit above was solved by another linux user in 1.82 hours. A Xeon V4 @ 2.8 GHz isn't the fastest machine, but is there something out there that runs 4 times faster? And why does that system claim just 10% more points for four times the speed? I'm not surprised by the 1.82 hour time for one of the "long" MCM1 tasks (VMethod=NFCV)-- in my earlier post in this thread I mentioned that my Ryzen 3700X gets through [longer] MCM1 tasks in under 2 hours... And it typically takes 1.25 to 1.4 hours for the shorter ones (VMethod=LOO).As for the points score, that's based on typical times for the particular host, so [for instance] if it's twice as fast that doesn't entitle it to twice the points per task for doing the same amount of work in half the time -- it gets double the points by doing twice as many tasks :-) By the way, as I had samples of both LOO and NFCV tasks in my present work mix I looked at an example of each using perf top on both my Ryzen and my Intel i7. The LOO ones show almost no kernel activity in the main thread, whilst the NFCV main threads show a lot of kernel activity on both systems! That expliains the differences I reported in that other post... However, whatever it is in the NFCV code-path that provokes the extra kernel time, it doesn't seem to burn off such a high proportion of CPU time on the Ryzen... Cheers - Al. |
||
|
|
thunder7
Senior Cruncher Netherlands Joined: Mar 6, 2013 Post Count: 238 Status: Offline Project Badges:
|
I can only dream of getting a reaction from the person/team that 'creates' the NFCV tasks...
|
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
I can only dream of getting a reaction from the person/team that 'creates' the NFCV tasks... Nowadays, that would be the same folks as run WCG :-)For what it's worth, I checked back to the last BETA tests done for MCM1[1] on WCG/IBM as I had a memory of them being faster... However, it appears that they did two sets of tests, the first using NFCV and the second using LOO (but with a different, more complex, set of control parameters). The NFCV tasks showed a minimal improvement in performance on my Ryzen and were, if anything, slightly slower on the Intel; the LOO tasks were quicker on both platforms, but how much of that was down to re-compilation and how much to the re-jigged parameters is unknown. So it does look like a code-path issue, and whether some changes might be possible remains to be seen -- for now they're probably far more invested in finishing the post-transfer "snagging" (especially the bits regarding communication between the two databases, which seem to be at the core of most outstanding issues not related to work accessibility.) Cheers - Al [1] The BETA ended without comment as to whether anything useful was learnt, and nothing appeared to change regarding production MCM1 work...... |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Not that I want to get too involved in a highly technical explanation of the differences between the LOO and NFCV units, but it would be nice to at least hear a layman's explanation for the differences and what is being learned from each variation. Just because of the disparities in the times I know they are doing something different during their calculations. Just what that might be would be interesting to know.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
|