Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 6
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 29056 times and has 5 replies Next Thread
Mumak
Senior Cruncher
Joined: Dec 7, 2012
Post Count: 477
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
BOINC not properly handling Windows NUMA systems

As some of us already know, the current BOINC client has some problems when running Windows systems with >64 CPU threads. In such case, the available CPU threads are split into multiple CPU Groups (NUMA nodes) and it's up to the application how it assigns affinity of its child processes/threads.
Currently BOINC relies on default system scheduler, which is not ideal in this case - it doesn't seem to perform effective load balancing. The result of this is that after a while most of the BOINC child processes are assigned to one CPU group, while the other group(s) remain almost idle. That means an overload/over-scheduling on one CPU and underload on others.

I have already submitted this problem to BOINC forums:
https://boinc.berkeley.edu/dev/forum_thread.php?id=10124
https://github.com/BOINC/boinc/issues/1357
but it seems the developers don't care about it.

So I have decided to create a work-around for this case until (hopefully some day) the BOINC team will address it. I have created a tool, that checks all WCG processes running in system and spreads their NUMA Node affinity across all CPU groups.

If anyone has such problems or is interested, let me know...
----------------------------------------

[Feb 11, 2016 10:50:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: BOINC not properly handling Windows NUMA systems

"but it seems the developers don't care about it."

There is no longer a BOINC team [see publication **]. All development comes from outside/volunteers, where David Anderson's involvement seems to be to pull code-check-ins at times.

To get attention, post to the BOINC alpha mail list. If you have a workaround that is easily portable into the BOINC open source code it stands chance of being incorporated [but is it a windows only problem, as the discussion seems to indicate?]

** knreed's name gets mentioned elsewhere still as being server lead, so not sure how fresh the governance document is.
[Feb 11, 2016 11:22:14 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mumak
Senior Cruncher
Joined: Dec 7, 2012
Post Count: 477
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: BOINC not properly handling Windows NUMA systems

I'm aware of that change, but I have submitted this problem almost a year ago.
I also understand that the developers think that the operating system should more effectively manage group affinities, but this unfortunately doesn't seem to be the case.

Nevertheless, my main intention was to provide this workaround to all users experiencing this problem.
----------------------------------------

[Feb 11, 2016 12:12:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mumak
Senior Cruncher
Joined: Dec 7, 2012
Post Count: 477
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: BOINC not properly handling Windows NUMA systems

Here's v1.2 of the tool: www.hwinfo.com/beta/NUMA_Balancer_1_2.zip
By default when launched, it will check all WCG processes/threads running and if it determines some of the NUMA nodes are overloaded (while others are not), it will balance the NUMA assignment (by adjusting NUMA node affinity) for all WCG processes/threads.
For best performance it's recommended to start this tool every few minutes, preferably via Task Scheduler.

Starting the tool with the "-w" option will wait for a keystroke at the end, so you can see the output.

There are additional options available to use it for any processes (not just WCG). Let me know if interested and I'll describe it.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Mumak at Jun 16, 2016 8:51:04 AM]
[Jun 16, 2016 8:50:45 AM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: BOINC not properly handling Windows NUMA systems

"As some of us already know, the current BOINC client has some problems when running Windows systems with >64 CPU threads."

Recent posts suggested problems started already when > 32 threads running HST1 https://secure.worldcommunitygrid.org/forums/...ead_thread,38956_offset,0 but no idea bears in any way on your Numa issue.

(No, moi is not particularly interested to get any the wiser on the matter, since 8 threads is the max I can run concurrent ;).
----------------------------------------
[Edit 1 times, last edit by SekeRob* at Jun 16, 2016 9:07:29 AM]
[Jun 16, 2016 9:05:43 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mumak
Senior Cruncher
Joined: Dec 7, 2012
Post Count: 477
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: BOINC not properly handling Windows NUMA systems

I don't think that mentioned HST1 problem is the same as described here. The NUMA problem occurs only on systems which are using multiple CPU groups, which is always the case when there are >64 CPU threads in a system.
But one can setup a multi-group system with <64 threads too, just has to do that manually since Windows doesn't do it on such systems by default.

If the system is affected by the NUMA problem is easy to check - open Task Manager, switch to Performance, right-click on the graph and change to "NUMA nodes". If that entry is greyed out, it means your system has just 1 group and this doesn't apply. If the user is running at full CPU load and sees that one of the NUMA nodes is at 100%, while the other one is much lower, then the system is affected - its work threads are not optimally balanced across all CPU threads.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Mumak at Jun 16, 2016 11:25:44 AM]
[Jun 16, 2016 11:25:01 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread