| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 781
|
|
| Author |
|
|
flynryan
Senior Cruncher United States Joined: Aug 15, 2006 Post Count: 235 Status: Offline Project Badges:
|
With my current cache setting of 2 days, each of my threadrippers has 1000 CPU task and running out of GPU tasks. Guess you found the issue :) hmm interesting. maybe there's a bug in that the scheduler is seeing your 1000 CPU tasks and applying that to your GPU limit. again, just a guess until the admins take a closer look. it's possible that there is no limit (or a much higher limit) for sending CPU work, but a lower limit for GPU work (known to be 200), but the CPU tasks are being counted towards the GPU limit. Theres a message in boinc that says it has reached the limit of tasks in progress. Anyway, what do you need 1000 work unit buffer for? Unnecessary. Reduce your buffer to 0.5 or 1 day instead of 2 or more, problem solved. |
||
|
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges:
|
I did not reply because my issues were fixed eventually. But here you go: Cache settings: 1 day of work + 0.5 additional day. In the event log the client was requesting work for both CPU and GPU, but I only received CPU workunits. No other unexpected messages were reported. Some extra context: - I only have WCG as active BOINC project - I selected multiple WCG projects but de-selected OpenPandemics - I selected the option to send me work from the not selected projects when there is no other work available --> the goal here is to crunch with my cpu on all non-OpenPandemic projects and to reserve OpenPandemics for GPU work only. This worked fine, except then for the temporary issues I mentioned above. did you do anything to resolve the issue? or did you just wait? did you by chance still have some OPN1 CPU tasks in your task list when first enabling this? ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
|
Chooka
Cruncher Australia Joined: Jan 25, 2017 Post Count: 49 Status: Offline Project Badges:
|
I receive a maximum of 50 tasks for pc. This doesn't last long with 2 x Radeon VII's in one pc.
----------------------------------------My slower pc's with the Vega 56's and 280X have no issues with keeping work up to them. Currently they all have 50 GPU wu's to complete. Both of the 16 core threadrippers have now run out of GPU work and are sitting idle :( I've now changed my WCG setting to stop using the CPU. This should start chewing through the 1000 CPU tasks and I imagine will start filling up the GPU work again. Work is calling... better go. Have a good day/night all. ![]() ![]() |
||
|
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges:
|
The only message I see that could be of concern is - "Not requesting tasks - Too many runnable tasks" 'Too many runnable tasks' is a hard limit in the BOINC client code:I've seen this on another pc also. https://github.com/BOINC/boinc/blob/master/client/work_fetch.cpp#L1236 It may well be 1,000 - I have a mental note that of that figure, but won't check it until tomorrow. If so, it won't be under the control of the admins of this or any other project. Edit: Oh, OK then. https://github.com/BOINC/boinc/blob/master/client/client_state.h#L598 #define WF_MAX_RUNNABLE_JOBS 1000 QED// don't fetch work from a project if it has this many runnable jobs. // This is a failsafe mechanism to prevent infinite fetching [Edit 2 times, last edit by Richard Haselgrove at Apr 29, 2021 9:30:14 PM] |
||
|
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges:
|
to me it shows fairly poor optimization of the application. the ideal situation is that the application uses as much of the GPU as possible to get the most work done. needing 16 CPU cores to feed a single GPU is absurdly high. most other projects use 1 or less to feed a GPU and keep high utilization the entire time. You shouldn't have to give up 16 cores (along with the power consumption that comes with that) that could otherwise be doing something more useful like crunching CPU tasks for a project without a GPU app. Look at GPUGRID or Einstein. those are how you want your app to operate. able to feed the GPU to 95+% for the entire run with only a single CPU core to keep the GPU busy. usually this means preloading more data into the GPU memory and making the GPU handle more functions. I know it's a "first cut" for this app, but it still has a long way to go for efficiency in my opinion. we should all push for better utilization of resources for the sake of efficiency and not accept so much waste. Uplinger addressed this a while ago. He wanted to keep the GPU work units the same as the CPU work units initially to ensure consistent results, no doubt necessary for the science. He said he would tweak it up later. this isnt correct. the GPU WUs are not the same as the CPU WUs. the GPU tasks have many many more tasks prepackaged and are actually much larger than the CPU tasks. and GPU tasks cannot crossvalidate with GPU tasks because of their differences. the GPU app optimization has nothing to do with this. The statement of keeping the GPU tasks as close to the CPU tasks is correct. This helps in multiple ways. It allows us to verify that things are working as they should without adding too many variables to the mix. These work units use the same method of starting and stopping each job (ligand) in the workunit. All that was modified in the way they were generated was I said assume it's allowed to run 20x longer than CPU. Not much else changed beyond that. Keeping the pipelines from the researchers to us and then to you similar allows for us to decrease the number of variables that we introduce into the equation of differences. Yes, there are differences in the GPU code that is not the same as CPU, but these were vetted and tested by the researchers before we took the application to grid enable it. There are multiple options that we are in discussions with the researchers about. How long it'll take to get those implemented from the WCG end is unknown. I can not promise when an updated version will be released. We have heard members commenting on the GPU version using too much IO and other complaints, such as the polar opposite of it causing them to have issues on their displays...Some members commenting on bandwidth usage, etc... The purpose of this stress test was to determine where some of the bottlenecks were in the system. We have heard the comments and suggestions about the application. We have made changes to our load balancer to help handle a lot more work units. We have identified that the small ligand files cause issues with the inodes of the filesystem filling up. All of these are stresses of the system. Some may be easily addressed, others take lots of time and effort. Releasing a new science application does not come easy and quickly as you would hope, this is distributed to thousands of people and needs to be properly vetted and tested. All of that is to say while supporting and running other application and trying to get some sleep in there. This stress test has been very exciting for us and our team. We are in constant communication with the researchers and they are also very excited about the test so far. Thank you to everyone for your help on making this a successful test. Please try to keep comments positive and helpful towards everyone in the forums and not combative. We try to make things run as best as they can, but we do not have unlimited resources. Thanks, -Uplinger |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Once upon a time WCG had implemented a 70 per thread WU limit. Until recently, WCG was a CPU only project. Machines with high core/thread counts would hit the 1000 (plus a little) BOINC client limit before reaching the 70 per thread limit. This project (OPN1/OPNG) is a hybrid. I would assume that the 70 per thread CPU limit is still there but I am unsure how they are addressing the GPU WU counts. Richard was correct is stating that Uplinger had set the GPU limit to 50 per GPU. Question is: 1. are those GPU counts included in the 70 per thread or are they separate counts. On my 128 thread machine, I had to recompile the BOINC client and raise the 1000 limit to 5000 to get a days worth of OPN1 work. Otherwise it would chew through the 1000 WUs in about 18 hours.
----------------------------------------[Edit 2 times, last edit by Former Member at Apr 29, 2021 9:28:51 PM] |
||
|
|
m0320174
Cruncher Joined: Feb 13, 2021 Post Count: 11 Status: Offline Project Badges:
|
I did not reply because my issues were fixed eventually. But here you go: Cache settings: 1 day of work + 0.5 additional day. In the event log the client was requesting work for both CPU and GPU, but I only received CPU workunits. No other unexpected messages were reported. Some extra context: - I only have WCG as active BOINC project - I selected multiple WCG projects but de-selected OpenPandemics - I selected the option to send me work from the not selected projects when there is no other work available --> the goal here is to crunch with my cpu on all non-OpenPandemic projects and to reserve OpenPandemics for GPU work only. This worked fine, except then for the temporary issues I mentioned above. did you do anything to resolve the issue? or did you just wait? I did not do anything, waiting was the solution for me. did you by chance still have some OPN1 CPU tasks in your task list when first enabling this? I did not check but most likely I did. In any case: above approach worked fine initially. It was only (some time) after starting to run multiple GPU units in parallel that I ran out of work. And then miraculously it was fixed again. So. the only thing I can think of is that there is some kind of mechanism at server side which distributes work according to previously returned work. And that this mechanism initially did not take into account the GPU speed increase. |
||
|
|
MindCrimeZ
Cruncher Joined: Feb 28, 2014 Post Count: 9 Status: Offline Project Badges:
|
On the CPU usage of OPNG WUs
Seems like AMD cards only use about half the cpu time, while nvidia cards use almost 100% I'm cpu limited on my 2x7970 machine, anyone with a newer AMD card cranked up to like 8-16+ concurrent? |
||
|
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges:
|
The only message I see that could be of concern is - "Not requesting tasks - Too many runnable tasks" 'Too many runnable tasks' is a hard limit in the BOINC client code:I've seen this on another pc also. https://github.com/BOINC/boinc/blob/master/client/work_fetch.cpp#L1236 It may well be 1,000 - I have a mental note that of that figure, but won't check it until tomorrow. If so, it won't be under the control of the admins of this or any other project. Edit: Oh, OK then. https://github.com/BOINC/boinc/blob/master/client/client_state.h#L598 #define WF_MAX_RUNNABLE_JOBS 1000 QED// don't fetch work from a project if it has this many runnable jobs. // This is a failsafe mechanism to prevent infinite fetching ah, his specific error looks to be the default 1000 task limit in BOINC. but there is still the 200 GPU task limit to contend with. I just tried on my multi-GPU host with 200 already in progress, and I get a similar, but not exactly the same, response from the project. Thu 29 Apr 2021 06:19:19 PM EDT | World Community Grid | This computer has reached a limit on tasks in progress ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
|
Laserbait
Cruncher USA Joined: Oct 27, 2020 Post Count: 9 Status: Offline Project Badges:
|
This stress test has been very exciting for us and our team. We are in constant communication with the researchers and they are also very excited about the test so far. Thank you to everyone for your help on making this a successful test. Please try to keep comments positive and helpful towards everyone in the forums and not combative. We try to make things run as best as they can, but we do not have unlimited resources. Thanks, -Uplinger I just wanted to say thanks Uplinger, the progression of this has been rather fun for me participate in as well! Keep up the good work! --Joseph |
||
|
|
|