World Community Grid - View Thread - OpenPandemics

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: OpenPandemics - GPU Stress Test

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 781

[ ]

Author

This topic has been viewed 946261 times and has 780 replies

flynryan
Senior Cruncher
United States
Joined: Aug 15, 2006
Post Count: 235
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

5 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

10 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

10 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

100 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

With my current cache setting of 2 days, each of my threadrippers has 1000 CPU task and running out of GPU tasks.

Guess you found the issue :)

hmm interesting. maybe there's a bug in that the scheduler is seeing your 1000 CPU tasks and applying that to your GPU limit. again, just a guess until the admins take a closer look. it's possible that there is no limit (or a much higher limit) for sending CPU work, but a lower limit for GPU work (known to be 200), but the CPU tasks are being counted towards the GPU limit.

Theres a message in boinc that says it has reached the limit of tasks in progress. Anyway, what do you need 1000 work unit buffer for? Unnecessary. Reduce your buffer to 0.5 or 1 day instead of 2 or more, problem solved.

[Apr 29, 2021 9:09:25 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:

90 day badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

I did not reply because my issues were fixed eventually. But here you go:

Cache settings: 1 day of work + 0.5 additional day.

In the event log the client was requesting work for both CPU and GPU, but I only received CPU workunits. No other unexpected messages were reported.

Some extra context:
- I only have WCG as active BOINC project
- I selected multiple WCG projects but de-selected OpenPandemics
- I selected the option to send me work from the not selected projects when there is no other work available
--> the goal here is to crunch with my cpu on all non-OpenPandemic projects and to reserve OpenPandemics for GPU work only. This worked fine, except then for the temporary issues I mentioned above.

did you do anything to resolve the issue? or did you just wait?

did you by chance still have some OPN1 CPU tasks in your task list when first enabling this?

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Apr 29, 2021 9:13:54 PM]

Chooka
Cruncher
Australia
Joined: Jan 25, 2017
Post Count: 49
Status: Offline
Project Badges:

10 year badge for Smash Childhood Cancer

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

I receive a maximum of 50 tasks for pc. This doesn't last long with 2 x Radeon VII's in one pc.
My slower pc's with the Vega 56's and 280X have no issues with keeping work up to them. Currently they all have 50 GPU wu's to complete. Both of the 16 core threadrippers have now run out of GPU work and are sitting idle :(

I've now changed my WCG setting to stop using the CPU. This should start chewing through the 1000 CPU tasks and I imagine will start filling up the GPU work again.

Work is calling... better go. Have a good day/night all.

----------------------------------------

[Apr 29, 2021 9:18:04 PM]

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:

2 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

The only message I see that could be of concern is - "Not requesting tasks - Too many runnable tasks"

I've seen this on another pc also.

'Too many runnable tasks' is a hard limit in the BOINC client code:

https://github.com/BOINC/boinc/blob/master/client/work_fetch.cpp#L1236

It may well be 1,000 - I have a mental note that of that figure, but won't check it until tomorrow. If so, it won't be under the control of the admins of this or any other project.

Edit: Oh, OK then.
https://github.com/BOINC/boinc/blob/master/client/client_state.h#L598

#define WF_MAX_RUNNABLE_JOBS 1000
// don't fetch work from a project if it has this many runnable jobs.
// This is a failsafe mechanism to prevent infinite fetching

QED

----------------------------------------
[Edit 2 times, last edit by Richard Haselgrove at Apr 29, 2021 9:30:14 PM]

[Apr 29, 2021 9:24:35 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

2 year badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

50 year badge for Microbiome Immunity Project


Re: OpenPandemics - GPU Stress Test

to me it shows fairly poor optimization of the application. the ideal situation is that the application uses as much of the GPU as possible to get the most work done. needing 16 CPU cores to feed a single GPU is absurdly high. most other projects use 1 or less to feed a GPU and keep high utilization the entire time. You shouldn't have to give up 16 cores (along with the power consumption that comes with that) that could otherwise be doing something more useful like crunching CPU tasks for a project without a GPU app. Look at GPUGRID or Einstein. those are how you want your app to operate. able to feed the GPU to 95+% for the entire run with only a single CPU core to keep the GPU busy. usually this means preloading more data into the GPU memory and making the GPU handle more functions.

I know it's a "first cut" for this app, but it still has a long way to go for efficiency in my opinion. we should all push for better utilization of resources for the sake of efficiency and not accept so much waste.

Uplinger addressed this a while ago. He wanted to keep the GPU work units the same as the CPU work units initially to ensure consistent results, no doubt necessary for the science.
He said he would tweak it up later.

this isnt correct. the GPU WUs are not the same as the CPU WUs. the GPU tasks have many many more tasks prepackaged and are actually much larger than the CPU tasks. and GPU tasks cannot crossvalidate with GPU tasks because of their differences.

the GPU app optimization has nothing to do with this.

The statement of keeping the GPU tasks as close to the CPU tasks is correct. This helps in multiple ways. It allows us to verify that things are working as they should without adding too many variables to the mix. These work units use the same method of starting and stopping each job (ligand) in the workunit. All that was modified in the way they were generated was I said assume it's allowed to run 20x longer than CPU. Not much else changed beyond that. Keeping the pipelines from the researchers to us and then to you similar allows for us to decrease the number of variables that we introduce into the equation of differences.

Yes, there are differences in the GPU code that is not the same as CPU, but these were vetted and tested by the researchers before we took the application to grid enable it. There are multiple options that we are in discussions with the researchers about. How long it'll take to get those implemented from the WCG end is unknown. I can not promise when an updated version will be released.

We have heard members commenting on the GPU version using too much IO and other complaints, such as the polar opposite of it causing them to have issues on their displays...Some members commenting on bandwidth usage, etc...

The purpose of this stress test was to determine where some of the bottlenecks were in the system. We have heard the comments and suggestions about the application. We have made changes to our load balancer to help handle a lot more work units. We have identified that the small ligand files cause issues with the inodes of the filesystem filling up. All of these are stresses of the system. Some may be easily addressed, others take lots of time and effort. Releasing a new science application does not come easy and quickly as you would hope, this is distributed to thousands of people and needs to be properly vetted and tested. All of that is to say while supporting and running other application and trying to get some sleep in there.

This stress test has been very exciting for us and our team. We are in constant communication with the researchers and they are also very excited about the test so far. Thank you to everyone for your help on making this a successful test.

Please try to keep comments positive and helpful towards everyone in the forums and not combative. We try to make things run as best as they can, but we do not have unlimited resources.

Thanks,
-Uplinger

[Apr 29, 2021 9:25:52 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: OpenPandemics - GPU Stress Test

Once upon a time WCG had implemented a 70 per thread WU limit. Until recently, WCG was a CPU only project. Machines with high core/thread counts would hit the 1000 (plus a little) BOINC client limit before reaching the 70 per thread limit. This project (OPN1/OPNG) is a hybrid. I would assume that the 70 per thread CPU limit is still there but I am unsure how they are addressing the GPU WU counts. Richard was correct is stating that Uplinger had set the GPU limit to 50 per GPU. Question is: 1. are those GPU counts included in the 70 per thread or are they separate counts. On my 128 thread machine, I had to recompile the BOINC client and raise the 1000 limit to 5000 to get a days worth of OPN1 work. Otherwise it would chew through the 1000 WUs in about 18 hours.

----------------------------------------
[Edit 2 times, last edit by Former Member at Apr 29, 2021 9:28:51 PM]

[Apr 29, 2021 9:27:24 PM]

m0320174
Cruncher
Joined: Feb 13, 2021
Post Count: 11
Status: Offline
Project Badges:

180 day badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project


Re: OpenPandemics - GPU Stress Test

did you do anything to resolve the issue? or did you just wait?

I did not do anything, waiting was the solution for me.

did you by chance still have some OPN1 CPU tasks in your task list when first enabling this?

I did not check but most likely I did. In any case: above approach worked fine initially. It was only (some time) after starting to run multiple GPU units in parallel that I ran out of work. And then miraculously it was fixed again.

So. the only thing I can think of is that there is some kind of mechanism at server side which distributes work according to previously returned work. And that this mechanism initially did not take into account the GPU speed increase.

[Apr 29, 2021 9:28:45 PM]

MindCrimeZ
Cruncher
Joined: Feb 28, 2014
Post Count: 9
Status: Offline
Project Badges:

14 day badge for The Clean Energy Project - Phase 2

180 day badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

On the CPU usage of OPNG WUs

Seems like AMD cards only use about half the cpu time, while nvidia cards use almost 100%

I'm cpu limited on my 2x7970 machine, anyone with a newer AMD card cranked up to like 8-16+ concurrent?

[Apr 29, 2021 9:29:19 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

The only message I see that could be of concern is - "Not requesting tasks - Too many runnable tasks"

I've seen this on another pc also.

#define WF_MAX_RUNNABLE_JOBS 1000
// don't fetch work from a project if it has this many runnable jobs.
// This is a failsafe mechanism to prevent infinite fetching

QED

ah, his specific error looks to be the default 1000 task limit in BOINC. but there is still the 200 GPU task limit to contend with.

I just tried on my multi-GPU host with 200 already in progress, and I get a similar, but not exactly the same, response from the project.

Thu 29 Apr 2021 06:19:19 PM EDT | World Community Grid | This computer has reached a limit on tasks in progress

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Apr 29, 2021 10:22:24 PM]

Laserbait
Cruncher
USA
Joined: Oct 27, 2020
Post Count: 9
Status: Offline
Project Badges:

20 year badge for Mapping Cancer Markers

180 day badge for Smash Childhood Cancer

45 day badge for Africa Rainfall Project


Re: OpenPandemics - GPU Stress Test

This stress test has been very exciting for us and our team. We are in constant communication with the researchers and they are also very excited about the test so far. Thank you to everyone for your help on making this a successful test.

Please try to keep comments positive and helpful towards everyone in the forums and not combative. We try to make things run as best as they can, but we do not have unlimited resources.

Thanks,
-Uplinger

I just wanted to say thanks Uplinger, the progression of this has been rather fun for me participate in as well! Keep up the good work!

--Joseph

[Apr 30, 2021 12:39:46 AM]

[ ]