World Community Grid - View Thread - OpenPandemics

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: OpenPandemics - GPU Stress Test

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 781

[ ]

Author

This topic has been viewed 945455 times and has 780 replies

True54Blue
Advanced Cruncher
Joined: Nov 17, 2004
Post Count: 97
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

90 day badge for Help Fight Childhood Cancer

180 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

45 day badge for Drug Search for Leishmaniasis

45 day badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

45 day badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

I just increased my GPU WUs to 10 with 10 CPU threads and I'm still not fully utilizing the RTX3060Ti. I think I can see why someone went with 16. Perhaps I'll try that tomorrow.

----------------------------------------

[Apr 29, 2021 8:51:55 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:

90 day badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Nobody minds a little meowing......
But I think a few folks are getting tired of the caterwaulling.

Meow.

oh well.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Apr 29, 2021 8:53:31 PM]

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:

2 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

I observe a 200 task limit. is your CPU+GPU equating to 200 tasks?

Please define your test scenario. I see a limit of 100 - on a dual GPU machine. I also remember Uplinger writing that he was enforcing a 50 task limit.

I surmise that limit was actually '50 tasks per GPU'.

[Apr 29, 2021 8:53:43 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

this isnt correct. the GPU WUs are not the same as the CPU WUs. the GPU tasks have many many more tasks prepackaged and are actually much larger than the CPU tasks. and GPU tasks cannot crossvalidate with GPU tasks because of their differences.

the GPU app optimization has nothing to do with this.

Of course there are more of them. But individually I think they are comparable. You need to read his post. He thought he could optimize the work units.

[Apr 29, 2021 8:55:29 PM]

Chooka
Cruncher
Australia
Joined: Jan 25, 2017
Post Count: 49
Status: Offline
Project Badges:

10 year badge for Smash Childhood Cancer


Re: OpenPandemics - GPU Stress Test

Sorry... I should have mentioned I am currently running both CPU & GPU for Open pandemics.
3950X Using 27 cores and running 0.33 across 2 GPU's (6 tasks running at a time - Radeon VII's)

----------------------------------------

[Apr 29, 2021 8:56:32 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

I observe a 200 task limit. is your CPU+GPU equating to 200 tasks?

the limit is 50 per GPU. or 200 (GPU tasks) per host.

taking a gander at my hosts, one can observe that I have several hosts which will bump into this limit. hosts with more than 4 GPUs do not receive 50 per GPU anymore, they stop at the hard limit of 200.

so if Chooka has 200 CPU tasks, he wont get any GPU tasks due to this limit (that's the guess right now at least, assuming this task limit is shared between CPU+GPU)

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

----------------------------------------
[Edit 2 times, last edit by Ian-n-Steve C. at Apr 29, 2021 9:09:33 PM]

[Apr 29, 2021 8:57:40 PM]

Chooka
Cruncher
Australia
Joined: Jan 25, 2017
Post Count: 49
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

Hi Ian n Steve. Thanks for the reply.

Cache settings in WCG is set to 2 days work. It was at 7 days but either way I had the same issue.
The only message I see that could be of concern is - "Not requesting tasks - Too many runnable tasks"

I've seen this on another pc also.

then that's your answer. you have too many tasks from the project to be sent more. I'm guessing you're loaded up on CPU tasks.

I'm only running GPU tasks (CPU processing disabled), and I observe a 200 task limit. is your CPU+GPU equating to 200 tasks? I think this is probably from the resource share issue. unfortunately

you could for sure work around this by running multiple clients (which is a bit of a can of worms in itself) one with only CPU work, and one with only GPU work, or maybe playing around with the resource share value between projects. there are options in BOINC to control how many tasks of each type are running at a time, but other than a cache setting (which I'm sure is shared between OPN1 and OPNG since it's the same project) theres no way to tell the project "only send me X amount of CPU tasks"

Thank you.
This is similar to another project where you want to run more GPU work but it pulls CPU work instead and starves the GPU of more work.
Guess I might just have to run GPU only on Open Pandemics. Bit of a shame.

----------------------------------------

[Apr 29, 2021 8:59:31 PM]

Chooka
Cruncher
Australia
Joined: Jan 25, 2017
Post Count: 49
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

With my current cache setting of 2 days, each of my threadrippers has 1000 CPU task and running out of GPU tasks.

Guess you found the issue :)

----------------------------------------

[Apr 29, 2021 9:00:56 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

With my current cache setting of 2 days, each of my threadrippers has 1000 CPU task and running out of GPU tasks.

Guess you found the issue :)

hmm interesting. maybe there's a bug in that the scheduler is seeing your 1000 CPU tasks and applying that to your GPU limit. again, just a guess until the admins take a closer look. it's possible that there is no limit (or a much higher limit) for sending CPU work, but a lower limit for GPU work (known to be 200), but the CPU tasks are being counted towards the GPU limit.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Apr 29, 2021 9:05:28 PM]

m0320174
Cruncher
Joined: Feb 13, 2021
Post Count: 11
Status: Offline
Project Badges:

180 day badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project


Re: OpenPandemics - GPU Stress Test

I'm experiencing some strange behaviour after modifying the app_config file.

I forced BOINC to run up to 8 GPU workunits in parallel:

<gpu_usage>0.125</gpu_usage>
<cpu_usage>0.25</cpu_usage>

This works absolutely fine. I run both GPU and CPU workunits and my GPU and CPU are able to process that many in parallel. This obviously has a dramatic effect on throughput.

However the BOINC client is not able to fetch GPU workunits anymore. It tries to fetch both CPU and GPU workunits but only receives CPU workunits. Anybody who experienced the same?

Yes... I'm finding the same thing.
I've woken this morning to once again find my pc run dry of GPU work :/ It's just not fetching more GPU work.

I asked m0320174 these questions, but he never replied. so I'll ask you the same since you're having the same issue.

what are your cache settings? and what does the Event Log say during work fetch? it will usually list a reason for not requesting work. or a reason why they aren't sending you any.

I did not reply because my issues were fixed eventually. But here you go:

Cache settings: 1 day of work + 0.5 additional day.

In the event log the client was requesting work for both CPU and GPU, but I only received CPU workunits. No other unexpected messages were reported.

Some extra context:
- I only have WCG as active BOINC project
- I selected multiple WCG projects but de-selected OpenPandemics
- I selected the option to send me work from the not selected projects when there is no other work available
--> the goal here is to crunch with my cpu on all non-OpenPandemic projects and to reserve OpenPandemics for GPU work only. This worked fine, except then for the temporary issues I mentioned above.

[Apr 29, 2021 9:09:10 PM]

[ ]