World Community Grid - View Thread - OpenPandemics

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: OpenPandemics - GPU Stress Test

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 781

[ ]

Author

This topic has been viewed 945861 times and has 780 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: OpenPandemics - GPU Stress Test

I'm experiencing some strange behaviour after modifying the app_config file.

I forced BOINC to run up to 8 GPU workunits in parallel:

<gpu_usage>0.125</gpu_usage>
<cpu_usage>0.25</cpu_usage>

This works absolutely fine. I run both GPU and CPU workunits and my GPU and CPU are able to process that many in parallel. This obviously has a dramatic effect on throughput.

However the BOINC client is not able to fetch GPU workunits anymore. It tries to fetch both CPU and GPU workunits but only receives CPU workunits. Anybody who experienced the same?

That is very likely the old BOINC problem that the scheduler gets confused when you try to run both CPU and GPU work units from the same project. It has something to do with the "duration correction factor" (DCF) as I recall. You have the same problem on Einstein or MilkyWay when you try to run both CPU and GPU.

It is as old as the hills. Maybe a BOINC expert (are you there Richard?) can illuminate it further.

I thought DCF was turned off at WCG and it is handled using an algorithm on the server

From 2017:
As DCF is locked to 1.000000 by WCG on standard clients, meaning the client does not adapt/adjust runtime to real-time throughput, the only messing happening is server driven. Combined with the lapse rate between work generation, the point where fpops are slotted in, and current average runtime used as base for setting those fpops, at science level, makes for chaos on any science that has large variability in their runtime durations, HST1 neither a stranger to the issue.

----------------------------------------
[Edit 1 times, last edit by Former Member at Apr 29, 2021 5:01:37 PM]

[Apr 29, 2021 4:58:02 PM]

Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

I thought DCF was turned off at WCG and it is handled using an algorithm on the server

That looks to be the case. All my DCF show 1.00000000000.
But I don't think that prevents the server from creating the problem, does it?

It may not be a problem here, as I noted above, due to the different task names. But I run only GPU for OPN, and see no point using the CPU.

EDIT: Then of course I can't run any WCG CPU projects, since I have to set CPU to "off". But there are plenty of other worthwhile projects. For COVID-19, there is always Rosetta and SiDock. And plenty of non-COVID projects.

----------------------------------------
[Edit 2 times, last edit by Jim1348 at Apr 29, 2021 5:26:07 PM]

[Apr 29, 2021 5:23:03 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

100 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Good afternoon,

We are going to be making some changes to the work units that are being sent out. This is to help a storage issue that we are trying to prevent on the backend. Without these changes, we would more than likely have to stop the stress test before all 30k batches are complete.

The changes we are making is setting the deadlines to 3 days instead of 7 days that have been previously sent out. All new work downloaded will have the 3 day deadline. Also, because we would like to hit the plateau of work being packaged sooner, we are going to over schedule about 7000 work units that are preventing about 2,000 batches from completing. This allows us to start seeing where a steady state with a 3 day deadline is at, as well as starts the later stages of the pipeline for sending results back to the researchers to happen at a consistent pace.

Note: For this to happen, I will be turning off validation and the feeder for a few minutes.

Thanks,
-Uplinger

[Apr 29, 2021 5:30:51 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

Hello again,

Feeder and validators have been re-enabled.

Thanks,
-Uplinger

[Apr 29, 2021 5:33:36 PM]

kittyman
Advanced Cruncher
Joined: May 14, 2020
Post Count: 140
Status: Offline
Project Badges:

1 year badge for Microbiome Immunity Project

1 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

You just might toss some Boinc clients into panic mode, and they will start processing the new, shorter deadline WUs first.................
Just meowin'.

Meow

----------------------------------------

[Apr 29, 2021 5:41:02 PM]

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:

2 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

I'm experiencing some strange behaviour after modifying the app_config file.

I forced BOINC to run up to 8 GPU workunits in parallel:

<gpu_usage>0.125</gpu_usage>
<cpu_usage>0.25</cpu_usage>

I could take a look, but I'd need to find a quiet space to think about it. I'd feel more comfortable in https://boinc.berkeley.edu/forum_forum.php?id=10 where there's a bit more space to move and we won't get buried in the flood of posts here. Could someone come across there, please, and explain the problem from the beginning?

[Apr 29, 2021 5:43:11 PM]

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

You just might toss some Boinc clients into panic mode, and they will start processing the new, shorter deadline WUs first.................
Just meowin'.

Meow

I think we're fairly safe on that score. These tasks are so short that they make it from the back of the cache to the front in about 2.5 hours.

[Apr 29, 2021 5:46:32 PM]

Pandelta
Advanced Cruncher
Joined: Jun 24, 2012
Post Count: 55
Status: Offline
Project Badges:

90 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

20 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

100 year badge for Microbiome Immunity Project

100 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

I hope you all can greatly increase GPU units after the stress test and keep this going. I am highly tempted to go buy an overpriced card.

From the numbers I have seen, the higher-end cards don't get you much more performance. Maybe someone here with an RTX, for example, could show what they are getting.

After fine tuning my card, I got 17M points yesterday with my RTX 3080. I might be able to get it to 20M. There's still headroom, because it's not running at 100% all the time.

Holy Smokes! I thought I was doing good Lol That's awesome!

[Apr 29, 2021 5:47:00 PM]

kittyman
Advanced Cruncher
Joined: May 14, 2020
Post Count: 140
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

You just might toss some Boinc clients into panic mode, and they will start processing the new, shorter deadline WUs first.................
Just meowin'.

Meow

I think we're fairly safe on that score. These tasks are so short that they make it from the back of the cache to the front in about 2.5 hours.

Granted. But there are some awfully slow GPUs out there.....LOL.

Meow

----------------------------------------

[Apr 29, 2021 5:49:06 PM]

Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2508
Status: Recently Active
Project Badges:

10 year badge for Mapping Cancer Markers

14 day badge for FightAIDS@Home - Phase 2

90 day badge for Africa Rainfall Project


Re: OpenPandemics - GPU Stress Test

You just might toss some Boinc clients into panic mode, and they will start processing the new, shorter deadline WUs first.................
Just meowin'.

Meow

I think we're fairly safe on that score. These tasks are so short that they make it from the back of the cache to the front in about 2.5 hours.

Granted. But there are some awfully slow GPUs out there.....LOL.

Meow

Exactly. My slow GTX 660M had a cache of 18 WU's with a deadline of May 5th and 6th. It just got a new WU with a deadline of May 2. Big panic mode, and it immediately started running the one with the May 2 deadline. That was really unnecessary, because those 18 cached would have been finished by tomorrow. Boinc is not especially smart when it comes to things like this.

----------------------------------------
[Edit 1 times, last edit by Grumpy Swede at Apr 29, 2021 6:00:04 PM]

[Apr 29, 2021 5:59:25 PM]

[ ]