World Community Grid - View Thread - OpenPandemics

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: OpenPandemics - GPU Stress Test

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 781

[ ]

Author

This topic has been viewed 957835 times and has 780 replies

Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2551
Status: Offline
Project Badges:

10 year badge for Mapping Cancer Markers

14 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

The stress test started Apr 26, 2021 10:05:47 PM (GMT+2)
https://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=656665

So, this stress test is certainly going to take much longer than 3 days. We were going to run the "new" batches 13345 - 41773, and so far, I have not seen any WU's from any batch higher than 28868.

Edit: Great joy for my electricity supplier though. I had planned to participate with all my GPU's for 3 days, and now it seems as if there's going to be many more days than that. I'm hanging in for a few more days.

----------------------------------------
[Edit 3 times, last edit by Grumpy Swede at Apr 30, 2021 9:56:06 AM]

[Apr 30, 2021 9:43:52 AM]

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

Nope, only WCG task.
It might be a hardware issue. I got a AMD pop up to say there was a device hanging or something.
Nevermind.

That might also be a Windows problem. It was a popup like that that led me to the workround for my Intel HD 4600 problems during the Beta.

[Apr 30, 2021 9:46:18 AM]

erich56
Senior Cruncher
Austria
Joined: Feb 24, 2007
Post Count: 300
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

45 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project - Phase 2

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

45 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Well, this stress test is certinly going to take much longer than 3 days.

this became clear, though, from the moment on when the runtime of a task has gone up markedly due to very low utilisation the the GPU.
With the tasks that had been sent out initially, this was not the case.

[Apr 30, 2021 9:47:58 AM]

maeax
Advanced Cruncher
Joined: May 2, 2007
Post Count: 144
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

14 day badge for The Clean Energy Project

180 day badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

180 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

180 day badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

4 GPU-Tasks with 0.25 CPU and 0.25 GPU.
Advanced Work for 0.1 days.
Get a constant flow of Tasks for GPU and CPU.
The GPU-Tasks take now 1 hour for this longer Tasks.
Good work from the Researchers and the WCG-Team, Thank you.

----------------------------------------

AMD Ryzen Threadripper PRO 3995WX 64-Cores/ AMD Radeon (TM) Pro W6600. OS Win11pro

[Apr 30, 2021 10:08:17 AM]

kittyman
Advanced Cruncher
Joined: May 14, 2020
Post Count: 140
Status: Offline
Project Badges:

1 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Well, the kitties only have had 2 error tasks for the duration of their munching on WCG kibble. One was a bum task with zero runtime, and the second was caused when I was trying to increase the modest overclock on their GPU. As soon as I saw the error, I dialed it back and that was the end of that error cause.
No more kitty errors since.

Meow!

----------------------------------------

[Apr 30, 2021 10:14:36 AM]

M-spec
Cruncher
The Netherlands
Joined: Jul 29, 2007
Post Count: 4
Status: Offline
Project Badges:

45 day badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

5 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

to me it shows fairly poor optimization of the application. the ideal situation is that the application uses as much of the GPU as possible to get the most work done. needing 16 CPU cores to feed a single GPU is absurdly high. most other projects use 1 or less to feed a GPU and keep high utilization the entire time. You shouldn't have to give up 16 cores (along with the power consumption that comes with that) that could otherwise be doing something more useful like crunching CPU tasks for a project without a GPU app. Look at GPUGRID or Einstein. those are how you want your app to operate. able to feed the GPU to 95+% for the entire run with only a single CPU core to keep the GPU busy. usually this means preloading more data into the GPU memory and making the GPU handle more functions.

I know it's a "first cut" for this app, but it still has a long way to go for efficiency in my opinion. we should all push for better utilization of resources for the sake of efficiency and not accept so much waste.

Uplinger addressed this a while ago. He wanted to keep the GPU work units the same as the CPU work units initially to ensure consistent results, no doubt necessary for the science.
He said he would tweak it up later.

this isnt correct. the GPU WUs are not the same as the CPU WUs. the GPU tasks have many many more tasks prepackaged and are actually much larger than the CPU tasks. and GPU tasks cannot crossvalidate with GPU tasks because of their differences.

the GPU app optimization has nothing to do with this.

The statement of keeping the GPU tasks as close to the CPU tasks is correct. This helps in multiple ways. It allows us to verify that things are working as they should without adding too many variables to the mix. These work units use the same method of starting and stopping each job (ligand) in the workunit. All that was modified in the way they were generated was I said assume it's allowed to run 20x longer than CPU. Not much else changed beyond that. Keeping the pipelines from the researchers to us and then to you similar allows for us to decrease the number of variables that we introduce into the equation of differences.

Yes, there are differences in the GPU code that is not the same as CPU, but these were vetted and tested by the researchers before we took the application to grid enable it. There are multiple options that we are in discussions with the researchers about. How long it'll take to get those implemented from the WCG end is unknown. I can not promise when an updated version will be released.

We have heard members commenting on the GPU version using too much IO and other complaints, such as the polar opposite of it causing them to have issues on their displays...Some members commenting on bandwidth usage, etc...

The purpose of this stress test was to determine where some of the bottlenecks were in the system. We have heard the comments and suggestions about the application. We have made changes to our load balancer to help handle a lot more work units. We have identified that the small ligand files cause issues with the inodes of the filesystem filling up. All of these are stresses of the system. Some may be easily addressed, others take lots of time and effort. Releasing a new science application does not come easy and quickly as you would hope, this is distributed to thousands of people and needs to be properly vetted and tested. All of that is to say while supporting and running other application and trying to get some sleep in there.

This stress test has been very exciting for us and our team. We are in constant communication with the researchers and they are also very excited about the test so far. Thank you to everyone for your help on making this a successful test.

Please try to keep comments positive and helpful towards everyone in the forums and not combative. We try to make things run as best as they can, but we do not have unlimited resources.

Thanks,
-Uplinger

It's a good stress test but at the same time it also shows the credit reward is way off from a reasonable balance with the current OPNG application when running 24/7 on a high-end GPU and CPU. For instance, I run a RTX 3090 and 3950X currently with 32 WU's simultaneously with all 32 CPU threads @100% to back it up. This results in 150+ WU's/hour with the current batch size and 125x(+) daily credits in comparison to CPU only work; an insane daily record of 40.000.000! credits on a single computer. So many GPU task does stress the I/O with around 80MB/s writes on average to the SSD and GPU VRAM usage switching between 6-13GB.

This may be 'fun' for a while and I do hope the project will deliver promising results but I doubt it will please all of the volunteers of much needed CPU only projects (that can also benefit much more from modern CPU instructions), myself included. Runtime for badges may be harder to achieve but it only takes a couple of days now to be given the same amount of 'credit' for years of CPU runtime and ofcourse energy consumption. It's off the charts, even in comparison to todays high-end Epyc server systems. May not be healthy for WCG in the long run.

[Apr 30, 2021 11:01:12 AM]

goben_2003
Advanced Cruncher
Joined: Jun 16, 2006
Post Count: 146
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for The Clean Energy Project - Phase 2

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project


Re: OpenPandemics - GPU Stress Test

For the points:
I personally do not care about points. However:

The gpu work units should be worth more than a cpu unit. The earlier gpu units were 20x as much work as a cpu unit. IIRC the new ones are even larger. The points are supposed to line up with the amount of science done(at least roughly so on average).
Even my intel gpus do in roughly 1 hour what would take roughly 9 hours using all 8 cpu cores on the same machine. It is even more extreme with discrete cards, especially higher end ones like you have. So the increase in amount of points generated in a day has gone up an extreme amount because the amount of science done has gone up by such an extreme amount.

For the writes:
I know some here disagree, but for me the writes were higher than I was comfortable when running the nvidia units. So I set it up to run off of a ramdisk since I have plenty of ram.

Cheers smile

----------------------------------------

[Apr 30, 2021 11:17:35 AM]

hnapel
Advanced Cruncher
Netherlands
Joined: Nov 17, 2004
Post Count: 82
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

5 year badge for The Clean Energy Project - Phase 2

45 day badge for Computing for Clean Water

90 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

1 year badge for FightAIDS@Home - Phase 2

180 day badge for Africa Rainfall Project


Re: OpenPandemics - GPU Stress Test

I just didn't edit my app_config.xml files to run more WU's on the same GPU at the same time, I'm just going with the flow of default settings and think the project should tweak the jobs on the server end to improve utilization if they want. With increased utilization also comes increased power demand and disk I/O (this should also be tweaked IMHO) and I'm fine as it is. I just dialed down the setting for CPU percentage so the GPU jobs get sufficient CPU and I like to keep my system CPU usage just below 100%. I have 4 PCs running the OPNG jobs on a total of 6 GPUs, my daily points are now about close to 7M on average (coming from around 250K with only CPU jobs), during this stress test my WCG point score rank dropped below 2000 with a dramatic jump so I proud myself to be now among the top tier contributors, also I'm now close to 100M points for Openpandemics alone so I'm happy as is. I think the post mortem for this would be an interesting take on distributed computing including the social aspects, I'm glad that even if this is a test we are doing real science ramming the ZINC database through our collective rigs and hopefully it will lead to some insights to fight the pandemic.

[Apr 30, 2021 11:20:54 AM]

erich56
Senior Cruncher
Austria
Joined: Feb 24, 2007
Post Count: 300
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

All my PCs say "no tasks available" -
does anyone here make the same experience?

[Apr 30, 2021 2:01:15 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: OpenPandemics - GPU Stress Test

My card has gone back to executing the 7 day work that was suspended when the 3 day work was released. Must be a pause in WU release

[Apr 30, 2021 2:22:15 PM]

[ ]