World Community Grid - View Thread - OpenPandemics

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: OpenPandemics - GPU Stress Test

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 781

[ ]

Author

This topic has been viewed 958149 times and has 780 replies

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:

2 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Yeah, I got 13 GPUs (each with 4 tasks in tandem) on it since roughly when it started and have had to rely heavily on back-up projects due to stalled/slow transfers in both directions (currently hundreds of pending uploads). Far cry from a steady workflow. I hope whatever insights gained from this server pounding are put into making things more efficient down the line. :)

I agree with that observation: with modern NVida GPUs, I was producing upload files far faster than the server could accept them. Downloads were also a problem, but less severe than uploads. I've withdrawn my fast machines from this test, and uploaded/reported all outstanding tasks.

I'll restart my Windows machines to run on iGPU only, so I can monitor how things go later in the day. My observations relate to between about 05:30 UTC and 07:00 UTC, which is normally a relatively quiet time: I hate to think what will happen when the USA starts to wake up again. I may dip in and out again with a fast Linux machine, to keep in touch with the wider picture.

There are other side effects from the stress test: this forum is much slower than normal, and I think we've lost at least one scheduled statistics export.

----------------------------------------
[Edit 1 times, last edit by Richard Haselgrove at Apr 27, 2021 8:55:59 AM]

[Apr 27, 2021 8:54:37 AM]

hnapel
Advanced Cruncher
Netherlands
Joined: Nov 17, 2004
Post Count: 82
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

180 day badge for Help Fight Childhood Cancer

5 year badge for The Clean Energy Project - Phase 2

45 day badge for Computing for Clean Water

90 day badge for Drug Search for Leishmaniasis

90 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Lot's of uploads go to 100% but somehow do not complete.

----------------------------------------
[Edit 1 times, last edit by hnapel at Apr 27, 2021 10:35:19 AM]

[Apr 27, 2021 8:54:53 AM]

PMH_UK
Veteran Cruncher
UK
Joined: Apr 26, 2007
Post Count: 787
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

You are adding to the problem with 120 second loop to retry transfers.
900 seconds would be more reasonable, that is enough to stop transfers going to multi-hour backoffs but won't hammer the servers that are already overloaded.

Paul.

----------------------------------------

Paul.

[Apr 27, 2021 9:05:32 AM]

TonyEllis
Senior Cruncher
Australia
Joined: Jul 9, 2008
Post Count: 291
Status: Offline
Project Badges:

90 day badge for Nutritious Rice for the World

180 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for GO Fight Against Malaria

50 year badge for Mapping Cancer Markers

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Not part of the GPU test (haven't any that qualify) - but ended up here to ascertain why ALL of my uploads and downloads were stalling and the forums so slow. Those interested in the GPU WUs probably knew about the test's potential impact - but what about the rest of us severely impacted by a test that has nothing to do with us and not informed i.e. not interest in GPU crunching?
Anyway - having fitted 10 Linux machines with a retry file-transfer script can now file-transfer VERY SLOWLY with multiple retires until all files for a given WU get finally uploaded/downloaded.
Have a Windows laptop on 2.4G wifi that has run 7 years crunching WCG WUs - no problem. Nothing I could do would get it from being stalled until moving it to a 5G AP.

----------------------------------------

Run Time Stats https://grassmere-productions.no-ip.biz/

[Apr 27, 2021 9:09:55 AM]

squid
Advanced Cruncher
Germany
Joined: May 15, 2020
Post Count: 56
Status: Offline
Project Badges:

14 day badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Today my GPU got many GPU tasks. It processed the tasks without problems.
The upload of some tasks gave an error like below. I think it is a WCG server overload.

27-Apr-2021 10:40:28 [World Community Grid] Temporarily failed upload of OPNG_0004774_00156_0_r1196970475_0: transient HTTP error
27-Apr-2021 10:40:28 [World Community Grid] Backing off 00:18:04 on upload of OPNG_0004774_00156_0_r1196970475_0

[Apr 27, 2021 9:12:35 AM]

goben_2003
Advanced Cruncher
Joined: Jun 16, 2006
Post Count: 146
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for The Clean Energy Project - Phase 2

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

5 year badge for Africa Rainfall Project


Re: OpenPandemics - GPU Stress Test

Sorry Paul, but if iirc, I have 2 more undersea cables to jump through to get to the servers than you do. So even without the stress test I semi-regularly go into project back off.

----------------------------------------

[Apr 27, 2021 9:19:12 AM]

_heinz
Cruncher
Joined: Apr 5, 2020
Post Count: 10
Status: Offline
Project Badges:

180 day badge for Smash Childhood Cancer


Re: OpenPandemics - GPU Stress Test

I opened the doors of my V8-Xeon with 3 GTX Titans
will see how the units run :-)

[Apr 27, 2021 9:21:46 AM]

tux93
Cruncher
Germany
Joined: Jan 5, 2012
Post Count: 9
Status: Offline
Project Badges:

14 day badge for Computing for Clean Water

14 day badge for Drug Search for Leishmaniasis

14 day badge for GO Fight Against Malaria

45 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

Another option would be to put your boinc directory on a cheap spinny drive or ISCSI nas.

That's what I ended up doing for the time being, copied the boinc dir to a spinning rust partition and bind-mounted it to the original location

----------------------------------------

Primary: Intel i7-4790 + nVidia GTX 1060
Secondary: Intel i7-2600 + nVidia GTX 750 Ti
OS: openSUSE Tumbleweed

[Apr 27, 2021 9:30:20 AM]

aegidius
Cruncher
Joined: Aug 29, 2006
Post Count: 25
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for The Clean Energy Project - Phase 2

45 day badge for Drug Search for Leishmaniasis

45 day badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

180 day badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: OpenPandemics - GPU Stress Test

So are the OPNG WU's going to keep coming after the 3-day stress test?
If they are, I'll go buy a better GPU :-)

[Apr 27, 2021 9:34:36 AM]

Chooka
Cruncher
Australia
Joined: Jan 25, 2017
Post Count: 49
Status: Offline
Project Badges:


Re: OpenPandemics - GPU Stress Test

FWIW the stats haven't exported for Einstein@Home either for those commenting on stats. It might not be limited to WCG.... or just coincidence.

----------------------------------------

[Apr 27, 2021 9:38:51 AM]

[ ]