World Community Grid - View Thread - All OPNG tasks erroring on all gpus-all hosts

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: All OPNG tasks erroring on all gpus-all hosts

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 36

[ ]

Author

This topic has been viewed 5926 times and has 35 replies

Keith Myers
Senior Cruncher
USA
Joined: Apr 6, 2021
Post Count: 193
Status: Offline
Project Badges:

1 year badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

After reviewing your results and the other results returned for the workunits you have run, I don't see any common reason as to why yours are failing. Over the past 12 hour we have around 70,000 results returned and 97.2% ran correctly. This is a normal level that indicates that the application and current set of data is running in a healthy state (especially for a GPU apps which tend to have more errors). I've also checked and we are not seeing individual workunits failing.

It is very likely that something changed on your machine - my guess would be the graphics driver (or the inter-operation of it and the kernel). I realize that your computer is running GPU apps from other projects successfully, but there is nothing on our side that points to why your specific machine is suddenly having issues.

Kevin, just want to reiterate the fact the problem is not just isolated to one machine.
ALL of my PC's are having the issue. 5 hosts in total. All modern AMD cpus.
Two 3950X, one 5950X, one Epyc 7402P and one Epyc 7443P

All relatively modern gpus also. Oldest being a 1080 and though a 3080. Most being of the Turing family. 13 gpus in total across the five hosts.

----------------------------------------

A proud member of the OFA (Old Farts Association)

----------------------------------------
[Edit 1 times, last edit by Keith Myers at Jan 14, 2022 1:13:19 AM]

[Jan 14, 2022 1:05:23 AM]

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:

2 year badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

Now, if there are any watchdog threads associated with OPNG and something holds up progress for too long, maybe the entire process gets killed without benefit of any error messages! This is speculative, based on my experience of sluggish BOINC agent work with two OPNG tasks, and the fact that lots of folks with lesser equipment (and less time stress per job) don't seem to be having problems at all (and are saying so!), whilst I suspect that the people being bitten have systems most likely to overtax the wrapper/agent code (e.g. multiple GPUs, lots of tasks running at once. Keith's case in point had a powerful GPU - I wonder how many tasks that user is running at once?!?)

I run these tasks under Windows too, so I let Process Explorer take a look at one. You're right - there are a lot of threads, and every few seconds the usage jumps right up:

But even so, this oldish Windows 7 machine with 4 CPU cores and three GPUs can run five OPNG tasks simultaneously (when available, of course!)

[Jan 14, 2022 8:48:01 AM]

BladeD
Ace Cruncher
USA
Joined: Nov 17, 2004
Post Count: 28976
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

180 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

Since all was well a few days ago, you need to restore your PCs back to the state they were in then.

----------------------------------------

MyCity

[Jan 14, 2022 9:59:00 AM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:

90 day badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

Keith, see my post on the Team forum. I know the cause. no solution without compromise at the moment.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Jan 14, 2022 4:57:51 PM]

Keith Myers
Senior Cruncher
USA
Joined: Apr 6, 2021
Post Count: 193
Status: Offline
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

PC state hasn't changed. Only these new larger tasks have changed. Was running and have been running the same state for months now and tasks have completed with no issues.

I had a teammate try and run these new tasks and he had the same issue with the 255 error.

So we have figured out why we can't run these tasks is because we are running a special Einstein Gamma Ray application that is incompatible with these new larger tasks.

So either we get our developer to tweak the application again to handle OPNG, stop using the application or stop running OPNG.

Sorry to have wasted everyone's time. I just didn't even think of the Einstein application because it was running with the older tasks just fine for months now.

----------------------------------------

A proud member of the OFA (Old Farts Association)

[Jan 14, 2022 5:08:27 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

to be more specific, it's not about the einstein application itself and we don't even use a full custom build of the einstein app (that would be much easier, but we have limitations in doing that), but rather a shared library that we inject into BOINC to manipulate how openCL kernels are compiled at runtime which is aimed at optimizing the einstein application. since OPNG is also opencl, it's having some unintended side effects.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Jan 14, 2022 5:16:39 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

also this is an issue caused by the custom library being used.

it wont affect anyone else other than our team members who are using it.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Jan 14, 2022 5:49:10 PM]

bluestang
Senior Cruncher
USA
Joined: Oct 1, 2010
Post Count: 272
Status: Offline
Project Badges:

50 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project


Re: All OPNG tasks erroring on all gpus-all hosts

I'll go out on a limb and say it's just not the custom Einstein code you're running but also these newer OPNG WUs which have 10x more jobs inside them also causing the issue.

----------------------------------------

https://xs4s.org/index.php
https://discord.gg/ePTkyue2

[Jan 14, 2022 10:39:15 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

The longer jobs are a contributing factor. Since the errors seems to happen after about the same amount of time and the previous shorter jobs may have been under this threshold. But the custom library in use is certainly the root cause. Without the custom library in use, no errors happen.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Jan 15, 2022 3:04:30 PM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

I'll go out on a limb and say it's just not the custom Einstein code you're running but also these newer OPNG WUs which have 10x more jobs inside them also causing the issue.

where does this 10x value come from? looking at the few jobs I completed recently they only ran ~88 autodock jobs, which is in line with what they were pushing out a few months ago. these are not excessively long and seem to take the same amount of time as previous work

did they change to really short WUs with only ~10 autodock jobs since a few months ago and then change back to "longer" tasks?

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Jan 15, 2022 4:12:28 PM]

[ ]