Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 36
|
![]() |
Author |
|
Keith Myers
Senior Cruncher USA Joined: Apr 6, 2021 Post Count: 193 Status: Offline Project Badges: ![]() |
After reviewing your results and the other results returned for the workunits you have run, I don't see any common reason as to why yours are failing. Over the past 12 hour we have around 70,000 results returned and 97.2% ran correctly. This is a normal level that indicates that the application and current set of data is running in a healthy state (especially for a GPU apps which tend to have more errors). I've also checked and we are not seeing individual workunits failing. It is very likely that something changed on your machine - my guess would be the graphics driver (or the inter-operation of it and the kernel). I realize that your computer is running GPU apps from other projects successfully, but there is nothing on our side that points to why your specific machine is suddenly having issues. Kevin, just want to reiterate the fact the problem is not just isolated to one machine. ALL of my PC's are having the issue. 5 hosts in total. All modern AMD cpus. Two 3950X, one 5950X, one Epyc 7402P and one Epyc 7443P All relatively modern gpus also. Oldest being a 1080 and though a 3080. Most being of the Turing family. 13 gpus in total across the five hosts. ![]() A proud member of the OFA (Old Farts Association) [Edit 1 times, last edit by Keith Myers at Jan 14, 2022 1:13:19 AM] |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
Now, if there are any watchdog threads associated with OPNG and something holds up progress for too long, maybe the entire process gets killed without benefit of any error messages! This is speculative, based on my experience of sluggish BOINC agent work with two OPNG tasks, and the fact that lots of folks with lesser equipment (and less time stress per job) don't seem to be having problems at all (and are saying so!), whilst I suspect that the people being bitten have systems most likely to overtax the wrapper/agent code (e.g. multiple GPUs, lots of tasks running at once. Keith's case in point had a powerful GPU - I wonder how many tasks that user is running at once?!?) I run these tasks under Windows too, so I let Process Explorer take a look at one. You're right - there are a lot of threads, and every few seconds the usage jumps right up:![]() But even so, this oldish Windows 7 machine with 4 CPU cores and three GPUs can run five OPNG tasks simultaneously (when available, of course!) |
||
|
BladeD
Ace Cruncher USA Joined: Nov 17, 2004 Post Count: 28976 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
After reviewing your results and the other results returned for the workunits you have run, I don't see any common reason as to why yours are failing. Over the past 12 hour we have around 70,000 results returned and 97.2% ran correctly. This is a normal level that indicates that the application and current set of data is running in a healthy state (especially for a GPU apps which tend to have more errors). I've also checked and we are not seeing individual workunits failing. It is very likely that something changed on your machine - my guess would be the graphics driver (or the inter-operation of it and the kernel). I realize that your computer is running GPU apps from other projects successfully, but there is nothing on our side that points to why your specific machine is suddenly having issues. Kevin, just want to reiterate the fact the problem is not just isolated to one machine. ALL of my PC's are having the issue. 5 hosts in total. All modern AMD cpus. Two 3950X, one 5950X, one Epyc 7402P and one Epyc 7443P All relatively modern gpus also. Oldest being a 1080 and though a 3080. Most being of the Turing family. 13 gpus in total across the five hosts. Since all was well a few days ago, you need to restore your PCs back to the state they were in then. |
||
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
Keith, see my post on the Team forum. I know the cause. no solution without compromise at the moment.
----------------------------------------![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
Keith Myers
Senior Cruncher USA Joined: Apr 6, 2021 Post Count: 193 Status: Offline Project Badges: ![]() |
PC state hasn't changed. Only these new larger tasks have changed. Was running and have been running the same state for months now and tasks have completed with no issues.
----------------------------------------I had a teammate try and run these new tasks and he had the same issue with the 255 error. So we have figured out why we can't run these tasks is because we are running a special Einstein Gamma Ray application that is incompatible with these new larger tasks. So either we get our developer to tweak the application again to handle OPNG, stop using the application or stop running OPNG. Sorry to have wasted everyone's time. I just didn't even think of the Einstein application because it was running with the older tasks just fine for months now. ![]() A proud member of the OFA (Old Farts Association) |
||
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
to be more specific, it's not about the einstein application itself and we don't even use a full custom build of the einstein app (that would be much easier, but we have limitations in doing that), but rather a shared library that we inject into BOINC to manipulate how openCL kernels are compiled at runtime which is aimed at optimizing the einstein application. since OPNG is also opencl, it's having some unintended side effects.
----------------------------------------![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
also this is an issue caused by the custom library being used.
----------------------------------------it wont affect anyone else other than our team members who are using it. ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
bluestang
Senior Cruncher USA Joined: Oct 1, 2010 Post Count: 272 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I'll go out on a limb and say it's just not the custom Einstein code you're running but also these newer OPNG WUs which have 10x more jobs inside them also causing the issue.
---------------------------------------- |
||
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
The longer jobs are a contributing factor. Since the errors seems to happen after about the same amount of time and the previous shorter jobs may have been under this threshold. But the custom library in use is certainly the root cause. Without the custom library in use, no errors happen.
----------------------------------------![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
I'll go out on a limb and say it's just not the custom Einstein code you're running but also these newer OPNG WUs which have 10x more jobs inside them also causing the issue. where does this 10x value come from? looking at the few jobs I completed recently they only ran ~88 autodock jobs, which is in line with what they were pushing out a few months ago. these are not excessively long and seem to take the same amount of time as previous work did they change to really short WUs with only ~10 autodock jobs since a few months ago and then change back to "longer" tasks? ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
|
![]() |