Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 781
Posts: 781   Pages: 79   [ Previous Page | 47 48 49 50 51 52 53 54 55 56 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 945862 times and has 780 replies Next Thread
Chooka
Cruncher
Australia
Joined: Jan 25, 2017
Post Count: 49
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

I wonder if changing my project weight to 0 might keep the work flowing and not run into this limit?
The downside is, if there's not a lot of GPU work, I'll have idle pc's.

My 3950X now has 100 GPU wu's and I guess that's because I have 2 GPU's in this system. The 1950X still has no GPU work. I've just culled a heap of CPU wu's. Lets see if that adds more GPU tasks.

*edit - Culling the cpu tasks has allowed more GPU work. Hmm...now how to keep the GPU tasks constant.

@MindCrimeZ - I've got a Radeon VII but I'm limited it to 3 wu's concurrently. I'm not keen on giving up too many CPU threads to support it.

@Uplinger - Thank you for the testing. We all appreciate your time and effort and fully understand this is in its infancy.
----------------------------------------


----------------------------------------
[Edit 1 times, last edit by Chooka at Apr 30, 2021 3:26:31 AM]
[Apr 30, 2021 3:24:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Chooka
Cruncher
Australia
Joined: Jan 25, 2017
Post Count: 49
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

wow...One of these GPU tasks has been running for 47 min running 3 concurrent tasks with a Radeon VII. Sounds a bit much?
----------------------------------------


[Apr 30, 2021 3:34:43 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

wow...One of these GPU tasks has been running for 47 min running 3 concurrent tasks with a Radeon VII. Sounds a bit much?


It happens. Some of them are just really long. I have one that ran for over an hour on an RTX 2080, running only 1 task on the GPU.
----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti
[Apr 30, 2021 4:00:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
flynryan
Senior Cruncher
United States
Joined: Aug 15, 2006
Post Count: 235
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

wow...One of these GPU tasks has been running for 47 min running 3 concurrent tasks with a Radeon VII. Sounds a bit much?


Yep, are you running any other workloads on the GPU?
[Apr 30, 2021 5:41:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
zdnko
Senior Cruncher
Joined: Dec 1, 2005
Post Count: 235
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

First error from the beginning of stress test:
https://www.worldcommunitygrid.org/ms/device/...og.do?resultId=1662839852

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF70444DEE0 read attempt to address 0x00000000

[Apr 30, 2021 6:22:28 AM]   Link   Report threatening or abusive post: please login first  Go to top 
erich56
Senior Cruncher
Austria
Joined: Feb 24, 2007
Post Count: 300
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

yesterday evening, I noticed that quite a number of tasks were running the full time, but the "Status" finally showed "error". When clicking on "error", it shows this:

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 3765269347 (0xe06d7363)</message>
<stderr_txt>
projects/www.worldcommunitygrid.org/wcgrid_opng_autodockgpu_7.28_windows_x86_64__opencl_nvidia_102 -jobs OPNG_0019629_00019.job -input OPNG_0019629_00019.zip -seed 40279370 -wcgruns 4550 -wcgdpf 91
INFO: Using gpu device from app init data 0


anyone any idea what kind of error code this is?
[Apr 30, 2021 6:47:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Chooka
Cruncher
Australia
Joined: Jan 25, 2017
Post Count: 49
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

wow...One of these GPU tasks has been running for 47 min running 3 concurrent tasks with a Radeon VII. Sounds a bit much?


Yep, are you running any other workloads on the GPU?


Nope, only WCG task.
It might be a hardware issue. I got a AMD pop up to say there was a device hanging or something.
Nevermind.
----------------------------------------


[Apr 30, 2021 8:42:38 AM]   Link   Report threatening or abusive post: please login first  Go to top 
zdnko
Senior Cruncher
Joined: Dec 1, 2005
Post Count: 235
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

Is it normal for a wu to be sent to a second wingman 4 days before the deadline?
Project Name: 		OpenPandemics - COVID-19 - GPU
Created: 04/27/2021 11:03:15
Name: OPNG_0015605_00015
Minimum Quorum: 2
Replication: 3


Result Name OS type OS version App Version Number Status Sent Time Time Due /
Return Time CPU Time / Elapsed Time (hours) Claimed/ Granted BOINC Credit
OPNG_ 0015605_ 00015_ 2-- Microsoft Windows 10 Professional x64 Edition, (10.00.10586.00) 728 Valid 4/29/21 18:30:23 4/30/21 08:44:48 0.38 1.2 / 1,618.7
OPNG_ 0015605_ 00015_ 1-- Microsoft Windows 10 Core x64 Edition, (10.00.19041.00) 728 Server Aborted 4/27/21 22:06:47 4/30/21 08:46:49 0.00 0.0 / 0.0
OPNG_ 0015605_ 00015_ 0-- Microsoft Windows 10 Education x64 Edition, (10.00.18363.00) 728 Valid 4/27/21 22:05:11 4/28/21 03:39:57 0.20 0.9 / 1,552.3

Maybe the deadline has remained unchanged on the wu but the server considers the new 3 day limit?
[Apr 30, 2021 9:06:48 AM]   Link   Report threatening or abusive post: please login first  Go to top 
goben_2003
Advanced Cruncher
Joined: Jun 16, 2006
Post Count: 146
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: OpenPandemics - GPU Stress Test

The statement of keeping the GPU tasks as close to the CPU tasks is correct. This helps in multiple ways. It allows us to verify that things are working as they should without adding too many variables to the mix. These work units use the same method of starting and stopping each job (ligand) in the workunit. All that was modified in the way they were generated was I said assume it's allowed to run 20x longer than CPU. Not much else changed beyond that. Keeping the pipelines from the researchers to us and then to you similar allows for us to decrease the number of variables that we introduce into the equation of differences.

Yes, there are differences in the GPU code that is not the same as CPU, but these were vetted and tested by the researchers before we took the application to grid enable it. There are multiple options that we are in discussions with the researchers about. How long it'll take to get those implemented from the WCG end is unknown. I can not promise when an updated version will be released.

We have heard members commenting on the GPU version using too much IO and other complaints, such as the polar opposite of it causing them to have issues on their displays...Some members commenting on bandwidth usage, etc...

The purpose of this stress test was to determine where some of the bottlenecks were in the system. We have heard the comments and suggestions about the application. We have made changes to our load balancer to help handle a lot more work units. We have identified that the small ligand files cause issues with the inodes of the filesystem filling up. All of these are stresses of the system. Some may be easily addressed, others take lots of time and effort. Releasing a new science application does not come easy and quickly as you would hope, this is distributed to thousands of people and needs to be properly vetted and tested. All of that is to say while supporting and running other application and trying to get some sleep in there.

This stress test has been very exciting for us and our team. We are in constant communication with the researchers and they are also very excited about the test so far. Thank you to everyone for your help on making this a successful test.

Please try to keep comments positive and helpful towards everyone in the forums and not combative. We try to make things run as best as they can, but we do not have unlimited resources.

Thanks,
-Uplinger

This stress test has been exciting for me too! I
Thank you! I am grateful for all the work that you and the rest of the team have put into it! smile applause
----------------------------------------

[Apr 30, 2021 9:26:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 781   Pages: 79   [ Previous Page | 47 48 49 50 51 52 53 54 55 56 | Next Page ]
[ Jump to Last Post ]
Post new Thread