Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Beta Testing Forum: Beta Test Support Forum Thread: OpenPandemics GPU Beta Test - March 26 2021 [ Issues Thread ] |
No member browsing this thread |
Thread Status: Locked Total posts in this thread: 511
|
Author |
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: |
Since the last beta I had a very unexpected exchange with an AMD engineer, which led to a bit of research, and long story short: I now have the OpenCL portions of the AMDGPU-PRO drivers deployed on my nodes with AMD GPUs, and this means that those nodes are now OCL 1.2 compliant. Which, in turn, means that OPNG now runs successfully on those nodes (well, only one of the three got WUs, but they're all using the same GPU series, so this result should apply to all of them). 8 WUs completed, all in very reasonable times:
----------------------------------------
All validated already. No infinite hangs or other weirdness. Woo! |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
Keith, FWIW the 4 resend tasks that I had validated before you released the latest 20 batches of betas all received between roughly 1100 & 1300 points each. The betas after those 4 received half or less the amount of points per task. I'm guessing that's because those 4 resends were more involved tasks. I'm not at all complaining. Just giving you a heads up in case there is an issue that needs addressed. Thanks for all the hard work you do here. After my last post I was looking into the points and there was a bug in the granting of credit. The amount of credit should be higher. I am working to push that change out now. Your thoughts on those getting closer to 1100 and 1300 points means they ran pretty much 90% on average to all jobs, some of which were probably hitting 100%. Do you have an example of the work unit name for the 1100 and 1300 point results that you encountered? I can use those to verify on my end what you're seeing. Thanks, -Uplinger |
||
|
Speedy51
Veteran Cruncher New Zealand Joined: Nov 4, 2005 Post Count: 1264 Status: Offline Project Badges: |
Thanks for the details
----------------------------------------From your comment of getting 600, that sounds like lots of the jobs stopped early. I will be monitoring the values to make sure things look correct over the next few days and early when we go to production. Task in question had 30 jobs inside it. Time spent on the job was 0.03/0.04 0000162_00228. Looks like between 2 and 4 seconds were spent on each job within the task I noticed I had another task that had 18 jobs inside and granted me with 0.3/820.7 with the runtime of 0.03/0.03 Does seem excessively high considering my other task only had 30 jobs? 0030007_00530_0 Looks like around 7 seconds were spent on each job with this task I have included the task names above for you my Device name is DESKTOP-FVL1L8F If you would like to look into it. I am sure I talk for more than myself when I say this I/we appreciate. The time you spend on things making sure they work correctly. |
||
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges: |
Keith, FWIW the 4 resend tasks that I had validated before you released the latest 20 batches of betas all received between roughly 1100 & 1300 points each. The betas after those 4 received half or less the amount of points per task. I'm guessing that's because those 4 resends were more involved tasks. I'm not at all complaining. Just giving you a heads up in case there is an issue that needs addressed. Thanks for all the hard work you do here. After my last post I was looking into the points and there was a bug in the granting of credit. The amount of credit should be higher. I am working to push that change out now. Your thoughts on those getting closer to 1100 and 1300 points means they ran pretty much 90% on average to all jobs, some of which were probably hitting 100%. Do you have an example of the work unit name for the 1100 and 1300 point results that you encountered? I can use those to verify on my end what you're seeing. Thanks, -Uplinger Here is the one that was granted the most points. Mine is the Win 7 machine and is the slowest of the 4 I have running. Project Name: BETA - OpenPandemics - COVID-19 - GPU Created: 03/26/2021 19:50:39 Name: BETA_OPNG_0000089_00236 Minimum Quorum: 2 Replication: 2 Result Name OS type OS version App Version Number Status Sent Time Time Due / Return Time CPU Time / Elapsed Time (hours) Claimed/ Granted BOINC Credit BETA_ OPNG_ 0000089_ 00236_ 2-- Microsoft Windows 7 Professional x64 Edition, Service Pack 1, (06.01.7601.00) 728 Valid 3/30/21 20:16:22 3/30/21 20:37:49 0.03 0.0 / 1,321.9 BETA_ OPNG_ 0000089_ 00236_ 1-- Microsoft Windows 10 Professional x64 Edition, (10.00.19042.00) 728 Valid 3/26/21 20:15:44 3/26/21 20:37:45 0.04 0.0 / 1,285.1 BETA_ OPNG_ 0000089_ 00236_ 0-- Microsoft Windows 10 Core x64 Edition, (10.00.19042.00) - No Reply 3/26/21 20:15:35 3/30/21 20:15:35 0.00 0.0 / 0.0
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
----------------------------------------[Edit 1 times, last edit by nanoprobe at Mar 31, 2021 3:49:47 AM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
Thanks Speedy,
Those credits look in line with what I'm expecting to see. For a CPU work unit, we estimate they can run X jobs based on what each job has inside of it. This is based off how many atoms are in a given ligand. ( 0.0000000122 * Atoms^2 + 0.0000000751 * Atoms + 0.0000105946 ) * ga_num_evals * ga_run = how long we estimate it'll take for an average cpu. Each job has a different number of atoms and structure, which changes the equation by evals being different and higher generally with more atoms in a ligand. This is 100% just an estimate but gets us a pretty good average runtime on similar processors. When a work unit is created, we package multiple jobs together or split them up based on how difficult they are. We try to target say 3 hours per CPU work unit. For the GPU version, we create them with 20 times the difficulty as CPU version. These are split the exact same way, thus they get 20 times more points because they were originally created 20 times harder. If we ran one of the GPU work units on CPU, it would on average take them 60 hours to complete the same task. This is the basis for why points are granted the way they are for this application. One thing that is different as I have mentioned just above, is that when a job finds a good answer, it does not need to continue and stops early. This is why you are granted a percentage of the total max points allowed. Thanks, -Uplinger |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
Keith, FWIW the 4 resend tasks that I had validated before you released the latest 20 batches of betas all received between roughly 1100 & 1300 points each. The betas after those 4 received half or less the amount of points per task. I'm guessing that's because those 4 resends were more involved tasks. I'm not at all complaining. Just giving you a heads up in case there is an issue that needs addressed. Thanks for all the hard work you do here. After my last post I was looking into the points and there was a bug in the granting of credit. The amount of credit should be higher. I am working to push that change out now. Your thoughts on those getting closer to 1100 and 1300 points means they ran pretty much 90% on average to all jobs, some of which were probably hitting 100%. Do you have an example of the work unit name for the 1100 and 1300 point results that you encountered? I can use those to verify on my end what you're seeing. Thanks, -Uplinger Here is the one that was granted the most points. Mine is the Win 7 machine and is the slowest of the 4 I have running. Project Name: BETA - OpenPandemics - COVID-19 - GPU Created: 03/26/2021 19:50:39 Name: BETA_OPNG_0000089_00236 Minimum Quorum: 2 Replication: 2 Result Name OS type OS version App Version Number Status Sent Time Time Due / Return Time CPU Time / Elapsed Time (hours) Claimed/ Granted BOINC Credit BETA_ OPNG_ 0000089_ 00236_ 2-- Microsoft Windows 7 Professional x64 Edition, Service Pack 1, (06.01.7601.00) 728 Valid 3/30/21 20:16:22 3/30/21 20:37:49 0.03 0.0 / 1,321.9 BETA_ OPNG_ 0000089_ 00236_ 1-- Microsoft Windows 10 Professional x64 Edition, (10.00.19042.00) 728 Valid 3/26/21 20:15:44 3/26/21 20:37:45 0.04 0.0 / 1,285.1 BETA_ OPNG_ 0000089_ 00236_ 0-- Microsoft Windows 10 Core x64 Edition, (10.00.19042.00) - No Reply 3/26/21 20:15:35 3/30/21 20:15:35 0.00 0.0 / 0.0 The no reply person has 3 total results that all went no reply. For your workunit, yes, you had to basically do all the calculations in each job. Unfortunately I can not tell the future on these jobs to know what they will do. But this particular work unit was very difficult on every job. Thanks, -Uplinger |
||
|
bozz4science
Advanced Cruncher Germany Joined: May 3, 2020 Post Count: 104 Status: Offline Project Badges: |
Looking good so far my side as bell. Just the same issue repeating for my since the start of the first beta test, that my 1660S always suddenly stops after a certain amount of time. If I dont baby sit these tasks ans suspend/unsuspend them, they run into runtime exceeded errors, while otherwise applying this strategy, sets back runtime by a few minutes and let's them finish within minutes.
----------------------------------------No issues on a 970. I'll try a clean driver intstall before going live trying to fix that issue hopefully. AMD Ryzen 3700X @ 4.0 GHz / GTX1660S Intel i5-4278U CPU @ 2.60GHz |
||
|
maeax
Advanced Cruncher Joined: May 2, 2007 Post Count: 142 Status: Offline Project Badges: |
The no reply person has 3 total results that all went no reply. For your workunit, yes, you had to basically do all the calculations in each job. Unfortunately I can not tell the future on these jobs to know what they will do. But this particular work unit was very difficult on every job. Thanks, -Uplinger 24 Calculations https://www.worldcommunitygrid.org/ms/device/...s.do?workunitId=596793907 0.03 / 0.94 0.3 / 891.7 29 Calculations https://www.worldcommunitygrid.org/ms/device/...s.do?workunitId=596792236 0.04 / 1.11 0.3 / 1,023.1 The difference seems ok.
AMD Ryzen Threadripper PRO 3995WX 64-Cores/ AMD Radeon (TM) Pro W6600. OS Win11pro
|
||
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges: |
The no reply person has 3 total results that all went no reply. For your workunit, yes, you had to basically do all the calculations in each job. Unfortunately I can not tell the future on these jobs to know what they will do. But this particular work unit was very difficult on every job. Thanks, -Uplinger Thanks Keith. Glad to see all seems well. Just one more question. When do you sleep?
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
----------------------------------------[Edit 1 times, last edit by nanoprobe at Mar 31, 2021 12:18:46 PM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
The no reply person has 3 total results that all went no reply. For your workunit, yes, you had to basically do all the calculations in each job. Unfortunately I can not tell the future on these jobs to know what they will do. But this particular work unit was very difficult on every job. Thanks, -Uplinger Thanks Keith. Glad to see all seems well. Just one more question. When do you sleep? Sleep is overrated...We are putting in a few extra hours to help push the release of this on my aggressive timeline :) On a side note, I am planning on releasing about 200 batches of Beta today. I am scheduling them to build here in a few minutes and then after my 9am meeting, I'll work towards loading them into boinc. Also, it looks like the fix to the validator to award points is working as expected. Thanks, -Uplinger |
||
|
|