| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 781
|
|
| Author |
|
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2509 Status: Offline Project Badges:
|
The stress test started Apr 26, 2021 10:05:47 PM (GMT+2)
----------------------------------------https://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=656665 So, this stress test is certainly going to take much longer than 3 days. We were going to run the "new" batches 13345 - 41773, and so far, I have not seen any WU's from any batch higher than 28868. Edit: Great joy for my electricity supplier though. I had planned to participate with all my GPU's for 3 days, and now it seems as if there's going to be many more days than that. I'm hanging in for a few more days. [Edit 3 times, last edit by Grumpy Swede at Apr 30, 2021 9:56:06 AM] |
||
|
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges:
|
Nope, only WCG task. That might also be a Windows problem. It was a popup like that that led me to the workround for my Intel HD 4600 problems during the Beta.It might be a hardware issue. I got a AMD pop up to say there was a device hanging or something. Nevermind. |
||
|
|
erich56
Senior Cruncher Austria Joined: Feb 24, 2007 Post Count: 300 Status: Offline Project Badges:
|
Well, this stress test is certinly going to take much longer than 3 days. this became clear, though, from the moment on when the runtime of a task has gone up markedly due to very low utilisation the the GPU.With the tasks that had been sent out initially, this was not the case. |
||
|
|
maeax
Advanced Cruncher Joined: May 2, 2007 Post Count: 144 Status: Offline Project Badges:
|
4 GPU-Tasks with 0.25 CPU and 0.25 GPU.
----------------------------------------Advanced Work for 0.1 days. Get a constant flow of Tasks for GPU and CPU. The GPU-Tasks take now 1 hour for this longer Tasks. Good work from the Researchers and the WCG-Team, Thank you.
AMD Ryzen Threadripper PRO 3995WX 64-Cores/ AMD Radeon (TM) Pro W6600. OS Win11pro
|
||
|
|
kittyman
Advanced Cruncher Joined: May 14, 2020 Post Count: 140 Status: Offline Project Badges:
|
Well, the kitties only have had 2 error tasks for the duration of their munching on WCG kibble. One was a bum task with zero runtime, and the second was caused when I was trying to increase the modest overclock on their GPU. As soon as I saw the error, I dialed it back and that was the end of that error cause.
----------------------------------------No more kitty errors since. Meow! ![]() |
||
|
|
M-spec
Cruncher The Netherlands Joined: Jul 29, 2007 Post Count: 4 Status: Offline Project Badges:
|
to me it shows fairly poor optimization of the application. the ideal situation is that the application uses as much of the GPU as possible to get the most work done. needing 16 CPU cores to feed a single GPU is absurdly high. most other projects use 1 or less to feed a GPU and keep high utilization the entire time. You shouldn't have to give up 16 cores (along with the power consumption that comes with that) that could otherwise be doing something more useful like crunching CPU tasks for a project without a GPU app. Look at GPUGRID or Einstein. those are how you want your app to operate. able to feed the GPU to 95+% for the entire run with only a single CPU core to keep the GPU busy. usually this means preloading more data into the GPU memory and making the GPU handle more functions. I know it's a "first cut" for this app, but it still has a long way to go for efficiency in my opinion. we should all push for better utilization of resources for the sake of efficiency and not accept so much waste. Uplinger addressed this a while ago. He wanted to keep the GPU work units the same as the CPU work units initially to ensure consistent results, no doubt necessary for the science. He said he would tweak it up later. this isnt correct. the GPU WUs are not the same as the CPU WUs. the GPU tasks have many many more tasks prepackaged and are actually much larger than the CPU tasks. and GPU tasks cannot crossvalidate with GPU tasks because of their differences. the GPU app optimization has nothing to do with this. The statement of keeping the GPU tasks as close to the CPU tasks is correct. This helps in multiple ways. It allows us to verify that things are working as they should without adding too many variables to the mix. These work units use the same method of starting and stopping each job (ligand) in the workunit. All that was modified in the way they were generated was I said assume it's allowed to run 20x longer than CPU. Not much else changed beyond that. Keeping the pipelines from the researchers to us and then to you similar allows for us to decrease the number of variables that we introduce into the equation of differences. Yes, there are differences in the GPU code that is not the same as CPU, but these were vetted and tested by the researchers before we took the application to grid enable it. There are multiple options that we are in discussions with the researchers about. How long it'll take to get those implemented from the WCG end is unknown. I can not promise when an updated version will be released. We have heard members commenting on the GPU version using too much IO and other complaints, such as the polar opposite of it causing them to have issues on their displays...Some members commenting on bandwidth usage, etc... The purpose of this stress test was to determine where some of the bottlenecks were in the system. We have heard the comments and suggestions about the application. We have made changes to our load balancer to help handle a lot more work units. We have identified that the small ligand files cause issues with the inodes of the filesystem filling up. All of these are stresses of the system. Some may be easily addressed, others take lots of time and effort. Releasing a new science application does not come easy and quickly as you would hope, this is distributed to thousands of people and needs to be properly vetted and tested. All of that is to say while supporting and running other application and trying to get some sleep in there. This stress test has been very exciting for us and our team. We are in constant communication with the researchers and they are also very excited about the test so far. Thank you to everyone for your help on making this a successful test. Please try to keep comments positive and helpful towards everyone in the forums and not combative. We try to make things run as best as they can, but we do not have unlimited resources. Thanks, -Uplinger It's a good stress test but at the same time it also shows the credit reward is way off from a reasonable balance with the current OPNG application when running 24/7 on a high-end GPU and CPU. For instance, I run a RTX 3090 and 3950X currently with 32 WU's simultaneously with all 32 CPU threads @100% to back it up. This results in 150+ WU's/hour with the current batch size and 125x(+) daily credits in comparison to CPU only work; an insane daily record of 40.000.000! credits on a single computer. So many GPU task does stress the I/O with around 80MB/s writes on average to the SSD and GPU VRAM usage switching between 6-13GB. This may be 'fun' for a while and I do hope the project will deliver promising results but I doubt it will please all of the volunteers of much needed CPU only projects (that can also benefit much more from modern CPU instructions), myself included. Runtime for badges may be harder to achieve but it only takes a couple of days now to be given the same amount of 'credit' for years of CPU runtime and ofcourse energy consumption. It's off the charts, even in comparison to todays high-end Epyc server systems. May not be healthy for WCG in the long run. |
||
|
|
goben_2003
Advanced Cruncher Joined: Jun 16, 2006 Post Count: 146 Status: Offline Project Badges:
|
It's a good stress test but at the same time it also shows the credit reward is way off from a reasonable balance with the current OPNG application when running 24/7 on a high-end GPU and CPU. For instance, I run a RTX 3090 and 3950X currently with 32 WU's simultaneously with all 32 CPU threads @100% to back it up. This results in 150+ WU's/hour with the current batch size and 125x(+) daily credits in comparison to CPU only work; an insane daily record of 40.000.000! credits on a single computer. So many GPU task does stress the I/O with around 80MB/s writes on average to the SSD and GPU VRAM usage switching between 6-13GB. This may be 'fun' for a while and I do hope the project will deliver promising results but I doubt it will please all of the volunteers of much needed CPU only projects (that can also benefit much more from modern CPU instructions), myself included. Runtime for badges may be harder to achieve but it only takes a couple of days now to be given the same amount of 'credit' for years of CPU runtime and ofcourse energy consumption. It's off the charts, even in comparison to todays high-end Epyc server systems. May not be healthy for WCG in the long run. For the points: I personally do not care about points. However: The gpu work units should be worth more than a cpu unit. The earlier gpu units were 20x as much work as a cpu unit. IIRC the new ones are even larger. The points are supposed to line up with the amount of science done(at least roughly so on average). Even my intel gpus do in roughly 1 hour what would take roughly 9 hours using all 8 cpu cores on the same machine. It is even more extreme with discrete cards, especially higher end ones like you have. So the increase in amount of points generated in a day has gone up an extreme amount because the amount of science done has gone up by such an extreme amount. For the writes: I know some here disagree, but for me the writes were higher than I was comfortable when running the nvidia units. So I set it up to run off of a ramdisk since I have plenty of ram. Cheers ![]() ![]() |
||
|
|
hnapel
Advanced Cruncher Netherlands Joined: Nov 17, 2004 Post Count: 82 Status: Offline Project Badges:
|
I just didn't edit my app_config.xml files to run more WU's on the same GPU at the same time, I'm just going with the flow of default settings and think the project should tweak the jobs on the server end to improve utilization if they want. With increased utilization also comes increased power demand and disk I/O (this should also be tweaked IMHO) and I'm fine as it is. I just dialed down the setting for CPU percentage so the GPU jobs get sufficient CPU and I like to keep my system CPU usage just below 100%. I have 4 PCs running the OPNG jobs on a total of 6 GPUs, my daily points are now about close to 7M on average (coming from around 250K with only CPU jobs), during this stress test my WCG point score rank dropped below 2000 with a dramatic jump so I proud myself to be now among the top tier contributors, also I'm now close to 100M points for Openpandemics alone so I'm happy as is. I think the post mortem for this would be an interesting take on distributed computing including the social aspects, I'm glad that even if this is a test we are doing real science ramming the ZINC database through our collective rigs and hopefully it will lead to some insights to fight the pandemic.
|
||
|
|
erich56
Senior Cruncher Austria Joined: Feb 24, 2007 Post Count: 300 Status: Offline Project Badges:
|
All my PCs say "no tasks available" -
does anyone here make the same experience? |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My card has gone back to executing the 7 day work that was suspended when the 3 day work was released. Must be a pause in WU release
|
||
|
|
|