Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 5
|
![]() |
Author |
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Background
In March 2021, World Community Grid released a GPU version of the Autodock research application. Immediately, there was a strong demand for work from volunteer machines; in fact, there was considerably higher demand than supply of GPU work units. The World Community Grid tech team wanted to determine the upper limit of computational power for the program, and to find out if the current infrastructure would be able to support the load if enough GPU work was provided to meet the demand. Additionally, the OpenPandemics - COVID-19 scientists and their collaborators at Scripps Research are exploring novel promising target sites on the spike protein of the SARS-CoV2 virus that could be vulnerable to different ligands, and they were eager to investigate this target as quickly and thoroughly as possible. They provided World Community Grid with approximately 30,000 batches of work (equal to the amount of work done in about 10 months by CPUs), and we let these batches run until they were fully processed. The stress test took 8 days to run, from April 26 through May 4, 2021. The results outlined below represent World Community Grid's current technical capabilities. This information could help active and future projects make decisions about how they run work with us, keeping in mind that they have varying needs and resources. Summary The key findings of the stress test revealed the following points:
Bottlenecks identified During the test, there were three major issues where the system became unstable until we could identify the bottlenecks and resolve them. Prior to Launch Before the launch of the stress test when we were creating the individual workunits to send to volunteers, we exhausted the available inodes on the filesystem. This prevented new files or directories from being created on the filesystem, and as a result it caused an outage for our back-end processes and prevented results being uploaded from volunteer machines. We resolved this issue by increasing the max number of inodes allowed and then we added a monitor to warn us if we start approaching the new limit. Launch Shortly after releasing the large supply of workunits, we experienced an issue where the connections from our load balancer to the backend servers reached their maximum configured limits and blocked new connections. This appears to be caused by connections being created by clients that opened connections and stalled out or very slowly downloaded work. We implemented logic in the load balancer to automatically close those connections. Once this logic was deployed, the connections from the front-end became stable and work was able to flow freely. Packaging ramps up The next obstacle occurred when batches started to complete and packaging became a heavy load on the system. Several changes were made to improve this issue: The process of marking batches completed in order to start the packaging process originally was run only every 8 hours. We changed that so that batches would be marked complete and packaged every 30 minutes. Our clustered filesystem had configuration options that were sub-optimal and could be improved. We researched how the configuration could be improved in order to increase the performance and then made those changes. Following these changes, the system was stable even with packaging and building occurring at a high level. The time to package a batch dropped from 9 minutes to 4.5 minutes and the time to build a batch dropped by a similar amount. Upload and downloads performed reliably as well. However, we were only able to run in this modified configuration during the final 12 hours of the stress test. It would have been useful to run with these settings for longer and to confirm that they resulted in a stable system over an extended period of time. Potential future changes to further enhance performance Clustered filesystem tuning The disk drives that back our clustered filesystem only reached about 50% of their rated throughput. It is possible that the configuration of the filesystem could be further optimized to further increase performance. This is not certain, but if the opportunity exists to engage an expert with deep experience optimizing high performance clustered filesystems using IBM Spectrum Scale, this could be a worthwhile avenue to explore. Website availability We identified an issue where high IO load on the clustered filesystem will cause problems with the user experience and performance of the website. These two systems are logically isolated from each other, but share physical infrastructure due to system design. This degradation of the performance of the website should not have happened, but it clearly did. We want to determine for certain why this issue exists, but at this time we believe that this issue stems from the way our load balancer, HAProxy, is configured. We have HAProxy running as a single instance with one front-end for all traffic passing the data back to multiple back-ends for each of the different types of traffic. We could instead run HAProxy with multiple instances on the same system provided that there are separate IP address for each instance to bind to. If we were to run a separate instance for website traffic and a second instance for all BOINC traffic, we expect that this would allow website traffic to perform reliably, even if the BOINC system was under heavy load. Thank you to everyone who participated in the stress test. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7745 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you for the update. It is interesting to see what the stresses to the system were and how they were remedied. I see the foundation for further improvements to capacity are being blueprinted. It appears one of the bottlenecks is the researchers ability to not only generate additional work units, but also to handle the influx of results (point 4 in the summary).
----------------------------------------Thank you to all of the WCG staff and the researchers for for attempting to max out the system. I think everybody concerned, researchers, staff and volunteers learned quite a bit about the capabilities of the system and their machines. I think it also has shown the willingness of the volunteers to provide the computing power needed for basic research. I hope IBM is willing to continue to put the needed resources into this endeavor. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Rymorea
Cruncher Turkey Joined: Feb 12, 2014 Post Count: 13 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you for this detailed explanations. Now we know what's going on at background operations.
|
||
|
RockLr
Cruncher China Joined: Mar 14, 2020 Post Count: 26 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thanks for your details!
|
||
|
adict2jane
Cruncher Joined: Aug 18, 2006 Post Count: 31 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you for the update knreed!
Hope everything with Uplinger is ok. |
||
|
|
![]() |