World Community Grid - View Thread - OpenPandemics GPU Stress Test

Background
In March 2021, World Community Grid released a GPU version of the Autodock research application. Immediately, there was a strong demand for work from volunteer machines; in fact, there was considerably higher demand than supply of GPU work units.

The World Community Grid tech team wanted to determine the upper limit of computational power for the program, and to find out if the current infrastructure would be able to support the load if enough GPU work was provided to meet the demand.

Additionally, the OpenPandemics - COVID-19 scientists and their collaborators at Scripps Research are exploring novel promising target sites on the spike protein of the SARS-CoV2 virus that could be vulnerable to different ligands, and they were eager to investigate this target as quickly and thoroughly as possible. They provided World Community Grid with approximately 30,000 batches of work (equal to the amount of work done in about 10 months by CPUs), and we let these batches run until they were fully processed.

The stress test took 8 days to run, from April 26 through May 4, 2021.

The results outlined below represent World Community Grid's current technical capabilities. This information could help active and future projects make decisions about how they run work with us, keeping in mind that they have varying needs and resources.

Summary
The key findings of the stress test revealed the following points:

We had previously determined that in 2020 the volunteers contributing to World Community Grid delivered the computing power similar to a cluster of 12,000 computing nodes running at 100% capacity 24x7 for the entire year where node each contains 1 Intel Core i7-9700K CPU @ 3.60GHz processor from CPUs only. We can now further state that the volunteers are able to provide an additional 8x that computing power from GPUs.
The current World Community Grid infrastructure is able to meet the load generated by this computing power with this particular mix of research projects. However, the infrastructure was pushed to its limit, and any further growth or possibly a different mix of research projects would require increased infrastructure.
The OpenPandemics - GPU workunits consisted of many small files that created high IO load on both the volunteers' computers and the World Community Grid infrastructure. If we were able to combine these small files into a few larger files, this may reduce the IO load on both the volunteers' computers and on the World Community Grid infrastructure. This change would likely allow the infrastructure to handle a greater load from the volunteers and improve the experience for the volunteers.
On the back side of the pipeline, backing up the data and sending results to Scripps server does not appear to be a bottleneck. However, running OpenPandemics at a higher speed will cause the research team to focus the majority of their time and energy on preparing input data sets and archiving returned data rather than performing analysis of the results and moving the interesting results to the next step in the pipeline. As a result, the project will remain at its current speed for the foreseeable future.
Now that we are able to quantify the capabilities of World Community Grid, scientists can use this information as a factor in their decision-making process in addition to their labs' resources and their own data analysis needs.

Bottlenecks identified
During the test, there were three major issues where the system became unstable until we could identify the bottlenecks and resolve them.

Prior to Launch
Before the launch of the stress test when we were creating the individual workunits to send to volunteers, we exhausted the available inodes on the filesystem. This prevented new files or directories from being created on the filesystem, and as a result it caused an outage for our back-end processes and prevented results being uploaded from volunteer machines. We resolved this issue by increasing the max number of inodes allowed and then we added a monitor to warn us if we start approaching the new limit.

Launch
Shortly after releasing the large supply of workunits, we experienced an issue where the connections from our load balancer to the backend servers reached their maximum configured limits and blocked new connections. This appears to be caused by connections being created by clients that opened connections and stalled out or very slowly downloaded work. We implemented logic in the load balancer to automatically close those connections. Once this logic was deployed, the connections from the front-end became stable and work was able to flow freely.

Packaging ramps up
The next obstacle occurred when batches started to complete and packaging became a heavy load on the system. Several changes were made to improve this issue:

The process of marking batches completed in order to start the packaging process originally was run only every 8 hours. We changed that so that batches would be marked complete and packaged every 30 minutes.
Our clustered filesystem had configuration options that were sub-optimal and could be improved. We researched how the configuration could be improved in order to increase the performance and then made those changes.

Following these changes, the system was stable even with packaging and building occurring at a high level. The time to package a batch dropped from 9 minutes to 4.5 minutes and the time to build a batch dropped by a similar amount. Upload and downloads performed reliably as well. However, we were only able to run in this modified configuration during the final 12 hours of the stress test. It would have been useful to run with these settings for longer and to confirm that they resulted in a stable system over an extended period of time.

Potential future changes to further enhance performance

Clustered filesystem tuning
The disk drives that back our clustered filesystem only reached about 50% of their rated throughput. It is possible that the configuration of the filesystem could be further optimized to further increase performance. This is not certain, but if the opportunity exists to engage an expert with deep experience optimizing high performance clustered filesystems using IBM Spectrum Scale, this could be a worthwhile avenue to explore.

Website availability
We identified an issue where high IO load on the clustered filesystem will cause problems with the user experience and performance of the website. These two systems are logically isolated from each other, but share physical infrastructure due to system design. This degradation of the performance of the website should not have happened, but it clearly did. We want to determine for certain why this issue exists, but at this time we believe that this issue stems from the way our load balancer, HAProxy, is configured.

We have HAProxy running as a single instance with one front-end for all traffic passing the data back to multiple back-ends for each of the different types of traffic. We could instead run HAProxy with multiple instances on the same system provided that there are separate IP address for each instance to bind to. If we were to run a separate instance for website traffic and a second instance for all BOINC traffic, we expect that this would allow website traffic to perform reliably, even if the BOINC system was under heavy load.

Thank you to everyone who participated in the stress test.