World Community Grid - View Thread - Degraded database performance [Resolved]

World Community Grid Forums

Category: Retired Forums

Forum: Known Issues [read only]

Thread: Degraded database performance [Resolved]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 14

[ ]

Author

This topic has been viewed 91973 times and has 13 replies

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Degraded database performance

The good news is that over the past few days we have seen the back end server daemons catch up and work has been flowing steadily to the users.

The bad news is that a number of our backend processes that that read large numbers of records for reports or for large batch updates of information have been performing worse and worse. We have identified the continuing cause of this issue.

I've included some articles below that help explain the issue.

Briefly:

MySQL maintains an 'undo log' that allows multiple current transactions to proceed with different isolation levels. Adding records to this undo log is done as part of ongoing transactions. However, cleaning up the 'undo log' is performed on a delayed basis when the server has time and is able to perform the clean up (known as purge) . If the server is under heavy load, then this delay can be very long and the 'history list length' can be come very large.

By default, the purge process is performed as part of MySQL's 'master thread'. This means that the purge process competes with other activities for time to perform purge operations.

While the undo log is small and the log can remain in memory, the purge process can operate quickly. However, if the process starts to fall behind, the log can go large and eventually it has to be moved to disk. Once this happens the process takes considerably longer and it is likely to fall further behind.

Due to the nature of the undo log, its size directly impacts the performance of different types of queries. As it gets longer, the performance of these queries becomes slower and slower.

At this time, our undo log contains over 51 million entries. It is significantly behind and it is falling further behind daily. The database averages about 3,500 transactions per second. The undo log only stores entries that modify the database which are about half our transactions. This means that the undo log is over 8 hours behind and falling further behind.

Yesterday and this morning we made a number of changes to the database in order to improve the performance of the purge process (including using the new MySQL 5.5 option to run the purge process in its own thread). Unfortunately, with the current size of the undo log and it not residing entirely within memory, these additional options are not able to allow the process to catch up during normal operations.

As a result, we are going to have to take unusual action in order to clear out the undo log. Specifically, we are going to have to stop all access to the database, take a complete backup, delete the existing database, and then restore the database from backup. This is the process that we used when we migrated from MySQL 5.1 to MySQL 5.5. Unfortunately, we estimate that with the current database size, this outage could take up to 24 hours. I will be posting details about that outage shortly.

This will restore the database to its normal behavior. We have high confidence that the changes that we have put in place this week and last week will allow the purge process to keep up with the database transactions. However, as we plan to continually recruit additional volunteers to help us grow bigger, we are examining more substantive changes to accommodate this growth. These changes range from repacking workunits with various sizes in order to reduce the total number of results per day, migrating to MySQL 5.6 when it is released later this year, migrating to Percona Server, as well as adding a replica database for the purpose of performing read only transactions against it (reports, backups, result status page on website, etc). Any one of these options will ensure that we do not face this issue again in the future. We need to investigate further against our long term growth plans to make the proper changes.

We appreciate the patience you have shown while we investigate and we look forward to returning to normal operations soon.

http://www.pythian.com/news/32571/some-fun-around-mysql-history-list/
https://mysqlquicksand.wordpress.com/2012/01/...5-upgrade-blues-part-one/
http://www.mysqlperformanceblog.com/2010/06/1...y-main-innodb-tablespace/
http://dev.mysql.com/doc/refman/5.5/en/glossary.html#glos_purge

[Jan 18, 2013 2:43:13 AM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: Degraded database performance

Ok - we have now handle the very significant surge in scheduler requests following the outage. The system handle the load, but it was definitely under pressure.

Now that we have been up and running for a few hours, the system is running very well. The end of day stats and other processes are running now and they are running quickly. Additionally, some of our backend processes such as reports that we use monitoring batch progression are working properly again and quickly.

We are going to be allowing the system to continue to run the backend server daemons while the database backup runs in a few hours. If all goes as plan, there will not be significant interference.

The next item on our agenda is the installation of the additional memory which will occur about 36 hours from now.

[Jan 23, 2013 1:31:30 AM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: Degraded database performance

The database backup completed in 2 hours even with the backend server daemons running. This is good news as it means we don't have to stop various pieces of the system during the database backups.

This history list length has remained short except during the database backup. During that time it grew to over 3 million entries. However, once the backup completed, the list shrunk back to under one thousand within 45 minutes. The changes to improve performance appear to be helping the system keep up with the work load.

We are looking to the RAM installation tomorrow as that will further help the system handle the load.

[Jan 23, 2013 6:59:40 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: Degraded database performance

The database server continues to run nicely and we have been adding back in various bit of functions we disabled in order to obtain every last bit of performance. Over the last 24 hours we have re-enabled binary logging and we will shortly be resuming taking backups with log coordinates. If both of these work without impacting the server, then we will be close to being back to our desired normal operating conditions.

[Feb 7, 2013 2:23:45 PM]

[ ]