Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 31
Posts: 31   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 41758 times and has 30 replies Next Thread
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Slow validation

Just an update. We are currently running well (good supply of workunits and the transitioner is caught up). The validators are also caught up.

We are looking to track down an issue that is causing a deadlock that causes the transitioners to exit when they try to find new workunits to transition. This is an issue because once the transitioners stop running, we eventually run out of work and validation falls way behind.

As we look into what is causing this deadlock, we expect the transitioners to get knocked offline now and and again. As a result, we may experience temporary delays in validation and shortages for work. We are working to get this resolved and we appreciate your patience while we work through this.
----------------------------------------
[Edit 1 times, last edit by knreed at Jan 16, 2018 2:46:43 PM]
[Jan 16, 2018 2:46:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Slow validation

I love to speculate about this sort of thing, so excuse me for chucking in my 2p'th.

Assuming this problem is new and likely caused by the Meltdown/Spectre fixes, one of the things I vaguely remember seeing is that the resolution of (some?) timers was reduced. Could this be the cause of (some of) your issues?

Either way, I think you all deserve a big hand for working so hard to get the systems up and working so well again, especially if your servers are now running more slowly because of those fixes.

And thank-you for keeping us informed of your actions, progress and likely impacts.
[Jan 16, 2018 3:50:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Slow validation

We are going to take a short outage of the feeder and BOINC daemons. During this time boinc-client will not be able to report work completed or get new work.

We are attempting to improve the performance of one index that is at the core of many of our current challenges.

We appreciate your patience while we continue to work through this.
[Jan 16, 2018 5:32:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mdxi
Advanced Cruncher
Joined: Dec 6, 2017
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Database Outage

Wasn't one of the "features" of moving to a cloud infrastructure, the ability to respond more quickly to resource needs by adding CPUs, storage paths, memory etc as needed since they are supposedly virtual?


This is mostly true for infrastructure pieces which are more-or-less stateless (i.e. not holding the data which must be stored long-term). Things like HTTP servers, load balancing front-ends, and application servers which have been designed for horizontal scale-out.

Databases are a different beast. They're about as stateful as you can get. Even if you do spin up a new DB instance, all the load is going to be on the existing DB machines (in one way or another) until the new one has sync'd the data. And the sync process might make things worse before it makes things better.

Disclaimer: I know nothing about WCG's backend. I'm speaking in generalities, from my own experience with distributed and/or virtualized systems.
----------------------------------------

[Jan 16, 2018 6:24:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Slow validation

This outage is done and we are seeing the performance we previously expected. We are going to watch things closely over the next few days and make sure the issue does not occur again. Thanks to everyone for their patience during this past week and a half.
[Jan 16, 2018 6:25:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Slow validation

Thank you Kevin for your support and for the feedback.
Hopefully, the forthcoming days will be a little bit more quiet for you and your colleagues.
Cheers,
Yves
----------------------------------------
[Jan 16, 2018 11:57:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Slow validation

Things have continued to run well over the past 12 hours or so. Now that we think we have gotten to the root cause I can give some more detail about what was going on. This is highly technical for those who are interested.

When we did the update to apply the Spectre and Meltdown patches, we started to see various performance related issues on the systems. Since the issues we saw were so highly correlated in time to the application of those patches, we spent a lot of time investigating that. However, during our investigation some information started to appear that didn't make sense with that being the explanation. In particular we discovered that these queries were not performing like we expected:
MariaDB [BOINC]> select id from workunit where transition_time < 1;
Empty set (26.46 sec)

MariaDB [BOINC]> select id from workunit ignore index(wu_timeout) where transition_time < 1;
Empty set (7.53 sec)

The table workunit has an index called wu_timeout that consists of transition_time. That means that the performance of a query that has transition_time in the where clause should perform significantly faster using the index then not using the index. However, we were seeing the exact opposite which was very odd.

After reading and looking at various performance metrics we reached the conclusion that something during the Dec 6th updates had someone damaged these indexes and we eventually decided that they needed to be dropped and recreated. That is what we did during the short outage yesterday. Immediately after that we saw the following performance:
MariaDB [BOINC]> select id from workunit where transition_time < 1;
Empty set (0.00 sec)

MariaDB [BOINC]> select id from workunit ignore index(wu_timeout) where transition_time < 1;
Empty set (5.26 sec)

The query time using the index went from 26.5 seconds down to a tiny fraction of a second. The overall load and contention on the database dropped dramatically and the system started to behave like it had been before.

We looked a little further and saw a few other indexes that were exhibiting similar symptoms so we rebuilt those as well. As a result, the database is backing to performing beautifully and the team's stress level has gone way down.

Based on this, we still feel that Spectre and Meltdown have caused a performance hit on our database servers of somewhere between 20-30%. However, the database servers were sized to handle at least a doubling of load so we are still very comfortably within our performance capabilities.
[Jan 17, 2018 3:13:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Database Outage

Wasn't one of the "features" of moving to a cloud infrastructure, the ability to respond more quickly to resource needs by adding CPUs, storage paths, memory etc as needed since they are supposedly virtual?


This is mostly true for infrastructure pieces which are more-or-less stateless (i.e. not holding the data which must be stored long-term). Things like HTTP servers, load balancing front-ends, and application servers which have been designed for horizontal scale-out.

Databases are a different beast. They're about as stateful as you can get. Even if you do spin up a new DB instance, all the load is going to be on the existing DB machines (in one way or another) until the new one has sync'd the data. And the sync process might make things worse before it makes things better.

Disclaimer: I know nothing about WCG's backend. I'm speaking in generalities, from my own experience with distributed and/or virtualized systems.


mdxi - this is a very good way of expressing this situation. With our move to the IBM Cloud (even if we are actually running on base metal) we can add or remove servers and resources in a tiny fraction of the time that we could at a traditional data center.

As part of planning our move to IBM Cloud we had to analyze the databases and decide whether to move to a replicated or shared nothing database setup or remain in a single instance configuration. Replicated databases are more complicated to configure and maintain and there would be additional work during the migration to prepare and setup the replicated scenario which would take away from other changes we could make. As a result, we decided that investing our time in improving our automation/devlops and monitoring and not adding the additional complexity of the replicated backend was a better use of our time. In the future, if we get to the point where we need a replicated setup, we can do so, but we are not at that point at this time (or in the near future).

In this particular situation, the root cause was an issue with the database indexes. Adding more cpu or ram resources to resolve the problem would not have materially improved the performance since the hit on performance was due to row level lock contention which would not change with more memory or cpu's.
----------------------------------------
[Edit 3 times, last edit by knreed at Jan 17, 2018 3:29:56 PM]
[Jan 17, 2018 3:26:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Database Outage

Thankyou Kevin for both those very interesting posts - and, of course, for all your, and your colleagues effort in resolving the situation.

I'm curious to know, do you, or would you ever consider, rebuilding the database indexes on a regular/set schedule (say, once a week/month)?, as when I was working for IBM several years ago, we had regular housekeeping jobs that did just that.
----------------------------------------

[Jan 17, 2018 3:55:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Database Outage

I'm curious to know, do you, or would you ever consider, rebuilding the database indexes on a regular/set schedule (say, once a week/month)?, as when I was working for IBM several years ago, we had regular housekeeping jobs that did just that.


I think you are referring to a table or index reorganization.

The website database is running on DB2 and DB2 has added a lot of features to automatically handle these things for us. So we basically tell it "don't use more than this much memory" and it adjusts various memory parameters and settings to optimize its performance. It also keeps track of the tables and indexes and decides when to collect stats for the indexes or reorganize the the tables and indexes. All of this happens automatically without our intervention (for the website/db2 database).

The BOINC database is a little bit different. Most of the data on the database is in the workunit and result tables. These tables undergo very heavy modification (50% change per week). As a result they are in a persistent state of needing reorganization - to the point - that it isn't really worth trying to constantly re-organize (or in MariaDB terms defragment) the tables since they would very quickly become fragmented again.

Prior to whatever event damaged the index, the tables and indexes had remained at a constant high performance state without having to rebuild them periodically. We are now back at that level of performance and so we other than monitoring the load on the server more tightly to see if things start to behave poorly again, we do not think that additional steps are necessary.
[Jan 18, 2018 2:16:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 31   Pages: 4   [ Previous Page | 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread