| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 31
|
|
| Author |
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
Just an update. We are currently running well (good supply of workunits and the transitioner is caught up). The validators are also caught up.
----------------------------------------We are looking to track down an issue that is causing a deadlock that causes the transitioners to exit when they try to find new workunits to transition. This is an issue because once the transitioners stop running, we eventually run out of work and validation falls way behind. As we look into what is causing this deadlock, we expect the transitioners to get knocked offline now and and again. As a result, we may experience temporary delays in validation and shortages for work. We are working to get this resolved and we appreciate your patience while we work through this. [Edit 1 times, last edit by knreed at Jan 16, 2018 2:46:43 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I love to speculate about this sort of thing, so excuse me for chucking in my 2p'th.
Assuming this problem is new and likely caused by the Meltdown/Spectre fixes, one of the things I vaguely remember seeing is that the resolution of (some?) timers was reduced. Could this be the cause of (some of) your issues? Either way, I think you all deserve a big hand for working so hard to get the systems up and working so well again, especially if your servers are now running more slowly because of those fixes. And thank-you for keeping us informed of your actions, progress and likely impacts. |
||
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
We are going to take a short outage of the feeder and BOINC daemons. During this time boinc-client will not be able to report work completed or get new work.
We are attempting to improve the performance of one index that is at the core of many of our current challenges. We appreciate your patience while we continue to work through this. |
||
|
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges:
|
Wasn't one of the "features" of moving to a cloud infrastructure, the ability to respond more quickly to resource needs by adding CPUs, storage paths, memory etc as needed since they are supposedly virtual? This is mostly true for infrastructure pieces which are more-or-less stateless (i.e. not holding the data which must be stored long-term). Things like HTTP servers, load balancing front-ends, and application servers which have been designed for horizontal scale-out. Databases are a different beast. They're about as stateful as you can get. Even if you do spin up a new DB instance, all the load is going to be on the existing DB machines (in one way or another) until the new one has sync'd the data. And the sync process might make things worse before it makes things better. Disclaimer: I know nothing about WCG's backend. I'm speaking in generalities, from my own experience with distributed and/or virtualized systems. ![]() |
||
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
This outage is done and we are seeing the performance we previously expected. We are going to watch things closely over the next few days and make sure the issue does not occur again. Thanks to everyone for their patience during this past week and a half.
|
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Thank you Kevin for your support and for the feedback.
----------------------------------------Hopefully, the forthcoming days will be a little bit more quiet for you and your colleagues. Cheers, Yves |
||
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
Things have continued to run well over the past 12 hours or so. Now that we think we have gotten to the root cause I can give some more detail about what was going on. This is highly technical for those who are interested.
When we did the update to apply the Spectre and Meltdown patches, we started to see various performance related issues on the systems. Since the issues we saw were so highly correlated in time to the application of those patches, we spent a lot of time investigating that. However, during our investigation some information started to appear that didn't make sense with that being the explanation. In particular we discovered that these queries were not performing like we expected: MariaDB [BOINC]> select id from workunit where transition_time < 1; Empty set (26.46 sec) MariaDB [BOINC]> select id from workunit ignore index(wu_timeout) where transition_time < 1; Empty set (7.53 sec) The table workunit has an index called wu_timeout that consists of transition_time. That means that the performance of a query that has transition_time in the where clause should perform significantly faster using the index then not using the index. However, we were seeing the exact opposite which was very odd. After reading and looking at various performance metrics we reached the conclusion that something during the Dec 6th updates had someone damaged these indexes and we eventually decided that they needed to be dropped and recreated. That is what we did during the short outage yesterday. Immediately after that we saw the following performance: MariaDB [BOINC]> select id from workunit where transition_time < 1; Empty set (0.00 sec) MariaDB [BOINC]> select id from workunit ignore index(wu_timeout) where transition_time < 1; Empty set (5.26 sec) The query time using the index went from 26.5 seconds down to a tiny fraction of a second. The overall load and contention on the database dropped dramatically and the system started to behave like it had been before. We looked a little further and saw a few other indexes that were exhibiting similar symptoms so we rebuilt those as well. As a result, the database is backing to performing beautifully and the team's stress level has gone way down. Based on this, we still feel that Spectre and Meltdown have caused a performance hit on our database servers of somewhere between 20-30%. However, the database servers were sized to handle at least a doubling of load so we are still very comfortably within our performance capabilities. |
||
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
Wasn't one of the "features" of moving to a cloud infrastructure, the ability to respond more quickly to resource needs by adding CPUs, storage paths, memory etc as needed since they are supposedly virtual? This is mostly true for infrastructure pieces which are more-or-less stateless (i.e. not holding the data which must be stored long-term). Things like HTTP servers, load balancing front-ends, and application servers which have been designed for horizontal scale-out. Databases are a different beast. They're about as stateful as you can get. Even if you do spin up a new DB instance, all the load is going to be on the existing DB machines (in one way or another) until the new one has sync'd the data. And the sync process might make things worse before it makes things better. Disclaimer: I know nothing about WCG's backend. I'm speaking in generalities, from my own experience with distributed and/or virtualized systems. mdxi - this is a very good way of expressing this situation. With our move to the IBM Cloud (even if we are actually running on base metal) we can add or remove servers and resources in a tiny fraction of the time that we could at a traditional data center. As part of planning our move to IBM Cloud we had to analyze the databases and decide whether to move to a replicated or shared nothing database setup or remain in a single instance configuration. Replicated databases are more complicated to configure and maintain and there would be additional work during the migration to prepare and setup the replicated scenario which would take away from other changes we could make. As a result, we decided that investing our time in improving our automation/devlops and monitoring and not adding the additional complexity of the replicated backend was a better use of our time. In the future, if we get to the point where we need a replicated setup, we can do so, but we are not at that point at this time (or in the near future). In this particular situation, the root cause was an issue with the database indexes. Adding more cpu or ram resources to resolve the problem would not have materially improved the performance since the hit on performance was due to row level lock contention which would not change with more memory or cpu's. [Edit 3 times, last edit by knreed at Jan 17, 2018 3:29:56 PM] |
||
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
Thankyou Kevin for both those very interesting posts - and, of course, for all your, and your colleagues effort in resolving the situation.
----------------------------------------I'm curious to know, do you, or would you ever consider, rebuilding the database indexes on a regular/set schedule (say, once a week/month)?, as when I was working for IBM several years ago, we had regular housekeeping jobs that did just that. ![]() |
||
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
I'm curious to know, do you, or would you ever consider, rebuilding the database indexes on a regular/set schedule (say, once a week/month)?, as when I was working for IBM several years ago, we had regular housekeeping jobs that did just that. I think you are referring to a table or index reorganization. The website database is running on DB2 and DB2 has added a lot of features to automatically handle these things for us. So we basically tell it "don't use more than this much memory" and it adjusts various memory parameters and settings to optimize its performance. It also keeps track of the tables and indexes and decides when to collect stats for the indexes or reorganize the the tables and indexes. All of this happens automatically without our intervention (for the website/db2 database). The BOINC database is a little bit different. Most of the data on the database is in the workunit and result tables. These tables undergo very heavy modification (50% change per week). As a result they are in a persistent state of needing reorganization - to the point - that it isn't really worth trying to constantly re-organize (or in MariaDB terms defragment) the tables since they would very quickly become fragmented again. Prior to whatever event damaged the index, the tables and indexes had remained at a constant high performance state without having to rebuild them periodically. We are now back at that level of performance and so we other than monitoring the load on the server more tightly to see if things start to behave poorly again, we do not think that additional steps are necessary. |
||
|
|
|