| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 11
|
|
| Author |
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
Greetings,
There have been some questions asked recently about our environment, which we recently migrated to IBM Cloud. We have three different environments at IBM Cloud. Each is isolated from each other except for a few specific connections (like allowing monitoring). The first is our “Admin/DevOps” environment. This is available only inside the IBM firewall to house tools that our team use to help us run and manage the program. It is where we run programs like Sonar (code quality analysis tool), Jenkins, Jira (task management tool), Confluence (collaborative team wiki), Gitlab (code repository) and Nagios (monitoring tool). These are installed on 2 Linux (Red Hat 7) virtual machines. The second is our QA environment where we carry out our testing. We are running 5 Linux virtual machines in this environment. This is enough servers to match each “flavor” of server that we have running in production (see below) The third is our production environment. We are running 11 Linux physical machines in this environment. The “flavors” of servers we have in production are as follows:
We are using IBM Spectrum Scale, Apache Aurora, and Apache Mesos so that when we run any of our backend tasks such as the BOINC validators, workunit building tools or result aggregation scripts, the tasks is distributed to whichever science server has the least work and the task is run there. Each server has the same access to files and the database as the others. This setup makes it easy to manage this infrastructure. It also makes it very easy to additional capacity by just adding additional servers and configuring them like the others. One question that some people will ask is if we are in the cloud why are we using physical machines. There are two answers to this question. The first is that IBM Cloud has made it as easy to provision a physical server as it is to provision a virtual server. We do lose a few things like being able to easily take a snapshot of the server and save it, but since we used automation heavily during setup, it is relatively easy for us to recreate a server from scratch. The second reason is that we use a lot of bandwidth and file I/O. These are things that have proven to be a challenge for the industry to get to perform well at scale when virtualized. A lot of research is being done on this and it will probably change in the next few years. However, we wanted to make sure that we had the I/O capabilities we needed and going physical was the best way to ensure that. We are using two automation tools. We have used Ansible for a lot of the low level setup and configuration of the servers. We are using IBM UrbanCode Deploy for the ongoing deployment of changes for our different applications and processes. Now, to the question of what caused the two major outages that we have had recently. Both of them have occurred when the IBM Spectrum Scale determined that it couldn't trust the data on a set of disks on one of the servers following a reboot. In order to restore operations, the software needed to perform a long recovery process to scan and check the data. IBM Spectrum Scale is very defensive in order to make sure that the data it provides access to is reliable (this is a good thing). The reason this happened is more complicated. The long and the short of it is that we are working with a internal operations team that we haven't worked with before. There were some questions about who was responsible for certain activities, which resulted in some things being set up wrong that triggered these issues. While this is embarrassing to our team, we are confident we can resolve these issues quickly and look forward to leveraging the new capabilities we have built into our new environment. |
||
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
Thanks Kevin for that very insightful post.
----------------------------------------I'm sure that this will go a long way to answer many of the questions that have recently been posed and also reassure everyone that, as you've highlighted (and we've witnessed), there have been a 'few teathing problems', but hopefully there won't be many (preferably any) more. ![]() |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
Thanks for the explanation. Very informative.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
AMuthig
Advanced Cruncher USA Joined: Nov 30, 2013 Post Count: 59 Status: Offline Project Badges:
|
Thank you for the information!
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thank you for the information and in a way that I could mostly understand. I promise not to moan if and (yes) when they go wrong again.
----------------------------------------The horrors of of my "senior analyst / engineer" days are now well past. When a disk on a RAID 5 server started to fail, a new identical disk was ordered and whist machine was off, replaced. On restart the first message was that the "Geometry" on the disk was not correct, it got sillier on a second restart when the message said "The disk is not sane". Only thing to do was to get in touch with the supplier an IBM accredited supplier, who came two days later and had the server working inside 4 hours, the server was running the companies Financial records. ![]() [Edit 1 times, last edit by Former Member at Jul 23, 2017 9:40:39 AM] |
||
|
|
cjslman
Master Cruncher Mexico Joined: Nov 23, 2004 Post Count: 2082 Status: Offline Project Badges:
|
Thanks for the explanation, it helps with understanding the complex environment in which WCG is run.
----------------------------------------CJSL Keep Calm and Crunch !!! |
||
|
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1684 Status: Offline Project Badges:
|
Thank you knreed for this very informative post.
----------------------------------------On my side, I am relatively impressed to notice that the WCG is finally not so "large" as I had assumed. I supposed that you would have a larger server infrastructure, well noticed that you did not mention the storage units (SAN). Thank you for the explanation about the two outages. I wish you a good success by the further consolidation activities. As usual, the most important point is to be able to learn from failures and incidents (without a culprit finger). Cheers, Yves |
||
|
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 2173 Status: Offline Project Badges:
|
Thanks for the info. It clears up a few things that haven't been mentioned when the actual "move to the cloud" happened a few weeks ago.
And as far as I am concerned, the last paragraph pretty much seems to sum up what the issue of the last (unplanned) outage was. And as that seems to be more of a communication issue than a plan technical issue, it should be relatively easy to fix and therefor prevent future occurrences... Ralf |
||
|
|
KLiK
Master Cruncher Croatia Joined: Nov 13, 2006 Post Count: 3108 Status: Offline Project Badges:
|
WoW, nice ones...enjoying your work!
----------------------------------------Thanks for the info. |
||
|
|
asdavid
Veteran Cruncher FRANCE Joined: Nov 18, 2004 Post Count: 521 Status: Offline Project Badges:
|
Appreciate this very informative post. Thanks.
----------------------------------------
Anne-Sophie
![]() |
||
|
|
|