Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 11
Posts: 11   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2688 times and has 10 replies Next Thread
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Our new environment

Greetings,
There have been some questions asked recently about our environment, which we recently migrated to IBM Cloud.

We have three different environments at IBM Cloud. Each is isolated from each other except for a few specific connections (like allowing monitoring).

The first is our “Admin/DevOps” environment. This is available only inside the IBM firewall to house tools that our team use to help us run and manage the program. It is where we run programs like Sonar (code quality analysis tool), Jenkins, Jira (task management tool), Confluence (collaborative team wiki), Gitlab (code repository) and Nagios (monitoring tool). These are installed on 2 Linux (Red Hat 7) virtual machines.

The second is our QA environment where we carry out our testing. We are running 5 Linux virtual machines in this environment. This is enough servers to match each “flavor” of server that we have running in production (see below)

The third is our production environment. We are running 11 Linux physical machines in this environment.

The “flavors” of servers we have in production are as follows:

  • 2 servers (24 cores, 64GB RAM, 20 Gbps connections) host our load balancers (HAProxy)
  • 2 servers (24 cores, 64GB RAM, 20 Gbps connections) host the BOINC scheduler and the website (Apache HTTPD (web server) and WebSphere Application Server)
  • 2 servers (28 cores, 512GB RAM, 20 Gbps connections) host the databases (DB2 and MariaDB)
  • 5 servers (28 cores, 64GB RAM, 20 Gbps connections) host the science data, run the BOINC daemons and handle file transfers (Apache HTTPD, Apache Mesos (distributed task scheduling tool), Apache Aurora (Mesos framework for long running jobs), IBM Spectrum Scale (clustered storage tool))

We are using IBM Spectrum Scale, Apache Aurora, and Apache Mesos so that when we run any of our backend tasks such as the BOINC validators, workunit building tools or result aggregation scripts, the tasks is distributed to whichever science server has the least work and the task is run there. Each server has the same access to files and the database as the others. This setup makes it easy to manage this infrastructure. It also makes it very easy to additional capacity by just adding additional servers and configuring them like the others.

One question that some people will ask is if we are in the cloud why are we using physical machines. There are two answers to this question. The first is that IBM Cloud has made it as easy to provision a physical server as it is to provision a virtual server. We do lose a few things like being able to easily take a snapshot of the server and save it, but since we used automation heavily during setup, it is relatively easy for us to recreate a server from scratch. The second reason is that we use a lot of bandwidth and file I/O. These are things that have proven to be a challenge for the industry to get to perform well at scale when virtualized. A lot of research is being done on this and it will probably change in the next few years. However, we wanted to make sure that we had the I/O capabilities we needed and going physical was the best way to ensure that.

We are using two automation tools. We have used Ansible for a lot of the low level setup and configuration of the servers. We are using IBM UrbanCode Deploy for the ongoing deployment of changes for our different applications and processes.

Now, to the question of what caused the two major outages that we have had recently. Both of them have occurred when the IBM Spectrum Scale determined that it couldn't trust the data on a set of disks on one of the servers following a reboot. In order to restore operations, the software needed to perform a long recovery process to scan and check the data. IBM Spectrum Scale is very defensive in order to make sure that the data it provides access to is reliable (this is a good thing). The reason this happened is more complicated. The long and the short of it is that we are working with a internal operations team that we haven't worked with before. There were some questions about who was responsible for certain activities, which resulted in some things being set up wrong that triggered these issues. While this is embarrassing to our team, we are confident we can resolve these issues quickly and look forward to leveraging the new capabilities we have built into our new environment.
[Jul 21, 2017 10:17:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Our new environment

Thanks Kevin for that very insightful post.

I'm sure that this will go a long way to answer many of the questions that have recently been posed and also reassure everyone that, as you've highlighted (and we've witnessed), there have been a 'few teathing problems', but hopefully there won't be many (preferably any) more.
----------------------------------------

[Jul 21, 2017 11:26:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Our new environment

Thanks for the explanation. Very informative.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jul 22, 2017 1:24:01 AM]   Link   Report threatening or abusive post: please login first  Go to top 
AMuthig
Advanced Cruncher
USA
Joined: Nov 30, 2013
Post Count: 59
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Our new environment

Thank you for the information!
[Jul 22, 2017 5:41:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Our new environment

Thank you for the information and in a way that I could mostly understand. I promise not to moan if and (yes) when they go wrong again.

The horrors of of my "senior analyst / engineer" days are now well past. When a disk on a RAID 5 server started to fail, a new identical disk was ordered and whist machine was off, replaced. On restart the first message was that the "Geometry" on the disk was not correct, it got sillier on a second restart when the message said "The disk is not sane". Only thing to do was to get in touch with the supplier an IBM accredited supplier, who came two days later and had the server working inside 4 hours, the server was running the companies Financial records. blushing angry
----------------------------------------
[Edit 1 times, last edit by Former Member at Jul 23, 2017 9:40:39 AM]
[Jul 23, 2017 9:39:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Our new environment

Thanks for the explanation, it helps with understanding the complex environment in which WCG is run.
CJSL

Keep Calm and Crunch !!!
----------------------------------------
I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team


[Jul 23, 2017 12:58:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Our new environment

Thank you knreed for this very informative post.
On my side, I am relatively impressed to notice that the WCG is finally not so "large" as I had assumed. I supposed that you would have a larger server infrastructure, well noticed that you did not mention the storage units (SAN).
Thank you for the explanation about the two outages.
I wish you a good success by the further consolidation activities. As usual, the most important point is to be able to learn from failures and incidents (without a culprit finger).
Cheers,
Yves
----------------------------------------
[Jul 23, 2017 1:00:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2173
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Our new environment

Thanks for the info. It clears up a few things that haven't been mentioned when the actual "move to the cloud" happened a few weeks ago.
And as far as I am concerned, the last paragraph pretty much seems to sum up what the issue of the last (unplanned) outage was. And as that seems to be more of a communication issue than a plan technical issue, it should be relatively easy to fix and therefor prevent future occurrences...

Ralf
[Jul 26, 2017 6:52:12 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KLiK
Master Cruncher
Croatia
Joined: Nov 13, 2006
Post Count: 3108
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Our new environment

WoW, nice ones...enjoying your work!

Thanks for the info.
----------------------------------------
oldies:UDgrid.org & PS3 Life@home


non-profit org. Play4Life in Zagreb, Croatia
[Jul 26, 2017 12:36:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
asdavid
Veteran Cruncher
FRANCE
Joined: Nov 18, 2004
Post Count: 521
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Our new environment

Appreciate this very informative post. Thanks.
----------------------------------------
Anne-Sophie

[Jul 26, 2017 2:20:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 11   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread