World Community Grid - View Thread

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Our new environment

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 11

[ ]

Author

This topic has been viewed 2688 times and has 10 replies

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Our new environment

Greetings,
There have been some questions asked recently about our environment, which we recently migrated to IBM Cloud.

We have three different environments at IBM Cloud. Each is isolated from each other except for a few specific connections (like allowing monitoring).

The first is our “Admin/DevOps” environment. This is available only inside the IBM firewall to house tools that our team use to help us run and manage the program. It is where we run programs like Sonar (code quality analysis tool), Jenkins, Jira (task management tool), Confluence (collaborative team wiki), Gitlab (code repository) and Nagios (monitoring tool). These are installed on 2 Linux (Red Hat 7) virtual machines.

The second is our QA environment where we carry out our testing. We are running 5 Linux virtual machines in this environment. This is enough servers to match each “flavor” of server that we have running in production (see below)

The third is our production environment. We are running 11 Linux physical machines in this environment.

The “flavors” of servers we have in production are as follows:

2 servers (24 cores, 64GB RAM, 20 Gbps connections) host our load balancers (HAProxy)
2 servers (24 cores, 64GB RAM, 20 Gbps connections) host the BOINC scheduler and the website (Apache HTTPD (web server) and WebSphere Application Server)
2 servers (28 cores, 512GB RAM, 20 Gbps connections) host the databases (DB2 and MariaDB)
5 servers (28 cores, 64GB RAM, 20 Gbps connections) host the science data, run the BOINC daemons and handle file transfers (Apache HTTPD, Apache Mesos (distributed task scheduling tool), Apache Aurora (Mesos framework for long running jobs), IBM Spectrum Scale (clustered storage tool))

We are using IBM Spectrum Scale, Apache Aurora, and Apache Mesos so that when we run any of our backend tasks such as the BOINC validators, workunit building tools or result aggregation scripts, the tasks is distributed to whichever science server has the least work and the task is run there. Each server has the same access to files and the database as the others. This setup makes it easy to manage this infrastructure. It also makes it very easy to additional capacity by just adding additional servers and configuring them like the others.

One question that some people will ask is if we are in the cloud why are we using physical machines. There are two answers to this question. The first is that IBM Cloud has made it as easy to provision a physical server as it is to provision a virtual server. We do lose a few things like being able to easily take a snapshot of the server and save it, but since we used automation heavily during setup, it is relatively easy for us to recreate a server from scratch. The second reason is that we use a lot of bandwidth and file I/O. These are things that have proven to be a challenge for the industry to get to perform well at scale when virtualized. A lot of research is being done on this and it will probably change in the next few years. However, we wanted to make sure that we had the I/O capabilities we needed and going physical was the best way to ensure that.

We are using two automation tools. We have used Ansible for a lot of the low level setup and configuration of the servers. We are using IBM UrbanCode Deploy for the ongoing deployment of changes for our different applications and processes.

Now, to the question of what caused the two major outages that we have had recently. Both of them have occurred when the IBM Spectrum Scale determined that it couldn't trust the data on a set of disks on one of the servers following a reboot. In order to restore operations, the software needed to perform a long recovery process to scan and check the data. IBM Spectrum Scale is very defensive in order to make sure that the data it provides access to is reliable (this is a good thing). The reason this happened is more complicated. The long and the short of it is that we are working with a internal operations team that we haven't worked with before. There were some questions about who was responsible for certain activities, which resulted in some things being set up wrong that triggered these issues. While this is embarrassing to our team, we are confident we can resolve these issues quickly and look forward to leveraging the new capabilities we have built into our new environment.

[Jul 21, 2017 10:17:58 PM]

gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

90 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Our new environment

Thanks Kevin for that very insightful post.

I'm sure that this will go a long way to answer many of the questions that have recently been posed and also reassure everyone that, as you've highlighted (and we've witnessed), there have been a 'few teathing problems', but hopefully there won't be many (preferably any) more.

----------------------------------------

[Jul 21, 2017 11:26:55 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Our new environment

Thanks for the explanation. Very informative.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Jul 22, 2017 1:24:01 AM]

AMuthig
Advanced Cruncher
USA
Joined: Nov 30, 2013
Post Count: 59
Status: Offline
Project Badges:

5 year badge for The Clean Energy Project - Phase 2

100 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

100 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Our new environment

Thank you for the information!

[Jul 22, 2017 5:41:08 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Our new environment

Thank you for the information and in a way that I could mostly understand. I promise not to moan if and (yes) when they go wrong again.

The horrors of of my "senior analyst / engineer" days are now well past. When a disk on a RAID 5 server started to fail, a new identical disk was ordered and whist machine was off, replaced. On restart the first message was that the "Geometry" on the disk was not correct, it got sillier on a second restart when the message said "The disk is not sane". Only thing to do was to get in touch with the supplier an IBM accredited supplier, who came two days later and had the server working inside 4 hours, the server was running the companies Financial records. blushing

----------------------------------------
[Edit 1 times, last edit by Former Member at Jul 23, 2017 9:40:39 AM]

[Jul 23, 2017 9:39:51 AM]

cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

90 day badge for Computing for Sustainable Water

10 year badge for Mapping Cancer Markers

5 year badge for Outsmart Ebola Together

5 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project


Re: Our new environment

Thanks for the explanation, it helps with understanding the complex environment in which WCG is run.
CJSL

Keep Calm and Crunch !!!

----------------------------------------

I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team

[Jul 23, 2017 12:58:07 PM]

KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

5 year badge for Nutritious Rice for the World

2 year badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Computing for Clean Water

2 year badge for GO Fight Against Malaria

10 year badge for Uncovering Genome Mysteries

5 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Africa Rainfall Project


Re: Our new environment

Thank you knreed for this very informative post.
On my side, I am relatively impressed to notice that the WCG is finally not so "large" as I had assumed. I supposed that you would have a larger server infrastructure, well noticed that you did not mention the storage units (SAN).
Thank you for the explanation about the two outages.
I wish you a good success by the further consolidation activities. As usual, the most important point is to be able to learn from failures and incidents (without a culprit finger).
Cheers,
Yves

----------------------------------------

Décrypthon team progress - KerSamson's contribution

[Jul 23, 2017 1:00:36 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2173
Status: Offline
Project Badges:

2 year badge for Drug Search for Leishmaniasis

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Microbiome Immunity Project


Re: Our new environment

Thanks for the info. It clears up a few things that haven't been mentioned when the actual "move to the cloud" happened a few weeks ago.
And as far as I am concerned, the last paragraph pretty much seems to sum up what the issue of the last (unplanned) outage was. And as that seems to be more of a communication issue than a plan technical issue, it should be relatively easy to fix and therefor prevent future occurrences...

Ralf

[Jul 26, 2017 6:52:12 AM]