Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 8
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1696 times and has 7 replies Next Thread
keithhenry
Ace Cruncher
Senile old farts of the world ....uh.....uh..... nevermind
Joined: Nov 18, 2004
Post Count: 18667
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Is failover file system possible???

With the recent file system error that required a full day plus for copying, I got to wondering if a duplicate would be possible that would be like a real-time-ish replica (probably a few minutes behind is the best that would be possible) that the servers could "failover" to in the case of another file system error. That would at least keep things running and give the admins a bit more flexibility in resolving the problem. There might be a issue to address with how to deal with files created between that last update to the replica and the time of the error on the primary. Still, could something like this work?
----------------------------------------
Join/Website/IMODB



[Oct 18, 2010 2:14:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is failover file system possible???

In the last few discussions by knreed there's talk of fail-over robustness being hardened... the recent switch to GPFS and so on http://www-03.ibm.com/systems/software/gpfs/ . I would imagine that they're still working to complete this transition... at least so is my reading.

Today we're heading for resuming the consecutive [real] records... 355 CPU years is what I make of it at the moment.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Oct 18, 2010 2:45:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is failover file system possible???

A second real-time failover file system would prohibitively expensive and difficult to maintain given the volume of work that moves on and off of the file system.

However, we are now fully using GPFS as the file system for the BOINC file system. These means the following:


  • Each of the servers that has access to the file system has the storage mounted directly rather than through an intermediary servers (previously we had a single point of failure for the server that mounted the SAN storage which shared the filesystem via NFS to the other servers)
  • Each server that mounts the storage has a redundant HBA cards with redundant pathing to the SAN storage array
  • The SAN storage array is a highly available solution that supports multiple customers (not just World Community Grid)


It is unlikely that we will lose access to the file system. However, human error could always occur and that could cause issues.

There is always more money that could be spent to making the system more available and more reliable. But the costs to make it ever more reliable become increasingly expensive with diminishing returns.

As far as I can recall (without digging through notes and history) there have been 3 major outages of the BOINC system since we started using it in 2005. One was right before Christmas 2006 and was significant because BOINC primarily ran off of one server at that time without redundancy. The second was when we migrated to our new hosting environment. The third was last week. There have been other outages ranging from a few minutes to a few hours, but nothing beyond that.

Our scheduled up time is above 99% over our 4-5 year history (I could argue it is above 99.5% but that would require me getting more accurate). I personally think that is a reasonable availability for a system that does not involve finances, medical, or other life threatening or critical systems.
[Oct 19, 2010 2:13:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
applause Re: Is failover file system possible???

I believe that WCG has a fantastic high availability, and that's one of the main reason I crunch almost exclusively here. Distributed Computing allows for some downtime, which in WCG's case is almost non existent.
[Oct 19, 2010 2:43:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Dataman
Ace Cruncher
Joined: Nov 16, 2004
Post Count: 4865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is failover file system possible???

A second real-time failover file system would prohibitively expensive and difficult to maintain given the volume of work that moves on and off of the file system.

I personally think that is a reasonable availability for a system that does not involve finances, medical, or other life threatening or critical systems.


This thread asks the wrong question. Is a failover system possible? Of course it is. Is it necessary and cost justifiable? Absolutely not. Kevin says it perfectly correctly. This is not a critical system as defined by the business world. Whether a project is down two hours, two days, two weeks or two months is inconsequential (except for the whining of crunchers).
I find that even a one nine of availably (90.00%) achieved by WCG is simply amazing for a data center of this type. Even commercial banks and hospitals can only achieve 2 nines and only the most critical data centers acheive 3 nines. Removing every single point of failure is all but impossible except for the most mission critical data centers. In my (non-humble) opinion, WCG has greatly exceeded expectations for availability and I am certain they have met or exceed their service level agreements with their clients. Moving to 99.90% would roughly double the expenditure and I hope that if their budget doubles that they spend it on capacity and not redundancy.

I know it is hard to remember sometimes but the DC technology was not created to keep crunchers entertained and its primary work product is not points and badges.
Having said that … Where’s my UD credits, Kevin? laughing biggrin laughing biggrin
Cheers and crunch on. coffee

cowboy
----------------------------------------


----------------------------------------
[Edit 1 times, last edit by Dataman at Oct 19, 2010 3:52:42 PM]
[Oct 19, 2010 3:41:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is failover file system possible???

This thread asks the wrong question. Is a failover system possible? Of course it is. Is it necessary and cost justifiable? Absolutely not.
It is so easy to tell someone else how to spend their money.
Whether a project is down two hours, two days, two weeks or two months is inconsequential (except for the whining of crunchers)
Which unfortunately does enter into the equation as even the imperturbable Sekerob has been rattled by the cacophony of whining on occasion
WCG has greatly exceeded expectations for availability
With the last outage, I lost about 15 days of crunching. That was due to MY decision to run at .1 days buffer. While I lost some time, I have received other benefits, so no whining from me and no stress over the loss. It is what it is, and I am still at .1 days
[Oct 19, 2010 3:56:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Is failover file system possible???

  • The SAN storage array is a highly available solution that supports multiple customers (not just World Community Grid)

Saw allot of this in the past 48 hours... clouds... forecast for today ... cloudless. Good harvest day for the olives.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Oct 20, 2010 1:03:58 AM]
[Oct 20, 2010 1:03:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Is failover file system possible???

A second real-time failover file system would prohibitively expensive and difficult to maintain given the volume of work that moves on and off of the file system.

I personally think that is a reasonable availability for a system that does not involve finances, medical, or other life threatening or critical systems.


This thread asks the wrong question. Is a failover system possible? Of course it is. Is it necessary and cost justifiable? Absolutely not. Kevin says it perfectly correctly. This is not a critical system as defined by the business world. Whether a project is down two hours, two days, two weeks or two months is inconsequential (except for the whining of crunchers).
I find that even a one nine of availably (90.00%) achieved by WCG is simply amazing for a data center of this type. Even commercial banks and hospitals can only achieve 2 nines and only the most critical data centers acheive 3 nines. Removing every single point of failure is all but impossible except for the most mission critical data centers. In my (non-humble) opinion, WCG has greatly exceeded expectations for availability and I am certain they have met or exceed their service level agreements with their clients. Moving to 99.90% would roughly double the expenditure and I hope that if their budget doubles that they spend it on capacity and not redundancy.

I know it is hard to remember sometimes but the DC technology was not created to keep crunchers entertained and its primary work product is not points and badges.
Having said that … Where’s my UD credits, Kevin? laughing biggrin laughing biggrin
Cheers and crunch on. coffee

cowboy


I agree wholeheartedly. I had some peripheral experience with a Stratasys system with redundant everything, including backup power. Very expensive, but necessary for the environment in which it was used.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Oct 20, 2010 1:14:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread