| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 8
|
|
| Author |
|
|
keithhenry
Ace Cruncher Senile old farts of the world ....uh.....uh..... nevermind Joined: Nov 18, 2004 Post Count: 18667 Status: Offline Project Badges:
|
With the recent file system error that required a full day plus for copying, I got to wondering if a duplicate would be possible that would be like a real-time-ish replica (probably a few minutes behind is the best that would be possible) that the servers could "failover" to in the case of another file system error. That would at least keep things running and give the admins a bit more flexibility in resolving the problem. There might be a issue to address with how to deal with files created between that last update to the replica and the time of the error on the primary. Still, could something like this work?
---------------------------------------- |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
In the last few discussions by knreed there's talk of fail-over robustness being hardened... the recent switch to GPFS and so on http://www-03.ibm.com/systems/software/gpfs/ . I would imagine that they're still working to complete this transition... at least so is my reading.
----------------------------------------Today we're heading for resuming the consecutive [real] records... 355 CPU years is what I make of it at the moment.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
A second real-time failover file system would prohibitively expensive and difficult to maintain given the volume of work that moves on and off of the file system.
However, we are now fully using GPFS as the file system for the BOINC file system. These means the following:
It is unlikely that we will lose access to the file system. However, human error could always occur and that could cause issues. There is always more money that could be spent to making the system more available and more reliable. But the costs to make it ever more reliable become increasingly expensive with diminishing returns. As far as I can recall (without digging through notes and history) there have been 3 major outages of the BOINC system since we started using it in 2005. One was right before Christmas 2006 and was significant because BOINC primarily ran off of one server at that time without redundancy. The second was when we migrated to our new hosting environment. The third was last week. There have been other outages ranging from a few minutes to a few hours, but nothing beyond that. Our scheduled up time is above 99% over our 4-5 year history (I could argue it is above 99.5% but that would require me getting more accurate). I personally think that is a reasonable availability for a system that does not involve finances, medical, or other life threatening or critical systems. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I believe that WCG has a fantastic high availability, and that's one of the main reason I crunch almost exclusively here. Distributed Computing allows for some downtime, which in WCG's case is almost non existent.
|
||
|
|
Dataman
Ace Cruncher Joined: Nov 16, 2004 Post Count: 4865 Status: Offline Project Badges:
|
A second real-time failover file system would prohibitively expensive and difficult to maintain given the volume of work that moves on and off of the file system. I personally think that is a reasonable availability for a system that does not involve finances, medical, or other life threatening or critical systems. This thread asks the wrong question. Is a failover system possible? Of course it is. Is it necessary and cost justifiable? Absolutely not. Kevin says it perfectly correctly. This is not a critical system as defined by the business world. Whether a project is down two hours, two days, two weeks or two months is inconsequential (except for the whining of crunchers). I find that even a one nine of availably (90.00%) achieved by WCG is simply amazing for a data center of this type. Even commercial banks and hospitals can only achieve 2 nines and only the most critical data centers acheive 3 nines. Removing every single point of failure is all but impossible except for the most mission critical data centers. In my (non-humble) opinion, WCG has greatly exceeded expectations for availability and I am certain they have met or exceed their service level agreements with their clients. Moving to 99.90% would roughly double the expenditure and I hope that if their budget doubles that they spend it on capacity and not redundancy. I know it is hard to remember sometimes but the DC technology was not created to keep crunchers entertained and its primary work product is not points and badges. Having said that … Where’s my UD credits, Kevin? Cheers and crunch on. ![]() ![]() [Edit 1 times, last edit by Dataman at Oct 19, 2010 3:52:42 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
This thread asks the wrong question. Is a failover system possible? Of course it is. Is it necessary and cost justifiable? Absolutely not. It is so easy to tell someone else how to spend their money.Whether a project is down two hours, two days, two weeks or two months is inconsequential (except for the whining of crunchers) Which unfortunately does enter into the equation as even the imperturbable Sekerob has been rattled by the cacophony of whining on occasionWCG has greatly exceeded expectations for availability With the last outage, I lost about 15 days of crunching. That was due to MY decision to run at .1 days buffer. While I lost some time, I have received other benefits, so no whining from me and no stress over the loss. It is what it is, and I am still at .1 days |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Saw allot of this in the past 48 hours... clouds... forecast for today ... cloudless. Good harvest day for the olives.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Oct 20, 2010 1:03:58 AM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
A second real-time failover file system would prohibitively expensive and difficult to maintain given the volume of work that moves on and off of the file system. I personally think that is a reasonable availability for a system that does not involve finances, medical, or other life threatening or critical systems. This thread asks the wrong question. Is a failover system possible? Of course it is. Is it necessary and cost justifiable? Absolutely not. Kevin says it perfectly correctly. This is not a critical system as defined by the business world. Whether a project is down two hours, two days, two weeks or two months is inconsequential (except for the whining of crunchers). I find that even a one nine of availably (90.00%) achieved by WCG is simply amazing for a data center of this type. Even commercial banks and hospitals can only achieve 2 nines and only the most critical data centers acheive 3 nines. Removing every single point of failure is all but impossible except for the most mission critical data centers. In my (non-humble) opinion, WCG has greatly exceeded expectations for availability and I am certain they have met or exceed their service level agreements with their clients. Moving to 99.90% would roughly double the expenditure and I hope that if their budget doubles that they spend it on capacity and not redundancy. I know it is hard to remember sometimes but the DC technology was not created to keep crunchers entertained and its primary work product is not points and badges. Having said that … Where’s my UD credits, Kevin? Cheers and crunch on. ![]() I agree wholeheartedly. I had some peripheral experience with a Stratasys system with redundant everything, including backup power. Very expensive, but necessary for the environment in which it was used. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
|