World Community Grid - View Thread - No Tasks Available

Here is the post describing the outage: https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,35980

As for comments with regards to 24x7 support, etc. All programs and systems should be evaluated and reviewed against what level of uptime is desired. As you go from a target of 95% -> 98.5% -> 99% -> 99.9% -> 99.999% the costs gets significantly more expensive as you increase the target uptime. IBM provides us with a solid budget to run this program. Within that budget, have to decide how much we are going to spend on redundancy for server infrastructure, how much to spend on manpower for support and responding to incidents, how much to spend onboarding more research projects, how much to spend developing the website and how much to spend responding to emails, forums, social media, etc.

Our target for uptime at the application layer for World Community Grid is 99.0%. This means that each year our goal (excluding planned maintenance) is to be available 8,672 out of 8,760 hours. We have usually been closer to 99.5%. Early this year we had 3 incidents that caused some extended downtown and we are unfortunately going to be close to 99.0% this year. Note that the hosting infrastructure has a 24x7 staff and has higher availability targets.

We do not like having outages and we work to keep the system at a high level of availability. However, we do feel that the target of 99.0% availability is the right balance for the use of our budget on this project.

For those of you frustrated by having your machines idled for part of this time last night, I encourage you to learn about the ability to control how much work is buffered on your devices. You can instruct the client to store X hours of work on your machine so that you will have a supply of work to run locally during events such as this. For those of you new to us, outages of this duration are quite rare and a buffer of 8-12 hours is the most that I would recommend to store.

As someone who had to deal with a 24/7 operation I absolutely agree with Kevin. We operated on a 4 hour/year maximum downtime goal which is .9995 uptime. This is an expensive operational goal. It not only involves personnel on call but also involves multiple redundant systems. It is not for the faint of heart nor the light in the wallet crowd. I applaud IBM and WCG for excellent service.
Cheers