Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 51
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
to whoever fixed the problem... well done and thanks!
----------------------------------------[Edit 1 times, last edit by Former Member at Dec 10, 2013 1:54:13 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
New tasks flowing.....many thanks
![]() |
||
|
CandymanWCG
Senior Cruncher Romania Joined: Dec 20, 2010 Post Count: 421 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
All's well when it ends well. I got myself a couple of shiny Betas!
----------------------------------------![]() Knowledge is limited. Imagination encircles the world! - Albert Einstein ![]() ![]() |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Here is the post describing the outage: https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,35980
As for comments with regards to 24x7 support, etc. All programs and systems should be evaluated and reviewed against what level of uptime is desired. As you go from a target of 95% -> 98.5% -> 99% -> 99.9% -> 99.999% the costs gets significantly more expensive as you increase the target uptime. IBM provides us with a solid budget to run this program. Within that budget, have to decide how much we are going to spend on redundancy for server infrastructure, how much to spend on manpower for support and responding to incidents, how much to spend onboarding more research projects, how much to spend developing the website and how much to spend responding to emails, forums, social media, etc. Our target for uptime at the application layer for World Community Grid is 99.0%. This means that each year our goal (excluding planned maintenance) is to be available 8,672 out of 8,760 hours. We have usually been closer to 99.5%. Early this year we had 3 incidents that caused some extended downtown and we are unfortunately going to be close to 99.0% this year. Note that the hosting infrastructure has a 24x7 staff and has higher availability targets. We do not like having outages and we work to keep the system at a high level of availability. However, we do feel that the target of 99.0% availability is the right balance for the use of our budget on this project. For those of you frustrated by having your machines idled for part of this time last night, I encourage you to learn about the ability to control how much work is buffered on your devices. You can instruct the client to store X hours of work on your machine so that you will have a supply of work to run locally during events such as this. For those of you new to us, outages of this duration are quite rare and a buffer of 8-12 hours is the most that I would recommend to store. |
||
|
CandymanWCG
Senior Cruncher Romania Joined: Dec 20, 2010 Post Count: 421 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi Kevin,
----------------------------------------Thank you for the detailed explanation and updates. Indeed, 99% up time is more than reasonable as it is and I believe many of us understand the reasons behind it, especially with the facts that you have provided. Not to beat a dead horse here, but is there any chance that some scripts or notifications could be set in place to send some warning to you techs so if the situation calls for it and you don't mind getting out of bed at strange hours, you can quickly fix it? Regarding the cache, I'm sure that there are many of us that for one reason or the other we need to either keep a low or 0 cache, so it's not really about learning how to use this very nice feature it's the various factors that prevent us from doing so. Anyway, thanks again for your support and great work! ![]() Knowledge is limited. Imagination encircles the world! - Albert Einstein ![]() ![]() [Edit 1 times, last edit by CandymanWCG at Dec 10, 2013 2:50:25 PM] |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Not to beat a dead horse here, but is there any chance that some scripts or notifications could be set in place to send some warning to you techs so if the situation calls for it and you don't mind getting out of bed at strange hours, you can quickly fix it? We do in fact have many scripts and alerts in place. Not sure if you have ever been on support before, but it is very hard to tune monitoring to only notify in the event of errors but not send false alerts. In any system like this you get a number of false alerts for every real alert. This means that you actually need someone 'on-call' to respond and determine if an alert is a real issue or a false positive. We can only ensure that someone will repsond to an issue at the next scheduled work interval. Having said that, we frequently check them out when we see them day or night, weekend or weekday. Last night just happened to be the convergence of several factors where no-one was able to check them out until this morning (US time) and the issue started relatively early in the night. |
||
|
CandymanWCG
Senior Cruncher Romania Joined: Dec 20, 2010 Post Count: 421 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Crystal clear. Many thanks for your reply and patience!
----------------------------------------Cheers! ![]() Knowledge is limited. Imagination encircles the world! - Albert Einstein ![]() ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Whilst away, fetching an I7-4770 DT 3.4-3.9Ghz to replace the ol Q6600 2.4Ghz , and WCG coming back in me absence, the first thing was for the clients to load up in Beta, though I'd set the profiles back for each device to not seek them out in preference. Great. Seems after initially receiving only faah, the FAHV are flowing again too.
![]() |
||
|
Steve W
Advanced Cruncher Joined: Dec 9, 2005 Post Count: 110 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Just looking at my machine logs and getting the "Project has no tasks available" again.
I'm hoping that its just down to someone snaffling all the existing WU and not the feeder dying again. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It's hard to quantify impacts of these events, but we know that whoever had work, managed to report it and met with wingman if there was one. Just no buffer backfill. For now the data says the Tuesday morning validations were 290 years worth [a record], and the afternoon 225, a differential of -65. Whence the machines started receiving work again, when asking again [one of mine having hit on a 15 hour back-off], any task completed would have a higher chance of finding a wingman in the initial 12-24 hours... drained cache machines are in sync on the first jobs they do, so it will be interesting to see what the Wednesday morning will bring... more / less? For the moment my PV exploded... from 86 yesterday to 108 now. We'll know in a little. Place you bets.
|
||
|
|
![]() |