World Community Grid - View Thread

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Major Outage

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 44

[ ]

Author

This topic has been viewed 42352 times and has 43 replies

Dataman
Ace Cruncher
Joined: Nov 16, 2004
Post Count: 4865
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

1 year badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Major Outage

Wow, what a morning!! Milkyway also went down this morning and GPUGrid (which has huge upload files) finally nailed the lid on the coffin for my farm. All died from communications errors. I booted everything and let BOINC sort it out ... which it finally did.

Let's not do this again anytime soon. laughing

----------------------------------------

[Aug 15, 2010 4:37:55 PM]

sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

5 year badge for Nutritious Rice for the World

20 year badge for Help Fight Childhood Cancer

20 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for GO Fight Against Malaria

10 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

45 day badge for OpenPandemics - COVID-19


Re: Major Outage

For most of my systems I keep a day or two of cache, unless they have a limited GPU connected to GPUGrid. Even for those systems, I keep them attached to GPUGrid, WCG and at least one other project, so I saw no real impact on overall contribution.
Downtime is to be expected, for any project, even for a project of this size.

[Aug 15, 2010 6:37:47 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: Major Outage, 15 Aug 2010

Post-mortem:
It's now the day after the outage, and I just found this thread.
For some reason, I was unable to access the WCG forums or any more than about 2% of the website during the outage. I just got the "The application is unavailable at this time, please try again later." page, or a blank page, or an SQL error message. Yet it seems others were able to make posts in this thread. ??? [Edit]: I'm bookmarked into http://www.worldcommunitygrid.org/index.jsp

Main reason for this post:
Did anyone else check what was happening in the Transfers tab of their BOINC client(s)? With the WCG server off-line, you would expect that the uploads of each file would try to start, fail, and time out. Not this time. There must have been some front-end server still active. The files would upload - you could see the kb/s and their progress, reach 100% transferred, and then fail. For every file, BOINC would back off for a time that increased up to about 2 (?) hours, and then repeat the process. Over and over, with more and more files as they arrived in the queue. The Net traffic going into the WCG data centre must have been incredible, but it didn't seem to slow down.

Having multiple machines crunching, I suspended network activity in all but the slowest. I let that continue so that I could see when the system was back up.

For any of you whose ISP counts uploads and charges $$ for over quota, you could be in for a nasty surprise when your next bill arrives.

I know that the techs were probably tearing their hair out over other issues, but I suggest that if such an event happens again, the front-end server or whatever zombie was handling our uploads, gets disabled.

Suggestion #1 was going to be that WCG have an emergency substitute website server available to display a single page on which the techs could post simple status reports. However, it seems that some people could access the forum, so maybe there's no need.

----------------------------------------
[Edit 2 times, last edit by Rickjb at Aug 16, 2010 1:04:20 PM]

[Aug 16, 2010 12:56:21 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Major Outage, 15 Aug 2010

No one to my knowledge could access forums. I send a mail to support and got a reply about 2 hours before the site came back that all had turned read-only and soon as the forums were woken up, was there knreed's KI message, who's probably got a backdoor key.

The subs are on Facebook, where several message passed yesterday on the topic and Twatter. The front page actually did load partial here and it's bottom line where the Facebook/Twitter connects appear.

The new 6.10 client has a bandwidth volume setting to protect against a rampage. I've set it to 350MB daily (quad volume). If the client exceeds that it will not upload again until after midnight. Grant it it's not perfect as it's a per client limit, but then those that have a farm will have farm-bandwidth too (except one I think to have read recently).

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Aug 16, 2010 2:01:31 PM]

JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

1 year badge for Nutritious Rice for the World

10 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

5 year badge for Outsmart Ebola Together

180 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Major Outage, 15 Aug 2010

Rick,
Nobody has been able to access the forum for several hours. knreeds' post in the Known Issues has been posted at 14:17 UTC and the whole thing was already down sometime around 7 UTC for me.
Then after the first steps of the repair operations the website has been accessible again and the forum as a consequence.
Then uploads have been possible again.
And finally reporting and new tasks fetching.

Regarding unsuccessful uploads and their possible cost:
Since there was nothing at the other end of the pipe what you could see from the Transfer tab every time the client retried uploads was a few blocks pushed in the pipe (enough for a small file to reach 100 %).
Then upon receiving no acknowledgement at all from the other end the client entered back-off mode for a random time.
Therefore very few kilobytes of data were sent to your ISP during these attempts.

----------------------------------------

Team--> Decrypthon -->Statistics/Join -->Thread

[Aug 16, 2010 2:09:34 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Major Outage, 15 Aug 2010

Actually Jean, this was different as noted. From here the _4 CEP2 files went up all to 100% several times (19MB) and any other result file for it's full size going to WCG. It's just that the connect did not seem to allow the file to be stored in it's designated spot after that... a move from temp to db archive slot I suppose.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Aug 16, 2010 2:18:06 PM]

JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:


Re: Major Outage, 15 Aug 2010

If Rick's jobs were CEP2 it may have been different since the Harvard server could receive files, and if it needs WCG server's availability to complete the process it's a different story, unfortunately.
But Rick did not mention CEP2 in his post so far. biggrin

On the other hand he mentions something which was not true, at least for a set of files without any CEP2 tasks:

For every file, BOINC would back off for a time that increased up to about 2 (?) hours, and then repeat the process.

Every time I have watched a retry BOINC was starting to transfer a few files as I described, and then it backed off the whole list without trying for every file in the list.

----------------------------------------

Team--> Decrypthon -->Statistics/Join -->Thread

[Aug 16, 2010 2:34:28 PM]

RaymondFO
Veteran Cruncher
USA
Joined: Nov 30, 2004
Post Count: 561
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

2 year badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

10 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

5 year badge for Computing for Sustainable Water

10 year badge for Uncovering Genome Mysteries

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project


Re: Major Outage, 15 Aug 2010

It would appear CEP2 WU's were effected by the outage as none of my CEP2 WU's were able to upload nor was I able to receive additional CEP2 WU's.

[Aug 16, 2010 2:44:11 PM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

1 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Uncovering Genome Mysteries

5 year badge for Africa Rainfall Project


Re: Major Outage, 15 Aug 2010

Main reason for this post:
Did anyone else check what was happening in the Transfers tab of their BOINC client(s)? With the WCG server off-line, you would expect that the uploads of each file would try to start, fail, and time out. Not this time. There must have been some front-end server still active. The files would upload - you could see the kb/s and their progress, reach 100% transferred, and then fail. For every file, BOINC would back off for a time that increased up to about 2 (?) hours, and then repeat the process. Over and over, with more and more files as they arrived in the queue. The Net traffic going into the WCG data centre must have been incredible, but it didn't seem to slow down.

Sekerob and JmBoullier has commented on the server-side of things, so I'll keep it on how client handles errors.

In v6.10.xx and older clients each file-transfer (upload or download) will have it's own random backoff in case of errors, this is between 1 minute and 4 hours, and always starts with 1 minute and increases in case of errors.

v6.10.xx also has a new Project-wide backoff, so with 3 transfer-errors in a row to same project (**), the whole project will get a random backoff, and this again will be between 1 minute and 4 hours. This is so fast, multi-core computers that can have many file-transfers in case of problems, won't be continuously trying a different transfer as the individual file-backoffs times-out. While project has a project-wide backoff, any new uploads won't be tried immediately, but will wait for the project-wide backoff to count-down.

The next serie of BOINC-clients that is currently in development, v6.12.xx, will have an additional change, since instead of the 1 minute to 4 hours random backoff, it's increased to 10 minute to 12 hours random backoff, and the same for project-wide backoff.

(**) In case a project has multiple upload-servers and/or download-servers, all the listed servers will be tried before project-wide backoff kicks-in, so there will be more than 3 errors before you'll get a project-wide backoff in this instance. AFAIK WCG doesn't have multiple servers, except for the linux-only CEP2.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

----------------------------------------
[Edit 1 times, last edit by Ingleside at Aug 16, 2010 2:53:18 PM]

[Aug 16, 2010 2:48:55 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Major Outage, 15 Aug 2010

On the other hand he mentions something which was not true, at least for a set of files without any CEP2 tasks:

For every file, BOINC would back off for a time that increased up to about 2 (?) hours, and then repeat the process.

Every time I have watched a retry BOINC was starting to transfer a few files as I described, and then it backed off the whole list without trying for every file in the list.

Correctomundo, with 6.10.58 both on linux and w7 the backoff immediately struck for all uploads which towards service revival time neared probably 60-70 in the quad queue. But, far as I could see even the 600-750KB files that went to WCG, not Harvard, uploaded to 100% before the backoff kicked in. Backoff counters went all the way to about 3.5 hours... I was bad once and hit the retry now and got slapped ;P

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Aug 16, 2010 2:50:52 PM]

[ ]