| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 44
|
|
| Author |
|
|
Dataman
Ace Cruncher Joined: Nov 16, 2004 Post Count: 4865 Status: Offline Project Badges:
|
Wow, what a morning!! Milkyway also went down this morning and GPUGrid (which has huge upload files) finally nailed the lid on the coffin for my farm. All died from communications errors. I booted everything and let BOINC sort it out ... which it finally did.
----------------------------------------Let's not do this again anytime soon. ![]() ![]() |
||
|
|
sk..
Master Cruncher http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif Joined: Mar 22, 2007 Post Count: 2324 Status: Offline Project Badges:
|
For most of my systems I keep a day or two of cache, unless they have a limited GPU connected to GPUGrid. Even for those systems, I keep them attached to GPUGrid, WCG and at least one other project, so I saw no real impact on overall contribution.
Downtime is to be expected, for any project, even for a project of this size. |
||
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
Post-mortem:
----------------------------------------It's now the day after the outage, and I just found this thread. For some reason, I was unable to access the WCG forums or any more than about 2% of the website during the outage. I just got the "The application is unavailable at this time, please try again later." page, or a blank page, or an SQL error message. Yet it seems others were able to make posts in this thread. ??? [Edit]: I'm bookmarked into http://www.worldcommunitygrid.org/index.jsp Main reason for this post: Did anyone else check what was happening in the Transfers tab of their BOINC client(s)? With the WCG server off-line, you would expect that the uploads of each file would try to start, fail, and time out. Not this time. There must have been some front-end server still active. The files would upload - you could see the kb/s and their progress, reach 100% transferred, and then fail. For every file, BOINC would back off for a time that increased up to about 2 (?) hours, and then repeat the process. Over and over, with more and more files as they arrived in the queue. The Net traffic going into the WCG data centre must have been incredible, but it didn't seem to slow down. Having multiple machines crunching, I suspended network activity in all but the slowest. I let that continue so that I could see when the system was back up. For any of you whose ISP counts uploads and charges $$ for over quota, you could be in for a nasty surprise when your next bill arrives. I know that the techs were probably tearing their hair out over other issues, but I suggest that if such an event happens again, the front-end server or whatever zombie was handling our uploads, gets disabled. Suggestion #1 was going to be that WCG have an emergency substitute website server available to display a single page on which the techs could post simple status reports. However, it seems that some people could access the forum, so maybe there's no need. [Edit 2 times, last edit by Rickjb at Aug 16, 2010 1:04:20 PM] |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
No one to my knowledge could access forums. I send a mail to support and got a reply about 2 hours before the site came back that all had turned read-only and soon as the forums were woken up, was there knreed's KI message, who's probably got a backdoor key.
----------------------------------------The subs are on Facebook, where several message passed yesterday on the topic and Twatter. The front page actually did load partial here and it's bottom line where the Facebook/Twitter connects appear. The new 6.10 client has a bandwidth volume setting to protect against a rampage. I've set it to 350MB daily (quad volume). If the client exceeds that it will not upload again until after midnight. Grant it it's not perfect as it's a per client limit, but then those that have a farm will have farm-bandwidth too (except one I think to have read recently).
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3716 Status: Offline Project Badges:
|
Rick,
----------------------------------------Nobody has been able to access the forum for several hours. knreeds' post in the Known Issues has been posted at 14:17 UTC and the whole thing was already down sometime around 7 UTC for me. Then after the first steps of the repair operations the website has been accessible again and the forum as a consequence. Then uploads have been possible again. And finally reporting and new tasks fetching. Regarding unsuccessful uploads and their possible cost: Since there was nothing at the other end of the pipe what you could see from the Transfer tab every time the client retried uploads was a few blocks pushed in the pipe (enough for a small file to reach 100 %). Then upon receiving no acknowledgement at all from the other end the client entered back-off mode for a random time. Therefore very few kilobytes of data were sent to your ISP during these attempts. |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Actually Jean, this was different as noted. From here the _4 CEP2 files went up all to 100% several times (19MB) and any other result file for it's full size going to WCG. It's just that the connect did not seem to allow the file to be stored in it's designated spot after that... a move from temp to db archive slot I suppose.
----------------------------------------
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
JmBoullier
Former Community Advisor Normandy - France Joined: Jan 26, 2007 Post Count: 3716 Status: Offline Project Badges:
|
If Rick's jobs were CEP2 it may have been different since the Harvard server could receive files, and if it needs WCG server's availability to complete the process it's a different story, unfortunately.
----------------------------------------But Rick did not mention CEP2 in his post so far. On the other hand he mentions something which was not true, at least for a set of files without any CEP2 tasks: For every file, BOINC would back off for a time that increased up to about 2 (?) hours, and then repeat the process. Every time I have watched a retry BOINC was starting to transfer a few files as I described, and then it backed off the whole list without trying for every file in the list. |
||
|
|
RaymondFO
Veteran Cruncher USA Joined: Nov 30, 2004 Post Count: 561 Status: Offline Project Badges:
|
It would appear CEP2 WU's were effected by the outage as none of my CEP2 WU's were able to upload nor was I able to receive additional CEP2 WU's.
|
||
|
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges:
|
Main reason for this post: Did anyone else check what was happening in the Transfers tab of their BOINC client(s)? With the WCG server off-line, you would expect that the uploads of each file would try to start, fail, and time out. Not this time. There must have been some front-end server still active. The files would upload - you could see the kb/s and their progress, reach 100% transferred, and then fail. For every file, BOINC would back off for a time that increased up to about 2 (?) hours, and then repeat the process. Over and over, with more and more files as they arrived in the queue. The Net traffic going into the WCG data centre must have been incredible, but it didn't seem to slow down. Sekerob and JmBoullier has commented on the server-side of things, so I'll keep it on how client handles errors. In v6.10.xx and older clients each file-transfer (upload or download) will have it's own random backoff in case of errors, this is between 1 minute and 4 hours, and always starts with 1 minute and increases in case of errors. v6.10.xx also has a new Project-wide backoff, so with 3 transfer-errors in a row to same project (**), the whole project will get a random backoff, and this again will be between 1 minute and 4 hours. This is so fast, multi-core computers that can have many file-transfers in case of problems, won't be continuously trying a different transfer as the individual file-backoffs times-out. While project has a project-wide backoff, any new uploads won't be tried immediately, but will wait for the project-wide backoff to count-down. The next serie of BOINC-clients that is currently in development, v6.12.xx, will have an additional change, since instead of the 1 minute to 4 hours random backoff, it's increased to 10 minute to 12 hours random backoff, and the same for project-wide backoff. (**) In case a project has multiple upload-servers and/or download-servers, all the listed servers will be tried before project-wide backoff kicks-in, so there will be more than 3 errors before you'll get a project-wide backoff in this instance. AFAIK WCG doesn't have multiple servers, except for the linux-only CEP2. ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." [Edit 1 times, last edit by Ingleside at Aug 16, 2010 2:53:18 PM] |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
If Rick's jobs were CEP2 it may have been different since the Harvard server could receive files, and if it needs WCG server's availability to complete the process it's a different story, unfortunately. But Rick did not mention CEP2 in his post so far. On the other hand he mentions something which was not true, at least for a set of files without any CEP2 tasks: For every file, BOINC would back off for a time that increased up to about 2 (?) hours, and then repeat the process. Every time I have watched a retry BOINC was starting to transfer a few files as I described, and then it backed off the whole list without trying for every file in the list.Correctomundo, with 6.10.58 both on linux and w7 the backoff immediately struck for all uploads which towards service revival time neared probably 60-70 in the quad queue. But, far as I could see even the 600-750KB files that went to WCG, not Harvard, uploaded to 100% before the backoff kicked in. Backoff counters went all the way to about 3.5 hours... I was bad once and hit the retry now and got slapped ;P
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
|