World Community Grid - View Thread - NEW Outage on Boinc Server

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: NEW Outage on Boinc Server

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 26

[ ]

Author

This topic has been viewed 2650 times and has 25 replies

neebong
Cruncher
Joined: Mar 1, 2006
Post Count: 1
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

20 year badge for Mapping Cancer Markers

90 day badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

45 day badge for OpenPandemics - COVID-19


Re: NEW Outage on Boinc Server

Does anyone have any idea when the server is coming back up again?

[Dec 22, 2006 8:44:09 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: NEW Outage on Boinc Server

Does anyone have any idea when the server is coming back up again?

Your answer is at:
http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=10703

In one of the last, in the thread is:

"The worst case scenario is that the BOINC servers will be down until Tuesday, December 26, 2006."

Had a nice holiday

[Dec 22, 2006 8:58:07 PM]

keithhenry
Ace Cruncher
Senile old farts of the world ....uh.....uh..... nevermind
Joined: Nov 18, 2004
Post Count: 18667
Status: Offline
Project Badges:

180 day badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: NEW Outage on Boinc Server

Well, as with anything serious that occurs with this project, we always seem to have multiple threads going on the matter. Such is life. I joined this project two days after it started. In all that time, the WCG team has set it up, gotten it going, kept it going and improved it all while boarding new projects and dealing with the gremlins, grimjits and snipes that have found their way into our realm. Finally, after two long years, they seem to finally have a pretty serious outage, if only in duration and smack dab in the middle of the holiday season no less. Figures right? I mean, who hasn't tried to take a nice vacation from work only to have your longingly awaited plans scuttled by a nice good old snafu back at the office that you just can't avoid getting pulled into? From what I've been able to pick up on over the past two years, I'd guess that we're looking at probably 8-10, 12 at the outside, folks that we collectively label the "WCG admins". Just a handful of folks run this whole shebang. Makes sense if you think about it. Non-profits are notorious for operating on a shoestring. I expect that WCG is no different. IBM may provide the facilities, equipment and such but the last thing it is for them is a revenue source.

So I for one would like to take a moment to say thanks to our brave little band of warriors! Thanks for two plus great years of a job well done. No, it's not been perfect but I think they've done quite a job with what they have to work with.

As for the current situation, I'd say consider modifying your BOINC profile, when that's possible, to set the "Connect to server every X days" to have a value of 2.0. If things get fixed, you will get enough work to last through two days if this current bugger proves to be a bit stubborn and it comes back for a third visit. Once things look like it's back to normal, you can reset this back to what you normally use. If you don't run BOINC 24x7, you may want to adjust the 2.0 value accordingly. Let BOINC continue to crunch on your machines until it's out of work instead of aborting WUs. Once this is behind us, I would not be surprised to see the admins make some temporary tweak/adjustment so that our work doesn't show as being returned too late and being a waste. If you run out of work for BOINC, then you can reinstall/reactivate UD and run it in the meantime. When BOINC does get back on its legs, let UD finish it's current WU before you shut it down and switch back to BOINC. Please try to check here in the forums when you can as when BOINC does get running again, you'll want to make sure all of your completed WU's that are Ready To Report get returned. Let's try to keep any lost crunching to a minimum (not to mention delaying quorums any more than is necessary).

Most importantly, hang tough. I've chosen to crunch 100% for WCG for my own reasons. It's a good crew all in all. I've certainly read stories about other projects that seem to have ongoing problems so things could certainly be a lot worse for us here than it may be at the moment. What isn't and hasn't changed are the reasons each of us came here and started crunching for in the first place.

----------------------------------------

Join/Website/IMODB

[Dec 23, 2006 2:48:37 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: NEW Outage on Boinc Server

Finally, after two long years, they seem to finally have a pretty serious outage, if only in duration and smack dab in the middle of the holiday season no less. Figures right? I mean, who hasn't tried to take a nice vacation from work only to have your longingly awaited plans scuttled by a nice good old snafu back at the office that you just can't avoid getting pulled into? From what I've been able to pick up on over the past two years, I'd guess that we're looking at probably 8-10, 12 at the outside, folks that we collectively label the "WCG admins". Just a handful of folks run this whole shebang. Makes sense if you think about it. Non-profits are notorious for operating on a shoestring. I expect that WCG is no different. IBM may provide the facilities, equipment and such but the last thing it is for them is a revenue source.

So I for one would like to take a moment to say thanks to our brave little band of warriors! Thanks for two plus great years of a job well done. No, it's not been perfect but I think they've done quite a job with what they have to work with.

I second that, We have all on ooasion tweaked nelsonc and other admins, but push comes to shove we have a d*** (self edited to save nelsonc the effort) fine bunch of Admins and Techs.

Of course Murphy's Law says that an outage always occurs on a Friday at 5 pm or at the start of a long weekend.

Harware failure is one of those things that happens when you least expect it and is beyong the control of the Admins and Techs. And judging from the response time it was a D*** fast.

( Tongue in cheek mode on)

When this is all over... tell us it wasnt an IBM-Hitachi DeathStar err Speedstor that died.

( Toungue in cheek mode off)

A WELL DONE to all the Staff at WCG and at the risk of offending someone... A Merry Christmas to You All and I hope this doen't ruin any Christmas plans.

[Dec 23, 2006 3:45:34 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: NEW Outage on Boinc Server

I fully agree with Harvey and Keith. The folks at WCG have been first rate at keeping the work going and responding quickly and positively to any problems that have arisen.
We are fortunate that WCG has, in effect, two separate projects in the BOINC and UD Agents so, if one goes down as has happened, the other serves as a fallback and work can continue. BOINC fans may not like the UD agent because the points payoff isn't as big, but the outage is only temporary and running something is better than nothing.
Both agents achieve the same goal, producing results that matter to the research projects we volunteered to work on.

[Dec 23, 2006 4:23:38 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: NEW Outage on Boinc Server

i agree with the above posts.

in fact i am surprised that anything at all might be done this snowy holiday week!

even if all the data is lost, i think we all know that 'stuff' happens! smile

just let us know, that's all we ask.

[Dec 23, 2006 5:44:52 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: NEW Outage on Boinc Server

Hi halfcard,
According to http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=10703 no data has been lost. But it has to be loaded onto the new hard drive from the tape backup system, which may take time. I suspect that the hosting facility is running on a skeleton crew this snowy Christmas holiday season.

"In the Colorado Rockies
Where the snow is deep and cold
And a man afoot can starve to death
Unless he's brave and bold. . ."

Lawrence

----------------------------------------
[Edit 1 times, last edit by Former Member at Dec 23, 2006 7:36:31 AM]

[Dec 23, 2006 7:32:28 AM]

Dotsch
Advanced Cruncher
Joined: Feb 12, 2006
Post Count: 100
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Nutritious Rice for the World

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for OpenPandemics - COVID-19


Re: NEW Outage on Boinc Server

We are fortunate that WCG has, in effect, two separate projects in the BOINC and UD Agents so, if one goes down as has happened, the other serves as a fallback and work can continue. BOINC fans may not like the UD agent because the points payoff isn't as big, but the outage is only temporary and running something is better than nothing.
Both agents achieve the same goal, producing results that matter to the research projects we volunteered to work on.

It is also quite easy to setup a other BOINC project as backup project : http://boinc-wiki.ath.cx/index.php?title=BOINC_Powered_Backup_Project
The backup project setup, set a low resource share for the backup project, so the BOINC client downloads work from the backup project only if the primary project is down.

A list of the different BOINC projects : http://boinc-wiki.ath.cx/index.php?title=Choosing_a_BOINC_Powered_Project

Also, I recommend every body to setup a backup project or attach at some BOINC projects, if it is wanted to keep the systems crunching, if project will be down.
A higer WU cache will also help to keep the systems crunching, if the project is down. The default is 0.1 days. But set the WU cache to 2 or 3 days is a good value.

[Dec 23, 2006 10:25:24 AM]

merko
Cruncher
Joined: Jan 29, 2006
Post Count: 3
Status: Offline


Re: NEW Outage on Boinc Server

Hi all: I have been trying to upload several wu's over the past few days - with no luck. paste from boinc manager follows:
--
12/23/2006 4:30:56 AM||Resuming network activity
12/23/2006 4:30:56 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d098n817_x1MEU_00_2_0
12/23/2006 4:30:56 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d098n817_x1MEU_00_2_1
12/23/2006 4:30:56 AM|Project Neuron|Fetching scheduler list
12/23/2006 4:31:04 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d098n817_x1MEU_00_2_0: file not found
12/23/2006 4:31:04 AM|World Community Grid|Backing off 51 minutes and 28
seconds on upload of file faah1097_d098n817_x1MEU_00_2_0
12/23/2006 4:31:04 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d099n184_x1MEU_00_1_0
12/23/2006 4:31:05 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d098n817_x1MEU_00_2_1: file not found
12/23/2006 4:31:05 AM|World Community Grid|Backing off 3 hours, 29
minutes and 47 seconds on upload of file faah1097_d098n817_x1MEU_00_2_1
12/23/2006 4:31:05 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d099n184_x1MEU_00_1_1
12/23/2006 4:31:08 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d099n184_x1MEU_00_1_0: file not found
12/23/2006 4:31:08 AM|World Community Grid|Backing off 52 minutes and 54
seconds on upload of file faah1097_d099n184_x1MEU_00_1_0
12/23/2006 4:31:08 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d099n184_x1MEU_00_1_1: file not found
12/23/2006 4:31:08 AM|World Community Grid|Backing off 19 minutes and 58
seconds on upload of file faah1097_d099n184_x1MEU_00_1_1
12/23/2006 4:31:18 AM||Project communication failed: attempting access to
reference site
12/23/2006 4:31:21 AM||Access to reference site succeeded - project
servers may be temporarily down.
12/23/2006 4:31:23 AM|Project Neuron|Scheduler list fetch failed: system
connect
12/23/2006 4:31:23 AM|Project Neuron|6 consecutive failures fetching
scheduler list - deferring 86400 seconds
12/23/2006 4:31:23 AM|Project Neuron|Deferring scheduler requests for 1
days, 0 hours, 0 minutes and 0 seconds
12/23/2006 4:31:24 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d098n817_x1MEU_00_2_0
12/23/2006 4:31:25 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d098n817_x1MEU_00_2_1
12/23/2006 4:31:27 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d098n817_x1MEU_00_2_0: file not found
12/23/2006 4:31:27 AM|World Community Grid|Backing off 1 hours, 10
minutes and 12 seconds on upload of file faah1097_d098n817_x1MEU_00_2_0
12/23/2006 4:31:28 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d098n817_x1MEU_00_2_1: file not found
12/23/2006 4:31:28 AM|World Community Grid|Backing off 2 hours, 0 minutes
and 26 seconds on upload of file faah1097_d098n817_x1MEU_00_2_1
12/23/2006 4:31:28 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d099n184_x1MEU_00_1_0
12/23/2006 4:31:30 AM|NanoHive@Home|Sending scheduler request: Requested
by user
12/23/2006 4:31:30 AM|NanoHive@Home|(not requesting new work or reporting
completed tasks)
12/23/2006 4:31:32 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d099n184_x1MEU_00_1_0: file not found
12/23/2006 4:31:32 AM|World Community Grid|Backing off 2 hours, 39
minutes and 8 seconds on upload of file faah1097_d099n184_x1MEU_00_1_0
12/23/2006 4:31:32 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d099n184_x1MEU_00_1_1
12/23/2006 4:31:34 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d098n817_x1MEU_00_2_0
12/23/2006 4:31:34 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d099n184_x1MEU_00_1_1: file not found
12/23/2006 4:31:34 AM|World Community Grid|Backing off 8 minutes and 44
seconds on upload of file faah1097_d099n184_x1MEU_00_1_1
12/23/2006 4:31:34 AM|NanoHive@Home|Scheduler RPC succeeded
12/23/2006 4:31:34 AM|NanoHive@Home|Message from server: Project is
temporarily shut down for maintenance
12/23/2006 4:31:34 AM|NanoHive@Home|Deferring scheduler requests for 1
hours, 0 minutes and 0 seconds
12/23/2006 4:31:34 AM|NanoHive@Home|Project is down
12/23/2006 4:31:37 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d098n817_x1MEU_00_2_0: file not found
12/23/2006 4:31:37 AM|World Community Grid|Backing off 53 minutes and 13
seconds on upload of file faah1097_d098n817_x1MEU_00_2_0
12/23/2006 4:31:38 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d098n817_x1MEU_00_2_1
12/23/2006 4:31:39 AM|World Community Grid|[file_xfer] Started upload of
file faah1097_d099n184_x1MEU_00_1_0
12/23/2006 4:31:41 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d098n817_x1MEU_00_2_1: file not found
12/23/2006 4:31:41 AM|World Community Grid|Backing off 3 hours, 45
minutes and 8 seconds on upload of file faah1097_d098n817_x1MEU_00_2_1
12/23/2006 4:31:41 AM|World Community Grid|[file_xfer] Temporarily failed
upload of faah1097_d099n184_x1MEU_00_1_0: file not found
12/23/2006 4:31:41 AM|World Community Grid|Backing off 3 hours, 45
minutes and 48 seconds on upload of file faah1097_d099n184_x1MEU_00_1_0
12/23/2006 4:31:55 AM|World Community Grid|Sending scheduler request:
Requested by user
12/23/2006 4:31:55 AM|World Community Grid|Reporting 2 tasks
12/23/2006 4:32:00 AM|World Community Grid|Scheduler request failed: HTTP
file not found
12/23/2006 4:32:00 AM|World Community Grid|Deferring scheduler requests
for 16 minutes and 27 seconds
12/23/2006 4:32:06 AM|Spinhenge@home|Sending scheduler request: Requested
by user
12/23/2006 4:32:06 AM|Spinhenge@home|(not requesting new work or
reporting completed tasks)
12/23/2006 4:32:11 AM|Spinhenge@home|Scheduler RPC succeeded [server
version 507]
12/23/2006 4:32:11 AM|Spinhenge@home|Deferring scheduler requests for 5
minutes and 3 seconds
--

Mark Reiss

----------------------------------------

[Dec 23, 2006 10:33:54 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: NEW Outage on Boinc Server

Hello merko,
The problem with the World Community Grid BOINC server is explained in 'Known Issues': http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=10703

Lawrence

[Dec 23, 2006 11:18:37 AM]

[ ]