World Community Grid - View Thread

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Major Outage

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 44

[ ]

Author

This topic has been viewed 42354 times and has 43 replies

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Major Outage, 15 Aug 2010

It would appear CEP2 WU's were effected by the outage as none of my CEP2 WU's were able to upload nor was I able to receive additional CEP2 WU's.

Though these result files go by exception to Harvard directly, the scheduler that manages this is still inside the WCG daemon as are task downloads i.e. nothing at all went to completion for probably from about half a day.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Aug 16, 2010 2:55:41 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Major Outage, 15 Aug 2010

> But Rick did not mention CEP2 in his post so far.
No CEP2, just c4cw Beta and DDDDDDDT2
> Every time I have watched a retry BOINC was starting to transfer a few files as I described, and then it backed off the whole list without trying for every file in the list.
At least after a while, each file in my queues had its own retry time, and was retried individually.
The upload progress indications advanced at rates that corresponded to real data transfers happening. I could not correlate BOINC upload activity with my modem/router's activity LEDs due to other Net traffic.
BOINC 6.2.19, AFAIK the same as official WCG BOINC 6.2.28 but without the WCG logos etc. (OT: I tried a later version of BOINC once and hated it. They had removed the display of the no of seconds of work being fetched - something which gives me a useful feel for work cache behaviour. Dumbed down, like Windows ME).

[Edit]: Just noted Ingleside's post re project-wide backoff times in later versions of BOINC. I guess that's the reason for difference between JmBoullier's and my observations.

----------------------------------------
[Edit 1 times, last edit by Rickjb at Aug 16, 2010 3:38:20 PM]

[Aug 16, 2010 3:04:33 PM]

JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

1 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

10 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Major Outage, 15 Aug 2010

v6.10.xx also has a new Project-wide backoff, so with 3 transfer-errors in a row to same project (**), the whole project will get a random backoff, and this again will be between 1 minute and 4 hours. This is so fast, multi-core computers that can have many file-transfers in case of problems, won't be continuously trying a different transfer as the individual file-backoffs times-out. While project has a project-wide backoff, any new uploads won't be tried immediately, but will wait for the project-wide backoff to count-down.

OK, this is explaining why I was not seeing the same things as Rickjb, all my devices are 6.10.xx currently and Rick's ones are ~~probably~~] not.
Which is not a problem per se, but supporting members with so many different versions around with almost as many variable behaviors begins to be very challenging... cool

Anyway, that's what keeps life funny.

Edit: Rickjb has provided info on his versions while I was writing my post.

----------------------------------------

Team--> Decrypthon -->Statistics/Join -->Thread

----------------------------------------
[Edit 1 times, last edit by JmBoullier at Aug 16, 2010 3:15:21 PM]

[Aug 16, 2010 3:11:42 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Major Outage, 15 Aug 2010

(OT: I tried a later version of BOINC once and hated it. They had removed the display of the no of seconds of work being fetched - something which gives me a useful feel for work cache behaviour. Dumbed down, like Windows ME).

Set some log flags and it will be back. The default log was just 'MEnimalized'

828 World Community Grid 14-08-2010 15:10:16 [sched_op_debug] Starting scheduler request
829 World Community Grid 14-08-2010 15:10:16 Sending scheduler request: To fetch work.
830 World Community Grid 14-08-2010 15:10:16 Requesting new tasks
831 World Community Grid 14-08-2010 15:10:16 [sched_op_debug] CPU work request: 62467.89 seconds; 0.00 CPUs
832 World Community Grid 14-08-2010 15:10:19 Scheduler request completed: got 1 new tasks
833 World Community Grid 14-08-2010 15:10:19 [sched_op_debug] Server version 601
834 World Community Grid 14-08-2010 15:10:19 Project requested delay of 11 seconds
835 World Community Grid 14-08-2010 15:10:19 [sched_op_debug] estimated total CPU job duration: 14356 seconds

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Aug 16, 2010 3:18:09 PM]

[Aug 16, 2010 3:17:02 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:


Re: Major Outage, 15 Aug 2010

I repeat my suggestions:
1). During future WCG outages like this one, I think it would be good if the techs could completely disable incoming file transfers.
Works for all versions of BOINC, with all settings.

2). If the forum is likely to go offline, don't post status reports in there, as we won't be able to read them. If feasible, plug in a different "server" that displays a simple HTML status report page, and put updates there. Maybe you can just plug an ethernet cable feed into knreed's suitably-configured netbook.
Works for all kinds of outages, provided his battery has plenty of kick.

WCG member making suggestions ===>

[Aug 17, 2010 11:43:48 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Major Outage, 15 Aug 2010

Lets see: I've got a crisis, work my bud off on Sunday, have this flash moment whilst 100 other thoughts race through my gray mass to find the cause of the collapse and think: Let's put this message on the forums, the most commonly known medium so as soon as it comes up, it will be there... and then there was a message on Facebook and on the Berkeley forums, News on Project Outages.

Most all crashes are unique at WCG and being IBM, they document and note and put a postmortem report in place which surely will include your suggestions.

And now for the relaxation:

http://www.cartoonstock.com/directory/s/stand_in_line.asp

There's one for everyone ;-)

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Aug 17, 2010 12:04:30 PM]

JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:


Re: Major Outage, 15 Aug 2010

2). If the forum is likely to go offline, don't post status reports in there, as we won't be able to read them.

??? Why this comment?
Anybody can check that the forum was available when kneed has put his first post in the Known Issues.
07:xx UTC The whole site falls down
14:17 UTC knreed's first post in the Known Issues (KI)
14:21 UTC knreed confirms in his KI thread that the website and the forum are OK
14:23 UTC Jean Pierre. opens this thread
14:45 UTC nasher answers JP
14:52 UTC onward Several other posts after nasher's
15:08 UTC knreed announces in his KI thread that uploads are back
16:12 UTC knreed announces in his KI thread that everything is working again

It seems obvious to me that knreed has not posted in the forum when the forum was likely to fall down but when he was pretty sure it was working again.

----------------------------------------

Team--> Decrypthon -->Statistics/Join -->Thread

[Aug 17, 2010 2:53:22 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Major Outage, 15 Aug 2010

If the forum is likely to go offline, don't post status reports in there, as we won't be able to read them

How can you post to the forum if it is going down? knreed is way smarter than you give him credit.

[Aug 17, 2010 3:10:42 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Major Outage, 15 Aug 2010

Hello WCG

...

"16:12 UTC knreed announces in his KI thread that everything is working again"

- JmBoullier
Community Advisor
[Aug 17, 2010 2:53:22 PM] post

Good job WCG!

Good day
;

[Aug 17, 2010 3:41:34 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Major Outage, 15 Aug 2010

Good job WCG!

Nice

[Aug 17, 2010 4:26:06 PM]

[ ]