Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 44
Posts: 44   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 42354 times and has 43 replies Next Thread
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

It would appear CEP2 WU's were effected by the outage as none of my CEP2 WU's were able to upload nor was I able to receive additional CEP2 WU's.

Though these result files go by exception to Harvard directly, the scheduler that manages this is still inside the WCG daemon as are task downloads i.e. nothing at all went to completion for probably from about half a day.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Aug 16, 2010 2:55:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

> But Rick did not mention CEP2 in his post so far.
No CEP2, just c4cw Beta and DDDDDDDT2
> Every time I have watched a retry BOINC was starting to transfer a few files as I described, and then it backed off the whole list without trying for every file in the list.
At least after a while, each file in my queues had its own retry time, and was retried individually.
The upload progress indications advanced at rates that corresponded to real data transfers happening. I could not correlate BOINC upload activity with my modem/router's activity LEDs due to other Net traffic.
BOINC 6.2.19, AFAIK the same as official WCG BOINC 6.2.28 but without the WCG logos etc. (OT: I tried a later version of BOINC once and hated it. They had removed the display of the no of seconds of work being fetched - something which gives me a useful feel for work cache behaviour. Dumbed down, like Windows ME).

[Edit]: Just noted Ingleside's post re project-wide backoff times in later versions of BOINC. I guess that's the reason for difference between JmBoullier's and my observations.
----------------------------------------
[Edit 1 times, last edit by Rickjb at Aug 16, 2010 3:38:20 PM]
[Aug 16, 2010 3:04:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

v6.10.xx also has a new Project-wide backoff, so with 3 transfer-errors in a row to same project (**), the whole project will get a random backoff, and this again will be between 1 minute and 4 hours. This is so fast, multi-core computers that can have many file-transfers in case of problems, won't be continuously trying a different transfer as the individual file-backoffs times-out. While project has a project-wide backoff, any new uploads won't be tried immediately, but will wait for the project-wide backoff to count-down.
OK, this is explaining why I was not seeing the same things as Rickjb, all my devices are 6.10.xx currently and Rick's ones are probably] not.
Which is not a problem per se, but supporting members with so many different versions around with almost as many variable behaviors begins to be very challenging... cool
Anyway, that's what keeps life funny.

Edit: Rickjb has provided info on his versions while I was writing my post.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
----------------------------------------
[Edit 1 times, last edit by JmBoullier at Aug 16, 2010 3:15:21 PM]
[Aug 16, 2010 3:11:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

(OT: I tried a later version of BOINC once and hated it. They had removed the display of the no of seconds of work being fetched - something which gives me a useful feel for work cache behaviour. Dumbed down, like Windows ME).

Set some log flags and it will be back. The default log was just 'MEnimalized'

828 World Community Grid 14-08-2010 15:10:16 [sched_op_debug] Starting scheduler request
829 World Community Grid 14-08-2010 15:10:16 Sending scheduler request: To fetch work.
830 World Community Grid 14-08-2010 15:10:16 Requesting new tasks
831 World Community Grid 14-08-2010 15:10:16 [sched_op_debug] CPU work request: 62467.89 seconds; 0.00 CPUs
832 World Community Grid 14-08-2010 15:10:19 Scheduler request completed: got 1 new tasks
833 World Community Grid 14-08-2010 15:10:19 [sched_op_debug] Server version 601
834 World Community Grid 14-08-2010 15:10:19 Project requested delay of 11 seconds
835 World Community Grid 14-08-2010 15:10:19 [sched_op_debug] estimated total CPU job duration: 14356 seconds
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Aug 16, 2010 3:18:09 PM]
[Aug 16, 2010 3:17:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

I repeat my suggestions:
1). During future WCG outages like this one, I think it would be good if the techs could completely disable incoming file transfers.
Works for all versions of BOINC, with all settings.

2). If the forum is likely to go offline, don't post status reports in there, as we won't be able to read them. If feasible, plug in a different "server" that displays a simple HTML status report page, and put updates there. Maybe you can just plug an ethernet cable feed into knreed's suitably-configured netbook.
Works for all kinds of outages, provided his battery has plenty of kick.

WCG member making suggestions ===>
[Aug 17, 2010 11:43:48 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

Lets see: I've got a crisis, work my bud off on Sunday, have this flash moment whilst 100 other thoughts race through my gray mass to find the cause of the collapse and think: Let's put this message on the forums, the most commonly known medium so as soon as it comes up, it will be there... and then there was a message on Facebook and on the Berkeley forums, News on Project Outages.

Most all crashes are unique at WCG and being IBM, they document and note and put a postmortem report in place which surely will include your suggestions.

And now for the relaxation:

http://www.cartoonstock.com/directory/s/stand_in_line.asp

There's one for everyone ;-)
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Aug 17, 2010 12:04:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

2). If the forum is likely to go offline, don't post status reports in there, as we won't be able to read them.
??? Why this comment?
Anybody can check that the forum was available when kneed has put his first post in the Known Issues.
07:xx UTC The whole site falls down
14:17 UTC knreed's first post in the Known Issues (KI)
14:21 UTC knreed confirms in his KI thread that the website and the forum are OK
14:23 UTC Jean Pierre. opens this thread
14:45 UTC nasher answers JP
14:52 UTC onward Several other posts after nasher's
15:08 UTC knreed announces in his KI thread that uploads are back
16:12 UTC knreed announces in his KI thread that everything is working again

It seems obvious to me that knreed has not posted in the forum when the forum was likely to fall down but when he was pretty sure it was working again.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Aug 17, 2010 2:53:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

If the forum is likely to go offline, don't post status reports in there, as we won't be able to read them
How can you post to the forum if it is going down? knreed is way smarter than you give him credit.
[Aug 17, 2010 3:10:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

Hello WCG

...

"16:12 UTC knreed announces in his KI thread that everything is working again"

- JmBoullier
Community Advisor
[Aug 17, 2010 2:53:22 PM] post
Good job WCG!

Good day
;
[Aug 17, 2010 3:41:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Major Outage, 15 Aug 2010

Good job WCG!
Nice cool
[Aug 17, 2010 4:26:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 44   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread