World Community Grid - View Thread - Server Errors. [ RESOLVED ]

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Server Errors. [ RESOLVED ]

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 352

[ ]

Author

This topic has been viewed 2178558 times and has 351 replies

RCC_Survivor
Veteran Cruncher
USA
Joined: Apr 28, 2007
Post Count: 1337
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Server Errors.

So what do we crunchers do for the meantime: limit our sync-to-WCG to once every 24hrs? That would seem to be a direct solution to regulate the traffic and thereby help ease congestion. Or, do we crunchers have to play it by ear?

andzgrid,
This is a good suggestion and would be glad to set my network access in Preferences to a time period when server load was at a minimum.

SekeRob or knreed,
Do you have any server load info that would help determine the hours when server load is heaviest and lightest?

I do not mind running a 2-day queue with network access restricted to a few hours a day. I have lost a lot of WUs since last month and will do whatever it takes to reduce the losses.

Am I wrong or did this problem start after they did some recent software upgrades?

----------------------------------------

Be kinder than necessary, for everyone you meet is fighting some battle.

Please join the team The survivors hugs

Bilateral Renal, Melanoma, and Squamous Cell cancers

[Jul 27, 2012 11:18:17 PM]

BSD
Senior Cruncher
Joined: Apr 27, 2011
Post Count: 224
Status: Offline


Re: Server Errors.

My devices are playing in someone else's yard for a while. sleep

Hope the techs get it all sorted out. They are a busy bunch. coffee

Cheers

[Jul 28, 2012 12:04:22 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server Errors.

SekeRob or knreed,
Do you have any server load info that would help determine the hours when server load is heaviest and lightest?

I do not mind running a 2-day queue with network access restricted to a few hours a day. I have lost a lot of WUs since last month and will do whatever it takes to reduce the losses.

Am I wrong or did this problem start after they did some recent software upgrades?

Not likely to happen, but in the US of A / GB, where I estimate most Update / Retry Now hammerers to be, just follow the position of the sun and the moon and you have a pretty good idea when the bulk of button operators are not watching for you to sneak in those uploads. Yesterday had 305, day before 455, days before that 1, 10, 14, so I think things in autonomous mode are doing pretty good with the [fuzzy logic] scheme knreed has devised to respond to momentary overloads. Sure enough there are the random back-offs when things are too busy, but so far they all seem to recover after the deferral countdowns have run down, one / two / three times. Keep a fair cache 1.0 days and never a moment dry.

In all this, lost zero tasks.

[Jul 28, 2012 7:10:35 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server Errors.

... and would be glad to set my network access in Preferences to a time period when server load was at a minimum.

Going forward, we need to first make the above manual step if only as preparatory to the next step: WCG-server to WCG-clientMachine M2M (machine-to-machine) communications. The data in a serverStatus (webpage) would facilitate the transition.
;

[Jul 28, 2012 4:16:43 PM]

RCC_Survivor
Veteran Cruncher
USA
Joined: Apr 28, 2007
Post Count: 1337
Status: Offline
Project Badges:


Re: Server Errors.

(Edit - removed reference because of SekeRob's comment "not a good idea to mention".)

There are problems after sundown.
Based on a lack of info I will use SWAG and experiment until I get it right.
Shouldn't have to do this.

I am really surprised that the servers do not have a load balance/control system and the techs have to improvise.

Again, I think this problem started after the server/filesystem upgrades/updates in June.
It is difficult to troubleshoot intermittent problems when multiple changes were made in a short period.
Been there, done that.
When we had a problem it was discussed on the evening news and the next day's newspaper so we met/exceeded our 99.99% up time target because our paycheck was on the line.
If there was an outage we were required to write letters to upper management explaining why it happened and what we would do to prevent it in the future.
So I understand the difficulty and pressure involved in resolving service problems.

----------------------------------------

Be kinder than necessary, for everyone you meet is fighting some battle.

Please join the team The survivors hugs

Bilateral Renal, Melanoma, and Squamous Cell cancers

----------------------------------------
[Edit 1 times, last edit by RCC_Survivor at Jul 29, 2012 7:14:52 PM]

[Jul 28, 2012 7:33:07 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server Errors.

Not a good idea to mention that, and has become largely superfluous, particular to those that do scheduled networking and crunching and a few more reasons [client 7.0.xx];), but *no*, this is separate from the upload saturation issue... the 1st part of the upload/reporting cycle. The RtR is part 2.

It needs no repeating that something in the upgrading path at the server side kicked this, and IBM now working intensely with the likes as Red Hat and hosting support to get to the root of the issue.

[Jul 28, 2012 7:50:04 PM]

RCC_Survivor
Veteran Cruncher
USA
Joined: Apr 28, 2007
Post Count: 1337
Status: Offline
Project Badges:


Re: Server Errors.

Changed queue to 2.0 days and limit network access to 00:00-06:00 EDT (UTC-4) and experienced problems at 02:57 and 03:43 on separate PCs.
There may not be a time period that is free from the problem.
I feel I am having a "Popeil moment" and will "set it and forget it" as there are bigger fish to fry.

----------------------------------------

Be kinder than necessary, for everyone you meet is fighting some battle.

Please join the team The survivors hugs

Bilateral Renal, Melanoma, and Squamous Cell cancers

[Jul 29, 2012 7:33:19 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server Errors.

Hello RCC_Survivor,

https://secure.worldcommunitygrid.org/forums/...d,33316_offset,180#385932 and https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,33481 imply that there is no schedule. They isolate the file system on an as-required basis.

Lawrence

[Jul 29, 2012 10:53:12 PM]

Bearcat
Master Cruncher
USA
Joined: Jan 6, 2007
Post Count: 2803
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

1 year badge for Influenza Antiviral Drug Search

5 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project


Re: Server Errors.

Think it's best to let boinc do it's thing and let the servers tell our computers when to check back. Seems to work its self out sooner or later.

----------------------------------------

Crunching for humanity since 2007!

[Jul 30, 2012 2:42:29 AM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

180 day badge for OpenPandemics - COVID-19


Re: Server Errors.

The occurrence of this issue does not appear to correlate with load. As a result, we have not been able to predict when it will occur. We have been doing a lot of data collection this weekend that will allow for further analysis to hopefully find some clues to what is going on.

We have also figure out how to quickly detect that the issue is occurring in some of our backend processes such as our applications that we use to load new work into BOINC for distribution. We are using this to cause processes that are not volunteer facing to 'back-off' so that the system recovers quickly. We hope that this will significantly reduce the times when you are not able to upload/download work. We put this in place about 2 hours ago and we are watching to see what happens.

[Jul 30, 2012 3:04:47 AM]

[ ]