World Community Grid - View Thread - Server Outage March 25 @ 0900 UTC.

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Server Outage March 25 @ 0900 UTC.

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 29

[ ]

Author

This topic has been viewed 4543 times and has 28 replies

jonnieb-uk
Ace Cruncher
England
Joined: Nov 30, 2011
Post Count: 6105
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

180 day badge for Help Fight Childhood Cancer

180 day badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Server Outage March 25 @ 0900 UTC.

@ Mark Reiss
I've seen nothing to indicate a problem with the team update at 00:01 Mar 25. The update at noon, which was initially missing but has now been corrected, would have had no impact on team stats. Can you be more specific when you say the team update did not !!! and give an example of the team stats that were not updated in order that Keith Uplinger knows what he's looking for?

[edit] uplinger got his reply in first. wink

----------------------------------------

To Join follow this link: Join the UK Team All Welcome! UK Team thread

----------------------------------------
[Edit 1 times, last edit by jonnieb-uk at Mar 26, 2015 12:02:21 AM]

[Mar 26, 2015 12:00:49 AM]

[CSF] Thomas Dupont
Veteran Cruncher
Joined: Aug 25, 2013
Post Count: 685
Status: Offline


Re: Server Outage March 25 @ 0900 UTC.

Thanks for letting me know about the stats.

Users "have" to inform the staff about this?! hypnotized

Strange, the staff should be aware of this.
It's not the first time that I read comments by users who inform about missing/outdated stats (and the staff seems not to be aware of this).

When we notice an anomaly which was not notified on the WCG forum, it's important to submit it.
It does not mean that the WCG staff has not already seen it.
It's just that we are passed on the information.
A server outage can influence many parameters.
Some volunteers (like me and many others who publish in this thread) see this as mutual aid.
And the WCG staff is very reactive, I can assure you.
I participated in many other distributed computing projects in the past and I can assure you that the WCG staff is very effective and very responsive compared to other.

BTW, thanks Keith for the stats biggrin

----------------------------------------

CRUNCHERS SANS FRONTIERES
www.crunchersansfrontieres.org

[Mar 26, 2015 8:09:44 AM]

pcwr
Ace Cruncher
England
Joined: Sep 17, 2005
Post Count: 10903
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

14 day badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

1 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Computing for Sustainable Water

10 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: Server Outage March 25 @ 0900 UTC.

Majority of my returned WUs since the crash have gone to P Ver status.

OET1 and CEP2.

regards,
Patrick

----------------------------------------

[Mar 26, 2015 9:33:12 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server Outage March 25 @ 0900 UTC.

Had the Linux client, which had -sans exception- all OET results [zero redundant project], passing through 'Pending Verification'... like 58, even though there was no single error logged. Then in the afternoon, before website restore, it had one task with error 214 [could not unzip task input file], and the PVer continued for another 30 or so, which is when enough validations of PVer's had collected to go single again. The second [Windows] host running OET did not experience this. Both run CEP2 on 1 core and saw only 1 turning PVer... unaffected it seems.

The PVer series kicked off just before the website went belly up [coincidence], see not how this could affect the BOINC part which continued, bar a few 3600 second back-offs in the morning to report results.

For ref., this Linux host had 162 valid yesterday i.e. is quick to get back to get the 20 odd serial valid to go it alone again.

[Mar 26, 2015 9:51:06 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server Outage March 25 @ 0900 UTC.

On 'root cause', of course a secret, but did notice few days ago the forum had [never seen before] over 3800 guests and just under 40 logged in members. If all those lines are committed, could see something is giving.

[Mar 26, 2015 11:37:40 AM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

2 year badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

50 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

100 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Server Outage March 25 @ 0900 UTC.

Sek,

We are still trying to get a firm answer as to the root cause of the outage, but from my understanding the load on the database was not the issue. It was a hardware issue. I fear it was bad physical memory in the device. Without getting a solid answer as to what it was, we are still running off a redundant system as more tests will happen on the problem device today.

As for your OET1 having lots of Pending validations, about 12 hours before the main outage, the validator for OET1 was falling behind. This runs off a different server than what failed. But as of this moment, I do not see anything being outstanding. Please let me know if you are still seeing lots of pending validations for OET1.

Thanks,
-Uplinger

[Mar 26, 2015 12:49:46 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server Outage March 25 @ 0900 UTC.

Kay, it's Pending Verification, meaning an extra copy was being send out after return of the original _0, by the boatload, same as for pcwr. This occurred -during- the outage.

Hardware failure... well no amount of guest lines could be accountable for that, but it was something odd.

[Mar 26, 2015 3:15:22 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Server Outage March 25 @ 0900 UTC.

Because the forum was poorly performing, looked at the whoosonline and saw over 1000, then later looked again and saw 1479 and then it was poof, maintenance page.

Coincidences or more hardware burnouts?

[Mar 30, 2015 8:11:01 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:


Re: Server Outage March 25 @ 0900 UTC.

Sek,

This was not due to counts on the forum. Earlier in the week we encountered a hardware problem. When that server came back up, it mounted a filesystem that it was not supposed to. This was just recently discovered and needed to be corrected quickly as it had the potential for data loss. There was NO data loss, but the potential for it forced me to act quickly without scheduling anything. Sorry about the lack of notification, but I was hopeful the down time was only going to be a few minutes, but ended up taking longer than I expected.

Thanks,
-Uplinger

[Mar 30, 2015 8:48:17 PM]

[ ]