World Community Grid - View Thread - Scheduled Maint. July 18, 14:00 UTC, Extended?

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Scheduled Maint. July 18, 14:00 UTC, Extended?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 143

[ ]

Author

This topic has been viewed 242865 times and has 142 replies

KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

5 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Dear WCG Tech Team,
thank you very much for your support. I wish you a more quiet time for the rest of the Summer.
Regarding the post mortem reports, you can contact me directly. I would really appreciate if we can learn from such outages (by the way, I support customers operating IBM-based IT infrastructure).
Cheers,
Yves
---
@TonyEllis

Qty 118 'Server Aborts' so far and still happening and counting up... these were all downloaded earlier today... waste of bandwidth at both ends...

I would not be so negative. Server abort is less painful than to have to compute the WUs even if it would not be necessary. On my side, I have only about 90 server aborted WUs for about 2'000 WUs in progress. It is finally negligible (in particular comparing to the daily received spam e-mails).

----------------------------------------

Décrypthon team progress - KerSamson's contribution

[Jul 20, 2017 11:23:49 AM]

KLiK
Master Cruncher
Croatia
Joined: Nov 13, 2006
Post Count: 3108
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

10 year badge for Outsmart Ebola Together

10 year badge for Smash Childhood Cancer

2 year badge for Africa Rainfall Project


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Yes, IBM did a great job sponsoring WCG, but let's not loose track of reality here. When WCG was hosting the files in-house it worked more or less flawlessly for years. Since they have moved into the IBM cloud there have been two major melt-downs in short succession, the likes of which I cannot remember. I call that a big step back in reliability... And don't tell me this is a typical IT problem. It is not. We have a big 4 PB file server at my work place and it has been running completely reliably ever since it arrived.

Well, only 2 solutions are there:
1. WCG staff configured the cloud incorrectly (, but they'll learn in the future)
2. IBM has a bad cloud service confused

I just hope it's not the latter...
confused

----------------------------------------

oldies:UDgrid.org & PS3 Life@home

non-profit org. Play4Life in Zagreb, Croatia

[Jul 20, 2017 2:42:14 PM]

uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding

45 day badge for Help Cure Muscular Dystrophy

20 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

5 year badge for Help Fight Childhood Cancer

10 year badge for The Clean Energy Project - Phase 2

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

50 year badge for Mapping Cancer Markers

50 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

100 year badge for FightAIDS@Home - Phase 2

50 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Greetings,

On server aborts and extra copies that mistake was on me. There were a few things that attributed towards that.

1. There is a script that runs every day at about 12 UTC that stops all the BOINC daemons and restarts them. This is for a few reasons but one is to rotate the daemons around the backend scheduler. When this was started, it caused many results to be marked as invalid. With 4 transitioned running, it can process over 50k results per minute. So, many results were incorrectly marked and scheduled to be sent out again.

2. After this was caught running, the damage was already done. To alleviate the pain, I ran some db updates to the affected results. 'update result set outcome = 1 and validate_state = 1 where server_state = 5 and outcome in (1,6) and validate_state in (2, 4, 5);' However, this caused results to go into outcome of 0, unknown on the website. This was done before we restarted everything and then we started the grid back up.

3. Since many results have a limit of 5 results for a workunit, and many resend, this caused workunit to not have wiggle room to send additional copies. Basically if a workunit already had 3 copies sent and a 4th and 5th were sent after the outage, it did not allow either of those two results to have an error of any sort.

4. Once the errors for outcome 0 were caught I corrected my database update to reflect these properly and validated many results. If a result workunit was marked as valid, then it would Mark all other workunit in progress as not needed or server abort. That way if your BOINC client did not start the result yet, it wouldn't start it if a scheduuler request happened before starting it.

So, to help prevent this in the future, we have added more checks on the restart script for the BOINC daemons. As for the database update that marked results with outcome of 0, I am doing some testing to see why it ran without an error. Note, the database updates were reviewed before running as well, it was missed by 2 of us.

Again, we are sorry about this outage and appreciate your patience with us.

Thanks,
-Uplinger

[Jul 20, 2017 3:53:49 PM]

VietOZ
Senior Cruncher
United States
Joined: Apr 8, 2007
Post Count: 205
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

180 day badge for Nutritious Rice for the World

180 day badge for The Clean Energy Project - Phase 2

20 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

100 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

51 pages of server aborted ... ouch

----------------------------------------

if you are looking for a team please consider XtremeSystem Team
Team website: https://xs4s.org/

[Jul 20, 2017 4:07:04 PM]

KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1684
Status: Offline
Project Badges:


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Hi Uplinger,
many thanks for the explanation regarding the "aborted" WUs.

@All
It was mentioned yesterday that the outage has not been caused by the "cloud" but because of applied system kernel updates.
Whatever you read about "cloud computing", you should not forget that a cloud infrastructure is nothing more than a (large) collection of servers (hardware, firmware, OS/hypervisors), storage systems (hardware, firmware), FC fabrics (hardware, firmware), networks (hardware, firmware), etc. each of this hardware and software component requiring configuration effort and update activities. Basically, a cloud infrastructure is not more reliable than how its components are reliable. Redundancy, clustering, etc, can help to improve the cloud availability; however, for performance, compatibility, and consistency reasons some updates have to be applied to all components at the same time; in some way, such overall update activities represent a kind of unavoidable "common mode failure".
On my side, I am relatively tired to read a lot of criticisms about the both outages. I don't know what the business background of the complainers is and I am not willing to speculate about it. Nevertheless, system administrators managing large IT infrastructure know very well that unfortunately the IT infrastructure reliability is not obvious and it is only the result of an unstable balance between performance, security, reliability, and, finally, risk management by performing changes and applying updates.
Despite of very careful assessments, problems could occur during an update and, because of the complexity, such problems cannot be excluded.
Like every WCG contributors, I do appreciate the reliability of the WCG platform. Nonetheless, there is no reason to complain and to endless criticise the tech team and the main WCG sponsor.
Happy crunching,
Yves

----------------------------------------

Décrypthon team progress - KerSamson's contribution

----------------------------------------
[Edit 1 times, last edit by KerSamson at Jul 20, 2017 5:22:45 PM]

[Jul 20, 2017 4:45:30 PM]

jayrope
Cruncher
Joined: Jul 23, 2016
Post Count: 47
Status: Offline


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Completely agree to Yves. Thanx for pointing this out.

[Jul 20, 2017 4:58:32 PM]

Jozef J
Cruncher
Joined: Sep 24, 2012
Post Count: 45
Status: Offline
Project Badges:

90 day badge for Human Proteome Folding - Phase 2

90 day badge for Help Fight Childhood Cancer

90 day badge for The Clean Energy Project - Phase 2

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

180 day badge for Uncovering Genome Mysteries

1 year badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

https://www.worldcommunitygrid.org/ms/viewBoi...y=sentTime&pageNum=77
I have 77 pages in pending verification, and 19 in pending validation. Today i watched the whole day this and the number only grew......

Project Points
Generated Results Returned Total Run Time (y:d:h:m:s) Badges Earned
Smash Childhood Cancer 2,988,555 12,978 2:190:12:26:41 Sapphire Badge (2 years) for Smash Childhood Cancer
total run on SCHC stay days on same number 2:190:12:26:41... also 9:107:09:19:41 on ebola
Its few days already at my statistics, nothing has changed while I sending huge amount of work constantly..
so I ask what is going on.?? cool

have i just wait ?

----------------------------------------

[Jul 20, 2017 8:08:44 PM]

branjo
Master Cruncher
Slovakia
Joined: Jun 29, 2012
Post Count: 1892
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Fight Childhood Cancer

180 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

pvh513 wrote:

Yes, IBM did a great job sponsoring WCG, but let's not loose track of reality here. When WCG was hosting the files in-house it worked more or less flawlessly for years. Since they have moved into the IBM cloud there have been two major melt-downs in short succession, the likes of which I cannot remember. I call that a big step back in reliability... And don't tell me this is a typical IT problem. It is not. We have a big 4 PB file server at my work place and it has been running completely reliably ever since it arrived.

You are kidding us, aren't you? If you have a short memory about really bad outages "when WCG was hosting...", try search on this forum. They were not that long time ago.

Cheers

----------------------------------------

Crunching@Home since January 13 2000. Shrubbing@Home since January 5 2006

[Jul 20, 2017 8:11:35 PM]

gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:

90 day badge for Nutritious Rice for the World

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Jozef J - yes, unfortunately, patience is the only cure for PV jail WU's. Just think of them as 'cheques going through the clearing system', and that your money (points/credit/time) will be added to your account in due course.

Personally, whenever I see any _1 WU's (for a zero based science - such as SCC), I try to push them to the front of my queue. Thing is, currently my queue is practically full of these WU's - and I suspect others will have similar experiences.

Edit: One one of my computers, I've got 93 SCC WU's - 80 of them, are _1 (or higher) copies. My other computer is crunching the last of the _1 (or higher) WU's.

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by gb009761 at Jul 20, 2017 8:45:24 PM]

[Jul 20, 2017 8:41:33 PM]

Jozef J
Cruncher
Joined: Sep 24, 2012
Post Count: 45
Status: Offline
Project Badges:


Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Branjo on page 2 in this thread you can find my two post about Scheduled Maint. July 18, 14:00 UTC where I discusing about the WCg failure
READ whole thread Scheduled Maint. July 18, 14:00 UTC .. biggrin

after write some comment here .. wink

----------------------------------------

[Jul 20, 2017 9:08:16 PM]

[ ]