Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 143
Posts: 143   Pages: 15   [ Previous Page | 6 7 8 9 10 11 12 13 14 15 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 172439 times and has 142 replies Next Thread
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1673
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Dear WCG Tech Team,
thank you very much for your support. I wish you a more quiet time for the rest of the Summer.
Regarding the post mortem reports, you can contact me directly. I would really appreciate if we can learn from such outages (by the way, I support customers operating IBM-based IT infrastructure).
Cheers,
Yves
---
@TonyEllis
Qty 118 'Server Aborts' so far and still happening and counting up... these were all downloaded earlier today... waste of bandwidth at both ends...
I would not be so negative. Server abort is less painful than to have to compute the WUs even if it would not be necessary. On my side, I have only about 90 server aborted WUs for about 2'000 WUs in progress. It is finally negligible (in particular comparing to the daily received spam e-mails).
----------------------------------------
[Jul 20, 2017 11:23:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
KLiK
Master Cruncher
Croatia
Joined: Nov 13, 2006
Post Count: 3108
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Yes, IBM did a great job sponsoring WCG, but let's not loose track of reality here. When WCG was hosting the files in-house it worked more or less flawlessly for years. Since they have moved into the IBM cloud there have been two major melt-downs in short succession, the likes of which I cannot remember. I call that a big step back in reliability... And don't tell me this is a typical IT problem. It is not. We have a big 4 PB file server at my work place and it has been running completely reliably ever since it arrived.

Well, only 2 solutions are there:
1. WCG staff configured the cloud incorrectly (, but they'll learn in the future)
2. IBM has a bad cloud service confused

I just hope it's not the latter...
confused
----------------------------------------
oldies:UDgrid.org & PS3 Life@home


non-profit org. Play4Life in Zagreb, Croatia
[Jul 20, 2017 2:42:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Greetings,

On server aborts and extra copies that mistake was on me. There were a few things that attributed towards that.

1. There is a script that runs every day at about 12 UTC that stops all the BOINC daemons and restarts them. This is for a few reasons but one is to rotate the daemons around the backend scheduler. When this was started, it caused many results to be marked as invalid. With 4 transitioned running, it can process over 50k results per minute. So, many results were incorrectly marked and scheduled to be sent out again.

2. After this was caught running, the damage was already done. To alleviate the pain, I ran some db updates to the affected results. 'update result set outcome = 1 and validate_state = 1 where server_state = 5 and outcome in (1,6) and validate_state in (2, 4, 5);' However, this caused results to go into outcome of 0, unknown on the website. This was done before we restarted everything and then we started the grid back up.

3. Since many results have a limit of 5 results for a workunit, and many resend, this caused workunit to not have wiggle room to send additional copies. Basically if a workunit already had 3 copies sent and a 4th and 5th were sent after the outage, it did not allow either of those two results to have an error of any sort.

4. Once the errors for outcome 0 were caught I corrected my database update to reflect these properly and validated many results. If a result workunit was marked as valid, then it would Mark all other workunit in progress as not needed or server abort. That way if your BOINC client did not start the result yet, it wouldn't start it if a scheduuler request happened before starting it.

So, to help prevent this in the future, we have added more checks on the restart script for the BOINC daemons. As for the database update that marked results with outcome of 0, I am doing some testing to see why it ran without an error. Note, the database updates were reviewed before running as well, it was missed by 2 of us.

Again, we are sorry about this outage and appreciate your patience with us.

Thanks,
-Uplinger
[Jul 20, 2017 3:53:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
VietOZ
Senior Cruncher
United States
Joined: Apr 8, 2007
Post Count: 205
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

51 pages of server aborted ... ouch
----------------------------------------

if you are looking for a team please consider XtremeSystem Team
Team website: https://xs4s.org/
[Jul 20, 2017 4:07:04 PM]   Link   Report threatening or abusive post: please login first  Go to top 
KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1673
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Hi Uplinger,
many thanks for the explanation regarding the "aborted" WUs.

@All
It was mentioned yesterday that the outage has not been caused by the "cloud" but because of applied system kernel updates.
Whatever you read about "cloud computing", you should not forget that a cloud infrastructure is nothing more than a (large) collection of servers (hardware, firmware, OS/hypervisors), storage systems (hardware, firmware), FC fabrics (hardware, firmware), networks (hardware, firmware), etc. each of this hardware and software component requiring configuration effort and update activities. Basically, a cloud infrastructure is not more reliable than how its components are reliable. Redundancy, clustering, etc, can help to improve the cloud availability; however, for performance, compatibility, and consistency reasons some updates have to be applied to all components at the same time; in some way, such overall update activities represent a kind of unavoidable "common mode failure".
On my side, I am relatively tired to read a lot of criticisms about the both outages. I don't know what the business background of the complainers is and I am not willing to speculate about it. Nevertheless, system administrators managing large IT infrastructure know very well that unfortunately the IT infrastructure reliability is not obvious and it is only the result of an unstable balance between performance, security, reliability, and, finally, risk management by performing changes and applying updates.
Despite of very careful assessments, problems could occur during an update and, because of the complexity, such problems cannot be excluded.
Like every WCG contributors, I do appreciate the reliability of the WCG platform. Nonetheless, there is no reason to complain and to endless criticise the tech team and the main WCG sponsor.
Happy crunching,
Yves
----------------------------------------
----------------------------------------
[Edit 1 times, last edit by KerSamson at Jul 20, 2017 5:22:45 PM]
[Jul 20, 2017 4:45:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jayrope
Cruncher
Joined: Jul 23, 2016
Post Count: 46
Status: Offline
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Completely agree to Yves. Thanx for pointing this out.
[Jul 20, 2017 4:58:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jozef J
Cruncher
Joined: Sep 24, 2012
Post Count: 42
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

https://www.worldcommunitygrid.org/ms/viewBoi...y=sentTime&pageNum=77
I have 77 pages in pending verification, and 19 in pending validation. Today i watched the whole day this and the number only grew......

Project Points
Generated Results Returned Total Run Time (y:d:h:m:s) Badges Earned
Smash Childhood Cancer 2,988,555 12,978 2:190:12:26:41 Sapphire Badge (2 years) for Smash Childhood Cancer
total run on SCHC stay days on same number 2:190:12:26:41... also 9:107:09:19:41 on ebola
Its few days already at my statistics, nothing has changed while I sending huge amount of work constantly..
so I ask what is going on.?? cool have i just wait ?
----------------------------------------

[Jul 20, 2017 8:08:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
branjo
Master Cruncher
Slovakia
Joined: Jun 29, 2012
Post Count: 1892
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

pvh513 wrote:

Yes, IBM did a great job sponsoring WCG, but let's not loose track of reality here. When WCG was hosting the files in-house it worked more or less flawlessly for years. Since they have moved into the IBM cloud there have been two major melt-downs in short succession, the likes of which I cannot remember. I call that a big step back in reliability... And don't tell me this is a typical IT problem. It is not. We have a big 4 PB file server at my work place and it has been running completely reliably ever since it arrived.


You are kidding us, aren't you? If you have a short memory about really bad outages "when WCG was hosting...", try search on this forum. They were not that long time ago.

Cheers
----------------------------------------

Crunching@Home since January 13 2000. Shrubbing@Home since January 5 2006

[Jul 20, 2017 8:11:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 2982
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Jozef J - yes, unfortunately, patience is the only cure for PV jail WU's. Just think of them as 'cheques going through the clearing system', and that your money (points/credit/time) will be added to your account in due course.

Personally, whenever I see any _1 WU's (for a zero based science - such as SCC), I try to push them to the front of my queue. Thing is, currently my queue is practically full of these WU's - and I suspect others will have similar experiences.

Edit: One one of my computers, I've got 93 SCC WU's - 80 of them, are _1 (or higher) copies. My other computer is crunching the last of the _1 (or higher) WU's.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by gb009761 at Jul 20, 2017 8:45:24 PM]
[Jul 20, 2017 8:41:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jozef J
Cruncher
Joined: Sep 24, 2012
Post Count: 42
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Scheduled Maint. July 18, 14:00 UTC, extended?

Branjo on page 2 in this thread you can find my two post about Scheduled Maint. July 18, 14:00 UTC where I discusing about the WCg failure
READ whole thread Scheduled Maint. July 18, 14:00 UTC .. biggrin after write some comment here .. wink
----------------------------------------

[Jul 20, 2017 9:08:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 143   Pages: 15   [ Previous Page | 6 7 8 9 10 11 12 13 14 15 | Next Page ]
[ Jump to Last Post ]
Post new Thread