Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 143
|
![]() |
Author |
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1673 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Dear WCG Tech Team,
----------------------------------------thank you very much for your support. I wish you a more quiet time for the rest of the Summer. Regarding the post mortem reports, you can contact me directly. I would really appreciate if we can learn from such outages (by the way, I support customers operating IBM-based IT infrastructure). Cheers, Yves --- @TonyEllis Qty 118 'Server Aborts' so far and still happening and counting up... these were all downloaded earlier today... waste of bandwidth at both ends... I would not be so negative. Server abort is less painful than to have to compute the WUs even if it would not be necessary. On my side, I have only about 90 server aborted WUs for about 2'000 WUs in progress. It is finally negligible (in particular comparing to the daily received spam e-mails). |
||
|
KLiK
Master Cruncher Croatia Joined: Nov 13, 2006 Post Count: 3108 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yes, IBM did a great job sponsoring WCG, but let's not loose track of reality here. When WCG was hosting the files in-house it worked more or less flawlessly for years. Since they have moved into the IBM cloud there have been two major melt-downs in short succession, the likes of which I cannot remember. I call that a big step back in reliability... And don't tell me this is a typical IT problem. It is not. We have a big 4 PB file server at my work place and it has been running completely reliably ever since it arrived. Well, only 2 solutions are there: 1. WCG staff configured the cloud incorrectly (, but they'll learn in the future) 2. IBM has a bad cloud service ![]() I just hope it's not the latter... ![]() |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Greetings,
On server aborts and extra copies that mistake was on me. There were a few things that attributed towards that. 1. There is a script that runs every day at about 12 UTC that stops all the BOINC daemons and restarts them. This is for a few reasons but one is to rotate the daemons around the backend scheduler. When this was started, it caused many results to be marked as invalid. With 4 transitioned running, it can process over 50k results per minute. So, many results were incorrectly marked and scheduled to be sent out again. 2. After this was caught running, the damage was already done. To alleviate the pain, I ran some db updates to the affected results. 'update result set outcome = 1 and validate_state = 1 where server_state = 5 and outcome in (1,6) and validate_state in (2, 4, 5);' However, this caused results to go into outcome of 0, unknown on the website. This was done before we restarted everything and then we started the grid back up. 3. Since many results have a limit of 5 results for a workunit, and many resend, this caused workunit to not have wiggle room to send additional copies. Basically if a workunit already had 3 copies sent and a 4th and 5th were sent after the outage, it did not allow either of those two results to have an error of any sort. 4. Once the errors for outcome 0 were caught I corrected my database update to reflect these properly and validated many results. If a result workunit was marked as valid, then it would Mark all other workunit in progress as not needed or server abort. That way if your BOINC client did not start the result yet, it wouldn't start it if a scheduuler request happened before starting it. So, to help prevent this in the future, we have added more checks on the restart script for the BOINC daemons. As for the database update that marked results with outcome of 0, I am doing some testing to see why it ran without an error. Note, the database updates were reviewed before running as well, it was missed by 2 of us. Again, we are sorry about this outage and appreciate your patience with us. Thanks, -Uplinger |
||
|
VietOZ
Senior Cruncher United States Joined: Apr 8, 2007 Post Count: 205 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
51 pages of server aborted ... ouch
---------------------------------------- |
||
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1673 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi Uplinger,
----------------------------------------many thanks for the explanation regarding the "aborted" WUs. @All It was mentioned yesterday that the outage has not been caused by the "cloud" but because of applied system kernel updates. Whatever you read about "cloud computing", you should not forget that a cloud infrastructure is nothing more than a (large) collection of servers (hardware, firmware, OS/hypervisors), storage systems (hardware, firmware), FC fabrics (hardware, firmware), networks (hardware, firmware), etc. each of this hardware and software component requiring configuration effort and update activities. Basically, a cloud infrastructure is not more reliable than how its components are reliable. Redundancy, clustering, etc, can help to improve the cloud availability; however, for performance, compatibility, and consistency reasons some updates have to be applied to all components at the same time; in some way, such overall update activities represent a kind of unavoidable "common mode failure". On my side, I am relatively tired to read a lot of criticisms about the both outages. I don't know what the business background of the complainers is and I am not willing to speculate about it. Nevertheless, system administrators managing large IT infrastructure know very well that unfortunately the IT infrastructure reliability is not obvious and it is only the result of an unstable balance between performance, security, reliability, and, finally, risk management by performing changes and applying updates. Despite of very careful assessments, problems could occur during an update and, because of the complexity, such problems cannot be excluded. Like every WCG contributors, I do appreciate the reliability of the WCG platform. Nonetheless, there is no reason to complain and to endless criticise the tech team and the main WCG sponsor. Happy crunching, Yves ---------------------------------------- [Edit 1 times, last edit by KerSamson at Jul 20, 2017 5:22:45 PM] |
||
|
jayrope
Cruncher Joined: Jul 23, 2016 Post Count: 46 Status: Offline |
Completely agree to Yves. Thanx for pointing this out.
|
||
|
Jozef J
Cruncher Joined: Sep 24, 2012 Post Count: 42 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
https://www.worldcommunitygrid.org/ms/viewBoi...y=sentTime&pageNum=77
----------------------------------------I have 77 pages in pending verification, and 19 in pending validation. Today i watched the whole day this and the number only grew...... Project Points Generated Results Returned Total Run Time (y:d:h:m:s) Badges Earned Smash Childhood Cancer 2,988,555 12,978 2:190:12:26:41 Sapphire Badge (2 years) for Smash Childhood Cancer total run on SCHC stay days on same number 2:190:12:26:41... also 9:107:09:19:41 on ebola Its few days already at my statistics, nothing has changed while I sending huge amount of work constantly.. so I ask what is going on.?? ![]() ![]() |
||
|
branjo
Master Cruncher Slovakia Joined: Jun 29, 2012 Post Count: 1892 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
pvh513 wrote: Yes, IBM did a great job sponsoring WCG, but let's not loose track of reality here. When WCG was hosting the files in-house it worked more or less flawlessly for years. Since they have moved into the IBM cloud there have been two major melt-downs in short succession, the likes of which I cannot remember. I call that a big step back in reliability... And don't tell me this is a typical IT problem. It is not. We have a big 4 PB file server at my work place and it has been running completely reliably ever since it arrived. You are kidding us, aren't you? If you have a short memory about really bad outages "when WCG was hosting...", try search on this forum. They were not that long time ago. Cheers ![]() Crunching@Home since January 13 2000. Shrubbing@Home since January 5 2006 ![]() |
||
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 2982 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Jozef J - yes, unfortunately, patience is the only cure for PV jail WU's. Just think of them as 'cheques going through the clearing system', and that your money (points/credit/time) will be added to your account in due course.
----------------------------------------Personally, whenever I see any _1 WU's (for a zero based science - such as SCC), I try to push them to the front of my queue. Thing is, currently my queue is practically full of these WU's - and I suspect others will have similar experiences. Edit: One one of my computers, I've got 93 SCC WU's - 80 of them, are _1 (or higher) copies. My other computer is crunching the last of the _1 (or higher) WU's. ![]() [Edit 1 times, last edit by gb009761 at Jul 20, 2017 8:45:24 PM] |
||
|
Jozef J
Cruncher Joined: Sep 24, 2012 Post Count: 42 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Branjo on page 2 in this thread you can find my two post about Scheduled Maint. July 18, 14:00 UTC where I discusing about the WCg failure
----------------------------------------READ whole thread Scheduled Maint. July 18, 14:00 UTC .. ![]() ![]() ![]() |
||
|
|
![]() |