Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 10
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 11246 times and has 9 replies Next Thread
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Website Outage and BOINC outage. [Resolved]

We experienced an outage that caused the website to be unavaible for about 30 minutes. The website and the forums are now back up. We are working towards getting the backend processes (which include BOINC uploads and downloads) up and running. There is not an ETA on this yet. We will let you know as soon as possible.

Thank you for your patience,
-Uplinger
----------------------------------------
[Edit 2 times, last edit by knreed at Sep 5, 2012 5:31:40 AM]
[Aug 31, 2012 2:26:32 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Outage and BOINC outage.

It is going to take awhile. During some testing of the SAN performance, a small portion of the GPFS file system was corrupted. We are taking the time to create a current copy of the filesystem now before we attempt any sort of repair activity.
Unfortunately when you are moving nearly a TB of data around it takes a lot of time.
[Aug 31, 2012 3:53:55 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Outage and BOINC outage.

We are estimating at this point that with backups and restores and related work to recover it is going to take us 12-18 hours before we will be able to resume operations. We will advise and update as we go.
[Aug 31, 2012 8:15:46 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Outage and BOINC outage.

We have completed the backup of the filesystem and have rebuilt it. We are now in the process of copying the data back onto the filesystem.

However, because the filesystem was corrupted when we copied data off, we are in the process of validating files in different ways. We already have our work underway to ensure that no corrupted files are returned to the researchers and these scans against the files are running now. We are now working on validating the workunit input files. This is being done in order to minimize the disruptions in workunit distribution when we resume sending/receiving work. The boinc client will verify the download file and application binaries against signatures and hashes that were computed before the file system issue occurred. As a result, they will be rejected if we miss a few corrupt files. However, we are running scans now to ensure that will only be a very small number of files and not 1000's.
[Aug 31, 2012 5:41:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Outage and BOINC outage.

We have completed the restore back to the filesystems.

We have done a scan of the job input files and we are seeing about 0.2% of input files corrupted. We are going to cancel the impact jobs and reload those jobs. This may cause some aborts on jobs in progress.

Also - for results that are in progress and due less than 24 hours from now, we are moving the report deadline to 24 hours from now. This will minimize issues with jobs that have missed their deadline. You will not see this on the client - but only in the website.

Files that have already been uploaded will be examined as they pass through normal validation.
[Aug 31, 2012 10:24:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Outage and BOINC outage.

We have re-enabled file uploads and downloads. We are not yet quite ready to allow scheduler requests.
[Aug 31, 2012 11:19:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Outage and BOINC outage.

We are almost ready to start up the scheduler. We are still working on validating hfcc, hcmd2 and faah so we are not going to start distributing work for those, but we will let work get reported and distribute work for our other projects.
[Aug 31, 2012 11:46:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Outage and BOINC outage.

Ok the scheduler is back running. You can report completed results can fetch work for projects other than the ones listed above.


Please note that validation is not yet running.
----------------------------------------
[Edit 1 times, last edit by knreed at Aug 31, 2012 11:56:49 PM]
[Aug 31, 2012 11:56:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
uplinger
Former World Community Grid Tech
Joined: May 23, 2005
Post Count: 3952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Outage and BOINC outage.

We have started back HFCC, FAAH and HCMD2 for distribution.

We are still working towards getting all validators back up and running.

Thank you for your patience!
-Uplinger
[Sep 1, 2012 1:09:28 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Website Outage and BOINC outage.

We started up the validators for a short period of time, but we are now running our usual daily backend scripts for the next hour or so which stops them. Following that the validators will resume running and catch up.

Everything looks to be operational again but we will be watching over the next couple of days. We do expect to see higher invalids and pending verifications over the next several days due to the additional checks we added to the validators in order to catch any corrupt files. These should return to normal levels early next week.

Computing for Sustainable Water is out of work at the moment, but once the backend scripts finish, it will resume loading new work and people will be able to get work.

We appreciate everyone's patience during this unfortunate event and we appreciate your support now and at all times.
[Sep 1, 2012 3:09:27 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread