Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Retired Forums Forum: Known Issues [read only] Thread: Website Outage and BOINC outage. [Resolved] |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 10
|
Author |
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
We experienced an outage that caused the website to be unavaible for about 30 minutes. The website and the forums are now back up. We are working towards getting the backend processes (which include BOINC uploads and downloads) up and running. There is not an ETA on this yet. We will let you know as soon as possible.
----------------------------------------Thank you for your patience, -Uplinger [Edit 2 times, last edit by knreed at Sep 5, 2012 5:31:40 AM] |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: |
It is going to take awhile. During some testing of the SAN performance, a small portion of the GPFS file system was corrupted. We are taking the time to create a current copy of the filesystem now before we attempt any sort of repair activity.
Unfortunately when you are moving nearly a TB of data around it takes a lot of time. |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: |
We are estimating at this point that with backups and restores and related work to recover it is going to take us 12-18 hours before we will be able to resume operations. We will advise and update as we go.
|
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: |
We have completed the backup of the filesystem and have rebuilt it. We are now in the process of copying the data back onto the filesystem.
However, because the filesystem was corrupted when we copied data off, we are in the process of validating files in different ways. We already have our work underway to ensure that no corrupted files are returned to the researchers and these scans against the files are running now. We are now working on validating the workunit input files. This is being done in order to minimize the disruptions in workunit distribution when we resume sending/receiving work. The boinc client will verify the download file and application binaries against signatures and hashes that were computed before the file system issue occurred. As a result, they will be rejected if we miss a few corrupt files. However, we are running scans now to ensure that will only be a very small number of files and not 1000's. |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: |
We have completed the restore back to the filesystems.
We have done a scan of the job input files and we are seeing about 0.2% of input files corrupted. We are going to cancel the impact jobs and reload those jobs. This may cause some aborts on jobs in progress. Also - for results that are in progress and due less than 24 hours from now, we are moving the report deadline to 24 hours from now. This will minimize issues with jobs that have missed their deadline. You will not see this on the client - but only in the website. Files that have already been uploaded will be examined as they pass through normal validation. |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: |
We have re-enabled file uploads and downloads. We are not yet quite ready to allow scheduler requests.
|
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: |
We are almost ready to start up the scheduler. We are still working on validating hfcc, hcmd2 and faah so we are not going to start distributing work for those, but we will let work get reported and distribute work for our other projects.
|
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: |
Ok the scheduler is back running. You can report completed results can fetch work for projects other than the ones listed above.
----------------------------------------Please note that validation is not yet running. [Edit 1 times, last edit by knreed at Aug 31, 2012 11:56:49 PM] |
||
|
uplinger
Former World Community Grid Tech Joined: May 23, 2005 Post Count: 3952 Status: Offline Project Badges: |
We have started back HFCC, FAAH and HCMD2 for distribution.
We are still working towards getting all validators back up and running. Thank you for your patience! -Uplinger |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: |
We started up the validators for a short period of time, but we are now running our usual daily backend scripts for the next hour or so which stops them. Following that the validators will resume running and catch up.
Everything looks to be operational again but we will be watching over the next couple of days. We do expect to see higher invalids and pending verifications over the next several days due to the additional checks we added to the validators in order to catch any corrupt files. These should return to normal levels early next week. Computing for Sustainable Water is out of work at the moment, but once the backend scripts finish, it will resume loading new work and people will be able to get work. We appreciate everyone's patience during this unfortunate event and we appreciate your support now and at all times. |
||
|
|