Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Closed
Total posts in this thread: 196
Posts: 196   Pages: 20   [ Previous Page | 10 11 12 13 14 15 16 17 18 19 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2334314 times and has 195 replies Next Thread
Mad_Max
Cruncher
Russia
Joined: Nov 26, 2012
Post Count: 22
Status: Offline
Project Badges:
Re: Hardware Recovery Update

W.I.P.

Or possibly R.I.P. sad

No, not yet.
To this, you still need to dig a grave first.
It looks like that's what they're doing right now. sad
So really "W.I.P." indeed.

And if more seriously. From the very moment of moving from IBM to Krembil in the summer and autumn of last year, in my opinion, it is quite obvious that they "bit off more than they can chew". Their team and their infrastructure are just too weak to properly maintain the number of users and computing power that WCG had during IBM's management. Even during periods when everything is going relatively normal.

Therefore, there is no point in rushing or putting a lot of effort into restoring work as soon as possible in case of serious problems like we see now or those that were last year. Because the outflow of users to which such delays and problems lead is not perceived as something bad. Perhaps even as a desired side effect if some of the "excessive" users just leave to other DC projects.

Of course, they don't want WCG to die completely. So will try to restore work little by little. But to make it shrink to a more "manageable" level/size is another matter. Although they can't say such things directly, in plain text. But there are also non-verbal messaging.
[Mar 22, 2023 9:19:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Cosmic Computing
Cruncher
Canada
Joined: Aug 17, 2016
Post Count: 13
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Hey Krembil,

Based on the article you published, it sounds like WCG is running on a single bare metal server.... Given this is not 2003, what's going on here? Virtualization has been around since 2003-2008 and has been popularized since 2010-2012. Hearing about a single bare-metal server causing any sort of downtime to me is shocking outside of a homelab or very small business. Currently, I'm working on a project to double the compute and storage in a data centre using open source hypervisors and storage, the solutions aren't bound to expensive VMWare licensing anymore.

Do you guys want a hand planning this? I'm a Solutions Architect in Victoria, BC, Canada.
Cheers,


Triston Line
tmanaok@gmail.com
----------------------------------------
I may be only #890 out of 750,000 but you just wait until I build my datacentre!
Current personal rack: 12 servers, all dual X5650+, 2TB+ cumulative RAM, 60TB+ storage.
[Mar 22, 2023 9:22:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jave200372
Cruncher
Joined: Aug 17, 2008
Post Count: 3
Status: Offline
Project Badges:
Re: Hardware Recovery Update

@Cosmic Computing - I hear ya. One bare metal server is a bit poor. I hope they reach out to you and consider other options.

Also, I'd be willing to donate a small amount whenever I can to help out WCG. If someone's work unit delivers a vital result for humankind, how can that ever be repaid??
[Mar 24, 2023 7:01:12 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist
Joined: Feb 28, 2007
Post Count: 88
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Thank you for the suggestion and the offer.
In deed, we use VMs, and we do have multiple blades - but we do not have capacity (yet) for redundancy or sufficient capacity for growth.

This is all older equipment - but despite multiple attempts we do not have yet generous IT vendor or other partner that would give us much needed refresh and redundancy. (suffice to note - there was a planned refresh across academic HPC in Canada last year - but it did not happen yet).

However, two possible leads - *if* they work - we would be moved several years ahead.


on a quick update, finally, /science filesystem is on the move to the new storage from the recovery storage unit. As of last night, after 3 hours, the new storage /science filesystem shows 1.4TB used. Assuming such average rate of file transfer, it will take about 74 hours. Hopefully, we will be able to restart BOINC from the new storage and finally put the failure behind us. We will keep you posted.

sincerely
igor
[Mar 24, 2023 11:18:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
mwroggenbuck
Advanced Cruncher
USA
Joined: Nov 1, 2006
Post Count: 85
Status: Offline
Project Badges:
angry Re: Hardware Recovery Update

confused

Something to think about:

When the system does finally start up, there will be a lot of pending uploads. I don't know how that will affect the system. My guess is that BOINC will just try again, but the receiving server might need some TLC during this time.
[Mar 24, 2023 11:43:30 AM]   Link   Report threatening or abusive post: please login first  Go to top 
gibbcorp
Advanced Cruncher
Joined: Nov 29, 2005
Post Count: 80
Status: Offline
Project Badges:
Re: Hardware Recovery Update

You have a large community here. Many who work in the IT industry including me. Instead of just donation of hardware recourses there is a lot of experience and knowledge to take advantage of. Also if you can let us know what you need we can help. Why not set up a patreon or sell Mugs TShirts etc to raise funds. I am sure people will help. People are spending money on electricity already so are willing to support the project financially.
[Mar 24, 2023 12:30:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Dr Jurisica

Assuming such average rate of file transfer, it will take about 74 hours
can this run unmonitored (i.e., over the weekend/overnight - or will someone need to 'nurse' the transfer)? Hopefully, the former - allowing the possibility of a restart early next week.
----------------------------------------

[Mar 24, 2023 12:36:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist
Joined: Feb 28, 2007
Post Count: 88
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Thank you all for suggestion - the move to a new storage is already in place - but we will monitor it - and hopefully all goes well from there. Indeed - the first part would be not to start new WUs - but to download existing work. Hopefully, synchronization across databases will not run into unforeseen dependencies.

as for the help - logistic is tricky considering we run from a different data centre - and of course we cannot give access to a broad group - but once we can at lest walk again, there are things we plan on our side, and other with the broader community. Briefly - we need to simplify the backend - at the moment, we often run into multi points of failure, instead of robustness. But - once we will be in such a position - we want to run hackathons - this can substantially help with optimizing code we run on the grid, and bring new projects. So far, nVidia is interested to discuss this further - as our plan is to bring more GPU projects. But - of course the backend has to be upgraded before that - as peak performance during GPU stress test in 2021 was around 16PFLOPS.

thank you all for your support

Igor
[Mar 24, 2023 12:47:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 442
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Dr., thank you for your updates!
[Mar 24, 2023 3:23:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Eugene Zenzen
Veteran Cruncher
USA
Joined: Mar 31, 2006
Post Count: 890
Status: Offline
Project Badges:
Re: Hardware Recovery Update

Dr., thank you for your updates!
Yes, thank you for updates and explanations! rose
----------------------------------------

[Mar 24, 2023 5:52:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 196   Pages: 20   [ Previous Page | 10 11 12 13 14 15 16 17 18 19 | Next Page ]
[ Jump to Last Post ]
Post new Thread