World Community Grid - View Thread - Hardware Recovery Update (old)

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: Hardware Recovery Update (old)

Quick Go »

No member browsing this thread

Thread Status: Closed
Total posts in this thread: 196

[ ]

Author

This topic has been viewed 2364302 times and has 195 replies

Mad_Max
Cruncher
Russia
Joined: Nov 26, 2012
Post Count: 22
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

1 year badge for The Clean Energy Project - Phase 2

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

W.I.P.

Or possibly R.I.P. sad

No, not yet.
To this, you still need to dig a grave first.
It looks like that's what they're doing right now. sad

So really "W.I.P." indeed.

And if more seriously. From the very moment of moving from IBM to Krembil in the summer and autumn of last year, in my opinion, it is quite obvious that they "bit off more than they can chew". Their team and their infrastructure are just too weak to properly maintain the number of users and computing power that WCG had during IBM's management. Even during periods when everything is going relatively normal.

Therefore, there is no point in rushing or putting a lot of effort into restoring work as soon as possible in case of serious problems like we see now or those that were last year. Because the outflow of users to which such delays and problems lead is not perceived as something bad. Perhaps even as a desired side effect if some of the "excessive" users just leave to other DC projects.

Of course, they don't want WCG to die completely. So will try to restore work little by little. But to make it shrink to a more "manageable" level/size is another matter. Although they can't say such things directly, in plain text. But there are also non-verbal messaging.

[Mar 22, 2023 9:19:13 PM]

Cosmic Computing
Cruncher
Canada
Joined: Aug 17, 2016
Post Count: 13
Status: Offline
Project Badges:

100 year badge for Mapping Cancer Markers

45 day badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

Hey Krembil,

Based on the article you published, it sounds like WCG is running on a single bare metal server.... Given this is not 2003, what's going on here? Virtualization has been around since 2003-2008 and has been popularized since 2010-2012. Hearing about a single bare-metal server causing any sort of downtime to me is shocking outside of a homelab or very small business. Currently, I'm working on a project to double the compute and storage in a data centre using open source hypervisors and storage, the solutions aren't bound to expensive VMWare licensing anymore.

Do you guys want a hand planning this? I'm a Solutions Architect in Victoria, BC, Canada.
Cheers,

Triston Line
tmanaok@gmail.com

----------------------------------------

I may be only #890 out of 750,000 but you just wait until I build my datacentre!
Current personal rack: 12 servers, all dual X5650+, 2TB+ cumulative RAM, 60TB+ storage.

[Mar 22, 2023 9:22:24 PM]

jave200372
Cruncher
Joined: Aug 17, 2008
Post Count: 3
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

45 day badge for Drug Search for Leishmaniasis

45 day badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

90 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

@Cosmic Computing - I hear ya. One bare metal server is a bit poor. I hope they reach out to you and consider other options.

Also, I'd be willing to donate a small amount whenever I can to help out WCG. If someone's work unit delivers a vital result for humankind, how can that ever be repaid??

[Mar 24, 2023 7:01:12 AM]

Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist
Joined: Feb 28, 2007
Post Count: 88
Status: Offline
Project Badges:

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

180 day badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

14 day badge for Outsmart Ebola Together

14 day badge for FightAIDS@Home - Phase 2

5 year badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

Thank you for the suggestion and the offer.
In deed, we use VMs, and we do have multiple blades - but we do not have capacity (yet) for redundancy or sufficient capacity for growth.

This is all older equipment - but despite multiple attempts we do not have yet generous IT vendor or other partner that would give us much needed refresh and redundancy. (suffice to note - there was a planned refresh across academic HPC in Canada last year - but it did not happen yet).

However, two possible leads - *if* they work - we would be moved several years ahead.

on a quick update, finally, /science filesystem is on the move to the new storage from the recovery storage unit. As of last night, after 3 hours, the new storage /science filesystem shows 1.4TB used. Assuming such average rate of file transfer, it will take about 74 hours. Hopefully, we will be able to restart BOINC from the new storage and finally put the failure behind us. We will keep you posted.

sincerely
igor

[Mar 24, 2023 11:18:51 AM]

mwroggenbuck
Advanced Cruncher
USA
Joined: Nov 1, 2006
Post Count: 87
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

14 day badge for Computing for Clean Water

14 day badge for Drug Search for Leishmaniasis

14 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

45 day badge for Outsmart Ebola Together

180 day badge for Smash Childhood Cancer

14 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Hardware Recovery Update

Something to think about:

When the system does finally start up, there will be a lot of pending uploads. I don't know how that will affect the system. My guess is that BOINC will just try again, but the receiving server might need some TLC during this time.

[Mar 24, 2023 11:43:30 AM]

gibbcorp
Advanced Cruncher
Joined: Nov 29, 2005
Post Count: 80
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Nutritious Rice for the World

45 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

14 day badge for The Clean Energy Project - Phase 2

180 day badge for GO Fight Against Malaria

180 day badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

45 day badge for Africa Rainfall Project


Re: Hardware Recovery Update

You have a large community here. Many who work in the IT industry including me. Instead of just donation of hardware recourses there is a lot of experience and knowledge to take advantage of. Also if you can let us know what you need we can help. Why not set up a patreon or sell Mugs TShirts etc to raise funds. I am sure people will help. People are spending money on electricity already so are willing to support the project financially.

[Mar 24, 2023 12:30:08 PM]

gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 3010
Status: Offline
Project Badges:

90 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

2 year badge for Help Fight Childhood Cancer

1 year badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for Computing for Sustainable Water

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project


Re: Hardware Recovery Update

Dr Jurisica

Assuming such average rate of file transfer, it will take about 74 hours

can this run unmonitored (i.e., over the weekend/overnight - or will someone need to 'nurse' the transfer)? Hopefully, the former - allowing the possibility of a restart early next week.

----------------------------------------

[Mar 24, 2023 12:36:31 PM]

Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist
Joined: Feb 28, 2007
Post Count: 88
Status: Offline
Project Badges:


Re: Hardware Recovery Update

Thank you all for suggestion - the move to a new storage is already in place - but we will monitor it - and hopefully all goes well from there. Indeed - the first part would be not to start new WUs - but to download existing work. Hopefully, synchronization across databases will not run into unforeseen dependencies.

as for the help - logistic is tricky considering we run from a different data centre - and of course we cannot give access to a broad group - but once we can at lest walk again, there are things we plan on our side, and other with the broader community. Briefly - we need to simplify the backend - at the moment, we often run into multi points of failure, instead of robustness. But - once we will be in such a position - we want to run hackathons - this can substantially help with optimizing code we run on the grid, and bring new projects. So far, nVidia is interested to discuss this further - as our plan is to bring more GPU projects. But - of course the backend has to be upgraded before that - as peak performance during GPU stress test in 2021 was around 16PFLOPS.

thank you all for your support

Igor

[Mar 24, 2023 12:47:48 PM]