| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Closed Total posts in this thread: 196
|
|
| Author |
|
|
Mad_Max
Cruncher Russia Joined: Nov 26, 2012 Post Count: 22 Status: Offline Project Badges:
|
W.I.P. Or possibly R.I.P. ![]() No, not yet. To this, you still need to dig a grave first. It looks like that's what they're doing right now. So really "W.I.P." indeed. And if more seriously. From the very moment of moving from IBM to Krembil in the summer and autumn of last year, in my opinion, it is quite obvious that they "bit off more than they can chew". Their team and their infrastructure are just too weak to properly maintain the number of users and computing power that WCG had during IBM's management. Even during periods when everything is going relatively normal. Therefore, there is no point in rushing or putting a lot of effort into restoring work as soon as possible in case of serious problems like we see now or those that were last year. Because the outflow of users to which such delays and problems lead is not perceived as something bad. Perhaps even as a desired side effect if some of the "excessive" users just leave to other DC projects. Of course, they don't want WCG to die completely. So will try to restore work little by little. But to make it shrink to a more "manageable" level/size is another matter. Although they can't say such things directly, in plain text. But there are also non-verbal messaging. |
||
|
|
Cosmic Computing
Cruncher Canada Joined: Aug 17, 2016 Post Count: 13 Status: Offline Project Badges:
|
Hey Krembil,
----------------------------------------Based on the article you published, it sounds like WCG is running on a single bare metal server.... Given this is not 2003, what's going on here? Virtualization has been around since 2003-2008 and has been popularized since 2010-2012. Hearing about a single bare-metal server causing any sort of downtime to me is shocking outside of a homelab or very small business. Currently, I'm working on a project to double the compute and storage in a data centre using open source hypervisors and storage, the solutions aren't bound to expensive VMWare licensing anymore. Do you guys want a hand planning this? I'm a Solutions Architect in Victoria, BC, Canada. Cheers, Triston Line tmanaok@gmail.com
I may be only #890 out of 750,000 but you just wait until I build my datacentre!
Current personal rack: 12 servers, all dual X5650+, 2TB+ cumulative RAM, 60TB+ storage. |
||
|
|
jave200372
Cruncher Joined: Aug 17, 2008 Post Count: 3 Status: Offline Project Badges:
|
@Cosmic Computing - I hear ya. One bare metal server is a bit poor. I hope they reach out to you and consider other options.
Also, I'd be willing to donate a small amount whenever I can to help out WCG. If someone's work unit delivers a vital result for humankind, how can that ever be repaid?? |
||
|
|
Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist Joined: Feb 28, 2007 Post Count: 88 Status: Offline Project Badges:
|
Thank you for the suggestion and the offer.
In deed, we use VMs, and we do have multiple blades - but we do not have capacity (yet) for redundancy or sufficient capacity for growth. This is all older equipment - but despite multiple attempts we do not have yet generous IT vendor or other partner that would give us much needed refresh and redundancy. (suffice to note - there was a planned refresh across academic HPC in Canada last year - but it did not happen yet). However, two possible leads - *if* they work - we would be moved several years ahead. on a quick update, finally, /science filesystem is on the move to the new storage from the recovery storage unit. As of last night, after 3 hours, the new storage /science filesystem shows 1.4TB used. Assuming such average rate of file transfer, it will take about 74 hours. Hopefully, we will be able to restart BOINC from the new storage and finally put the failure behind us. We will keep you posted. sincerely igor |
||
|
|
mwroggenbuck
Advanced Cruncher USA Joined: Nov 1, 2006 Post Count: 85 Status: Offline Project Badges:
|
Something to think about: When the system does finally start up, there will be a lot of pending uploads. I don't know how that will affect the system. My guess is that BOINC will just try again, but the receiving server might need some TLC during this time. |
||
|
|
gibbcorp
Advanced Cruncher Joined: Nov 29, 2005 Post Count: 80 Status: Offline Project Badges:
|
You have a large community here. Many who work in the IT industry including me. Instead of just donation of hardware recourses there is a lot of experience and knowledge to take advantage of. Also if you can let us know what you need we can help. Why not set up a patreon or sell Mugs TShirts etc to raise funds. I am sure people will help. People are spending money on electricity already so are willing to support the project financially.
|
||
|
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 3010 Status: Offline Project Badges:
|
Dr Jurisica
----------------------------------------Assuming such average rate of file transfer, it will take about 74 hours can this run unmonitored (i.e., over the weekend/overnight - or will someone need to 'nurse' the transfer)? Hopefully, the former - allowing the possibility of a restart early next week.![]() |
||
|
|
Jurisica
World Community Grid Admin, Mapping Cancer Markers and Help Conquer Cancer Scientist Joined: Feb 28, 2007 Post Count: 88 Status: Offline Project Badges:
|
Thank you all for suggestion - the move to a new storage is already in place - but we will monitor it - and hopefully all goes well from there. Indeed - the first part would be not to start new WUs - but to download existing work. Hopefully, synchronization across databases will not run into unforeseen dependencies.
as for the help - logistic is tricky considering we run from a different data centre - and of course we cannot give access to a broad group - but once we can at lest walk again, there are things we plan on our side, and other with the broader community. Briefly - we need to simplify the backend - at the moment, we often run into multi points of failure, instead of robustness. But - once we will be in such a position - we want to run hackathons - this can substantially help with optimizing code we run on the grid, and bring new projects. So far, nVidia is interested to discuss this further - as our plan is to bring more GPU projects. But - of course the backend has to be upgraded before that - as peak performance during GPU stress test in 2021 was around 16PFLOPS. thank you all for your support Igor |
||
|
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 442 Status: Offline Project Badges:
|
Dr., thank you for your updates!
|
||
|
|
Eugene Zenzen
Veteran Cruncher USA Joined: Mar 31, 2006 Post Count: 890 Status: Offline Project Badges:
|
Dr., thank you for your updates! Yes, thank you for updates and explanations! ![]() ![]() |
||
|
|
|