World Community Grid - View Thread - Retry Now...RESOLVED

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Retry Now...RESOLVED

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 49

[ ]

Author

This topic has been viewed 7327 times and has 48 replies

Mad_Max
Cruncher
Russia
Joined: Nov 26, 2012
Post Count: 22
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

1 year badge for The Clean Energy Project - Phase 2

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

1 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Retry Now

Hi davidjharder and TPCBF, improving the servers are our #1 priority and we will share any news about improvements or changes to them when they happen.

Hello Cyclops

Well, if that's really your top priority...
Then I would recommend that you first of all contact the administration / programmer of the MCM project.

I monitored client downloads on my computers and found that almost all the bandwidth when working with WCG projects is occupied by downloading just the same large file (about 80-90% of the traffic falls on this one single file).
It is a "mcm1.dataset-sarc1.txt " file
102MB in size is one of the main databases of this project, which is the SAME for a huge numbers of WUs. I'm not sure, but I assume that this file is general/common for all current tasks of the project. But due to an incorrect configuration, this file is not saved among the permanent project files (such as executable files), but is downloaded again and again with each task package(each pack of WUs send by sever to client). Creating a huge (and absolutely unnecessary!) load on the WCG file servers and network capacity.

This feature (sharing, saving and reusing a single file across multiple WUs without downloading it again) has been around for a long time at BOINC and has been used successfully by many projects, including (but not limited to, just for example which in run too) Rosetta@Home and Einshein@Home.

And for starters, this file can be at least elementarily compressed by an archiver for transmission over the network. I opened this file - it is essentially a regular (only structured) text file, and therefore it compresses very well - even the most standard/vanilla ZIP gives about 3 times compression (from 102 MB to 33 MB). But right now it downloads as uncompressed text.
Such a waste of a network bandwidth!

This may not solve the problems with an insufficient number of sockets / connections on the servers (because it's just one file download for several WUs), but it will definitely solve the problems with the lack of their bandwidth. due to it size (102 MB versus <1 MB for almost all other downloads, average is about ~100 KB).

----------------------------------------
[Edit 2 times, last edit by Mad_Max at Nov 10, 2022 7:44:45 AM]

[Nov 10, 2022 7:33:03 AM]

Mad_Max
Cruncher
Russia
Joined: Nov 26, 2012
Post Count: 22
Status: Offline
Project Badges:


Re: Retry Now

P.S.

MCM staff DID some measures - this file is shared to multiple WUs, but problem as i see it - BOINC keeps it only while at lest one of running WUs in work queue use it. If all WUs finished - BOINC client immediately deletes it.
And has to download it again with the next pack of WUs.

There is an configuration option in BOINC somewhere to mark it a permanent (preserved) file that is stored regardless of whether there are currently queued jobs using it or not - until the application using it changes (R@H use this option to store it main database used by ALL WUs). Or until server send instruction to client about file not longer needed and can be deleted (E@H use this approach because they use files which common to multiple WUs but not for all of them)

[Nov 10, 2022 7:58:13 AM]

Cyclops
Senior Cruncher
Joined: Jun 13, 2022
Post Count: 295
Status: Offline


Re: Retry Now

Hi davidjharder and TPCBF, improving the servers are our #1 priority and we will share any news about improvements or changes to them when they happen.

Thanks, Mad_Max! I sent your suggestion to the tech team.

[Nov 10, 2022 4:21:59 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1932
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

10 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Retry Now

Thanks, Mad_Max! I sent your suggestion to the tech team.

Well, Cyclops,
why not have "the tech team" work on the underlying issues rather than trying to mask the problem instead?

We had download problems when there were no MCM1 WUs being send out, so that one single large file just can't be the source of the problems...

Ralf sad

----------------------------------------

[Nov 10, 2022 4:44:01 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7579
Status: Recently Active
Project Badges:

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

90 day badge for Influenza Antiviral Drug Search

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project


Re: Retry Now

Thanks, Mad_Max! I sent your suggestion to the tech team.

Well, Cyclops,
why not have "the tech team" work on the underlying issues rather than trying to mask the problem instead?
We had download problems when there were no MCM1 WUs being send out, so that one single large file just can't be the source of the problems...
Ralf sad

That large file is not the source of the problem. During the very few times I have totally run out of MCM and have to re-download that file, it appears to download without any problem. Not necessarily quickly, but it downloads with having any HTTP errors. As long as I have any MCM units in the queue I do not need to reload that file. Even with that file not having to be reloaded, it is the smaller workunit files which become stuck, both MCM and OPN1. As TPCBF has mentioned more than once, it is not a transmission issue, it is a connection issue.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Nov 10, 2022 4:56:44 PM]

PMH_UK
Veteran Cruncher
UK
Joined: Apr 26, 2007
Post Count: 765
Status: Offline
Project Badges:

1 year badge for Discovering Dengue Drugs - Together

1 year badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for Africa Rainfall Project


Re: Retry Now

As TPCBF has mentioned more than once, it is not a transmission issue, it is a connection issue.

With connections tied up slowly downloading files there are fewer available, hence errors and backoffs.
More bandwidth to free up connections sooner and more connections available are needed.

Paul.

----------------------------------------

Paul.

[Nov 10, 2022 5:46:50 PM]

bluestang
Senior Cruncher
USA
Joined: Oct 1, 2010
Post Count: 272
Status: Offline
Project Badges:

5 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: Retry Now

I will go out on a limb and speculate that a big part of the problem is an inadequate amount of money to compensate the increased burdens on the existing IT staff, not to mention the availability of increased hardware capacity. Not that throwing money at problems necessarily fixes them, but to me this looks like over burdening the staff and the hardware. Having had experience in a 24/7 environment my guess is an inadequate amount of both. Redundancy is expensive for both staff and hardware, but if you don't have it when you need it, it can become potentially crippling in operations and more expensive in the long run.
Kind of reminds me of 2 old saws: "Don't bite off more than you can chew" and "Don't let your eyes be bigger than your stomach."
Cheers

I am myself for more than a couple of decades in the IT business. And I am therefor also aware that there always can be issues creeping up when deploying a new system, or migrating an existing one.

But in our case here, what is going on is beyond comprehension.

For one, there have likely been talks between IBM and Krembil BEFORE they made the announcement of the move back in September 2021, that's now 14 months ago. That is not something that anyone, on either side, just decides over a lunch break. And during those talks, Krembil should have already had an idea of the scope and generally necessary resources required to take over the project. If they did not have that info, this would been downright stupid at that point.

Second, there were another 5 months, between the announcement and the shutdown of the system in February 2022. That would have been the right time to start putting all the resources in place required to run the project by Krembil.
At latest after the next three months, from the shutdown in February until the reactivating of the forum in May 2022, when doing the actual, physical migration, they should have been aware of most of the tripwires and other obstacles in the way to get the project operational again.
So when restarting "for testing" in June, I did expect that there will be a few weeks in which things might not work smoothly and a lot of fine tuning would still have to happen. But the scope and need for resources should by that time, 9 months into the transition, already been obvious.
But instead, yet another 5 months later, it just seems to be utter pandemonium, with things being amiss at a very basic level in this whole setup. It is incomprehensible that by now, a total of 14 months in, not having the resources is still an issue.
And on top of all of this, not having proper communication from Krembil just makes things only worse for us, the volunteers. Platitudes like "this issue is on the radar" or "fixing this is our #1 priority" are of no help here. On the contrary, it just enhances the feeling that Krembil is totally in over their collective heads at this point... sad

Ralf

You are assuming IBM told them about everything and I think BM had quite a bit of patchwork type of software on their end just to make things work. Don't forget, they could just borrow some of their cloud resources too.

I think IBM sorely and deliberately misinformed any potential party interested in taking over WCG of what was needed (and what they had Frankensteined together) to keep WCG going smoothly with the new generations of hardware volunteers were capable of having.

----------------------------------------

https://xs4s.org/index.php
https://discord.gg/ePTkyue2

[Nov 10, 2022 6:38:00 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1932
Status: Offline
Project Badges:


Re: Retry Now

As TPCBF has mentioned more than once, it is not a transmission issue, it is a connection issue.

With connections tied up slowly downloading files there are fewer available, hence errors and backoffs.
More bandwidth to free up connections sooner and more connections available are needed.

Paul.

No, you are just making assumptions based on symptoms.
The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s).
The download problem existed regardless of which projects is sending WUs, in pretty much any combination.
And the vast majority of files needed to be downloaded are <1KB in size, bandwidth simply is not an issue. Certainly not to "free up connections sooner", as all those files would fit into a single 1500 bytes size Ethernet package (standard maximum Ethernet MTU).
Even when downloading a typical MCM1 WU file which are mostly 290-650 bytes in size, usinga 33.6Kbit/sec dialup line would get those files in 1/10th of a second.and even just a 100MBit/sec upload connection (and I am sure that Krembil as an even better connection, we have 500MBit/sec here at the office), bandwidth wise, that connection could send out/upload 100,000 of those EVERY SECOND. If it can get that data of those WUs from the underlying database servers...

Ralf

----------------------------------------

[Nov 10, 2022 7:43:12 PM]

ThreadRipper
Veteran Cruncher
Sweden
Joined: Apr 26, 2007
Post Count: 1319
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

1 year badge for Help Fight Childhood Cancer

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

1 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project


Re: Retry Now

So if it is an issue of number of connections between DB server and Web server, let's assume. Then, what differs in how IBM had it wired up. Is the server hardware not the same or better than IBM had? Because IBM had it running without connectivity issues - what's the difference from that "old"/working setup?

Wouldn't trying to approach that setup be helpful? Can Krembil perhaps ask old IBM techs for some guidance since this issue has been around for so long now?

Before the OPNG WUs were distributed again, the grid was running fine the last time around. Would a test where no WUs other than ONPG be sent out be a relevant test - like, will the same issue happen still if all other WUs/projects are disabled?

----------------------------------------

Join The International Team: https://www.worldcommunitygrid.org/team/viewTeamInfo.do?teamId=CK9RP1BKX1

AMD TR2990WX @ PBO, 64GB Quad 3200MHz 14-17-17-17-1T, RX6900XT @ Stock
AMD 3800X @ PBO
AMD 2700X @ 4GHz

[Nov 10, 2022 9:41:15 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1932
Status: Offline
Project Badges:


Re: Retry Now

Well, as Cyclops mentioned in his post a bit earlier, Krembil is running WCG now on "real iron" while at IBM, WCG was switched over a couple years ago to a cloud setup. And honestly, the number of issues after that migration (which required a 3 days shutdown) had increased over their previous,non-cloud setup, though all issues that did come up, always got resolved by Kevin&Co in very little time.

Wouldn't trying to approach that setup be helpful? Can Krembil perhaps ask old IBM techs for some guidance since this issue has been around for so long now?

That's were I say that Krembil dropped the ball. They had months before the shutdown, another at least 3 months while doing the actual physical transfer. And now we are pretty much 6 months "testing" until they come to realize that they need more hardware resources... sad

Before the OPNG WUs were distributed again, the grid was running fine the last time around. Would a test where no WUs other than ONPG be sent out be a relevant test - like, will the same issue happen still if all other WUs/projects are disabled?

The download errors happened with pretty much any combination of project WUs, I don't think that you can seriously pinpoint it to one specific project. But that hits the same point that I am trying to make for months, Krembil just isn't transparent/communicative enough to decisively determine that.
Cyclops mentioned in his post today that they had/have "most valuable alpha testers" and I am seriously wondering who those are and what they were testing for the last 6 months...

Ralf

----------------------------------------

[Nov 10, 2022 10:48:27 PM]

[ ]