Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Support Forum: Website Support Thread: Retry Now...RESOLVED |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 49
|
Author |
|
Mad_Max
Cruncher Russia Joined: Nov 26, 2012 Post Count: 22 Status: Offline Project Badges: |
Hi davidjharder and TPCBF, improving the servers are our #1 priority and we will share any news about improvements or changes to them when they happen. Hello Cyclops Well, if that's really your top priority... Then I would recommend that you first of all contact the administration / programmer of the MCM project. I monitored client downloads on my computers and found that almost all the bandwidth when working with WCG projects is occupied by downloading just the same large file (about 80-90% of the traffic falls on this one single file). It is a "mcm1.dataset-sarc1.txt " file 102MB in size is one of the main databases of this project, which is the SAME for a huge numbers of WUs. I'm not sure, but I assume that this file is general/common for all current tasks of the project. But due to an incorrect configuration, this file is not saved among the permanent project files (such as executable files), but is downloaded again and again with each task package(each pack of WUs send by sever to client). Creating a huge (and absolutely unnecessary!) load on the WCG file servers and network capacity. This feature (sharing, saving and reusing a single file across multiple WUs without downloading it again) has been around for a long time at BOINC and has been used successfully by many projects, including (but not limited to, just for example which in run too) Rosetta@Home and Einshein@Home. And for starters, this file can be at least elementarily compressed by an archiver for transmission over the network. I opened this file - it is essentially a regular (only structured) text file, and therefore it compresses very well - even the most standard/vanilla ZIP gives about 3 times compression (from 102 MB to 33 MB). But right now it downloads as uncompressed text. Such a waste of a network bandwidth! This may not solve the problems with an insufficient number of sockets / connections on the servers (because it's just one file download for several WUs), but it will definitely solve the problems with the lack of their bandwidth. due to it size (102 MB versus <1 MB for almost all other downloads, average is about ~100 KB). [Edit 2 times, last edit by Mad_Max at Nov 10, 2022 7:44:45 AM] |
||
|
Mad_Max
Cruncher Russia Joined: Nov 26, 2012 Post Count: 22 Status: Offline Project Badges: |
P.S.
MCM staff DID some measures - this file is shared to multiple WUs, but problem as i see it - BOINC keeps it only while at lest one of running WUs in work queue use it. If all WUs finished - BOINC client immediately deletes it. And has to download it again with the next pack of WUs. There is an configuration option in BOINC somewhere to mark it a permanent (preserved) file that is stored regardless of whether there are currently queued jobs using it or not - until the application using it changes (R@H use this option to store it main database used by ALL WUs). Or until server send instruction to client about file not longer needed and can be deleted (E@H use this approach because they use files which common to multiple WUs but not for all of them) |
||
|
Cyclops
Senior Cruncher Joined: Jun 13, 2022 Post Count: 295 Status: Offline |
Hi davidjharder and TPCBF, improving the servers are our #1 priority and we will share any news about improvements or changes to them when they happen. Hello Cyclops Well, if that's really your top priority... Then I would recommend that you first of all contact the administration / programmer of the MCM project. I monitored client downloads on my computers and found that almost all the bandwidth when working with WCG projects is occupied by downloading just the same large file (about 80-90% of the traffic falls on this one single file). It is a "mcm1.dataset-sarc1.txt " file 102MB in size is one of the main databases of this project, which is the SAME for a huge numbers of WUs. I'm not sure, but I assume that this file is general/common for all current tasks of the project. But due to an incorrect configuration, this file is not saved among the permanent project files (such as executable files), but is downloaded again and again with each task package(each pack of WUs send by sever to client). Creating a huge (and absolutely unnecessary!) load on the WCG file servers and network capacity. This feature (sharing, saving and reusing a single file across multiple WUs without downloading it again) has been around for a long time at BOINC and has been used successfully by many projects, including (but not limited to, just for example which in run too) Rosetta@Home and Einshein@Home. And for starters, this file can be at least elementarily compressed by an archiver for transmission over the network. I opened this file - it is essentially a regular (only structured) text file, and therefore it compresses very well - even the most standard/vanilla ZIP gives about 3 times compression (from 102 MB to 33 MB). But right now it downloads as uncompressed text. Such a waste of a network bandwidth! This may not solve the problems with an insufficient number of sockets / connections on the servers (because it's just one file download for several WUs), but it will definitely solve the problems with the lack of their bandwidth. due to it size (102 MB versus <1 MB for almost all other downloads, average is about ~100 KB). Thanks, Mad_Max! I sent your suggestion to the tech team. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
Thanks, Mad_Max! I sent your suggestion to the tech team. Well, Cyclops,why not have "the tech team" work on the underlying issues rather than trying to mask the problem instead? We had download problems when there were no MCM1 WUs being send out, so that one single large file just can't be the source of the problems... Ralf |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7579 Status: Recently Active Project Badges: |
Thanks, Mad_Max! I sent your suggestion to the tech team. Well, Cyclops,why not have "the tech team" work on the underlying issues rather than trying to mask the problem instead? We had download problems when there were no MCM1 WUs being send out, so that one single large file just can't be the source of the problems... Ralf That large file is not the source of the problem. During the very few times I have totally run out of MCM and have to re-download that file, it appears to download without any problem. Not necessarily quickly, but it downloads with having any HTTP errors. As long as I have any MCM units in the queue I do not need to reload that file. Even with that file not having to be reloaded, it is the smaller workunit files which become stuck, both MCM and OPN1. As TPCBF has mentioned more than once, it is not a transmission issue, it is a connection issue. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
PMH_UK
Veteran Cruncher UK Joined: Apr 26, 2007 Post Count: 765 Status: Offline Project Badges: |
As TPCBF has mentioned more than once, it is not a transmission issue, it is a connection issue. With connections tied up slowly downloading files there are fewer available, hence errors and backoffs. More bandwidth to free up connections sooner and more connections available are needed. Paul.
Paul.
|
||
|
bluestang
Senior Cruncher USA Joined: Oct 1, 2010 Post Count: 272 Status: Offline Project Badges: |
I will go out on a limb and speculate that a big part of the problem is an inadequate amount of money to compensate the increased burdens on the existing IT staff, not to mention the availability of increased hardware capacity. Not that throwing money at problems necessarily fixes them, but to me this looks like over burdening the staff and the hardware. Having had experience in a 24/7 environment my guess is an inadequate amount of both. Redundancy is expensive for both staff and hardware, but if you don't have it when you need it, it can become potentially crippling in operations and more expensive in the long run. I am myself for more than a couple of decades in the IT business. And I am therefor also aware that there always can be issues creeping up when deploying a new system, or migrating an existing one.Kind of reminds me of 2 old saws: "Don't bite off more than you can chew" and "Don't let your eyes be bigger than your stomach." Cheers But in our case here, what is going on is beyond comprehension. For one, there have likely been talks between IBM and Krembil BEFORE they made the announcement of the move back in September 2021, that's now 14 months ago. That is not something that anyone, on either side, just decides over a lunch break. And during those talks, Krembil should have already had an idea of the scope and generally necessary resources required to take over the project. If they did not have that info, this would been downright stupid at that point. Second, there were another 5 months, between the announcement and the shutdown of the system in February 2022. That would have been the right time to start putting all the resources in place required to run the project by Krembil. At latest after the next three months, from the shutdown in February until the reactivating of the forum in May 2022, when doing the actual, physical migration, they should have been aware of most of the tripwires and other obstacles in the way to get the project operational again. So when restarting "for testing" in June, I did expect that there will be a few weeks in which things might not work smoothly and a lot of fine tuning would still have to happen. But the scope and need for resources should by that time, 9 months into the transition, already been obvious. But instead, yet another 5 months later, it just seems to be utter pandemonium, with things being amiss at a very basic level in this whole setup. It is incomprehensible that by now, a total of 14 months in, not having the resources is still an issue. And on top of all of this, not having proper communication from Krembil just makes things only worse for us, the volunteers. Platitudes like "this issue is on the radar" or "fixing this is our #1 priority" are of no help here. On the contrary, it just enhances the feeling that Krembil is totally in over their collective heads at this point... Ralf You are assuming IBM told them about everything and I think BM had quite a bit of patchwork type of software on their end just to make things work. Don't forget, they could just borrow some of their cloud resources too. I think IBM sorely and deliberately misinformed any potential party interested in taking over WCG of what was needed (and what they had Frankensteined together) to keep WCG going smoothly with the new generations of hardware volunteers were capable of having. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
As TPCBF has mentioned more than once, it is not a transmission issue, it is a connection issue. With connections tied up slowly downloading files there are fewer available, hence errors and backoffs. More bandwidth to free up connections sooner and more connections available are needed. Paul. The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s). The download problem existed regardless of which projects is sending WUs, in pretty much any combination. And the vast majority of files needed to be downloaded are <1KB in size, bandwidth simply is not an issue. Certainly not to "free up connections sooner", as all those files would fit into a single 1500 bytes size Ethernet package (standard maximum Ethernet MTU). Even when downloading a typical MCM1 WU file which are mostly 290-650 bytes in size, usinga 33.6Kbit/sec dialup line would get those files in 1/10th of a second.and even just a 100MBit/sec upload connection (and I am sure that Krembil as an even better connection, we have 500MBit/sec here at the office), bandwidth wise, that connection could send out/upload 100,000 of those EVERY SECOND. If it can get that data of those WUs from the underlying database servers... Ralf |
||
|
ThreadRipper
Veteran Cruncher Sweden Joined: Apr 26, 2007 Post Count: 1319 Status: Offline Project Badges: |
So if it is an issue of number of connections between DB server and Web server, let's assume. Then, what differs in how IBM had it wired up. Is the server hardware not the same or better than IBM had? Because IBM had it running without connectivity issues - what's the difference from that "old"/working setup?
----------------------------------------Wouldn't trying to approach that setup be helpful? Can Krembil perhaps ask old IBM techs for some guidance since this issue has been around for so long now? Before the OPNG WUs were distributed again, the grid was running fine the last time around. Would a test where no WUs other than ONPG be sent out be a relevant test - like, will the same issue happen still if all other WUs/projects are disabled? Join The International Team: https://www.worldcommunitygrid.org/team/viewTeamInfo.do?teamId=CK9RP1BKX1 AMD TR2990WX @ PBO, 64GB Quad 3200MHz 14-17-17-17-1T, RX6900XT @ Stock AMD 3800X @ PBO AMD 2700X @ 4GHz |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
So if it is an issue of number of connections between DB server and Web server, let's assume. Then, what differs in how IBM had it wired up. Is the server hardware not the same or better than IBM had? Because IBM had it running without connectivity issues - what's the difference from that "old"/working setup? Well, as Cyclops mentioned in his post a bit earlier, Krembil is running WCG now on "real iron" while at IBM, WCG was switched over a couple years ago to a cloud setup. And honestly, the number of issues after that migration (which required a 3 days shutdown) had increased over their previous,non-cloud setup, though all issues that did come up, always got resolved by Kevin&Co in very little time.Wouldn't trying to approach that setup be helpful? Can Krembil perhaps ask old IBM techs for some guidance since this issue has been around for so long now? That's were I say that Krembil dropped the ball. They had months before the shutdown, another at least 3 months while doing the actual physical transfer. And now we are pretty much 6 months "testing" until they come to realize that they need more hardware resources... Before the OPNG WUs were distributed again, the grid was running fine the last time around. Would a test where no WUs other than ONPG be sent out be a relevant test - like, will the same issue happen still if all other WUs/projects are disabled? The download errors happened with pretty much any combination of project WUs, I don't think that you can seriously pinpoint it to one specific project. But that hits the same point that I am trying to make for months, Krembil just isn't transparent/communicative enough to decisively determine that.Cyclops mentioned in his post today that they had/have "most valuable alpha testers" and I am seriously wondering who those are and what they were testing for the last 6 months... Ralf |
||
|
|