Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Official Messages Forum: News Thread: 2022-08-19 (Networking Issue Update) |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 203
|
Author |
|
mwroggenbuck
Advanced Cruncher USA Joined: Nov 1, 2006 Post Count: 77 Status: Offline Project Badges: |
Ralf,
I really enjoy your analysis. However, I also think (like Al) there is some sort of infrastructure shortage. How do you explain the non-responsive home page? Especially the 3 pictures that load very slowly. I would think that has to be a bandwidth shortage, or a shortage of CPU power behind that bandwidth. I don't see how a slow web page load could be anything else. If it was a connection type of shortage, I would expect an all or nothing scenario. Of course, I could be wrong . I would be interested in your thoughts. Mark |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1931 Status: Offline Project Badges: |
Ralf, How I explain that?? Well, I don't. Beside that I am not visiting that page all that much in the first place, but did in the last few days kind of unintentionally after upgrading Firefox on several machine.I really enjoy your analysis. However, I also think (like Al) there is some sort of infrastructure shortage. How do you explain the non-responsive home page? Especially the 3 pictures that load very slowly. I would think that has to be a bandwidth shortage, or a shortage of CPU power behind that bandwidth. I don't see how a slow web page load could be anything else. If it was a connection type of shortage, I would expect an all or nothing scenario. Of course, I could be wrong . I would be interested in your thoughts. Mark And I have yet to notice a real slowdown of either the home page or the forum. This has been working for me (ignoring the certificate snafu a couple of weeks ago) just fine. And all connectivity issue that I have experienced since Sunday night/Monday morning have been with the download of new WUs. Not uploads, not reporting. Hence my previously posted remote analysis of the problem. And in fact that I did not notice any slowdown on the home page or forum is one of the reasons that I believe that the problem is with the "processing stack" (in lack for a better word before I finished my first cup of coffee this morning), not a general "bandwidth problem". As certainly/obviously, home page and forum are hosted on different server (be it physical or virtual instances). If I would have to make a WAG, also because it has been reported only by a few people, that this is some more local problem on the receiving end, possibly by the ISP due to increased number of connection attempts possibly looking like some DDoS attempts.Hence, among possible server side issues, my strong suggestion that Krembil gets its ... together and fixes the download problem rather than people coming up with all kinds of automated workarounds which all cause an artificially increased number of connection attempts, all without actually transferring any data. For a lot of sysadmins at ISPs, this would have all the hallmarks of a DDoS attempt on a specific IP address (Krembil's) and only further packet inspection showing actual data being transferred (the web site contents). But again, I did not see that web site slow down myself and can only guess based on a couple years of experience with an Open Source firewall project... Ralf |
||
|
mwroggenbuck
Advanced Cruncher USA Joined: Nov 1, 2006 Post Count: 77 Status: Offline Project Badges: |
I did not think to mention that I routinely clear my web browser cache and history (I use a cleaner program). If I do not clean the cache, the home web page is quite fast after the first (slow) load. I would be curious if you see the same thing if your clear your cache.
----------------------------------------[Edit 1 times, last edit by mwroggenbuck at Sep 23, 2022 3:24:17 PM] |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1931 Status: Offline Project Badges: |
There's definitely a shortage of "infrastructure" - whether it can be [partially] solved by adjusting system configuration parameters (e.g. available file handles) is unclear, so that brings us back to why they still don't seem to have all the "servers" they apparently planned for. Good question. A basic question in return would be if those "servers" are physical ("realiron") or virtual ones. For each of those can be different reasons why they are not online yet. Physical rack space need space in some rack in the data center, with proper networking and monitoring equipment physical present.. Virtual servers, well, they need to be "provisioned" and the images restored and booted up again. I think that we are dealing with the later and they might have to deal with a rather tardy data center that is "straight by the book" and needs a work order and signature on 30 different forms for any changes.Btw, ever since it was mentioned that not all servers are back online, I have the suspicion that this is the reason why we don't have the stats back up and running. They never hooked those server(s) responsible for processing the stats back up. It could be just as simple as that... Of course, when there's not much work available the servers don't get hammered so hard and it might look as if the problems have been resolved - but no, they've just been deferred! And now we have more OPNG and non-retry ARP1 work so it's no surprise it has kicked off again... Well, yes, that sounds like a possibility. But that should also be a scenario that (now that this has happened repeatedly) should be possible to diagnose and remedy rather quickly. Which unfortunately didn't happen... Blount had a point when singling out the network people -- I'd love to be a fly on the wall when Igor Jurisica or one of his [small] team contacts them (yet again?) to ask when their extra servers will be available, as I suspect the frustration levels must be quite high... Well, yes, I would love to have a first row seat on that as well... But then the core of the problem here is something that Jurisica had a full year by now to sorten out. And we are 7 months into the physical move, 3 months since the soft/test restart. I have dealt in the last 30+ years that I am doing sysadmin work for a whole lot of different clients with quite a few different data centers, but even for a public/university setting, those response times for things to happen are well beyond reasonable and acceptable. Keep the critique going - eventually we might get some much more detailed responses! I certainly hope so. Being told earlier this week "this is just a transient problem it will go away on its own", sorry, that just triggered my bullshit meter... P.S. there may end up being a total bandwidth issue as when there's lots of work available my download rates seem to plummet by 80% or more (which suggests natural throttling...) However, perhaps when the network/infrastructure issues are resolved there'll also be more total bandwidth? I wonder how much total external capacity Sharcnet has :-) Well, yes, that is a general possibility. But so far, I don't see anything that would lead me to believe that there is a real plan/strategy on getting the current issues fixed. With those identified and decisive action to at least try to fix them, it will be much easier to "scale up" any possible resources necessary.And to add to the frustration, communication is pretty much non-existent. I am pretty sure that C. does have different jobs to do at Krembil, but a more steady, and technical sound flow of information should be possible. Ralf PS: Also, as a reason why I doubt that this is a general "bandwidth issue" is the simple fact that we are not dealing with THAT huge amounts of data, but to 99.9% (or more) with in general Internet terms rather "tiny" (<1KB) sized files, specially considering OPN1/OPNG, which would take even on a dial-up only fractions of a second to transfer (a 33,600bits/sec dialup line would be able to transmit +3KB of data per second, if a connection to the data source is established). |
||
|
phillipspencer
Advanced Cruncher France Joined: Apr 9, 2015 Post Count: 71 Status: Offline Project Badges: |
How do you explain the non-responsive home page? Especially the 3 pictures that load very slowly. I would think that has to be a bandwidth shortage, or a shortage of CPU power behind that bandwidth. I don't see how a slow web page load could be anything else. If it was a connection type of shortage, I would expect an all or nothing scenario I think you make a valid point, Mark. I had noticed the Stats overview page (masochist that I am I keep checking it!) and Forum pages loading very slowly just recently. (Not make a cup of tea slow but definite white screen for a frustrating number of seconds. Initially, I had assumed it was my end but checking other, more graphic heavy websites they loaded as fast as usual. It is very noticeable tonight which prompted this comment. Cheers Phillip PS: I am finding this remote problem analysis by you, Ralf and others interesting (especially in the absence of detailed information from Kembrill). Appreciate the insights. Thanks! |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 869 Status: Offline Project Badges: |
Ralf - thanks for the response.
But then the core of the problem here is something that Jurisica had a full year by now to sorten out. I wonder if the pre-migration months were full of nothing but lawyers wrangling with one another :-)And on bandwidth -- I also don't believe the current problems are caused by a lack of bandwidth, but I suspect end users are seeing the effects of there being a lot of traffic at times. So I do wonder what might happen if/when work output gets back up to WCG-IBM levels (if ever...) On the same point, i noted your response to mwroggenbuck; whilst huge numbers of [mostly unproductive] connection attempts may cause local grief, throughput is [probably] a different issue. I am seeing the same dramatically reduced download speeds on large files regardless of the time of day (or night!) and it was not that bad a week or more ago I tend to believe that that is not a local ISP issue. And uploads seem to go as fast as I let them... But that's merely an observation as I don't have acess to my ISP's data logs! :-) Cheers - Al. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1931 Status: Offline Project Badges: |
But then the core of the problem here is something that Jurisica had a full year by now to sorted out. I wonder if the pre-migration months were full of nothing but lawyers wrangling with one another :-)But from September 2021 to February 2022, that would have been 5 months to get all the basic stuff figured out, necessary bandwidth, certificates, required rackspace incl. connectivity, etc. That's all just "pencil pusher" work that could all be done while the system still was running at IBM. Then after the shutdown in end of February 2022, creating snapshots of all servers (supposedly, these were all cloud instances!), backup of all the databases, moving the required stuff (which had been figured out in the 5 months before) to Toronto, restoring data and VMs.Then standing up all the server instances one by one as necessary. That included the web site and forum, which came up first (May?, can't find the announcement right now), That would have been 3 months. Then in June, they claimed to have the basic data servers back up to the point they could send out "test" WUs. When then immediately connectivity issues came up, we were told they had to wait for the networking guy to come back from summer vacation. Weeks gone by before there was any apparently progress. And now we are here, end of September, the leaves are starting to turn color, and we are still not one step further. Not to mention such silly snafu as forgetting to extend the certificates for the whole thing. And that even after people were harping about switching the data transfer from http to https, which would have required to install/access the certificates, at which point the latest it should have been noticed that they were set to expire soon. And on bandwidth -- I also don't believe the current problems are caused by a lack of bandwidth, but I suspect end users are seeing the effects of there being a lot of traffic at times. So I do wonder what might happen if/when work output gets back up to WCG-IBM levels (if ever...) I think that every transfer problem (after been able to catch one of those scarce connections) that shows up is a symptom rather the cause of the issues here. I am not sure if most people (no offense intended) actually understand what "bandwidth" entails.And what the actual series of steps in case of data transfer for a request of WUs from the BOINC client actually are.I don't know the details of the way this is now implemented at Krembil (including database clusters, load balancers, data aggregators), but I am participating long enough in WCG (and other DC projects before that) that I think I have a more than fair understanding what is involved, beside having myself designed and programmed functionally similar industrial (MRP) setups, so the flow of data was reversed (tons of rather small(ish) data packets push in very rapid succession onto a database server to be processed, instead of pulling sending out lots of small files to a lot of different clients).On the same point, i noted your response to mwroggenbuck; whilst huge numbers of [mostly unproductive] connection attempts may cause local grief, throughput is [probably] a different issue. I am seeing the same dramatically reduced download speeds on large files regardless of the time of day (or night!) and it was not that bad a week or more ago I tend to believe that that is not a local ISP issue. And uploads seem to go as fast as I let them... But that's merely an observation as I don't have acess to my ISP's data logs! :-) Well, that is the problem with any discussion on our side. We can base any ideas on the results/symptoms we are seeing. And it is very easy to mistake symptoms for causes.As I mentioned, I have not seen any unusual slowdown of either web site (the Results page is the only thing I more or less frequently refresh), and the forum, but again, those should be hosted on different server instances and (if properly implemented in the first place) not be effected (or have and effect to) by the projects database/BOINC side of things. Pointing to the same general domain should be the only commonality, hence IMPE, things like a (temporary/false positive) DDoS detection could be one possible explaination. There might be others, but that again is something that can only be deduced by symptoms experienced in the whole picture.Ralf |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7578 Status: Offline Project Badges: |
I certainly hope so. Being told earlier this week "this is just a transient problem it will go away on its own", sorry, that just triggered my bullXXXX meter... Please. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1931 Status: Offline Project Badges: |
I certainly hope so. Being told earlier this week "this is just a transient problem it will go away on its own", sorry, that just triggered my bullXXXX meter... Please |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 869 Status: Offline Project Badges: |
See this post by Christian (cubes) in the "2022-09-15 Update (Networking & Workunits)" News thread. Lots of interesting details (at last)...
Cheers - Al. |
||
|
|