Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Support Forum: Website Support Thread: Retry Now...RESOLVED |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 49
|
Author |
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7579 Status: Offline Project Badges: |
The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s). Sorry, I should have clarified. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 873 Status: Offline Project Badges: |
The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s). I'm confused...If we are talking about upload/download problems, the database(s) should not be directly engaged in the process! Uploading and downloading should be able to take place even if the main service is down, as the communications with the database are done when a client makes requests or reports that uploads have been done, with the BOINC core accessing shared filestore to post or collect files. If IBM/WCG altered the upload/download mechanism to include main database access from those servers, that was a bad move! Now, the forums and the website are completely different matters, and that is where database connections can become an issue here, especially as just about every page access seems to check login status and the authorization mechanism appears to be one of the areas that overloads and eventually crashes. If someone can explain what I'm missing[1]... Cheers - Al. [1] Whilst I've helped with some debugging on occasions, I've never run a BOINC server system, so what do I know :-) |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s). I'm confused...If we are talking about upload/download problems, the database(s) should not be directly engaged in the process! Uploading and downloading should be able to take place even if the main service is down, as the communications with the database are done when a client makes requests or reports that uploads have been done, with the BOINC core accessing shared filestore to post or collect files. If IBM/WCG altered the upload/download mechanism to include main database access from those servers, that was a bad move! Now when you make a request for new WUs, the web server is asking the database servers to give it files "from the hopper/feeder", a database server where the projects have loaded up the projects prepared WUs. (well, first the client will upload WUs,but that is a different battlefield, but the same basic process, just reversed). That data that is send to you is not located/created on the (BOINC) web servers! And it is this connection, behind the scenes, that is the problem. It is not, regardless what so many people are always pestering about, the Internet facing server(s) that are the (main) issue, but the connection between the database servers(via the "hopper"/"feeder") and the web facing server(s). And because there are multiple of those servers involved, there is another layer, a load balancing proxy server (running the HAproxy software) between the actual database servers and the web server(s). This was confirmed in the only post we ever got since the move directly from a WCG tech (Christian, "cubes", a couple of months ago). A more detailed (in some aspects) diagram of the process can be found athttps://www.researchgate.net/figure/Internal-...OINC-server_fig1_41472232 Now, the forums and the website are completely different matters, and that is where database connections can become an issue here, especially as just about every page access seems to check login status and the authorization mechanism appears to be one of the areas that overloads and eventually crashes. Well, that is kind of a different issue. And yes, something in the current setup is not right and need to be fixed, and that is that the BOINC "web" servers (they don't server up a web site per se, they just provide the http(s) protocol for the upload and download to/from the BOINC clients, and that is the part where everyone out there sees the symptoms if there is a fiubar behind the scenes) and the web server(s) that provide the web site (including the "Overview" and "Results" pages) as well as the forum, should be separate and both go down at the same time, unless there is something like the now infamous expired SSL certificate from a few months ago. Even the Overview/Results stuff should physically be on a separate system than the forum. Both have databases behind them, the forum the ones that has all the user profiles/login info as well as all the forum messages. But that one is (should be) more or less standalone. There is however a connection between the before mentioned BOINC servers and the database servers for the Overview/Result pages, in that the later gets its data about the WUs per project/user from the BOINC servers. That is also something that Cyclops eluded to in one of his previous posts. As far as that sync of data goes, the Results database (server(s)) and the BOINC servers should not be able to take each other down either, if the BOINC servers are down, the only effect should be that the Results just don't get updated until the BOINC servers are back up, and there should be no way in hell that any issue with the Results server(s) should bring down the BOINC servers (at least that part seems to be working, most of the time).. The Results server(s)/database is also what handles both the WCG internal as well as the external stats with it's stats run happening twice a day. So issues with the stats (like back filling the missing stats from June through Sep 28th) should effect those servers aloine, and not have any influence otherwise of operations of the web site and the BOINC servers. [1] Whilst I've helped with some debugging on occasions, I've never run a BOINC server system, so what do I know :-) Me neither, but I have been now around DC (distributed computing) for a couple of decades, starting with SETI well before there ever was BOINC, and have dealt with the user facing issues at various projects in the past. The basic process as far as BOINC is concerned is the same, in order to have a unified BOINC client for all the different projects out there. just the implementation details (single server/cluster, real iron/cloud, etc) as well as the actual BOINC applications (the program the the BOINC client starts to process a WU, for any give project, as well as handling the download/upload for).That whole process is a bit simplified, but should be the general process that is involved here. It is certainly more difficult than what a lot of people in here apparently assume... And if I got any of the above parts fundamentally wrong (not by minute details), I would welcome any WCG tech to correct me as needed... Ralf [Edit 2 times, last edit by TPCBF at Nov 11, 2022 4:31:19 PM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7579 Status: Offline Project Badges: |
Nice explanation. This is the kind of information which should be coming from Krembil.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 294 Status: Offline Project Badges: |
Ralf,
Thanks. Fills in a lot of blanks for me - even though I am still processing what you presented. Bruce |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
Ralf, Well, the main problem is that there are A LOT of moving parts in this setup. It is far from being as simple "just add more bandwidth" that some people have been screaming...Thanks. Fills in a lot of blanks for me - even though I am still processing what you presented. Bruce And as Sgt.Joe mentioned, it would be nice to see a description of the actual setup, with all the moving parts, coming from Krembil/WCG. They should have a map already for their own use, To understand how changes in one part might effect another. And what are worthwhile areas for improvement. We are now dealing for 7 days with those download issues, 6 months into restarting sending out work again... Ralf |
||
|
Mad_Max
Cruncher Russia Joined: Nov 26, 2012 Post Count: 22 Status: Offline Project Badges: |
No, you are just making assumptions based on symptoms. The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s). The download problem existed regardless of which projects is sending WUs, in pretty much any combination. ..... Ralf And you're making stupid assumptions out of nothing! There are no problems with connecting to databases back-end. Because download problems occur from file servers that DO NOT use databases at all. Whereas the main project scheduler actively uses database for generating, distributing, issuing to clients and keeps tracks of issued WUs (all the stuff you show on scheme/pic above). But at the same time, there are no any problems with connecting to it - tasks are generated and issued to clients without any problems (no tasks available is not a technical error/problem). They arise at the next step, when the client, having received the tasks from scheduler (and necessary queries to DB already DONE at this stage) , tries to download the files necessary for them from the file servers that does not use any database and do not use a two-level (front-end-back end which you were trying to describe) structure. It's just a simple file servers with static (non dynamic/generated) content. So process which USE database and front-end/back-end 2lvl setup servers - run just fine And process which DO NOT use database server - have connection and bandwidth issues all the time. So your statement is completely false! [Edit 3 times, last edit by Mad_Max at Nov 11, 2022 10:38:07 PM] |
||
|
Mad_Max
Cruncher Russia Joined: Nov 26, 2012 Post Count: 22 Status: Offline Project Badges: |
All this complex "machinery" described in few TPCBF posts above happens when BOINC client connect to BOINC scheduler to report completed WUs and/or request new WUs
----------------------------------------It happens via this URL: https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi And there are no and have not been any problems lately, all this "large number of gears" is spinning properly and smoothly! AFTER getting new WUs from the Scheduler and AFTER Scheduler already completed all requests and modification to back-end database BOINC clients looks for files needed to process new WUs it get from Scheduler (usually its input scientific data, but also can be executables or client side database if client does not has it yet already from one of the previous WUs) And it contacts simple file server(s) to download files via urls like https://download.worldcommunitygrid.org/boinc/download/...... where .... is a direct path to the static file like https://download.worldcommunitygrid.org/boinc...odock_7.21_windows_x86_64 - executable https://download.worldcommunitygrid.org/boinc/slideshow/mip1_02_v01.png - picture https://download.worldcommunitygrid.org/boinc...ef/mcm1.dataset-sarc1.txt - client side database/data file common for multiple WUs https://download.worldcommunitygrid.org/boinc...453_MCM1_0192095_5453.txt - input data for a single WU (from one of MCM WUs) And so on... And here, only at this stage, download errors occurs. Apparently they are related to the insufficient number of sockets / the number of simultaneously supported connections that the server is configured for. Or which it is able to support due to software or hardware limitations. We can not see exact reason from client side. Bandwidth issues are NOT really the main/root cause indeed. But they are also present (which can be seen from the very low speed of downloading large files - they go without errors but slow). And they exacerbate the main problem with the number of connections. Because the lower the download speed, the longer you have to keep connections between the server and clients open to transfer large files, and the longer it takes before they are released for other clients who want to download other files and wait/back-off getting download errors. Even if it need download just 1 KB file - bit because some of the previous clients still downloading 100 MB file and keeps some of their connection open for ~10 minutes (150 KiB/s) instead of 1 min (1.5 MiB/s) due to low server/data-center bandwidth. Solving the bandwidth problem will not fix the root cause, but it will reduce the symptoms. Perhaps (but perhaps not) to the point where they become invisible to the average participant. But at least they will become less pronounced and less often. P.S Right now (and all this day) i do not see downloading problems - few hundreds files on few computers downloaded without single error today. Downloading speed of large files also increased few folds (from 100-200 KiB/S to 500-900 KiB/S). Which also confirms that the problems of throughput and the problem of number of connections are related to each other. Although of course this is not at all the same thing and they have different reasons and different ways to solve. It's just that one can affect the other to a some extent. [Edit 2 times, last edit by Mad_Max at Nov 12, 2022 12:06:04 AM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 873 Status: Offline Project Badges: |
Thanks to Mad Max for pointing out what I think Ralf might have missed in what I was saying -- the linkage between all the stuff in that useful diagram and the upload and download servers is shared file-store! (Serves me right for not going into a lot more detail in that post...)
----------------------------------------As Ralf suggested (but perhaps applied in a slightly different context), I'd be delighted to hear from WCG technical staff if there's something non-[BOINC-]standard in the way their upload and download servers work; until then, I have to assume that WCG/IBM left that part of BOINC well alone!... Cheers - Al. [Edit 1 times, last edit by alanb1951 at Nov 12, 2022 1:12:39 AM] |
||
|
Paul Schlaffer
Senior Cruncher USA Joined: Jun 12, 2005 Post Count: 242 Status: Offline Project Badges: |
Nice explanation. This is the kind of information which should be coming from Krembil. Cheers Agreed, and thank you Ralf for taking the time to post this. ...and because it's Friday, I'm seeing a new problem on my side. “Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792) [Edit 1 times, last edit by Paul Schlaffer at Nov 12, 2022 2:36:12 AM] |
||
|
|