World Community Grid - View Thread - Retry Now...RESOLVED

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Retry Now...RESOLVED

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 49

[ ]

Author

This topic has been viewed 7330 times and has 48 replies

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7579
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Retry Now

The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s).

Sorry, I should have clarified.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Nov 11, 2022 4:25:19 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 873
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Retry Now

The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s).

I'm confused...

If we are talking about upload/download problems, the database(s) should not be directly engaged in the process! Uploading and downloading should be able to take place even if the main service is down, as the communications with the database are done when a client makes requests or reports that uploads have been done, with the BOINC core accessing shared filestore to post or collect files. If IBM/WCG altered the upload/download mechanism to include main database access from those servers, that was a bad move!

Now, the forums and the website are completely different matters, and that is where database connections can become an issue here, especially as just about every page access seems to check login status and the authorization mechanism appears to be one of the areas that overloads and eventually crashes.

If someone can explain what I'm missing[1]...

Cheers - Al.

[1] Whilst I've helped with some debugging on occasions, I've never run a BOINC server system, so what do I know :-)

[Nov 11, 2022 8:34:59 AM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1932
Status: Offline
Project Badges:

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

50 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project


Re: Retry Now

The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s).

For starters, there is not a single server (that much I am VERY sure can be assumed), but various ones. And the data that the Internet facing server(s) are handing to you (the BOINC client) via upload (download for you, the BOINC client) needs to come from somewhere. Those are the database servers "behind the scenes", which are not directly connected to the Internet.
Now when you make a request for new WUs, the web server is asking the database servers to give it files "from the hopper/feeder", a database server where the projects have loaded up the projects prepared WUs. (well, first the client will upload WUs,but that is a different battlefield, but the same basic process, just reversed). That data that is send to you is not located/created on the (BOINC) web servers! And it is this connection, behind the scenes, that is the problem. It is not, regardless what so many people are always pestering about, the Internet facing server(s) that are the (main) issue, but the connection between the database servers(via the "hopper"/"feeder") and the web facing server(s). And because there are multiple of those servers involved, there is another layer, a load balancing proxy server (running the HAproxy software) between the actual database servers and the web server(s). This was confirmed in the only post we ever got since the move directly from a WCG tech (Christian, "cubes", a couple of months ago).

A more detailed (in some aspects) diagram of the process can be found athttps://www.researchgate.net/figure/Internal-...OINC-server_fig1_41472232

Now, the forums and the website are completely different matters, and that is where database connections can become an issue here, especially as just about every page access seems to check login status and the authorization mechanism appears to be one of the areas that overloads and eventually crashes.

Well, that is kind of a different issue. And yes, something in the current setup is not right and need to be fixed, and that is that the BOINC "web" servers (they don't server up a web site per se, they just provide the http(s) protocol for the upload and download to/from the BOINC clients, and that is the part where everyone out there sees the symptoms if there is a fiubar behind the scenes) and the web server(s) that provide the web site (including the "Overview" and "Results" pages) as well as the forum, should be separate and both go down at the same time, unless there is something like the now infamous expired SSL certificate from a few months ago. Even the Overview/Results stuff should physically be on a separate system than the forum. Both have databases behind them, the forum the ones that has all the user profiles/login info as well as all the forum messages. But that one is (should be) more or less standalone. There is however a connection between the before mentioned BOINC servers and the database servers for the Overview/Result pages, in that the later gets its data about the WUs per project/user from the BOINC servers. That is also something that Cyclops eluded to in one of his previous posts. As far as that sync of data goes, the Results database (server(s)) and the BOINC servers should not be able to take each other down either, if the BOINC servers are down, the only effect should be that the Results just don't get updated until the BOINC servers are back up, and there should be no way in hell that any issue with the Results server(s) should bring down the BOINC servers (at least that part seems to be working, most of the time)..
The Results server(s)/database is also what handles both the WCG internal as well as the external stats with it's stats run happening twice a day. So issues with the stats (like back filling the missing stats from June through Sep 28th) should effect those servers aloine, and not have any influence otherwise of operations of the web site and the BOINC servers.

[1] Whilst I've helped with some debugging on occasions, I've never run a BOINC server system, so what do I know :-)

Me neither, but I have been now around DC (distributed computing) for a couple of decades, starting with SETI well before there ever was BOINC, and have dealt with the user facing issues at various projects in the past. The basic process as far as BOINC is concerned is the same, in order to have a unified BOINC client for all the different projects out there. just the implementation details (single server/cluster, real iron/cloud, etc) as well as the actual BOINC applications (the program the the BOINC client starts to process a WU, for any give project, as well as handling the download/upload for).

That whole process is a bit simplified, but should be the general process that is involved here. It is certainly more difficult than what a lot of people in here apparently assume...

And if I got any of the above parts fundamentally wrong (not by minute details), I would welcome any WCG tech to correct me as needed...

Ralf

----------------------------------------

----------------------------------------
[Edit 2 times, last edit by TPCBF at Nov 11, 2022 4:31:19 PM]

[Nov 11, 2022 4:13:57 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7579
Status: Offline
Project Badges:


Re: Retry Now

Nice explanation. This is the kind of information which should be coming from Krembil.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Nov 11, 2022 4:45:09 PM]

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 294
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

180 day badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

180 day badge for Microbiome Immunity Project

20 year badge for OpenPandemics - COVID-19


Re: Retry Now

Ralf,
Thanks. Fills in a lot of blanks for me - even though I am still processing what you presented.

Bruce

[Nov 11, 2022 5:02:29 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1932
Status: Offline
Project Badges:


Re: Retry Now

Ralf,
Thanks. Fills in a lot of blanks for me - even though I am still processing what you presented.

Bruce

Well, the main problem is that there are A LOT of moving parts in this setup. It is far from being as simple "just add more bandwidth" that some people have been screaming...

And as Sgt.Joe mentioned, it would be nice to see a description of the actual setup, with all the moving parts, coming from Krembil/WCG. They should have a map already for their own use, To understand how changes in one part might effect another. And what are worthwhile areas for improvement. We are now dealing for 7 days with those download issues, 6 months into restarting sending out work again...

Ralf

----------------------------------------

[Nov 11, 2022 5:08:54 PM]

Mad_Max
Cruncher
Russia
Joined: Nov 26, 2012
Post Count: 22
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

1 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

1 year badge for Africa Rainfall Project


Re: Retry Now

No, you are just making assumptions based on symptoms.
The lack of connections is not on the Internet facing side (and hence not "bandwidth" related) but it is an issue of getting enough connections from the internal database to web server(s).
The download problem existed regardless of which projects is sending WUs, in pretty much any combination.
.....

Ralf

And you're making stupid assumptions out of nothing!

There are no problems with connecting to databases back-end. Because download problems occur from file servers that DO NOT use databases at all.
Whereas the main project scheduler actively uses database for generating, distributing, issuing to clients and keeps tracks of issued WUs (all the stuff you show on scheme/pic above). But at the same time, there are no any problems with connecting to it - tasks are generated and issued to clients without any problems (no tasks available is not a technical error/problem).

They arise at the next step, when the client, having received the tasks from scheduler (and necessary queries to DB already DONE at this stage) , tries to download the files necessary for them from the file servers that does not use any database and do not use a two-level (front-end-back end which you were trying to describe) structure. It's just a simple file servers with static (non dynamic/generated) content.

So process which USE database and front-end/back-end 2lvl setup servers - run just fine
And process which DO NOT use database server - have connection and bandwidth issues all the time.

So your statement is completely false!

----------------------------------------
[Edit 3 times, last edit by Mad_Max at Nov 11, 2022 10:38:07 PM]

[Nov 11, 2022 9:43:57 PM]

Mad_Max
Cruncher
Russia
Joined: Nov 26, 2012
Post Count: 22
Status: Offline
Project Badges:


Re: Retry Now

All this complex "machinery" described in few TPCBF posts above happens when BOINC client connect to BOINC scheduler to report completed WUs and/or request new WUs
It happens via this URL: https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi

And there are no and have not been any problems lately, all this "large number of gears" is spinning properly and smoothly!

AFTER getting new WUs from the Scheduler and AFTER Scheduler already completed all requests and modification to back-end database BOINC clients looks for files needed to process new WUs it get from Scheduler (usually its input scientific data, but also can be executables or client side database if client does not has it yet already from one of the previous WUs)

And it contacts simple file server(s) to download files via urls like
https://download.worldcommunitygrid.org/boinc/download/......
where .... is a direct path to the static file like
https://download.worldcommunitygrid.org/boinc...odock_7.21_windows_x86_64 - executable
https://download.worldcommunitygrid.org/boinc/slideshow/mip1_02_v01.png - picture
https://download.worldcommunitygrid.org/boinc...ef/mcm1.dataset-sarc1.txt - client side database/data file common for multiple WUs

https://download.worldcommunitygrid.org/boinc...453_MCM1_0192095_5453.txt - input data for a single WU (from one of MCM WUs)
And so on...

And here, only at this stage, download errors occurs.

Apparently they are related to the insufficient number of sockets / the number of simultaneously supported connections that the server is configured for. Or which it is able to support due to software or hardware limitations. We can not see exact reason from client side.

Bandwidth issues are NOT really the main/root cause indeed. But they are also present (which can be seen from the very low speed of downloading large files - they go without errors but slow). And they exacerbate the main problem with the number of connections. Because the lower the download speed, the longer you have to keep connections between the server and clients open to transfer large files, and the longer it takes before they are released for other clients who want to download other files and wait/back-off getting download errors. Even if it need download just 1 KB file - bit because some of the previous clients still downloading 100 MB file and keeps some of their connection open for ~10 minutes (150 KiB/s) instead of 1 min (1.5 MiB/s) due to low server/data-center bandwidth.

Solving the bandwidth problem will not fix the root cause, but it will reduce the symptoms. Perhaps (but perhaps not) to the point where they become invisible to the average participant. But at least they will become less pronounced and less often.

P.S
Right now (and all this day) i do not see downloading problems - few hundreds files on few computers downloaded without single error today.
Downloading speed of large files also increased few folds (from 100-200 KiB/S to 500-900 KiB/S).
Which also confirms that the problems of throughput and the problem of number of connections are related to each other. Although of course this is not at all the same thing and they have different reasons and different ways to solve. It's just that one can affect the other to a some extent.

----------------------------------------
[Edit 2 times, last edit by Mad_Max at Nov 12, 2022 12:06:04 AM]

[Nov 11, 2022 11:45:01 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 873
Status: Offline
Project Badges:


Re: Retry Now

Thanks to Mad Max for pointing out what I think Ralf might have missed in what I was saying -- the linkage between all the stuff in that useful diagram and the upload and download servers is shared file-store! (Serves me right for not going into a lot more detail in that post...)

As Ralf suggested (but perhaps applied in a slightly different context), I'd be delighted to hear from WCG technical staff if there's something non-[BOINC-]standard in the way their upload and download servers work; until then, I have to assume that WCG/IBM left that part of BOINC well alone!...

Cheers - Al.

----------------------------------------
[Edit 1 times, last edit by alanb1951 at Nov 12, 2022 1:12:39 AM]

[Nov 12, 2022 1:10:54 AM]

Paul Schlaffer
Senior Cruncher
USA
Joined: Jun 12, 2005
Post Count: 242
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

50 year badge for The Clean Energy Project - Phase 2

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

20 year badge for Microbiome Immunity Project

5 year badge for OpenPandemics - COVID-19


Re: Retry Now

Nice explanation. This is the kind of information which should be coming from Krembil.
Cheers

Agreed, and thank you Ralf for taking the time to post this.

...and because it's Friday, I'm seeing a new problem on my side. ~~ARP~~ All units are stuck on the upload side, with upload speeds being in the single (yes single) digit KBps. All these months, I never had issues with uploads, only downloads. (sigh)

----------------------------------------

“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)

----------------------------------------
[Edit 1 times, last edit by Paul Schlaffer at Nov 12, 2022 2:36:12 AM]

[Nov 12, 2022 2:13:55 AM]

[ ]