World Community Grid - View Thread - 2022-08-19 (Networking Issue Update)

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: 2022-08-19 (Networking Issue Update)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 203

[ ]

Author

This topic has been viewed 248761 times and has 202 replies

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1994
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

10 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: 2022-08-19 (Networking Issue Update)

To automate retries I use Adri's wcgresults from
https://sourceforge.net/projects/wcgtools/files/
On Linux use command crontab -e to create a timer to run every 15 minutes with option -x.

Beside that not everyone is running Linux (probably a minority among the hosts), this is just a crutch to get by for the time being, rather than have Krembil fix this problem at its source...

Ralf

----------------------------------------

[Sep 21, 2022 6:55:27 PM]

Robokapp
Senior Cruncher
Joined: Feb 6, 2012
Post Count: 249
Status: Offline
Project Badges:

2 year badge for Help Fight Childhood Cancer

180 day badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: 2022-08-19 (Networking Issue Update)

small indie company and team of volunteers, remember?

[Sep 21, 2022 11:34:31 PM]

Phill23
Advanced Cruncher
Joined: Jan 3, 2006
Post Count: 59
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

100 year badge for Mapping Cancer Markers

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer


Re: 2022-08-19 (Networking Issue Update)

Experiencing nothing but issues when it comes to downloading the work units, however the upload has been perfectly fine. The retry button I think will soon be worn out with the amount of times I've been pressing it recently :(

Some friends in the US haven't been having the issue thought but here (UK) and a friend in Germany, have been having the same issues. Having the same HTTP error as it seems a few people are :( Such a shame.

[Sep 22, 2022 10:34:33 AM]

JEvenden
Cruncher
Joined: Aug 18, 2005
Post Count: 2
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

180 day badge for Help Cure Muscular Dystrophy

180 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

180 day badge for Uncovering Genome Mysteries

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: 2022-08-19 (Networking Issue Update)

I have serval machine which have ben running fine getting work and processing till yesterday. Most now not getting work and this one I am on has 6 hung in transfer and 4 out of the usual 8 running. i will be full stop in 4 hours.

[Sep 22, 2022 11:32:48 AM]

PMH_UK
Veteran Cruncher
UK
Joined: Apr 26, 2007
Post Count: 779
Status: Offline
Project Badges:

1 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

180 day badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together


Re: 2022-08-19 (Networking Issue Update)

To automate retries I use Adri's wcgresults from
https://sourceforge.net/projects/wcgtools/files/
On Linux use command crontab -e to create a timer to run every 15 minutes with option -x.

Beside that not everyone is running Linux (probably a minority among the hosts), this is just a crutch to get by for the time being, rather than have Krembil fix this problem at its source...

Ralf

True, I'm in a minority running only Linux now.
wcgresults could probably be run on Windows under Cygwin or WSL.
Others have posted ways to automate retry in this and/or other threads.
Also someone just posted a script to retry on multiple systems from one.

Paul.

----------------------------------------

Paul.

[Sep 22, 2022 2:34:18 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7777
Status: Offline
Project Badges:

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

90 day badge for Influenza Antiviral Drug Search

45 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

20 year badge for Outsmart Ebola Together

100 year badge for Smash Childhood Cancer


Re: 2022-08-19 (Networking Issue Update)

Some friends in the US haven't been having the issue thought but here (UK) and a friend in Germany, have been having the same issues. Having the same HTTP error as it seems a few people are :( Such a shame.

Nope. Still having the same issues here. It is still a Krembil problem, probably everywhere.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Sep 22, 2022 4:20:43 PM]

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 345
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

180 day badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: 2022-08-19 (Networking Issue Update)

I’m in the US and still having issues!

[Sep 22, 2022 4:41:49 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1994
Status: Offline
Project Badges:


Re: 2022-08-19 (Networking Issue Update)

Linux or not, the bottom line is that workarounds like this are not going to fix the situation. In fact, they are likely to make things worse if workarounds/crutches like this are being used in larger numbers, due to an ever increasing barrage of connection attempts on the server sides.

And as I mentioned before, I doubt that this is a "bandwidth" issue, the longer it keeps going, the more I am convinced that this is a limitation of the number of concurrent file handles on the (cluster) file system of the server(s), the number of concurrent connection on the database(s) being used or the number of concurrent connections on the web server(s) being used. Most likely even a combination of those things.
Don't think that it is a direct problem of processing power of either the database or web server(s), as the later at least is able to send proper 503 error messages back. If the web server would be so overburdened that it wouldn't answer at all when a connection is being tried, much rather a 408 (timeout) client side http error would be to be expected...

Ralf

----------------------------------------

[Sep 22, 2022 4:52:19 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1062
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

1 year badge for Drug Search for Leishmaniasis

14 day badge for Computing for Sustainable Water


Re: 2022-08-19 (Networking Issue Update)

Ralf - good analysis...

And as I mentioned before, I doubt that this is a "bandwidth" issue, the longer it keeps going, the more I am convinced that this is a limitation of the number of concurrent file handles on the (cluster) file system of the server(s), the number of concurrent connection on the database(s) being used or the number of concurrent connections on the web server(s) being used. Most likely even a combination of those things.
Don't think that it is a direct problem of processing power of either the database or web server(s), as the later at least is able to send proper 503 error messages back. If the web server would be so overburdened that it wouldn't answer at all when a connection is being tried, much rather a 408 (timeout) client side http error would be to be expected...

There's definitely a shortage of "infrastructure" - whether it can be [partially] solved by adjusting system configuration parameters (e.g. available file handles) is unclear, so that brings us back to why they still don't seem to have all the "servers" they apparently planned for.

Of course, when there's not much work available the servers don't get hammered so hard and it might look as if the problems have been resolved - but no, they've just been deferred! And now we have more OPNG and non-retry ARP1 work so it's no surprise it has kicked off again...

Blount had a point when singling out the network people -- I'd love to be a fly on the wall when Igor Jurisica or one of his [small] team contacts them (yet again?) to ask when their extra servers will be available, as I suspect the frustration levels must be quite high...

Keep the critique going - eventually we might get some much more detailed responses!

Cheers - Al.

P.S. there may end up being a total bandwidth issue as when there's lots of work available my download rates seem to plummet by 80% or more (which suggests natural throttling...) However, perhaps when the network/infrastructure issues are resolved there'll also be more total bandwidth? I wonder how much total external capacity Sharcnet has :-)

[Sep 22, 2022 9:04:42 PM]

[ ]