World Community Grid - View Thread - 2022-09-15 Update (Networking & Workunits)

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: 2022-09-15 Update (Networking & Workunits)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 214

[ ]

Author

This topic has been viewed 149886 times and has 213 replies

Robokapp
Senior Cruncher
Joined: Feb 6, 2012
Post Count: 264
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

2 year badge for Help Fight Childhood Cancer

180 day badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

But yes, people mashing the update button doesn't help the system, and it's an incredibly wasteful use of one's time.

scripts and autoclickers do the mashing. I doubt any noteworthy ammount of retries is done by hand.

sadly they also mash when no mash is needed, further stressing the server.

[Sep 26, 2022 5:16:06 AM]

marbesoz
Cruncher
Joined: Jul 4, 2020
Post Count: 8
Status: Offline


Re: 2022-09-15 Update (Networking & Workunits)

In a testing phase after six months? we cannot continue with the "politically correct", it is time to say that IBM has transferred WCG to a structure that lacks adequate competence, I am very angry ...

[Sep 26, 2022 11:44:08 AM]

Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2494
Status: Offline
Project Badges:

10 year badge for Mapping Cancer Markers

14 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Well, according to external stats sites like BoincStats, WCG has only had 16156 users, with any credits at all the last 24 hours. That is nothing compared to before the transfer to Krembil/Jurisica Lab. So if WCG under Krembil/Jurisica Lab, with only 16156 users, have such enormous problems, just think about what will happen if the rest of the WCG users decides to return. 16156 is just a fraction of the amount of users WCG had under IBM.

Even if they fix the problems they have now, with downloads, they will come to a screeching halt, if even a small amount of WCG users from before the transition returns.

This is an infrastructure problem for sure. They need more real iron, and not just more VM's. I also really question if sharcnet really have enough bandwidth to be able to host WCG. This is certainly not looking good. This migration was certainly not well thought out, and I will not be surprised at all, if Krembil decides to tell Jurisica Lab, to just pull the plug on WCG.

----------------------------------------
[Edit 2 times, last edit by Grumpy Swede at Sep 26, 2022 2:46:17 PM]

[Sep 26, 2022 2:34:29 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2173
Status: Offline
Project Badges:

10 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

I don't remember now the exact number, but the number of ACTIVE users before the move I think was around 80,000. So we at are around 20-25% of the previous number of active users. And with taking pretty much a year to get things going again, I am not sure if we will get to those pre-move numbers (ever) again...

This is an infrastructure problem for sure. They need more real iron, and not just more VM's. I also really question if sharcnet really have enough bandwidth to be able to host WCG.

I am not sure if you have followed previous posts by me (and Alan and a couple of others) and the current issues are certainly not a "bandwidth" issue, but rather a connection issue. Those two things are decisively different and the later is a thing that needs to be solved at Krembil's end, it's not an ISP issue. The post from Christian ("cubes") kind of confirmed this, and his reply has been so far one first one that makes me think they are finally on the right track...

Haven't seen any OPNG WUs since I got up this morning (there were a couple of batches late last night (PDT)), but so far, I have not seen ANY stuck or stalled downloads this morning. But then I might have just jinxed it and the problems might be back as soon as OPNG WUs are again released into the wild... ;-)

This is certainly not looking good. This migration was certainly not well thought out,

No argument from me here. I think a lot of the current/recent problems could have been prevented/minimized in the months between the announcement and the move, as well as in the months before they started to bring things back online. (Certificates!)

and I will not be surprised at all, if Krembil decides to tell Jurisica Lab, to just pull the plug on WCG.

I seriously hope not. Beside that would be VERY bad publicity for Krembil and I am not sure if that is something they can really afford to do. It would cast a serious doubt on their capability as a research institute...

Ralf

[Sep 26, 2022 6:44:01 PM]

Paul Schlaffer
Senior Cruncher
USA
Joined: Jun 12, 2005
Post Count: 278
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

14 day badge for Influenza Antiviral Drug Search

50 year badge for The Clean Energy Project - Phase 2

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

20 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

For the latter part, I sure hope not. I am grateful they were willing to take this on after IBM decided they were done. I've been part of this charitable project from the beginning. However, it's clear they have been over their heads with this project. If it does at some point move on, it should go to someone like Elon Musk, who would have the resources to regrow the project and make it succeed. There's certainly no argument we've lost a lot of contributors due to the many months of downtime, and the buggy restart.

----------------------------------------

“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)

----------------------------------------
[Edit 2 times, last edit by Paul Schlaffer at Sep 27, 2022 2:47:02 AM]

[Sep 27, 2022 2:02:18 AM]

sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:

20 year badge for Mapping Cancer Markers

45 day badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project


Re: 2022-09-15 Update (Networking & Workunits)

Our load balancer runs HAProxy.

Proxy, and 65535 TCP connections limit per address...

HTTP server may be able to handle millions of incoming connections, but if this is a proxy which forwards all those connection to only 1 or 2 addresses, then with as little as 65535 connections to 1 address or 131070 to 2 addresses, it will run out of TCP ports and have random "no server available" problem. Unsure if this would help, but maybe a possible workaround is to just add more IPv4/IPv6 addresses to upload/download server, maybe with virtual address or something.

Slow ARP1 downloads at 50 MegaByte per second have often been a problem as well. Might be either slow network connection or slow server storage, unsure.

[Sep 27, 2022 3:38:37 AM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2173
Status: Offline
Project Badges:


Re: 2022-09-15 Update (Networking & Workunits)

Our load balancer runs HAProxy.

Sorry, but you don't know what you are talking about. HAProxy is a proxy specially for load balancers and has no 64k TCP port limitation. To quote:

Servers equipped with 6 to 8 cores generally achieve between 200000 and 500000 requests per second, and have no trouble saturating a 25 Gbit/s connection under Linux

Ralf

[Sep 27, 2022 5:52:40 AM]

Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:

1 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

180 day badge for Influenza Antiviral Drug Search

1 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for FightAIDS@Home - Phase 2

20 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Slow ARP1 downloads at 50 MegaByte per second have often been a problem as well. Might be either slow network connection or slow server storage, unsure.

If ARP1 downloads really was at 50 Megabyte per second, even the largest file would download in 1 second and where wouldn't be any problems.

Unfortunately, in reality ARP1 downloads have been down to around 50 kilobyte per second for any of the 10+ MB files and by using 1+ minute per file this does tie-up the download server for same amount of time. With multiple large files wouldn't be surprised a single ARP1 wu could take close to 10 minutes download-time (not counting all the hours waiting on actually getting a connection).

Thankfully it seems ARP1 downloads have greatly improved, since did manage getting a single new wu where the input_d0? files speed was now 1 Megabyte/s - 2.5 Megabyte/s and the largest file speed was roughly 4 Megabyte/s.

Now if this is a real improvement due to downloads servers not being swamped with downloading all the tiny 1 KB files for other type of wu's is more difficult to know, since with ARP1 "committed to other platforms" it doesn't really look like where's much ARP1 going out at all.

----------------------------------------

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

[Sep 27, 2022 12:44:00 PM]

Aperture_Science_Innovators
Advanced Cruncher
United States
Joined: Jul 6, 2009
Post Count: 139
Status: Offline
Project Badges:

2 year badge for Nutritious Rice for the World

10 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

50 year badge for Uncovering Genome Mysteries

200 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

100 year badge for Microbiome Immunity Project


Re: 2022-09-15 Update (Networking & Workunits)

...since with ARP1 "committed to other platforms" it doesn't really look like where's much ARP1 going out at all.

Do you set your devices to run mostly/entirely ARP tasks? I have my preferences set to give each device 10-12 ARP tasks a time (probably can up this, given that work is coming in pretty stably so they are getting lots of other WUs too) but they're not having a hard time pulling that down. I'm seeing my most recent work fetch on one of my systems pulled down 4 ARP tasks out of about 25 tasks total, which is not a bad ratio

----------------------------------------

[Sep 27, 2022 1:27:45 PM]

Hans Sveen
Veteran Cruncher
Norge
Joined: Feb 18, 2008
Post Count: 983
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for The Clean Energy Project - Phase 2

90 day badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2


Re: 2022-09-15 Update (Networking & Workunits)

Hello!
Since this morning I have got about 100 new wu's mostly OPN1.
Also some ARP1 and MCM was downloaded without any extra clicking on retry.

While writing this I just got 13 OPNG wu also with no retries, maybe we can see
much more light in the end of the long tunnel!?😍

Keep up the good work!

With regards,
H.Sveen

[Sep 27, 2022 2:39:05 PM]

[ ]