World Community Grid - View Thread - 2022-09-15 Update (Networking & Workunits)

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: 2022-09-15 Update (Networking & Workunits)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 214

[ ]

Author

This topic has been viewed 154710 times and has 213 replies

bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 448
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

200 year badge for Mapping Cancer Markers

180 day badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

180 day badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

We GREATLY appreciate hearing anything about what is transpiring there!

We also GREATLY appreciate the extended details of what has been discovered, successfully isolated, additional “http transient error” envelope information, work performed, additional “tweaking” to be done and all the other tidbits you have shared.

Out of frustration from the need to “babysit” the downloads and the lack of any real status reports and prior to seeing your update, I had already configured several of my systems for “no new tasks”. When focusing on manually expediting downloads, a week is a very long time.

But, having read your detailed message, “Allow new tasks” has been restored.

I do look forward to not needing to “babysit” file transfers!

But more importantly: to smoothly advance science. Thank you again for your detailed update!

Bruce

[Sep 24, 2022 2:43:25 PM]

hendermd
Cruncher
United States
Joined: Apr 30, 2010
Post Count: 29
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

5 year badge for Help Fight Childhood Cancer

10 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Thanks for the update but as of now have not noticed any change in downloads.

----------------------------------------

[Sep 24, 2022 4:11:13 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7852
Status: Offline
Project Badges:

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

100 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Cubes:
Thank you for the detailed explanation. It is highly appreciated. Hopefully with a little more tuning, the problem will be minimized. It is interesting to see you now have two servers saturated. Perhaps when they catch up the download situation will be resolved. It is better here, but not yet fully resolved.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Sep 24, 2022 4:12:43 PM]

Traveller42
Cruncher
Joined: May 7, 2017
Post Count: 21
Status: Offline
Project Badges:

10 year badge for Mapping Cancer Markers

1 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

If you can see the issue and make changes that affect the response to that issue, that is great progress.

I look forward to the Download situation settling.

Currently, it looks like about 25% of my attempts are working, but there are streaks where several go through at once, and others where about 50% go through, but is enough to keep the client happy to keep retrying.

[Sep 24, 2022 4:16:23 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2175
Status: Offline
Project Badges:

5 year badge for The Clean Energy Project - Phase 2

50 year badge for Outsmart Ebola Together

50 year badge for Smash Childhood Cancer

5 year badge for Africa Rainfall Project


Re: 2022-09-15 Update (Networking & Workunits)

Thanks Christian, this is the kind of update that a lot of us have been waiting for for a while now!

Prior to that happy event, we looked into the source of the "transient" errors reported in client logs. As it happens, the BOINC client will log almost any kind of HTTP/HTTPS error status as a "transient HTTP error". We first investigated our upload/download server, but its logs showed a >99.9% rate of successful responses, and the server load was generally low. Whatever the exact errors the clients were receiving, it seemed they did not come directly there. So we moved on to the load balancer. Our load balancer runs HAProxy. Examining its operating stats showed it was the source of the BOINC "transient" errors, apparently configured to be a little over-protective of our u/d server, turning down lots of requests. Our HAProxy configuration was originally copied from IBM's, then adapted to work in the new environment, though we left many of parameters unchanged -- maximum number of simultaneous connections, etc. As it turns out, some of those settings do not work well in the Krembil WCG cluster, at least when we're at 50% download capacity. We made a cautious change or two, but with the new server online now, we will wait until the system settles into a new equilibrium to resume parameter tuning.

That kind of makes sense. Something on the sending http server was responding with those 503 errors, which indicated that the problem had to be somewhere "on the inside" between the server that makes the actual connection to the remote BOINC client and the internal source of the WU data. So my general hunch was at least in the right direction cool

...

The changes probably won't eliminate the "transient" errors -- initial stats from HAProxy say both download servers are saturated now, but hopefully the second download server reduces the pain, and tuning our load balancer should improve things further.

The downloads still get stuck, though it seems the initial amount of files that make it through as somewhat increased, but then this could also be a sign of the weekend...

Also kind of interesting that you mentioned you had been contacted by your ISP about a possible DDoS scenario. Would really love to hear more details if available... wink

thanks again and hoping to hear from you again soon,

Ralf

----------------------------------------
[Edit 1 times, last edit by TPCBF at Sep 24, 2022 5:38:58 PM]

[Sep 24, 2022 5:30:33 PM]

Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2550
Status: Offline
Project Badges:

14 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

90 day badge for Africa Rainfall Project


Re: 2022-09-15 Update (Networking & Workunits)

Well, the http errors on downloads, certainly aren't any better on this side of the pond. No difference, or maybe even worse than before. The website and the forum is definitely much worse now. It's like surfing in molasses.

[Sep 24, 2022 10:06:36 PM]

robertmiles
Senior Cruncher
US
Joined: Apr 16, 2008
Post Count: 445
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

45 day badge for The Clean Energy Project

180 day badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

180 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

1 year badge for Africa Rainfall Project


Re: 2022-09-15 Update (Networking & Workunits)

I've noticed that the server now allows downloads of the larger files to complete with fewer retries than it does for smaller files. Is there any reason why?

[Sep 24, 2022 11:47:41 PM]

SnoShu
Cruncher
Joined: May 15, 2020
Post Count: 1
Status: Offline
Project Badges:

90 day badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

So, I've got over 300 BOINC "Transfers" "queued". They are ALL "Download: pending (project backoff: hh:mm:ss)" with no active tasks. This has been going on all day today (24 Sep 2022, in Idaho, USA)

Sometimes I get a one or two tasks and they execute quickly and get uploaded.

Do I just leave it alone and let WCG and BOINC do their thing? Do an "Abort Transfer" for some or all of them? I just don't understand what is so complex about transferring 650 bytes, or 804 bytes or 1.02KB or any other quantity.

My internet connection is solid - 350Mdown/35Mup. Works fine, lasts a long time.

Queued entries are OPNG_0151220_00369_0 through OPNG_0151316_00248_0, and MCM1_0190732_0598_1 through MCM1_0190732_0768_0

Would be nice if they actually transferred so the work could get done...

----------------------------------------
[Edit 1 times, last edit by SnoShu at Sep 25, 2022 3:09:41 AM]

[Sep 25, 2022 3:08:11 AM]

Paul Schlaffer
Senior Cruncher
USA
Joined: Jun 12, 2005
Post Count: 279
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

50 year badge for The Clean Energy Project - Phase 2

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

20 year badge for Microbiome Immunity Project

5 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Leave it alone. Boinc will take care of the retries on it's own, and the server will take care of any aborts if it's (unlikely) necessary.

----------------------------------------

“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)

[Sep 25, 2022 3:21:08 AM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2175
Status: Offline
Project Badges:


Re: 2022-09-15 Update (Networking & Workunits)

It is a know problem and a tech from Krembil/WCG finally posted last night that they found the likely source of the problem and so, this will hopefully be fixed soon(ish)...

In the meantime, no need to abort, but "mark all the transfers" and hit [Retry Now] every once in a while when the BOINC client has been gone through the list and EVERYTHING is back to "backoff" or "Project: retry".

The problem is not to transfer the data in the small files, it is the large number of concurrent connections, for which apparently some setting needs to be adjusted in the load balancer...

This has been all explained in a post just a couple ones above yours. The answers are out there, you just need to read them...

Ralf

----------------------------------------
[Edit 1 times, last edit by TPCBF at Sep 25, 2022 3:48:09 PM]

[Sep 25, 2022 3:23:00 AM]

[ ]