Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 214
Posts: 214   Pages: 22   [ Previous Page | 5 6 7 8 9 10 11 12 13 14 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 67588 times and has 213 replies Next Thread
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 292
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

We GREATLY appreciate hearing anything about what is transpiring there!

We also GREATLY appreciate the extended details of what has been discovered, successfully isolated, additional “http transient error” envelope information, work performed, additional “tweaking” to be done and all the other tidbits you have shared.

Out of frustration from the need to “babysit” the downloads and the lack of any real status reports and prior to seeing your update, I had already configured several of my systems for “no new tasks”. When focusing on manually expediting downloads, a week is a very long time.

But, having read your detailed message, “Allow new tasks” has been restored.

I do look forward to not needing to “babysit” file transfers!

But more importantly: to smoothly advance science. Thank you again for your detailed update!

Bruce
[Sep 24, 2022 2:43:25 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hendermd
Cruncher
United States
Joined: Apr 30, 2010
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Thanks for the update but as of now have not noticed any change in downloads.
----------------------------------------

[Sep 24, 2022 4:11:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7546
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Cubes:
Thank you for the detailed explanation. It is highly appreciated. Hopefully with a little more tuning, the problem will be minimized. It is interesting to see you now have two servers saturated. Perhaps when they catch up the download situation will be resolved. It is better here, but not yet fully resolved.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Sep 24, 2022 4:12:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Traveller42
Cruncher
Joined: May 7, 2017
Post Count: 21
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

If you can see the issue and make changes that affect the response to that issue, that is great progress.

I look forward to the Download situation settling.

Currently, it looks like about 25% of my attempts are working, but there are streaks where several go through at once, and others where about 50% go through, but is enough to keep the client happy to keep retrying.
[Sep 24, 2022 4:16:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1928
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Thanks Christian, this is the kind of update that a lot of us have been waiting for for a while now!
Prior to that happy event, we looked into the source of the "transient" errors reported in client logs. As it happens, the BOINC client will log almost any kind of HTTP/HTTPS error status as a "transient HTTP error". We first investigated our upload/download server, but its logs showed a >99.9% rate of successful responses, and the server load was generally low. Whatever the exact errors the clients were receiving, it seemed they did not come directly there. So we moved on to the load balancer. Our load balancer runs HAProxy. Examining its operating stats showed it was the source of the BOINC "transient" errors, apparently configured to be a little over-protective of our u/d server, turning down lots of requests. Our HAProxy configuration was originally copied from IBM's, then adapted to work in the new environment, though we left many of parameters unchanged -- maximum number of simultaneous connections, etc. As it turns out, some of those settings do not work well in the Krembil WCG cluster, at least when we're at 50% download capacity. We made a cautious change or two, but with the new server online now, we will wait until the system settles into a new equilibrium to resume parameter tuning.
That kind of makes sense. Something on the sending http server was responding with those 503 errors, which indicated that the problem had to be somewhere "on the inside" between the server that makes the actual connection to the remote BOINC client and the internal source of the WU data. So my general hunch was at least in the right direction cool ...
The changes probably won't eliminate the "transient" errors -- initial stats from HAProxy say both download servers are saturated now, but hopefully the second download server reduces the pain, and tuning our load balancer should improve things further.
The downloads still get stuck, though it seems the initial amount of files that make it through as somewhat increased, but then this could also be a sign of the weekend...

Also kind of interesting that you mentioned you had been contacted by your ISP about a possible DDoS scenario. Would really love to hear more details if available... wink

thanks again and hoping to hear from you again soon,

Ralf
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by TPCBF at Sep 24, 2022 5:38:58 PM]
[Sep 24, 2022 5:30:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2068
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Well, the http errors on downloads, certainly aren't any better on this side of the pond. No difference, or maybe even worse than before. The website and the forum is definitely much worse now. It's like surfing in molasses.
[Sep 24, 2022 10:06:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
robertmiles
Senior Cruncher
US
Joined: Apr 16, 2008
Post Count: 443
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

I've noticed that the server now allows downloads of the larger files to complete with fewer retries than it does for smaller files. Is there any reason why?
[Sep 24, 2022 11:47:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SnoShu
Cruncher
Joined: May 15, 2020
Post Count: 1
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

So, I've got over 300 BOINC "Transfers" "queued". They are ALL "Download: pending (project backoff: hh:mm:ss)" with no active tasks. This has been going on all day today (24 Sep 2022, in Idaho, USA)

Sometimes I get a one or two tasks and they execute quickly and get uploaded.

Do I just leave it alone and let WCG and BOINC do their thing? Do an "Abort Transfer" for some or all of them? I just don't understand what is so complex about transferring 650 bytes, or 804 bytes or 1.02KB or any other quantity.

My internet connection is solid - 350Mdown/35Mup. Works fine, lasts a long time.

Queued entries are OPNG_0151220_00369_0 through OPNG_0151316_00248_0, and MCM1_0190732_0598_1 through MCM1_0190732_0768_0

Would be nice if they actually transferred so the work could get done...
----------------------------------------
[Edit 1 times, last edit by SnoShu at Sep 25, 2022 3:09:41 AM]
[Sep 25, 2022 3:08:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Paul Schlaffer
Senior Cruncher
USA
Joined: Jun 12, 2005
Post Count: 240
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Leave it alone. Boinc will take care of the retries on it's own, and the server will take care of any aborts if it's (unlikely) necessary.
----------------------------------------

“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)
[Sep 25, 2022 3:21:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1928
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

So, I've got over 300 BOINC "Transfers" "queued". They are ALL "Download: pending (project backoff: hh:mm:ss)" with no active tasks. This has been going on all day today (24 Sep 2022, in Idaho, USA)

Sometimes I get a one or two tasks and they execute quickly and get uploaded.

Do I just leave it alone and let WCG and BOINC do their thing? Do an "Abort Transfer" for some or all of them? I just don't understand what is so complex about transferring 650 bytes, or 804 bytes or 1.02KB or any other quantity.

My internet connection is solid - 350Mdown/35Mup. Works fine, lasts a long time.

Queued entries are OPNG_0151220_00369_0 through OPNG_0151316_00248_0, and MCM1_0190732_0598_1 through MCM1_0190732_0768_0

Would be nice if they actually transferred so the work could get done...
It is a know problem and a tech from Krembil/WCG finally posted last night that they found the likely source of the problem and so, this will hopefully be fixed soon(ish)...

In the meantime, no need to abort, but "mark all the transfers" and hit [Retry Now] every once in a while when the BOINC client has been gone through the list and EVERYTHING is back to "backoff" or "Project: retry".

The problem is not to transfer the data in the small files, it is the large number of concurrent connections, for which apparently some setting needs to be adjusted in the load balancer...

This has been all explained in a post just a couple ones above yours. The answers are out there, you just need to read them...

Ralf
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by TPCBF at Sep 25, 2022 3:48:09 PM]
[Sep 25, 2022 3:23:00 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 214   Pages: 22   [ Previous Page | 5 6 7 8 9 10 11 12 13 14 | Next Page ]
[ Jump to Last Post ]
Post new Thread