Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Official Messages Forum: News Thread: 2022-09-15 Update (Networking & Workunits) |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 214
|
Author |
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 292 Status: Offline Project Badges: |
We GREATLY appreciate hearing anything about what is transpiring there!
We also GREATLY appreciate the extended details of what has been discovered, successfully isolated, additional “http transient error” envelope information, work performed, additional “tweaking” to be done and all the other tidbits you have shared. Out of frustration from the need to “babysit” the downloads and the lack of any real status reports and prior to seeing your update, I had already configured several of my systems for “no new tasks”. When focusing on manually expediting downloads, a week is a very long time. But, having read your detailed message, “Allow new tasks” has been restored. I do look forward to not needing to “babysit” file transfers! But more importantly: to smoothly advance science. Thank you again for your detailed update! Bruce |
||
|
hendermd
Cruncher United States Joined: Apr 30, 2010 Post Count: 29 Status: Offline Project Badges: |
Thanks for the update but as of now have not noticed any change in downloads.
---------------------------------------- |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7546 Status: Recently Active Project Badges: |
Cubes:
----------------------------------------Thank you for the detailed explanation. It is highly appreciated. Hopefully with a little more tuning, the problem will be minimized. It is interesting to see you now have two servers saturated. Perhaps when they catch up the download situation will be resolved. It is better here, but not yet fully resolved. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Traveller42
Cruncher Joined: May 7, 2017 Post Count: 21 Status: Offline Project Badges: |
If you can see the issue and make changes that affect the response to that issue, that is great progress.
I look forward to the Download situation settling. Currently, it looks like about 25% of my attempts are working, but there are streaks where several go through at once, and others where about 50% go through, but is enough to keep the client happy to keep retrying. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1928 Status: Offline Project Badges: |
Thanks Christian, this is the kind of update that a lot of us have been waiting for for a while now!
----------------------------------------Prior to that happy event, we looked into the source of the "transient" errors reported in client logs. As it happens, the BOINC client will log almost any kind of HTTP/HTTPS error status as a "transient HTTP error". We first investigated our upload/download server, but its logs showed a >99.9% rate of successful responses, and the server load was generally low. Whatever the exact errors the clients were receiving, it seemed they did not come directly there. So we moved on to the load balancer. Our load balancer runs HAProxy. Examining its operating stats showed it was the source of the BOINC "transient" errors, apparently configured to be a little over-protective of our u/d server, turning down lots of requests. Our HAProxy configuration was originally copied from IBM's, then adapted to work in the new environment, though we left many of parameters unchanged -- maximum number of simultaneous connections, etc. As it turns out, some of those settings do not work well in the Krembil WCG cluster, at least when we're at 50% download capacity. We made a cautious change or two, but with the new server online now, we will wait until the system settles into a new equilibrium to resume parameter tuning. That kind of makes sense. Something on the sending http server was responding with those 503 errors, which indicated that the problem had to be somewhere "on the inside" between the server that makes the actual connection to the remote BOINC client and the internal source of the WU data. So my general hunch was at least in the right direction ...The changes probably won't eliminate the "transient" errors -- initial stats from HAProxy say both download servers are saturated now, but hopefully the second download server reduces the pain, and tuning our load balancer should improve things further. The downloads still get stuck, though it seems the initial amount of files that make it through as somewhat increased, but then this could also be a sign of the weekend...Also kind of interesting that you mentioned you had been contacted by your ISP about a possible DDoS scenario. Would really love to hear more details if available... thanks again and hoping to hear from you again soon, Ralf [Edit 1 times, last edit by TPCBF at Sep 24, 2022 5:38:58 PM] |
||
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2068 Status: Recently Active Project Badges: |
Well, the http errors on downloads, certainly aren't any better on this side of the pond. No difference, or maybe even worse than before. The website and the forum is definitely much worse now. It's like surfing in molasses.
|
||
|
robertmiles
Senior Cruncher US Joined: Apr 16, 2008 Post Count: 443 Status: Offline Project Badges: |
I've noticed that the server now allows downloads of the larger files to complete with fewer retries than it does for smaller files. Is there any reason why?
|
||
|
SnoShu
Cruncher Joined: May 15, 2020 Post Count: 1 Status: Offline Project Badges: |
So, I've got over 300 BOINC "Transfers" "queued". They are ALL "Download: pending (project backoff: hh:mm:ss)" with no active tasks. This has been going on all day today (24 Sep 2022, in Idaho, USA)
----------------------------------------Sometimes I get a one or two tasks and they execute quickly and get uploaded. Do I just leave it alone and let WCG and BOINC do their thing? Do an "Abort Transfer" for some or all of them? I just don't understand what is so complex about transferring 650 bytes, or 804 bytes or 1.02KB or any other quantity. My internet connection is solid - 350Mdown/35Mup. Works fine, lasts a long time. Queued entries are OPNG_0151220_00369_0 through OPNG_0151316_00248_0, and MCM1_0190732_0598_1 through MCM1_0190732_0768_0 Would be nice if they actually transferred so the work could get done... [Edit 1 times, last edit by SnoShu at Sep 25, 2022 3:09:41 AM] |
||
|
Paul Schlaffer
Senior Cruncher USA Joined: Jun 12, 2005 Post Count: 240 Status: Offline Project Badges: |
Leave it alone. Boinc will take care of the retries on it's own, and the server will take care of any aborts if it's (unlikely) necessary.
----------------------------------------“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792) |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1928 Status: Offline Project Badges: |
So, I've got over 300 BOINC "Transfers" "queued". They are ALL "Download: pending (project backoff: hh:mm:ss)" with no active tasks. This has been going on all day today (24 Sep 2022, in Idaho, USA) It is a know problem and a tech from Krembil/WCG finally posted last night that they found the likely source of the problem and so, this will hopefully be fixed soon(ish)...Sometimes I get a one or two tasks and they execute quickly and get uploaded. Do I just leave it alone and let WCG and BOINC do their thing? Do an "Abort Transfer" for some or all of them? I just don't understand what is so complex about transferring 650 bytes, or 804 bytes or 1.02KB or any other quantity. My internet connection is solid - 350Mdown/35Mup. Works fine, lasts a long time. Queued entries are OPNG_0151220_00369_0 through OPNG_0151316_00248_0, and MCM1_0190732_0598_1 through MCM1_0190732_0768_0 Would be nice if they actually transferred so the work could get done... In the meantime, no need to abort, but "mark all the transfers" and hit [Retry Now] every once in a while when the BOINC client has been gone through the list and EVERYTHING is back to "backoff" or "Project: retry". The problem is not to transfer the data in the small files, it is the large number of concurrent connections, for which apparently some setting needs to be adjusted in the load balancer... This has been all explained in a post just a couple ones above yours. The answers are out there, you just need to read them... Ralf [Edit 1 times, last edit by TPCBF at Sep 25, 2022 3:48:09 PM] |
||
|
|