Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Official Messages Forum: News Thread: 2022-09-15 Update (Networking & Workunits) |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 214
|
Author |
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: |
Networking problems are back with a vengeance - download failures (http 503 Service Unavailable), slow downloads, and now upload failures, slow page loads on this forum. The full set.
|
||
|
wildhagen
Veteran Cruncher The Netherlands Joined: Jun 5, 2009 Post Count: 761 Status: Recently Active Project Badges: |
And the up/download errors are back in massive numbers...
----------------------------------------30-9-2022 10:10:58 | World Community Grid | Temporarily failed upload of OPNG_0151478_00406_0_r685058128_1: transient HTTP error 30-9-2022 10:10:58 | World Community Grid | Backing off 00:03:26 on upload of OPNG_0151478_00406_0_r685058128_1 30-9-2022 10:11:21 | World Community Grid | Started upload of MCM1_0190979_3604_0_r737390477_0 30-9-2022 10:11:28 | World Community Grid | Temporarily failed upload of MCM1_0190979_3604_0_r737390477_0: transient HTTP error 30-9-2022 10:11:28 | World Community Grid | Backing off 00:05:49 on upload of MCM1_0190979_3604_0_r737390477_0 For both up- and download. Seems we are back to the same issues. Again.... [Edit 1 times, last edit by wildhagen at Sep 30, 2022 8:19:32 AM] |
||
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges: |
Slow ARP1 downloads at 25 KB/s
----------------------------------------My guess is, server drop connection in a middle of download, which may cause wrong size error. I wonder why BOINC client don't want to resume and just errors the task? [error] File ARP1_0009321_132_ARP1_0009321_input_d01 has wrong size: expected 12485444, got 2405264 Server may have slow network, 37% packet drops. Used TCP ping on linux. nping --tcp -p 443 www.worldcommunitygrid.org -c 200 --rate 5 Max rtt: 164.893ms | Min rtt: 109.219ms | Avg rtt: 130.245ms Raw packets sent: 200 (8.000KB) | Rcvd: 126 (5.796KB) | Lost: 74 (37.00%) ---- Edit: small update, downloads are now getting faster, 240 KB/s, and my ping shows the packet drops is reduced to 15%. [Edit 1 times, last edit by sam6861 at Sep 30, 2022 8:56:22 AM] |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
I am probably annoyed about the ongoing issues with WCG for the last few months just like the next guy, but I think there are quite a few users here who just like to exaggerate issues.
----------------------------------------I noticed that the people who this morning (my time) complained about "new" networking issues, "back with a vengeance", etc. I did not experience any issue since late Monday, all downloads and uploads worked just fine for me, also didn't really notice and web site or forum slowdown either. These posts were all made from folks in Europe, who are all 8-9h ahead of me (California/PDT). And at a time where I was already asleep, and even in Toronto, it would still be in the middle of the night. When checking the Event Log on one of my machines, I did see some issues for a (comparatively) short period of time, roughly 0030 through 0115 (again PDT, so 8-9h ahead of Western Europe), so for about 45 minutes. No issues logged before and after, and there were no hung downloads or such on any of the other machines around here either. Just checked another higher end cruncher in the office (both CPU and GPU tasks) and that one experienced some issues at the very same time frame. A data server here showed some errors for a bit longer time, but then around that time, this machine is doing a 3/4TB backup at night. Given that this is at night time even in Toronto (which would have had these issues then around 0330-0415 local time), I think it is not inconceivable that this is some sort of backup over there as well. And while I am frustrated of the slow going of WCG since the move, I also do recognize that there has been considerable advances in the system since last weekend. Communication with WCG/Krembil is still abysmal, but there really seems to be some daylight at the end of the tunnel. And to get to the bottom of what happened last night (my time), it would be more helpful is people would check their logs to get some more qualified info about what happened, at least the time frame. And it would be interesting to keep an eye out if and when (and for how long) any issue would repeat tonight (and in the coming days). If there is some kind of pattern, it makes it much easier to pinpoint the issue and thus find a fix for it. And mind you, the whole WCG project is NOT officially back "in production", so some glitches are still to be expected. Ralf |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
Used TCP ping on linux. ping is NOT a useful diagnostic tool. At best, it tells you what you already know, THAT there is some kind of issue.nping --tcp -p 443 www.worldcommunitygrid.org -c 200 --rate 5 Max rtt: 164.893ms | Min rtt: 109.219ms | Avg rtt: 130.245ms Raw packets sent: 200 (8.000KB) | Rcvd: 126 (5.796KB) | Lost: 74 (37.00%) Much rather you should do a traceroute, as that will be able to tell you WHERE the issue occurs, which is much more useful than any percentage of packets lost from your end. Ralf |
||
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges: |
I did do traceroute and tcptraceroute. Some limitations:
----------------------------------------WCG (World Commnity Grid) servers appears to block ICMP pings, regular ping don't work. tcptraceroute do mostly work except for a few hops. Can be difficunt to read as it doesn't count number of dropped packet. Command I used back then during server problem: tcptraceroute www.worldcommunitygrid.org -n -q 100 -w 1 6 * 64.125.30.62 28.728 ms * * * * * * ... (90% drops) 7 64.125.31.173 22.789 ms 26.757 ms 26.611 ms ... (no drops) 8 128.177.76.42 37.678 ms 37.657 ms 37.534 ms ... (no drops) 9 66.97.16.229 37.766 ms 39.717 ms 39.652 ms ... (no drops) 10 66.97.20.194 39.495 ms 37.753 ms 39.372 ms ... (no drops) 11 199.241.167.118 39.640 ms * 40.472 ms * ... (50% drops here) 12 199.241.167.118 [open] 135.399 ms * * * * * * * 147.358 ms 146.033 ms 159.542 ms * * * * 130.010 ms (about 10 to 40% drops) Re-tested tcptraceroute now with fully working and fast WCG downloads: 6 * 64.125.30.62 28.728 ms * * * * * * ... (90% drops again) 11 199.241.167.118 39.640 ms * 40.472 ms * 38.416 ms * (again, 50% drops) 12 199.241.167.118 [open] 38.680 ms 39.964 ms 40.948 ms 39.781 ms ... (no drops) Hop 6 and 11 should probably be ignored. Hop 12: high ping and packet drops on server slow down. [Edit 1 times, last edit by sam6861 at Sep 30, 2022 6:40:01 PM] |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1932 Status: Offline Project Badges: |
Hop 6 and 11 should probably be ignored. not sure why you would consider 6 and 11 to "be ignored". After all, those are the points where your packet loss supposedly occurs.Hop 12: high ping and packet drops on server slow down. 64.125.30.62 is some infrastructure router in the USA (part of Zayo.com, which is an infrastructure ISP), possibly the connection from Europe (where you apparently are, can't confirm more of the route as you remove hop 1 through 5). 199.241.167.118 is a router/load balancer at sharcnet.ca, which is the University ISP for universities and institutes in eastern Canada.which also is the endpoint for reaching worldcommunitygrid.org. There is very little that likely can be done for the loss at hop 6, but I am sure that hop 11 (and 12, which you can see has the same IP, unless you copied something strangely) is what Christian ('cubes") had mentioned they were looking at in his post last week... A traceroute from me here looks like this 09/30/22 12:54:47 Fast traceroute worldcommunitygrid.orgThat looks perfectly fine, with the two "No Response" entries are likely edge/transport layer routers between networks, between Los Angeles (LAX) and Dallas, TX as well as the transfer from the USA (Chicago) to the Canadian backbone in Toronto... As mentioned before, we still don't have an official "go" statement from WCG, so there might still be things to adjust.. Most interesting would be if we would see this again (tonight) and if this is a reoccurring thing or maybe just a one time case where someone had his foot on the water hose... Ralf |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 873 Status: Offline Project Badges: |
I've been elsewhere and didn't see Ralf's recent post on the generic posts about problems until just now, and it has taken a while to compose this. So this is a bit late but I hope it might be of some use...
----------------------------------------These posts were all made from folks in Europe, who are all 8-9h ahead of me (California/PDT). And at a time where I was already asleep, and even in Toronto, it would still be in the middle of the night. Yup - the issues seemed to start around 07:30 UTC - see below...And to get to the bottom of what happened last night (my time), it would be more helpful is people would check their logs to get some more qualified info about what happened, at least the time frame. What??? - Provide useful information??? - Never!!! :-)I've checked the logs on my systems that connect to WCG, and the problem times for 30th September were as follows: Downoad backoffs from 07:28 UTC to 09:25 UTC The times quoted are when the back-offs were logged; I have no useful record of when they cleared as I don't log the actual uploads/downloads (to reduce already significant log bloat!)... And there were no other issues that day, and it seemed to maintain a reasonably steady (if low volume) stream of work throughout the day, if work fetch was not stopped because "some download is stalled"... And it would be interesting to keep an eye out if and when (and for how long) any issue would repeat tonight (and in the coming days). If there is some kind of pattern, it makes it much easier to pinpoint the issue and thus find a fix for it. Agreed! :-) If any sort of pattern emerges, it'll be interesting to see if there's a difference between weekdays and weekends, or whether issues are day-specific (e.g. always on Friday mornings!) or happen every so often (e.g. alternate days...) And, of course, if there are any patterns, drawing attention to them can't do any harm, can it?And yet again, we wait... Cheers - Al. [Edited to add link to referenced post...] [Edit 2 times, last edit by alanb1951 at Oct 1, 2022 12:59:35 AM] |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: |
As the author of "back with a vengeance", writing from the UK, I'm quite prepared to accept that this particular glitch was a night-time transient in the Canadian hosting cloud, and wasn't part of normal operations. Indeed, it cleared quite quickly, and I processed a lot of WCG tasks - mostly OPNG - with no observed problems for the rest of the day.
But I think it's important for Krembil to recognise that they have taken on a massive task, played out in public in front of a 24/7 global audience. Things which go 'bump' in the night may not cause local concern, but they look different in the morning light here. A suggestion might be - if there is a scheduled backup, or similar, process which runs each night, and puts a significant extra load on the servers, maybe workunit generation should be paused for the duration? By coincidence, a single new OPNG task popped up on my monitoring screen as I started to compose this post - and it still hasn't completed downloading. I'm not a fast typist! 01/10/2022 09:08:27 | World Community Grid | Temporarily failed download of 2c035af11c9247ee58a10596614bed4a.pdbqt: transient HTTP error 01/10/2022 09:08:27 | World Community Grid | Backing off 00:02:39 on download of 2c035af11c9247ee58a10596614bed4a.pdbqt 01/10/2022 09:11:11 | World Community Grid | Temporarily failed download of 2c035af11c9247ee58a10596614bed4a.pdbqt: transient HTTP error 01/10/2022 09:11:11 | World Community Grid | Backing off 00:06:04 on download of 2c035af11c9247ee58a10596614bed4a.pdbqt 01/10/2022 09:17:20 | World Community Grid | Temporarily failed download of 2c035af11c9247ee58a10596614bed4a.pdbqt: transient HTTP error 01/10/2022 09:17:20 | World Community Grid | Backing off 00:13:03 on download of 2c035af11c9247ee58a10596614bed4a.pdbqt Note how BOINC increases the delay on repeated failures. Of course, it downloaded immediately when I tried to catch an http_debug log. |
||
|
wildhagen
Veteran Cruncher The Netherlands Joined: Jun 5, 2009 Post Count: 761 Status: Recently Active Project Badges: |
Same for me, all my machines are back to transient HTTP errors.
1-10-2022 15:07:37 | World Community Grid | Temporarily failed download of db81d33763f7ba36152667e3de4ea37d.pdbqt: transient HTTP error 1-10-2022 15:07:37 | World Community Grid | Backing off 00:27:16 on download of db81d33763f7ba36152667e3de4ea37d.pdbqt 1-10-2022 15:07:37 | World Community Grid | Temporarily failed download of e3200dfb6055f58026f3946b54a5fc5e.job: transient HTTP error 1-10-2022 15:07:37 | World Community Grid | Backing off 00:13:04 on download of e3200dfb6055f58026f3946b54a5fc5e.job 1-10-2022 15:07:38 | World Community Grid | Started download of 7ae87fe0f60df02485e578de1b6cc9f9.zip 1-10-2022 15:07:38 | World Community Grid | Started download of 9c7a2159c0aa96a6487dd185f3141545.gpf 1-10-2022 15:07:42 | World Community Grid | Temporarily failed download of 7ae87fe0f60df02485e578de1b6cc9f9.zip: transient HTTP error 1-10-2022 15:07:42 | World Community Grid | Backing off 00:09:38 on download of 7ae87fe0f60df02485e578de1b6cc9f9.zip 1-10-2022 15:07:42 | World Community Grid | Temporarily failed download of 9c7a2159c0aa96a6487dd185f3141545.gpf: transient HTTP error 1-10-2022 15:07:42 | World Community Grid | Backing off 00:15:55 on download of 9c7a2159c0aa96a6487dd185f3141545.gpf And we are not talking about a few work units (which would not be that big a problem) but literally hundreds of work units, spread over multiple machines and spread over a period of hours. |
||
|
|