World Community Grid - View Thread - 2022-09-15 Update (Networking & Workunits)

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: 2022-09-15 Update (Networking & Workunits)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 214

[ ]

Author

This topic has been viewed 149934 times and has 213 replies

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:

2 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Networking problems are back with a vengeance - download failures (http 503 Service Unavailable), slow downloads, and now upload failures, slow page loads on this forum. The full set.

[Sep 30, 2022 8:17:56 AM]

wildhagen
Veteran Cruncher
The Netherlands
Joined: Jun 5, 2009
Post Count: 1004
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

And the up/download errors are back in massive numbers...

30-9-2022 10:10:58 | World Community Grid | Temporarily failed upload of OPNG_0151478_00406_0_r685058128_1: transient HTTP error
30-9-2022 10:10:58 | World Community Grid | Backing off 00:03:26 on upload of OPNG_0151478_00406_0_r685058128_1
30-9-2022 10:11:21 | World Community Grid | Started upload of MCM1_0190979_3604_0_r737390477_0
30-9-2022 10:11:28 | World Community Grid | Temporarily failed upload of MCM1_0190979_3604_0_r737390477_0: transient HTTP error
30-9-2022 10:11:28 | World Community Grid | Backing off 00:05:49 on upload of MCM1_0190979_3604_0_r737390477_0

For both up- and download.

Seems we are back to the same issues. Again....

----------------------------------------
[Edit 1 times, last edit by wildhagen at Sep 30, 2022 8:19:32 AM]

[Sep 30, 2022 8:18:51 AM]

sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:

20 year badge for Mapping Cancer Markers

45 day badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

5 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Slow ARP1 downloads at 25 KB/s

My guess is, server drop connection in a middle of download, which may cause wrong size error. I wonder why BOINC client don't want to resume and just errors the task?
[error] File ARP1_0009321_132_ARP1_0009321_input_d01 has wrong size: expected 12485444, got 2405264

Server may have slow network, 37% packet drops. Used TCP ping on linux.
nping --tcp -p 443 www.worldcommunitygrid.org -c 200 --rate 5
Max rtt: 164.893ms | Min rtt: 109.219ms | Avg rtt: 130.245ms
Raw packets sent: 200 (8.000KB) | Rcvd: 126 (5.796KB) | Lost: 74 (37.00%)

----
Edit: small update, downloads are now getting faster, 240 KB/s, and my ping shows the packet drops is reduced to 15%.

----------------------------------------
[Edit 1 times, last edit by sam6861 at Sep 30, 2022 8:56:22 AM]

[Sep 30, 2022 8:33:02 AM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2173
Status: Offline
Project Badges:

10 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

I am probably annoyed about the ongoing issues with WCG for the last few months just like the next guy, but I think there are quite a few users here who just like to exaggerate issues.
I noticed that the people who this morning (my time) complained about "new" networking issues, "back with a vengeance", etc. I did not experience any issue since late Monday, all downloads and uploads worked just fine for me, also didn't really notice and web site or forum slowdown either.
These posts were all made from folks in Europe, who are all 8-9h ahead of me (California/PDT). And at a time where I was already asleep, and even in Toronto, it would still be in the middle of the night.
When checking the Event Log on one of my machines, I did see some issues for a (comparatively) short period of time, roughly 0030 through 0115 (again PDT, so 8-9h ahead of Western Europe), so for about 45 minutes.
No issues logged before and after, and there were no hung downloads or such on any of the other machines around here either. Just checked another higher end cruncher in the office (both CPU and GPU tasks) and that one experienced some issues at the very same time frame. A data server here showed some errors for a bit longer time, but then around that time, this machine is doing a 3/4TB backup at night.

Given that this is at night time even in Toronto (which would have had these issues then around 0330-0415 local time), I think it is not inconceivable that this is some sort of backup over there as well.

And while I am frustrated of the slow going of WCG since the move, I also do recognize that there has been considerable advances in the system since last weekend. Communication with WCG/Krembil is still abysmal, but there really seems to be some daylight at the end of the tunnel.

And to get to the bottom of what happened last night (my time), it would be more helpful is people would check their logs to get some more qualified info about what happened, at least the time frame.
And it would be interesting to keep an eye out if and when (and for how long) any issue would repeat tonight (and in the coming days). If there is some kind of pattern, it makes it much easier to pinpoint the issue and thus find a fix for it.

And mind you, the whole WCG project is NOT officially back "in production", so some glitches are still to be expected.

Ralf

[Sep 30, 2022 3:49:39 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2173
Status: Offline
Project Badges:


Re: 2022-09-15 Update (Networking & Workunits)

Used TCP ping on linux.
nping --tcp -p 443 www.worldcommunitygrid.org -c 200 --rate 5
Max rtt: 164.893ms | Min rtt: 109.219ms | Avg rtt: 130.245ms
Raw packets sent: 200 (8.000KB) | Rcvd: 126 (5.796KB) | Lost: 74 (37.00%)

ping is NOT a useful diagnostic tool. At best, it tells you what you already know, THAT there is some kind of issue.

Much rather you should do a traceroute, as that will be able to tell you WHERE the issue occurs, which is much more useful than any percentage of packets lost from your end.

Ralf

[Sep 30, 2022 3:55:12 PM]

sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:


Re: 2022-09-15 Update (Networking & Workunits)

I did do traceroute and tcptraceroute. Some limitations:
WCG (World Commnity Grid) servers appears to block ICMP pings, regular ping don't work.
tcptraceroute do mostly work except for a few hops. Can be difficunt to read as it doesn't count number of dropped packet.

Command I used back then during server problem:
tcptraceroute www.worldcommunitygrid.org -n -q 100 -w 1
6 * 64.125.30.62 28.728 ms * * * * * * ... (90% drops)
7 64.125.31.173 22.789 ms 26.757 ms 26.611 ms ... (no drops)
8 128.177.76.42 37.678 ms 37.657 ms 37.534 ms ... (no drops)
9 66.97.16.229 37.766 ms 39.717 ms 39.652 ms ... (no drops)
10 66.97.20.194 39.495 ms 37.753 ms 39.372 ms ... (no drops)
11 199.241.167.118 39.640 ms * 40.472 ms * ... (50% drops here)
12 199.241.167.118 [open] 135.399 ms * * * * * * * 147.358 ms 146.033 ms 159.542 ms * * * * 130.010 ms (about 10 to 40% drops)

Re-tested tcptraceroute now with fully working and fast WCG downloads:
6 * 64.125.30.62 28.728 ms * * * * * * ... (90% drops again)
11 199.241.167.118 39.640 ms * 40.472 ms * 38.416 ms * (again, 50% drops)
12 199.241.167.118 [open] 38.680 ms 39.964 ms 40.948 ms 39.781 ms ... (no drops)

Hop 6 and 11 should probably be ignored.
Hop 12: high ping and packet drops on server slow down.

----------------------------------------
[Edit 1 times, last edit by sam6861 at Sep 30, 2022 6:40:01 PM]

[Sep 30, 2022 6:38:53 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 2173
Status: Offline
Project Badges:


Re: 2022-09-15 Update (Networking & Workunits)

Hop 6 and 11 should probably be ignored.
Hop 12: high ping and packet drops on server slow down.

not sure why you would consider 6 and 11 to "be ignored". After all, those are the points where your packet loss supposedly occurs.

64.125.30.62 is some infrastructure router in the USA (part of Zayo.com, which is an infrastructure ISP), possibly the connection from Europe (where you apparently are, can't confirm more of the route as you remove hop 1 through 5).

199.241.167.118 is a router/load balancer at sharcnet.ca, which is the University ISP for universities and institutes in eastern Canada.which also is the endpoint for reaching worldcommunitygrid.org.

There is very little that likely can be done for the loss at hop 6, but I am sure that hop 11 (and 12, which you can see has the same IP, unless you copied something strangely) is what Christian ('cubes") had mentioned they were looking at in his post last week...

A traceroute from me here looks like this

09/30/22 12:54:47 Fast traceroute worldcommunitygrid.org
Trace worldcommunitygrid.org (199.241.167.118) ...
 1 172.18.19.1       0ms    1ms    1ms  TTL: 64  (No rDNS)
 2 216.xxx.yy.193    0ms    1ms    0ms  TTL: 63  (No rDNS)
 3 10.250.4.50       0ms    0ms    0ms  TTL: 62  (No rDNS)
 4 10.251.88.163     1ms    1ms    1ms  TTL: 61  (No rDNS)
 5 10.202.32.17      1ms    1ms    1ms  TTL:251  (No rDNS)
 6 65.19.189.53      1ms    2ms    1ms  TTL: 59  (v503.core1.lax2.he.net ok)
 7   No Response      *      *      *                 
 8 184.104.199.245  29ms   29ms   29ms  TTL: 57  (100ge10-1.core1.dal1.he.net ok)
 9 184.104.195.214   *     40ms    *    TTL: 57  (port-channel7.core3.mci3.he.net ok)
10 184.105.222.78    *      *     53ms  TTL: 56  (port-channel16.core3.chi1.he.net ok)
11   No Response      *      *      *                 
12 206.108.34.40    63ms   63ms   63ms  TTL:245  (orion.ip4.torontointernetxchange.net ok)
13 66.97.16.2       66ms   66ms   66ms  TTL:242  (No rDNS)
14 66.97.16.229     66ms   66ms   65ms  TTL:243  (No rDNS)
15 66.97.20.194     65ms   65ms   65ms  TTL: 51  (No rDNS)
16 199.241.167.118  66ms   65ms   65ms  TTL: 50  (deploy.cloud.sharcnet.ca ok)

That looks perfectly fine, with the two "No Response" entries are likely edge/transport layer routers between networks, between Los Angeles (LAX) and Dallas, TX as well as the transfer from the USA (Chicago) to the Canadian backbone in Toronto...

As mentioned before, we still don't have an official "go" statement from WCG, so there might still be things to adjust..

Most interesting would be if we would see this again (tonight) and if this is a reoccurring thing or maybe just a one time case where someone had his foot on the water hose... wink

Ralf

[Sep 30, 2022 8:20:20 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1317
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Computing for Clean Water

180 day badge for GO Fight Against Malaria

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

I've been elsewhere and didn't see Ralf's recent post on the generic posts about problems until just now, and it has taken a while to compose this. So this is a bit late but I hope it might be of some use...

These posts were all made from folks in Europe, who are all 8-9h ahead of me (California/PDT). And at a time where I was already asleep, and even in Toronto, it would still be in the middle of the night.

Yup - the issues seemed to start around 07:30 UTC - see below...

And to get to the bottom of what happened last night (my time), it would be more helpful is people would check their logs to get some more qualified info about what happened, at least the time frame.

What??? - Provide useful information??? - Never!!! :-)

I've checked the logs on my systems that connect to WCG, and the problem times for 30th September were as follows:

Downoad backoffs from 07:28 UTC to 09:25 UTC
Upload backoffs from 07:42 UTC to 08:12 UTC

The times quoted are when the back-offs were logged; I have no useful record of when they cleared as I don't log the actual uploads/downloads (to reduce already significant log bloat!)...

And there were no other issues that day, and it seemed to maintain a reasonably steady (if low volume) stream of work throughout the day, if work fetch was not stopped because "some download is stalled"...

And it would be interesting to keep an eye out if and when (and for how long) any issue would repeat tonight (and in the coming days). If there is some kind of pattern, it makes it much easier to pinpoint the issue and thus find a fix for it.

Agreed! :-) If any sort of pattern emerges, it'll be interesting to see if there's a difference between weekdays and weekends, or whether issues are day-specific (e.g. always on Friday mornings!) or happen every so often (e.g. alternate days...) And, of course, if there are any patterns, drawing attention to them can't do any harm, can it?

And yet again, we wait...

Cheers - Al.

[Edited to add link to referenced post...]

----------------------------------------
[Edit 2 times, last edit by alanb1951 at Oct 1, 2022 12:59:35 AM]

[Oct 1, 2022 12:55:38 AM]

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:


Re: 2022-09-15 Update (Networking & Workunits)

As the author of "back with a vengeance", writing from the UK, I'm quite prepared to accept that this particular glitch was a night-time transient in the Canadian hosting cloud, and wasn't part of normal operations. Indeed, it cleared quite quickly, and I processed a lot of WCG tasks - mostly OPNG - with no observed problems for the rest of the day.

But I think it's important for Krembil to recognise that they have taken on a massive task, played out in public in front of a 24/7 global audience. Things which go 'bump' in the night may not cause local concern, but they look different in the morning light here. A suggestion might be - if there is a scheduled backup, or similar, process which runs each night, and puts a significant extra load on the servers, maybe workunit generation should be paused for the duration?

By coincidence, a single new OPNG task popped up on my monitoring screen as I started to compose this post - and it still hasn't completed downloading. I'm not a fast typist!

01/10/2022 09:08:27 | World Community Grid | Temporarily failed download of 2c035af11c9247ee58a10596614bed4a.pdbqt: transient HTTP error
01/10/2022 09:08:27 | World Community Grid | Backing off 00:02:39 on download of 2c035af11c9247ee58a10596614bed4a.pdbqt
01/10/2022 09:11:11 | World Community Grid | Temporarily failed download of 2c035af11c9247ee58a10596614bed4a.pdbqt: transient HTTP error
01/10/2022 09:11:11 | World Community Grid | Backing off 00:06:04 on download of 2c035af11c9247ee58a10596614bed4a.pdbqt
01/10/2022 09:17:20 | World Community Grid | Temporarily failed download of 2c035af11c9247ee58a10596614bed4a.pdbqt: transient HTTP error
01/10/2022 09:17:20 | World Community Grid | Backing off 00:13:03 on download of 2c035af11c9247ee58a10596614bed4a.pdbqt

Note how BOINC increases the delay on repeated failures. Of course, it downloaded immediately when I tried to catch an http_debug log.

[Oct 1, 2022 8:31:24 AM]

wildhagen
Veteran Cruncher
The Netherlands
Joined: Jun 5, 2009
Post Count: 1004
Status: Offline
Project Badges:


Re: 2022-09-15 Update (Networking & Workunits)

Same for me, all my machines are back to transient HTTP errors.

1-10-2022 15:07:37 | World Community Grid | Temporarily failed download of db81d33763f7ba36152667e3de4ea37d.pdbqt: transient HTTP error
1-10-2022 15:07:37 | World Community Grid | Backing off 00:27:16 on download of db81d33763f7ba36152667e3de4ea37d.pdbqt
1-10-2022 15:07:37 | World Community Grid | Temporarily failed download of e3200dfb6055f58026f3946b54a5fc5e.job: transient HTTP error
1-10-2022 15:07:37 | World Community Grid | Backing off 00:13:04 on download of e3200dfb6055f58026f3946b54a5fc5e.job
1-10-2022 15:07:38 | World Community Grid | Started download of 7ae87fe0f60df02485e578de1b6cc9f9.zip
1-10-2022 15:07:38 | World Community Grid | Started download of 9c7a2159c0aa96a6487dd185f3141545.gpf
1-10-2022 15:07:42 | World Community Grid | Temporarily failed download of 7ae87fe0f60df02485e578de1b6cc9f9.zip: transient HTTP error
1-10-2022 15:07:42 | World Community Grid | Backing off 00:09:38 on download of 7ae87fe0f60df02485e578de1b6cc9f9.zip
1-10-2022 15:07:42 | World Community Grid | Temporarily failed download of 9c7a2159c0aa96a6487dd185f3141545.gpf: transient HTTP error
1-10-2022 15:07:42 | World Community Grid | Backing off 00:15:55 on download of 9c7a2159c0aa96a6487dd185f3141545.gpf

And we are not talking about a few work units (which would not be that big a problem) but literally hundreds of work units, spread over multiple machines and spread over a period of hours.

[Oct 1, 2022 1:10:46 PM]

[ ]