| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 10
|
|
| Author |
|
|
jay_Orlando
Senior Cruncher USA Joined: Jan 4, 2006 Post Count: 189 Status: Offline Project Badges:
|
Greetings.
----------------------------------------I need help debugging a network problem. I have had failures when trying to ping my closest DNS server. It appears my network goes down for minutes to hours at a time - then recovers. When It recovers, I see several WU hanging at 100% completed and the rest pending. This can last for several hours - and then, all Transmissions complete. Question 1: Is there anything special about sending that last packet - or its ACK? Question 2: Suggestions for debug? Thank you very much, Stay safe, Jay PS when pinging DNS Server - 1 very 5 seconds: 2835 packets transmitted 2182 received +412 errors 23.0335% packet loss time 14208925 ms = 3 Hours, 56 Minutes, 48 Seconds, 925 ms ![]() |
||
|
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 638 Status: Offline Project Badges:
|
What does "It appears my network goes down for minutes to hours at a time" mean?
----------------------------------------To me this means no Internet at all. ISP is down or router is disconnected from the ISP. I'm sure this is not what you mean. Your data says there were: +412 errors 23.0335% packet loss That's the DNS. Which one is it? Can you ping an IP address like Google's DNS 8.8.8.8 or Cloudflare's DNS 1.1.1.1 Maybe you could use one of these as DNS. IBM has one too. Quad9 But Ugh. If you can't even ping an IP then you are really dead. When you says the network is down can you ping 169.47.63.74 which is: www.worldcommunitygrid.org? [Edit 1 times, last edit by BobbyB at Nov 19, 2020 6:58:41 PM] |
||
|
|
jay_Orlando
Senior Cruncher USA Joined: Jan 4, 2006 Post Count: 189 Status: Offline Project Badges:
|
Greetings,
----------------------------------------Sorrry I'm late - lost the thread ID. I live in Orlando, Florida. The nearest DNS server is , I assume, is in Miami. ping -c 3 -l 3 dns.mia.bellsouth.net (and my ISP is AT&T / Bellsouth.) Pings to world community grid were OK. My guess the network was busy. A few years ago, I had a similar problem with the final ack. a well-used scrpit is
These were OK. slower than usual, but OK. This problem seems to come and go.... Jay ps The -l 3 allows 3 ping packets to be sent at once without waiting - then counting the responses. (Non-root is allowed 3.) PPS Merry Christmas! ![]() |
||
|
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 638 Status: Offline Project Badges:
|
I'm going to concentrate on just the WCG stuff. I'm making a presumption here that this machine is just used for crunching WCG.
----------------------------------------I would assign a hard DNS IP in my system and use Google's 8.8.8.8 - I doubt they will be down! ever... and if you can assign a secondary DNS then use Cloudflare's 1.1.1.1 With these 2 there should not be a DNS problem. This removes doubt of your ISP's DNS server Now I would hard assign www.worldcommunitygrid.org to 169.47.63.74 in the .hosts file of your OS. Doubt this IP will ever change. With this in place, YOU, are the DNS server for WCG. Now for the script which it seems is how it is determined that the network is down or slow: just ping 169.47.63.74 and 8.8.8.8 and 1.1.1.1 Now if there are problems connecting to WCG it is not DNS but something between your machine and the WCG server.... or the machine. Is it still crunching while you observe the slow down? Let's see how this works out. [Edit 1 times, last edit by BobbyB at Dec 25, 2020 8:30:24 PM] |
||
|
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 638 Status: Offline Project Badges:
|
Ah! I see from your other thread you run Einstein@Home
|
||
|
|
jay_Orlando
Senior Cruncher USA Joined: Jan 4, 2006 Post Count: 189 Status: Offline Project Badges:
|
Bobby,
----------------------------------------The DNS is not my problem. Missing the final ack is the problem. I used DNS pings just to show/infer that there was some net traffic, but not unrecoverable TCP failures. Have you encountered missing that final respone on your machine(s)?? What are your thoughts on the failure to complete the uploads?? Thanks, Jay PS Happy Boxing Day. ![]() |
||
|
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 638 Status: Offline Project Badges:
|
I have not seen what you described. I sometimes see a few WUs hang in there for a short time "ready to report" but they are gone when I check a while later. If I click update on the project tab they disappear and downloads start. Just did that to see what happens.
----------------------------------------I zoomed in on DNS because you said you had problems pinging your closest DNS server. When you say "It appears my network goes down for minutes to hours at a time - then recovers." does it apply to all the machines on your LAN or just these WCG machines? I presumed it was just these WCG machines? If it's everything everywhere then I can see why the WUs hang. It's connectivity. To debug I would start at the router and look at the logs. disconnect and reconnect would show there. [Edit 2 times, last edit by BobbyB at Dec 26, 2020 4:15:11 PM] |
||
|
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 638 Status: Offline Project Badges:
|
Here is what I observed if it can help.
----------------------------------------I watched one WU as it neared completion and counted down the seconds to zero. (100% progress) It transitions from working to "uploading" to "ready to report". I also observed, using a packet sniffer, the transmission of a "ready to report" WU. Yes it ends with an ACK or two but I don't see that as relevant to the problem. It's just normal TCP protocol. I interpret "uploading" as sending the WU output from memory to the disk or preparing the output on the disk somewhere ready to transmit because it can stay there for a while in a "ready to report" state so it is not really uploading (to WCG). 2020-12-27 10:49:29 | World Community Grid | Started upload of MIP1_00327497_15475_0_r1325880629_0 2020-12-27 10:49:33 | World Community Grid | Finished upload of MIP1_00327497_15475_0_r1325880629_0 At 11:28, as I write this, it is still sitting there "ready to report". If I click update on the project tab it will transmit to WCG update: Sun 27 Dec 2020 11:50:11 AM | World Community Grid | Sending scheduler request: To report completed tasks. Sun 27 Dec 2020 11:50:11 AM | World Community Grid | Reporting 1 completed tasks Sun 27 Dec 2020 11:50:11 AM | World Community Grid | Not requesting tasks: don't need (job cache full) Sun 27 Dec 2020 11:50:13 AM | World Community Grid | Scheduler request completed Sun 27 Dec 2020 11:50:13 AM | World Community Grid | Project requested delay of 121 seconds They are gone at 11:50 [Edit 4 times, last edit by BobbyB at Dec 27, 2020 5:27:22 PM] |
||
|
|
jay_Orlando
Senior Cruncher USA Joined: Jan 4, 2006 Post Count: 189 Status: Offline Project Badges:
|
Bobby,
----------------------------------------Thanks for the data - especially on the sniffer. Its been a long while since I worked on tcp/ip. The net problem was on all of my machines. I live at the end of the lines. I assumed many people working at home and kids doing virtual classroon attributed to full or near-full net capacity. My ISP is AT&T. Too bad they don't have a network capaciiy status or graphic. The problem has not happened recently. Other people in my neighborhood have told me that they had problems when it rained. THANKS AGAIN!! Jay ![]() |
||
|
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 638 Status: Offline Project Badges:
|
It seems I was wrong about the my interpretation of "uploading".
----------------------------------------I read the fine manual: https://boinc.berkeley.edu/wiki/How_BOINC_works https://boinc.berkeley.edu/wiki/Preferences and when it says "uploading" it really uploads to the data server. Seen this on a sniffer. Ready to report is "waiting for its points" I guess. Good that the problem is solved. [Edit 3 times, last edit by BobbyB at Dec 27, 2020 8:59:35 PM] |
||
|
|
|