World Community Grid - View Thread - Hundreds and Hundreds of "Detached" errors

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Hundreds and Hundreds of "Detached" errors

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 21

[ ]

Author

This topic has been viewed 6016 times and has 20 replies

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Hundreds and Hundreds of "Detached" errors

Not sure about 7.6.33, but 7.8 ** (get it fromgianfrtanco's ppa at https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc ... he's the Ubuntu/debian package maintainer ) for sure on detach fully erases the project sub data folder, but the main data dir would still contain the files with wcg in their name such as account_www.worldcommunitygrid.org.xml and master_www.worldcommunitygrid.org.xml. These need removing as else the re-add of the project would use old base information.

The circle I'd certainly engage in is going through all the devices event logs, start-up section, making sure there is no duplicate Computer ID being used. Reading your replies, in no normal world that could be... one can even run multiple clients on a single device with proper preparation, long as the additional instances are made to point to their own exclusive data directory.

Of course there's a possibility of conflict... could there be multiple data dirs on a device, not only in /var/lib/boinc-client?

** Dont know what's in present ubuntu repository, either 7.6.33 or 7.8.3, if you're running ubuntu... dont see which distro you run.

[Dec 3, 2017 9:38:31 AM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Hundreds and Hundreds of "Detached" errors

Does this look familiar...

29421 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61200_0
29422 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61045_0
29423 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61374_0
29424 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61479_0
29425 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61299_0
29426 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61416_0
29427 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61478_0
29428 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61265_0
29429 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61443_0
29430 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61300_0
29431 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61270_0
29432 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_60693_0
29433 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_60755_0
29434 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61266_0
29435 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61355_0

... It's in my case, because the write access to the whole BOINC data dir was funked up shock

, so it was cycling through all the jobs, trying to start them, then fetching new copies, several times.

----------------------------------------
[Edit 1 times, last edit by SekeRob* at Dec 3, 2017 12:12:42 PM]

[Dec 3, 2017 11:46:35 AM]

NUCCpod_NAPTIMELABS_01
Cruncher
Joined: Nov 28, 2017
Post Count: 10
Status: Offline
Project Badges:

20 year badge for Mapping Cancer Markers

5 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Hundreds and Hundreds of "Detached" errors

Sadly no, it looks like your message ids have those events back to back to back, which is not the case on my end.
My workers start processing the newly downloaded tasks, only to have them be "canceled" by the server, displaying "Aborted by projects" in the GUI.

[Dec 4, 2017 10:29:49 AM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Hundreds and Hundreds of "Detached" errors

Can you post your cc_config.xml file content... just another hunch

[Dec 6, 2017 12:12:26 PM]

BobCat13
Senior Cruncher
Joined: Oct 29, 2005
Post Count: 295
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

45 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

90 day badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Hundreds and Hundreds of "Detached" errors

What are the IP addresses for each device as listed in client_state.xml?

If they are running a Debian based distro of Linux and using DHCP, then most likely each of them will report 127.0.1.1 back to the WCG server.

----------------------------------------
[Edit 2 times, last edit by BobCat13 at Dec 6, 2017 7:04:20 PM]

[Dec 6, 2017 6:46:16 PM]

NUCCpod_NAPTIMELABS_01
Cruncher
Joined: Nov 28, 2017
Post Count: 10
Status: Offline
Project Badges:


Re: Hundreds and Hundreds of "Detached" errors

Yes that is the case, but *shouldn't* be an issue.

I waited for all the work units to expire, purged and reinstalled BOINC, and reattached my workers. Almost instantly the errors started accumulating again.

"Aborted by project" in the gui,
"Result *soandso* is no longer usable" in the logs

In just an hour or so, I'm back up to 11 pages of errors!
I really would like to get this solved, but I am out of ideas of things to try and fix on my end.

[Dec 20, 2017 1:57:22 AM]

NUCCpod_NAPTIMELABS_01
Cruncher
Joined: Nov 28, 2017
Post Count: 10
Status: Offline
Project Badges:


Re: Hundreds and Hundreds of "Detached" errors

The best I can tell is that WCG is giving multiple of my workers the same work unit, and when the workers check in on the next cycle invalidates the assignment.

10.0.1.246 2017-12-19 17:47:39 Result SCC1_0001579_Lin-CSD-A_5740_0 is no longer usable
10.0.1.245 2017-12-19 17:47:34 Result SCC1_0001579_Lin-CSD-A_5740_0 is no longer usable
10.0.1.245 2017-12-19 17:47:35 Computation for task SCC1_0001579_Lin-CSD-A_5740_0 finished
10.0.1.252 2017-12-19 17:47:35 Result SCC1_0001579_Lin-CSD-A_5740_0 is no longer usable
10.0.1.252 2017-12-19 17:47:37 Computation for task SCC1_0001579_Lin-CSD-A_5740_0 finished
10.0.1.245 2017-12-19 17:37:14 Resent lost task SCC1_0001579_Lin-CSD-A_5740_0
10.0.1.245 2017-12-19 17:37:44 Starting task SCC1_0001579_Lin-CSD-A_5740_0
10.0.1.252 2017-12-19 17:37:16 Resent lost task SCC1_0001579_Lin-CSD-A_5740_0
10.0.1.252 2017-12-19 17:37:41 Starting task SCC1_0001579_Lin-CSD-A_5740_0

----------------------------------------
[Edit 1 times, last edit by NUCCpod_NAPTIMELABS_01 at Dec 20, 2017 2:09:59 AM]

[Dec 20, 2017 2:07:09 AM]

BobCat13
Senior Cruncher
Joined: Oct 29, 2005
Post Count: 295
Status: Offline
Project Badges:


Re: Hundreds and Hundreds of "Detached" errors

I really would like to get this solved, but I am out of ideas of things to try and fix on my end.

I wrote the following up a while ago, but WCG would not let me post it at the time. Edit: still cannot post it all in one reply, so I am going to try breaking it up.

Debian based distros of Linux use 127.0.1.1 as the device IP address in the hosts file if DHCP is used. If you have multiple devices with the same hardware and Linux OS, same device name, and use DHCP then WCG may see them as one device.

[Dec 20, 2017 2:52:35 AM]

BobCat13
Senior Cruncher
Joined: Oct 29, 2005
Post Count: 295
Status: Offline
Project Badges:


Re: Hundreds and Hundreds of "Detached" errors

This scenario may not be correct, but it came to my mind.

Lets call these devices A and B:
1. Device A tells WCG it has no tasks on a connect to request work, so WCG sends tasks.
2. Device B tells WCG it has no tasks on a connect to request work, so WCG sends tasks. Since the request for work told the server that it had no tasks, all of the tasks for Device A are marked Detached (some other projects call these Abandoned).
3. Device A falls below the cache setting, so it requests more work. All of the tasks it previously received are now marked as Detached, so the server tells the client to Abort them. The server also assigns more tasks, but since the work request does not include the tasks for Device B, those Device B tasks are now marked Detached.
4. Device B requests more work and is told to Abort the tasks it has, receives more work but doesn't know about the Device A tasks, so Device A will be told to Abort on next contact, and so on, and so on.

This can be avoided by giving each device a unique name, i.e. nucc001, nucc002, etc. or using static IP addresses and making sure they are showing in the hosts file instead of 127.0.1.1

----------------------------------------
[Edit 1 times, last edit by BobCat13 at Dec 20, 2017 3:00:18 AM]

[Dec 20, 2017 2:53:19 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Hundreds and Hundreds of "Detached" errors

I really would like to get this solved, but I am out of ideas of things to try and fix on my end.

The internet has an opinion on the most common understanding of what 127.0.0.1 and 127.0.1.1 do.

127.0.0.1 IP Address Explained - Lifewire
https://www.lifewire.com › ... › Basics
Jun 9, 2017 · The IP address 127.0.0.1 is a special-purpose IPv4 address called localhost or loopback address. ... The loopback address is only used by the computer you're on, and only for special circumstances.

----------------------------------------
[Edit 1 times, last edit by Former Member at Dec 20, 2017 2:06:27 PM]

[Dec 20, 2017 2:05:54 PM]

[ ]