World Community Grid - View Thread - Hundreds and Hundreds of "Detached" errors

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Hundreds and Hundreds of "Detached" errors

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 21

[ ]

Author

This topic has been viewed 6008 times and has 20 replies

NUCCpod_NAPTIMELABS_01
Cruncher
Joined: Nov 28, 2017
Post Count: 10
Status: Offline
Project Badges:

20 year badge for Mapping Cancer Markers

5 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Hundreds and Hundreds of "Detached" errors

Hello fellow humans.

I'm trying to track down why I seem to be generating page after page of the "Detached" error for new (BOINC) work units.

I had some issues previously relating to my BOINC clients hitting a network bottleneck and having signature errors, so I reset all the workers on WCG and they are churning away, with the exception that I have reams of "Detached".

The only information I've managed to find is "When a newer client gets dis-associated from this project with tasks still in the cache, will a message be send to the servers to ensure that these tasks get quickly redistributed. With older clients this would not happen and task copies would not get send until the "No Reply" condition occurred. "

However, this seems to be happening with newly issued work units at an alarming rate.
Thoughts?

[Dec 2, 2017 7:11:04 AM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Hundreds and Hundreds of "Detached" errors

Open BOINC Manager, hit Ctrl+Shift+E, then scroll to top of the (Event Log) windows and check if one of the lines there says 'Starting...'. If so, hit the Copy All button, and paste content in a reply. If not, restart the computer and let it run for a while, then again do Ctrl+shift+E and do a copy/paste in a reply. This allows us to do a first read of your device setup and some client activity.

[Dec 2, 2017 12:50:33 PM]

NUCCpod_NAPTIMELABS_01
Cruncher
Joined: Nov 28, 2017
Post Count: 10
Status: Offline
Project Badges:


Re: Hundreds and Hundreds of "Detached" errors

Taking a small sample from the log collector, it seems related to the following.

10.0.1.246 2017-11-30 12:50:09 Result FAH2_001810_avx15988-0_000004_000074_007_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result OET1_0005196_x4GV3p_rig_49847_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result ZIKA_000292230_x2o8l_Saur_V8pr_0722_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result OET1_0005196_x4GV3p_rig_49839_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result OET1_0005196_x4GV3p_rig_49840_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result FAH2_001813_avx16757-0_000003_000053_008_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result OET1_0005196_x4GV3p_rig_4984_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result OET1_0005196_x4GV3p_rig_49843_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result ZIKA_000292230_x2o8l_Saur_V8pr_0745_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result MCM1_0138474_0118_1 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result OET1_0005196_x4GV3p_rig_49863_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result OET1_0005196_x4GV3p_rig_49881_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result SCC1_0001520_Lin-CSD-A_24983_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result ZIKA_000292230_x2o8l_Saur_V8pr_0740_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result OET1_0005196_x4GV3p_rig_49862_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result OET1_0005196_x4GV3p_rig_49882_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result FAH2_001823_avx17257-0_000008_000068_010_0 is no longer usable

Also of note, it looks like multiple of my workers are being sent the same WU?
Even after the client hears to abandon the task, it seems it pops up again only to error out once again and again.

10.0.1.245 2017-11-30 13:36:40 Result SCC1_0001520_Lin-CSD-A_24983_0 is no longer usable
10.0.1.246 2017-11-30 12:50:09 Result SCC1_0001520_Lin-CSD-A_24983_0 is no longer usable

[Dec 2, 2017 1:22:17 PM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Hundreds and Hundreds of "Detached" errors

Not what I asked for.... from the top please.

[Dec 2, 2017 1:23:52 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Hundreds and Hundreds of "Detached" errors

Did you deploy from one master image? Maybe all the computers have the same id somehow? That's the only way I could think multiple workers have the same jobs.

[Dec 2, 2017 9:19:19 PM]

NUCCpod_NAPTIMELABS_01
Cruncher
Joined: Nov 28, 2017
Post Count: 10
Status: Offline
Project Badges:


Re: Hundreds and Hundreds of "Detached" errors

I have 28 workers crunching at the moment, none of which was deployed from a master image that would have caused the host cpid to be duplicated, but many have the same host name. Does WCG identify workers via hostname and not boinc host ids?

It was a long night of bug hunting, so I only just now correctly parsed the request for logs.

Sat 02 Dec 2017 02:15:09 PM PST | | <![CDATA[Starting BOINC client version 7.6.33 for x86_64-pc-linux-gnu]]>
Sat 02 Dec 2017 02:15:09 PM PST | | <![CDATA[log flags: file_xfer, sched_ops, task]]>
Sat 02 Dec 2017 02:15:09 PM PST | | <![CDATA[Libraries: libcurl/7.52.1 OpenSSL/1.0.2g zlib/1.2.11 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) librtmp/2.3]]>
Sat 02 Dec 2017 02:15:09 PM PST | | <![CDATA[Data directory: /var/lib/boinc-client]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[CUDA: NVIDIA GPU 0: GeForce 9500 GT (driver version 340.10, CUDA version 6.5, compute capability 1.1, 511MB, 489MB available, 132 GFLOPS peak)]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[OpenCL: NVIDIA GPU 0: GeForce 9500 GT (driver version 340.102, device version OpenCL 1.0 CUDA, 511MB, 489MB available, 132 GFLOPS peak)]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Host name: nucc]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Processor: 4 GenuineIntel Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz [Family 6 Model 42 Stepping 7]]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[OS: Linux: 4.10.0-19-generic]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Memory: 7.77 GB physical, 0 bytes virtual]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Disk: 182.38 GB total, 167.39 GB free]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Local time is UTC -8 hours]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Config: GUI RPC allowed from any host]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Config: GUI RPCs allowed from:]]>
Sat 02 Dec 2017 02:15:10 PM PST | Asteroids@home | <![CDATA[URL http://asteroidsathome.net/boinc/; Computer ID 484321; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | Cosmology@Home | <![CDATA[URL http://www.cosmologyathome.org/; Computer ID 330783; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | Einstein@Home | <![CDATA[URL http://einsteinathome.org/; Computer ID 12595721; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | Milkyway@Home | <![CDATA[URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 753001; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | NFS@Home | <![CDATA[URL http://escatter11.fullerton.edu/nfs/; Computer ID 5856315; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | pogs | <![CDATA[URL http://pogs.theskynet.org/pogs/; Computer ID 832781; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | PrimeGrid | <![CDATA[URL http://www.primegrid.com/; Computer ID 912912; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | Rosetta@home | <![CDATA[URL http://boinc.bakerlab.org/rosetta/; Computer ID 3292772; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | SETI@home | <![CDATA[URL http://setiathome.berkeley.edu/; Computer ID 8386483; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | World Community Grid | <![CDATA[URL http://www.worldcommunitygrid.org/; Computer ID 4234391; resource share 100]]>
Sat 02 Dec 2017 02:15:10 PM PST | World Community Grid | <![CDATA[General prefs: from World Community Grid (last modified 28-Nov-2017 23:37:27)]]>
Sat 02 Dec 2017 02:15:10 PM PST | World Community Grid | <![CDATA[Host location: none]]>
Sat 02 Dec 2017 02:15:10 PM PST | World Community Grid | <![CDATA[General prefs: using your defaults]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Reading preferences override file]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[Preferences:]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[ max memory usage when active: 3976.74MB]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[ max memory usage when idle: 7158.14MB]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[ max disk usage: 164.15GB]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[ suspend work if non-BOINC CPU load exceeds 25%]]>
Sat 02 Dec 2017 02:15:10 PM PST | | <![CDATA[ (to change preferences, visit a project web site or select Preferences in the Manager)]]>

[Dec 2, 2017 10:19:45 PM]

NUCCpod_NAPTIMELABS_01
Cruncher
Joined: Nov 28, 2017
Post Count: 10
Status: Offline
Project Badges:


Re: Hundreds and Hundreds of "Detached" errors

It looks like I'm being sent lost tasks when I request work units, and the next time the client checks in with the project server it's told to abandon the task.

Sat 02 Dec 2017 03:46:39 PM PST | World Community Grid | <![CDATA[Resent lost task SCC1_0001528_Lin-CSD-A_17909_0]]>

Sat 02 Dec 2017 03:46:45 PM PST | World Community Grid | <![CDATA[Starting task SCC1_0001528_Lin-CSD-A_17909_0]]>

Sat 02 Dec 2017 04:35:34 PM PST | World Community Grid | <![CDATA[Result SCC1_0001528_Lin-CSD-A_17909_0 is no longer usable]]>

Sat 02 Dec 2017 04:35:35 PM PST | World Community Grid | <![CDATA[Computation for task SCC1_0001528_Lin-CSD-A_17909_0 finished]]>

Edit* What's more is that prodding by hand to update the project for a given worker, every single work unit that had been sent out to a given worker is abandoned by the server.

----------------------------------------
[Edit 1 times, last edit by NUCCpod_NAPTIMELABS_01 at Dec 3, 2017 1:05:47 AM]

[Dec 3, 2017 1:03:12 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7849
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Hundreds and Hundreds of "Detached" errors

...but many have the same host name. Does WCG identify workers via hostname and not boinc host ids?

I have two systems with the same host name and it has never been a problem with them. They do have different computer ID's listed in the logs.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Dec 3, 2017 1:40:36 AM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Hundreds and Hundreds of "Detached" errors

Sat 02 Dec 2017 02:15:10 PM PST | World Community Grid | <![CDATA[URL http://www.worldcommunitygrid.org/; Computer ID 4234391; resource share 100]

If there is more than 1 client with this number in the log 4234391, there's a problem. If you have devices same host ID, same LAN IP address, there's a problem. If multiple running clients look at same data directory, there's a problem. The server continuously compares what a client was assigned, plus it maintains connect counters. If they're out, it will reset and assign new work, but with 28, that could be frequent. It's important that each client is installed clean and has exclusive access to it's data directory.

This is puzzling, as per the log, there's many projects attached. Is it only WCG presenting the issue?

[Dec 3, 2017 7:20:51 AM]

NUCCpod_NAPTIMELABS_01
Cruncher
Joined: Nov 28, 2017
Post Count: 10
Status: Offline
Project Badges:


Re: Hundreds and Hundreds of "Detached" errors

Each client is a clean install, on it's own metal, with it's own IP and data directories.
The errors became so prevalent that under observation up to 100% of the work units being processed were being "abandoned" by the server after being sent to me. I've had to fully detach from WCG for the time being lest 130 some odd threads of potential science fire away into the void.

No other projects have had any problems remotely similar to this (on my end at least), and all other projects were suspended at the time.

My current plan is to attend to some other aspects of this rig and circle back around in a week or so with the hope that these cyclical nightmare work units vanish. The fallback plan is to register a new account and re-attach, which I would like to entirely avoid if at all possible.

[Dec 3, 2017 8:25:14 AM]

[ ]