| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 21
|
|
| Author |
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Not sure about 7.6.33, but 7.8 ** (get it fromgianfrtanco's ppa at https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc ... he's the Ubuntu/debian package maintainer ) for sure on detach fully erases the project sub data folder, but the main data dir would still contain the files with wcg in their name such as account_www.worldcommunitygrid.org.xml and master_www.worldcommunitygrid.org.xml. These need removing as else the re-add of the project would use old base information.
The circle I'd certainly engage in is going through all the devices event logs, start-up section, making sure there is no duplicate Computer ID being used. Reading your replies, in no normal world that could be... one can even run multiple clients on a single device with proper preparation, long as the additional instances are made to point to their own exclusive data directory. Of course there's a possibility of conflict... could there be multiple data dirs on a device, not only in /var/lib/boinc-client? ** Dont know what's in present ubuntu repository, either 7.6.33 or 7.8.3, if you're running ubuntu... dont see which distro you run. |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Does this look familiar...
----------------------------------------29421 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61200_0 29422 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61045_0 29423 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61374_0 29424 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61479_0 29425 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61299_0 29426 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61416_0 29427 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61478_0 29428 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61265_0 29429 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61443_0 29430 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61300_0 29431 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61270_0 29432 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_60693_0 29433 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_60755_0 29434 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61266_0 29435 World Community Grid 12/3/2017 12:31:08 PM Resent lost task OET1_0005203_x4GV3_rig_61355_0 ... It's in my case, because the write access to the whole BOINC data dir was funked up , so it was cycling through all the jobs, trying to start them, then fetching new copies, several times.[Edit 1 times, last edit by SekeRob* at Dec 3, 2017 12:12:42 PM] |
||
|
|
NUCCpod_NAPTIMELABS_01
Cruncher Joined: Nov 28, 2017 Post Count: 10 Status: Offline Project Badges:
|
Sadly no, it looks like your message ids have those events back to back to back, which is not the case on my end.
My workers start processing the newly downloaded tasks, only to have them be "canceled" by the server, displaying "Aborted by projects" in the GUI. |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Can you post your cc_config.xml file content... just another hunch
|
||
|
|
BobCat13
Senior Cruncher Joined: Oct 29, 2005 Post Count: 295 Status: Offline Project Badges:
|
What are the IP addresses for each device as listed in client_state.xml?
----------------------------------------If they are running a Debian based distro of Linux and using DHCP, then most likely each of them will report 127.0.1.1 back to the WCG server. [Edit 2 times, last edit by BobCat13 at Dec 6, 2017 7:04:20 PM] |
||
|
|
NUCCpod_NAPTIMELABS_01
Cruncher Joined: Nov 28, 2017 Post Count: 10 Status: Offline Project Badges:
|
What are the IP addresses for each device as listed in client_state.xml? If they are running a Debian based distro of Linux and using DHCP, then most likely each of them will report 127.0.1.1 back to the WCG server. Yes that is the case, but *shouldn't* be an issue. I waited for all the work units to expire, purged and reinstalled BOINC, and reattached my workers. Almost instantly the errors started accumulating again. "Aborted by project" in the gui, "Result *soandso* is no longer usable" in the logs In just an hour or so, I'm back up to 11 pages of errors! I really would like to get this solved, but I am out of ideas of things to try and fix on my end. |
||
|
|
NUCCpod_NAPTIMELABS_01
Cruncher Joined: Nov 28, 2017 Post Count: 10 Status: Offline Project Badges:
|
The best I can tell is that WCG is giving multiple of my workers the same work unit, and when the workers check in on the next cycle invalidates the assignment.
----------------------------------------10.0.1.246 2017-12-19 17:47:39 Result SCC1_0001579_Lin-CSD-A_5740_0 is no longer usable 10.0.1.245 2017-12-19 17:47:34 Result SCC1_0001579_Lin-CSD-A_5740_0 is no longer usable 10.0.1.245 2017-12-19 17:47:35 Computation for task SCC1_0001579_Lin-CSD-A_5740_0 finished 10.0.1.252 2017-12-19 17:47:35 Result SCC1_0001579_Lin-CSD-A_5740_0 is no longer usable 10.0.1.252 2017-12-19 17:47:37 Computation for task SCC1_0001579_Lin-CSD-A_5740_0 finished 10.0.1.245 2017-12-19 17:37:14 Resent lost task SCC1_0001579_Lin-CSD-A_5740_0 10.0.1.245 2017-12-19 17:37:44 Starting task SCC1_0001579_Lin-CSD-A_5740_0 10.0.1.252 2017-12-19 17:37:16 Resent lost task SCC1_0001579_Lin-CSD-A_5740_0 10.0.1.252 2017-12-19 17:37:41 Starting task SCC1_0001579_Lin-CSD-A_5740_0 [Edit 1 times, last edit by NUCCpod_NAPTIMELABS_01 at Dec 20, 2017 2:09:59 AM] |
||
|
|
BobCat13
Senior Cruncher Joined: Oct 29, 2005 Post Count: 295 Status: Offline Project Badges:
|
I really would like to get this solved, but I am out of ideas of things to try and fix on my end. I wrote the following up a while ago, but WCG would not let me post it at the time. Edit: still cannot post it all in one reply, so I am going to try breaking it up. Debian based distros of Linux use 127.0.1.1 as the device IP address in the hosts file if DHCP is used. If you have multiple devices with the same hardware and Linux OS, same device name, and use DHCP then WCG may see them as one device. |
||
|
|
BobCat13
Senior Cruncher Joined: Oct 29, 2005 Post Count: 295 Status: Offline Project Badges:
|
This scenario may not be correct, but it came to my mind.
----------------------------------------Lets call these devices A and B: 1. Device A tells WCG it has no tasks on a connect to request work, so WCG sends tasks. 2. Device B tells WCG it has no tasks on a connect to request work, so WCG sends tasks. Since the request for work told the server that it had no tasks, all of the tasks for Device A are marked Detached (some other projects call these Abandoned). 3. Device A falls below the cache setting, so it requests more work. All of the tasks it previously received are now marked as Detached, so the server tells the client to Abort them. The server also assigns more tasks, but since the work request does not include the tasks for Device B, those Device B tasks are now marked Detached. 4. Device B requests more work and is told to Abort the tasks it has, receives more work but doesn't know about the Device A tasks, so Device A will be told to Abort on next contact, and so on, and so on. This can be avoided by giving each device a unique name, i.e. nucc001, nucc002, etc. or using static IP addresses and making sure they are showing in the hosts file instead of 127.0.1.1 [Edit 1 times, last edit by BobCat13 at Dec 20, 2017 3:00:18 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I really would like to get this solved, but I am out of ideas of things to try and fix on my end. I wrote the following up a while ago, but WCG would not let me post it at the time. Edit: still cannot post it all in one reply, so I am going to try breaking it up. Debian based distros of Linux use 127.0.1.1 as the device IP address in the hosts file if DHCP is used. If you have multiple devices with the same hardware and Linux OS, same device name, and use DHCP then WCG may see them as one device. The internet has an opinion on the most common understanding of what 127.0.0.1 and 127.0.1.1 do. 127.0.0.1 IP Address Explained - Lifewire https://www.lifewire.com › ... › Basics Jun 9, 2017 · The IP address 127.0.0.1 is a special-purpose IPv4 address called localhost or loopback address. ... The loopback address is only used by the computer you're on, and only for special circumstances. [Edit 1 times, last edit by Former Member at Dec 20, 2017 2:06:27 PM] |
||
|
|
|