Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 10
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2425 times and has 9 replies Next Thread
Marcus.TheOriginal
Cruncher
Joined: Mar 31, 2020
Post Count: 4
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Many "detached" WU on new machines

Hi there,

last weekend I deployed two new servers to run some tasks for wcg and rosetta. I've deployed the machines, installed the boinc-client, copied the two xml config files from existing and working servers, attached the project and left the servers doing their job.

Today I've seen that many WUs have the status "detached". What does it mean? And would could be the cause? To me the configuration seems correct (I've copied it from other servers where it still works absolutely fine).

Screenshot from Result Status page

Here's some (cropped) boinc status


root@boinc5:~# boinccmd --get_state
======== Projects ========
1) -----------
name: World Community Grid
master URL: http://www.worldcommunitygrid.org/
[...]
nrpc_failures: 0
master_fetch_failures: 0
master fetch pending: no
scheduler RPC pending: no
trickle upload pending: no
attached via Account Manager: no
ended: no
suspended via GUI: no
don't request more work: no
disk usage: 0.000000
last RPC: Tue May 19 19:25:08 2020

project files downloaded: 1589731973.282394
[...]
jobs succeeded: 7
jobs failed: 1
elapsed time: 57273.015918
2) -----------
name: Rosetta@home
master URL: http://boinc.bakerlab.org/rosetta/
[...]
nrpc_failures: 0
master_fetch_failures: 0
master fetch pending: no
scheduler RPC pending: no
trickle upload pending: no
attached via Account Manager: no
ended: no
suspended via GUI: no
don't request more work: no
disk usage: 0.000000
last RPC: Wed May 20 11:02:28 2020

project files downloaded: 0.000000
[...]
jobs succeeded: 5
jobs failed: 0
elapsed time: 174923.215097

======== Applications ========
1) -----------
name: opn1
Project: World Community Grid
2) -----------
name: rosetta
Project: Rosetta@home

======== Application versions ========
1) -----------
project: World Community Grid
application: opn1
platform: x86_64-pc-linux-gnu
version: 7.17
estimated GFLOPS: 3.77
filename: wcgrid_opn1_autodock_7.17_x86_64-pc-linux-gnu

[...]
======== Workunits ========
1) -----------
name: OPN1_0000259_12304
FP estimate: 3.330834e+13
FP bound: 1.332334e+15
memory bound: 239.13 MB
disk bound: 477.42 MB
[... about 9 more for both projects]
======== Tasks ========
1) -----------
name: OPN1_0000259_12304_0
WU name: OPN1_0000259_12304
project URL: http://www.worldcommunitygrid.org/
received: Mon May 18 12:58:42 2020
report deadline: Mon May 25 12:58:41 2020
ready to report: no
state: uploading
scheduler state: uninitialized
active_task_state: UNINITIALIZED
app version num: 0
resources: 1 CPU
final CPU time: 7855.520000
final elapsed time: 7889.734088
exit_status: 0
signal: 0
[9 more entries alike]
======== Time stats ========
now: 1590002989.133758
on_frac: 0.999983
connected_frac: -1.000000
cpu_and_network_available_frac: 0.999966
active_frac: 0.999966
gpu_active_frac: 0.999966
client_start_time: Sun May 17 18:12:41 2020

previous_uptime: 214.382298
session_active_duration: 271018.599318
session_gpu_active_duration: 271018.599318
total_start_time: Sun May 17 18:08:50 2020

total_duration: 271229.818201
total_active_duration: 271189.574691
total_gpu_active_duration: 271189.574691


Server config:
AMD EPYC Processor (with IBPB) [Family 23 Model 1 Stepping 2] 2 Cores
Ubuntu 18.04.4 LTS [4.15.0-99-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]
Ram 1944.91 MB

In contrast, the config of the servers without problems:
Intel Xeon Processor (Skylake, IBRS) [Family 6 Model 85 Stepping 4] 1 Core
Ubuntu 18.04.4 LTS [4.15.0-91-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]
Ram 1945.09 MB

and
Intel Xeon Processor (Skylake, IBRS) [Family 6 Model 85 Stepping 4] 2 Cores
Ubuntu 18.04.4 LTS [4.15.0-88-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]
Ram 7786.51 MB

All three setups are cloud instances.

Any tips to identify and eliminate the problems?
Thank you :)
[May 20, 2020 7:41:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Many "detached" WU on new machines

"copied the two xml a files from existing and working servers"

cc_config and app_config ?

Does WCG disappear from these client? If so, are you using BAM?
----------------------------------------
[Edit 1 times, last edit by Former Member at May 20, 2020 8:04:30 PM]
[May 20, 2020 8:02:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Many "detached" WU on new machines

My bet is, you have two machines with the same hostid... The server is probably confused.
On new setups I wouldn't copy the cc_config. Let the new instance create it. If you are creating new instances, why copy the config files? Only reason to copy would be if you were moving an instance to another machine.
----------------------------------------
[Edit 1 times, last edit by Former Member at May 20, 2020 8:17:14 PM]
[May 20, 2020 8:14:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Marcus.TheOriginal
Cruncher
Joined: Mar 31, 2020
Post Count: 4
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Many "detached" WU on new machines

Ah, sorry, nope, I did not copy a file named "cc_config". I copied /var/lib/boinc-client/account_boinc.worldcommunitygrid.org_wcg.xml
which contains the authenticator-url. Same I did with the rosetta config.

I configured the other machines the exactly same way, only on "boinc1" I added these files manually.
[May 20, 2020 8:34:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Many "detached" WU on new machines

Actually, thinking back on it, the client doesn't create cc_config. The only time I have encountered detached work units was when I tried to start a second client on a host and forgot to set allow_multiple_clients in the cc_config. Did you copy anything other than the file you specified? Specifically, like the client_state.xml?
[May 20, 2020 10:51:19 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Marcus.TheOriginal
Cruncher
Joined: Mar 31, 2020
Post Count: 4
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Many "detached" WU on new machines

nope, I didn't
[May 21, 2020 12:12:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Many "detached" WU on new machines

This may or may not be helpful, but the only time I have had "detached" work units is when a hard drive or machine has become unusable. Any work units which had been assigned were lost, the servers could no longer find them. In essence, they ceased to exist even though they had been issued. I would guess the equivalent would be to uninstall boinc even though there is an existing cache of work units to be processed.
Hope this helps.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[May 21, 2020 1:35:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Many "detached" WU on new machines

Ah, sorry, nope, I did not copy a file named "cc_config". I copied /var/lib/boinc-client/account_boinc.worldcommunitygrid.org_wcg.xml
which contains the authenticator-url. Same I did with the rosetta config.

I configured the other machines the exactly same way, only on "boinc1" I added these files manually.

By doing this you've create additional machines with the same ID upon which the server gets confused while assigning work, so it removes the work each time and then sends new, this repeating when the other duplicate client contacts the server. You need to 'detach'/remove WCG from these clients and then re-add to create unique IDs. The problem should go away.

Are these headless Linux machines why you do this? The command line to attach projects is not too difficult, something like
boinccmd --project_attach URL account_key www.worldcommunitygrid.org accountkey

The manual: https://boinc.berkeley.edu/wiki/Boinccmd_tool
----------------------------------------
[Edit 2 times, last edit by Former Member at May 22, 2020 5:15:23 PM]
[May 21, 2020 7:12:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Marcus.TheOriginal
Cruncher
Joined: Mar 31, 2020
Post Count: 4
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Many "detached" WU on new machines

Thank you, lavaflow

I read this in some manual. "create the xml, then attach the project". I've done so on several machines without fuzz.

I've detached and re-attached the project via boinccmd. Now it seems to work. 8 WUs yesterday, all valid.

My host "boinc4", which had identical config and problems, still produces "detached" WUs. So I will re-attach the project there as well.

Thank you guys for your help!
[May 22, 2020 4:51:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Many "detached" WU on new machines

One of the safeties is a connection counter which is incremented by one each time. If a device connects and the counter is different for that device than what the server has for same, tasks are ditched, integrity lost, cant trust someone is not cooking the books.

Glad that was solved. Happy crunching
----------------------------------------
[Edit 1 times, last edit by Former Member at May 22, 2020 5:33:45 PM]
[May 22, 2020 5:32:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread