| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 10
|
|
| Author |
|
|
Marcus.TheOriginal
Cruncher Joined: Mar 31, 2020 Post Count: 4 Status: Offline Project Badges:
|
Hi there,
last weekend I deployed two new servers to run some tasks for wcg and rosetta. I've deployed the machines, installed the boinc-client, copied the two xml config files from existing and working servers, attached the project and left the servers doing their job. Today I've seen that many WUs have the status "detached". What does it mean? And would could be the cause? To me the configuration seems correct (I've copied it from other servers where it still works absolutely fine). Screenshot from Result Status page Here's some (cropped) boinc status root@boinc5:~# boinccmd --get_state ======== Projects ======== 1) ----------- name: World Community Grid master URL: http://www.worldcommunitygrid.org/ [...] nrpc_failures: 0 master_fetch_failures: 0 master fetch pending: no scheduler RPC pending: no trickle upload pending: no attached via Account Manager: no ended: no suspended via GUI: no don't request more work: no disk usage: 0.000000 last RPC: Tue May 19 19:25:08 2020 project files downloaded: 1589731973.282394 [...] jobs succeeded: 7 jobs failed: 1 elapsed time: 57273.015918 2) ----------- name: Rosetta@home master URL: http://boinc.bakerlab.org/rosetta/ [...] nrpc_failures: 0 master_fetch_failures: 0 master fetch pending: no scheduler RPC pending: no trickle upload pending: no attached via Account Manager: no ended: no suspended via GUI: no don't request more work: no disk usage: 0.000000 last RPC: Wed May 20 11:02:28 2020 project files downloaded: 0.000000 [...] jobs succeeded: 5 jobs failed: 0 elapsed time: 174923.215097 ======== Applications ======== 1) ----------- name: opn1 Project: World Community Grid 2) ----------- name: rosetta Project: Rosetta@home ======== Application versions ======== 1) ----------- project: World Community Grid application: opn1 platform: x86_64-pc-linux-gnu version: 7.17 estimated GFLOPS: 3.77 filename: wcgrid_opn1_autodock_7.17_x86_64-pc-linux-gnu [...] ======== Workunits ======== 1) ----------- name: OPN1_0000259_12304 FP estimate: 3.330834e+13 FP bound: 1.332334e+15 memory bound: 239.13 MB disk bound: 477.42 MB [... about 9 more for both projects] ======== Tasks ======== 1) ----------- name: OPN1_0000259_12304_0 WU name: OPN1_0000259_12304 project URL: http://www.worldcommunitygrid.org/ received: Mon May 18 12:58:42 2020 report deadline: Mon May 25 12:58:41 2020 ready to report: no state: uploading scheduler state: uninitialized active_task_state: UNINITIALIZED app version num: 0 resources: 1 CPU final CPU time: 7855.520000 final elapsed time: 7889.734088 exit_status: 0 signal: 0 [9 more entries alike] ======== Time stats ======== now: 1590002989.133758 on_frac: 0.999983 connected_frac: -1.000000 cpu_and_network_available_frac: 0.999966 active_frac: 0.999966 gpu_active_frac: 0.999966 client_start_time: Sun May 17 18:12:41 2020 previous_uptime: 214.382298 session_active_duration: 271018.599318 session_gpu_active_duration: 271018.599318 total_start_time: Sun May 17 18:08:50 2020 total_duration: 271229.818201 total_active_duration: 271189.574691 total_gpu_active_duration: 271189.574691 Server config: AMD EPYC Processor (with IBPB) [Family 23 Model 1 Stepping 2] 2 Cores Ubuntu 18.04.4 LTS [4.15.0-99-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Ram 1944.91 MB In contrast, the config of the servers without problems: Intel Xeon Processor (Skylake, IBRS) [Family 6 Model 85 Stepping 4] 1 Core Ubuntu 18.04.4 LTS [4.15.0-91-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Ram 1945.09 MB and Intel Xeon Processor (Skylake, IBRS) [Family 6 Model 85 Stepping 4] 2 Cores Ubuntu 18.04.4 LTS [4.15.0-88-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Ram 7786.51 MB All three setups are cloud instances. Any tips to identify and eliminate the problems? Thank you :) |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
"copied the two xml a files from existing and working servers"
----------------------------------------cc_config and app_config ? Does WCG disappear from these client? If so, are you using BAM? [Edit 1 times, last edit by Former Member at May 20, 2020 8:04:30 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My bet is, you have two machines with the same hostid... The server is probably confused.
----------------------------------------On new setups I wouldn't copy the cc_config. Let the new instance create it. If you are creating new instances, why copy the config files? Only reason to copy would be if you were moving an instance to another machine. [Edit 1 times, last edit by Former Member at May 20, 2020 8:17:14 PM] |
||
|
|
Marcus.TheOriginal
Cruncher Joined: Mar 31, 2020 Post Count: 4 Status: Offline Project Badges:
|
Ah, sorry, nope, I did not copy a file named "cc_config". I copied /var/lib/boinc-client/account_boinc.worldcommunitygrid.org_wcg.xml
which contains the authenticator-url. Same I did with the rosetta config. I configured the other machines the exactly same way, only on "boinc1" I added these files manually. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Actually, thinking back on it, the client doesn't create cc_config. The only time I have encountered detached work units was when I tried to start a second client on a host and forgot to set allow_multiple_clients in the cc_config. Did you copy anything other than the file you specified? Specifically, like the client_state.xml?
|
||
|
|
Marcus.TheOriginal
Cruncher Joined: Mar 31, 2020 Post Count: 4 Status: Offline Project Badges:
|
nope, I didn't
|
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
This may or may not be helpful, but the only time I have had "detached" work units is when a hard drive or machine has become unusable. Any work units which had been assigned were lost, the servers could no longer find them. In essence, they ceased to exist even though they had been issued. I would guess the equivalent would be to uninstall boinc even though there is an existing cache of work units to be processed.
----------------------------------------Hope this helps. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Ah, sorry, nope, I did not copy a file named "cc_config". I copied /var/lib/boinc-client/account_boinc.worldcommunitygrid.org_wcg.xml which contains the authenticator-url. Same I did with the rosetta config. I configured the other machines the exactly same way, only on "boinc1" I added these files manually. By doing this you've create additional machines with the same ID upon which the server gets confused while assigning work, so it removes the work each time and then sends new, this repeating when the other duplicate client contacts the server. You need to 'detach'/remove WCG from these clients and then re-add to create unique IDs. The problem should go away. Are these headless Linux machines why you do this? The command line to attach projects is not too difficult, something like boinccmd --project_attach The manual: https://boinc.berkeley.edu/wiki/Boinccmd_tool [Edit 2 times, last edit by Former Member at May 22, 2020 5:15:23 PM] |
||
|
|
Marcus.TheOriginal
Cruncher Joined: Mar 31, 2020 Post Count: 4 Status: Offline Project Badges:
|
Thank you, lavaflow
I read this in some manual. "create the xml, then attach the project". I've done so on several machines without fuzz. I've detached and re-attached the project via boinccmd. Now it seems to work. 8 WUs yesterday, all valid. My host "boinc4", which had identical config and problems, still produces "detached" WUs. So I will re-attach the project there as well. Thank you guys for your help! |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
One of the safeties is a connection counter which is incremented by one each time. If a device connects and the counter is different for that device than what the server has for same, tasks are ditched, integrity lost, cant trust someone is not cooking the books.
----------------------------------------Glad that was solved. Happy crunching [Edit 1 times, last edit by Former Member at May 22, 2020 5:33:45 PM] |
||
|
|
|