World Community Grid - View Thread - Re-imaging systems and stopping job download

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Re-imaging systems and stopping job download

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 10

[ ]

Author

This topic has been viewed 2330 times and has 9 replies

schepers
Advanced Cruncher
Canada
Joined: Oct 11, 2006
Post Count: 85
Status: Offline
Project Badges:

100 year badge for Human Proteome Folding - Phase 2

5 year badge for Discovering Dengue Drugs - Together

10 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

50 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

50 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for Computing for Clean Water

50 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re-imaging systems and stopping job download

I hope this question is in the right section. It's a bit complex and is specific to my installation.

I have two labs of PCs (42 total) that have been happily crunching for WCG for many years. But I've always had two problems that I've never seen a solution for. I use SCCM to deploy my lab image.

1. I need to stop all job downloads occasionally for all PCs. I see no way from the website to do this for all machines at once. For now I go around and manually tell each client to not request more jobs, so it will work its way through the queue and sit idle. As you can imaging, this is not something I like to do 42 times.

2. I re-image these machines 3 times a year. For now BOINC is stored on a separate partition (D:) which is not wiped/removed. However, I really want to unify my lab deployment with all the others which do repartition which means BOINC will be stored on C:. Unless I copy the BOINC app and data folders before I reimage, I will lose both the active/waiting/unreported jobs and also my machine identifier in BOINC. If I re-install the BOINC client without restoring these folders, then my machines will appear new to WCG and it will appear my machine count is much higher than it really is.

3. Replacing a failed hard disk causes the most problems. Jobs are lost, and so is the WCG identifier unless I can still read it and copy these off. I have an upcoming task of replacing all the HDDs with SSDs, and so I will have to backup/restore all the BOINC folders manually.

Any solutions? Does someone know how to backup/restore the BOINC folders in the SCCM task sequence?

[Aug 16, 2019 2:47:17 AM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Re-imaging systems and stopping job download

I have a partial solution to #1.
Since the supply of Help Stop TB is so limited, put all the machine on one profile. Select only Help Stop TB. Given the paucity of units there, you will get very few, if any. The machines can work down their queues and when they get to nothing, cut off internet access. You will have a few machines with a few HST units, but probably not many. You can abort any remaining HST units if you do not want to actually process them.
Good luck
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Aug 16, 2019 3:19:03 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Re-imaging systems and stopping job download

Disclaimer: I'm a Linux guy, but your questions are sort of OS agnostic in a way.

What you generally want is to add to your SCCM another (complementary) automation tool - Ansible is very popular, and I was able to easily find many Google hits for "SCCM and Ansible" which will explain it better than I can, like this one: https://opensource.com/article/19/2/ansible-windows-admin

1. Using a tool such as Ansible, you have a playbook which logs into your worker machines for you and runs the commandline tool "boinccmd ----project https://www.worldcommunitygrid.org nomorework" to handle this part to finish the existing work and not request more. (same as the GUI clicks)

1a. In order to check they are done/empty ("bled off"), you can run another automation a day later to log into the machines and use "boinccmd --get_tasks" to ensure they're 0 before you move on to re-imaging.

2/3: in the BOINC data folder, the magic files that matters the most is "client_state.xml" -- it's always changing because the tasks are stored in there, but what it has is the "CPID" of that machine. You can actually install a fresh BOINC, start it once (to prepare a new data directory), turn it off them replace the new auto-generated value in client_state.xml with the one you had before and it'll pop right back into place. Example of what the top of it looks like:

<client_state>
<host_info>
    <host_cpid>06a6891138b9872877b824116bbfc958</host_cpid>
    ... lots more stuff...

Takeaway: keep a basic spreadsheet/document/etc. of the machine's "host_cpid" values as disaster recovery, with this data you can restore the connections. Better: save <boinc data dir>/*.xml as a zip file if you can, the rest doesn't matter. It's just a simple XML file and very easy to get this data off of each machine that's running, you can do it right now and get started. :)

Note: I prefer to save <data dir>/*.xml as a backup for each machine - not all the sub directories or work or other stuff, just the top level set of XML files. There's one for your account manager login, one for the client state, one for the WCG project, etc. - it's a small, light zipfile to save on your backup host (less than 200k, I just tested one of mine). If you have just these XML files you have everything you need to rescue your BOINC workers from complete disaster.

[Aug 16, 2019 1:36:23 PM]

katoda
Senior Cruncher
Poland
Joined: Apr 28, 2007
Post Count: 172
Status: Offline
Project Badges:

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

10 year badge for Outsmart Ebola Together

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Re-imaging systems and stopping job download

Interesting explanations, but I would like to ensure that they are correct: during all years with WCG I learned that there are two important things in client_state.xlm - hostid (=identificator of the machine, which OP would like to preserve) and rpc_seqno (=number of connection attemps from client to server). From my experience if you do not preserve rpc_seqno (e.g. set it to 0 or any value lower than the actual number) and just replace hostid - it will not work, as the server knows that on his side rpc_seqno is higher, will assume that you are launching outdated client (because the actual one should have the same or higher number of connection attempts) and eventually will register new device with new hostid.
Now, after reading your explanations, looks like that all this mess could be avoided by just keeping host_cpid value. Does it means than I can - as you wrote - launch BOINC on an empty (=not registered) machine, replace host_cpid and boom, everything (including hostid) will be automatically taken from the server and no new device will be registered. Did you try it in the past and it worked, so you checked it "in the wild"?

I prefer to keep all the data from the BOINC directory as well, but sometimes it happens that the remote machine is lost, I do not have any recent backup of client_state.xml so to "revive" the lost host I had to manually increase rpc_seqno and it always worked. Once I replaced BOINC directory with the backup done just one minute before, where rpc_seqno was 2 less than actual and it was rejested and a new device was created.

----------------------------------------

[Aug 16, 2019 2:28:56 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Re-imaging systems and stopping job download

Yes, I have tested in the wild - I've replaced devices with new devices using the same hostname and done exactly as written with the CPID. I have not run into the issue you outline but can understand how it may occur, cool - so add that step to what I wrote to be sure. It can't hurt - and if you'd just saved and restored client_state.xml as an entire file you'd not have to worry about it. I have not had a situation where the drive died (true disaster) - my usage is always to bleed off the clients first which results in a pretty clean XML payload for restores.

As noted, it is preferable to me to save all the top-level XML files and restore them as complete items which would avoid what you're talking about naturally - as I use BAM! to manage my machines, I restore *.xml as a more-common step so that I don't have to go through reconnecting them to the account manager by hand. It's a more holistic disaster recovery plan, IMHO.

[Aug 16, 2019 3:43:55 PM]

PMH_UK
Veteran Cruncher
UK
Joined: Apr 26, 2007
Post Count: 786
Status: Offline
Project Badges:

1 year badge for Discovering Dengue Drugs - Together

180 day badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for Uncovering Genome Mysteries


Re: Re-imaging systems and stopping job download

Instead of "no new work" you could just set the profile(s) to 0.01 days then shutdown BOINC and copy it's data when ready.
That way only a few tasks lost, which the grid allows for, rather than idle systems.
Also consider putting BOINCTasks on a PC to more easily manage the rest.
It can now discover PCs on local network so even easier.

Paul.

----------------------------------------

Paul.

[Aug 16, 2019 6:37:31 PM]

katoda
Senior Cruncher
Poland
Joined: Apr 28, 2007
Post Count: 172
Status: Offline
Project Badges:


Re: Re-imaging systems and stopping job download

Yes, I have tested in the wild - I've replaced devices with new devices using the same hostname and done exactly as written with the CPID.

Just another question, hopefully the last one: did you do such replacement on the same machine (e.g. after reformatting or reimaging the OS) or did you use host_cpid on a totally different hardware?

----------------------------------------

[Aug 16, 2019 7:30:01 PM]

schepers
Advanced Cruncher
Canada
Joined: Oct 11, 2006
Post Count: 85
Status: Offline
Project Badges:


Re: Re-imaging systems and stopping job download

Thanks for the replies to my question. To be honest, I was disappointed but not surprised that there doesn't seem to a simple server-based solution to the ability to stop the job download. The website would seem to be the place to do this for large numbers of systems.

I'm not surprised about the computer ID as it's only local in the XML and needs to be backed up and restored in order to keep it.

What I came up with was a PowerShell script that handles the backup and restore of my BOINC folders during the imaging task sequence. It is still a work in progress but seems to function. The first step in the task sequence runs my PS script that backs up both BOINC folders (using compression) to a network share (if the BOINC folders exist). Later in the TS the script runs again, restores the folders, BOINC is reinstalled and the PC is rebooted to get BOINC functional.

The compressed files are on average about 300Mb per PC, taking about 13Gb total. It's not a bad solution, the systems can now be formatted and repartitioned, and I don't lose any jobs.

Note this solution also works when the hard disk needs replacing (or changed to an SSD) or replacing the PCs with new ones that use the same PC names. Just backup the folders manually to the network share before reimaging and nothing is lost.

[Sep 11, 2019 1:19:34 PM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:

45 day badge for Help Cure Muscular Dystrophy

1 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project


Re: Re-imaging systems and stopping job download

I like @xithryx's solution for #1, coupling an automation tool such as Ansible (or Puppet, Chef, etc.) with SCCM that allows you to remotely (and securely, within your organization's policies of course) execute commands. In a past job in a Windows XP ecosystem, I used the PsExec tool in a lot of my batch files for remote application installations and upgrades since my company didn't have SCCM at the time. It worked great, but using PsExec and the Ps suite of tools uses Remote Procedure Calls that may or may not be allowed in many organizations' security policies, even if they're semi-popular Microsoft-ish tools. Plus I'm not sure they still work in Windows 10. There might be a way to get a remote shell using PowerShell too, if you don't want to learn Ansible/Puppet/Chef. This would allow you to create a PowerShell script that, for example, says "For Each object in [text file or csv or spreadsheet with all the hostnames of your lab computers], go through them and run boinccmd.... etc"

But to @katoda's point, I've also encountered that with a device re-installation on WCG. WCG doesn't use the standard BOINC process of matching based on Device ID or Host ID in the client_state.xml file.

From https://boinc.berkeley.edu/wiki/Host_identification_and_merging

Alternate identification method at World Community Grid

WCG uses a different method to recognize existing devices to prevent duplicate registrations. The server compares the following host information:

user name (network name of host)
domain_name (The default on Windows is "WORKGROUP")
ip_addr (the ip of the client on the local network)
operating system name
processor vendor
memory

The most recent record that matches these attributes (if found) will be re-used. It will cancel any results currently assigned to the client, and then issue new work. This is because a user might be trying to clear out some work that was causing some form of trouble. If any of this information is hidden through for instance setting the <suppress_net_info> flag in the cc_config.xml file suppressing the IP address or domain_name, the method fails and will create a new device registration.

I recently reinstalled the OS on a computer (in this case, Debian 10) and took a chance by not even bothering with the BOINC data directory. I made sure that the version of the OS was exactly the same, the hostname, domain, local IP address (from DHCP), and WCG recognized the device as the older device and matched it. I failed on a Windows 10 Pro machine reinstall by messing with the client_state.xml file, but it rejected it and basically created a new Device/Host ID, so my OCD is forever tweaked. :P Lesson learned to just understand how WCG matches up machines and use that system instead.

When you re-image your lab computers, do you keep the same hostname? Like, some naming scheme based on the Asset Tag or Service Tag or whatever. If you do, and if you can somehow make sure the machine gets assigned the same IP address via DHCP (just to be extra safe) and if the OS version is the same (i.e., I'm worried that imaging from Windows 10 1903 to 1909 for example might be enough to make WCG think it's a new device), that machine should match up with WCG and download its old device profile.

I don't see an issue with installing BOINC in C:\Progam Files\ in the future instead of D:\. Could even make a BOINC package in SCCM for 7.14.2 (and 7.16 which is coming out soon).

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 2 times, last edit by hchc at Sep 13, 2019 6:05:05 AM]

[Sep 13, 2019 6:02:35 AM]

schepers
Advanced Cruncher
Canada
Joined: Oct 11, 2006
Post Count: 85
Status: Offline
Project Badges:


Re: Re-imaging systems and stopping job download

I backup both the BOINCAPP and BOINCDAT folders (mine are separate) and restore them, then install a new client. This way the XML file(s) are all still with the project. Yes, my lab machines always have the same name, IP, domain, etc. But using the XML always seem to prevent extra registrations. I also created a package in SCCM and deploy BOINC that way. If the package doesn't find a D: drive it assumes C:\ (not Program Files)

[Jan 3, 2020 6:40:47 PM]

[ ]