| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 12
|
|
| Author |
|
|
Synapp.IO
Cruncher United States Joined: Sep 16, 2017 Post Count: 18 Status: Offline Project Badges:
|
Hello all,
I'm a little bit new to this, so please point me in the right direction if this isn't the place to ask: I'm running BOINC (attached to World Community Grid) on a cluster of machines on the GCE (Google Compute Engine). They are running on "preemtible VMs" in a managed instance group, since this is the most cost effective way to run computations on the GCE (and I want to donate as much computation as possible for the amount of money I have). How this works: I have a disk image (Ubuntu + BOINC client configured) and GCE starts as many instances as they have capacity for in the given data center (up to limit set by me). If they don't have extra capacity, they stop some/all of the VMs and start new ones once capacity frees up. Also, they stop any instance after a maximum of 24 hours (even if excess capacity is available) and start a new one to take its place. The (potential) issue with this is that new "devices" are being added to my WCG account. Questions: - is this a potential problem? I wouldn't want to create a problem for the WCG database by adding thousands of devices that will never be used again. - If this is a potential problem, how can it be resolved? Can I somehow remove old devices from the WCG database? - I tried to set the same hostname on all the VMs thinking that WCG might consider them the same device, but I think WCG still somehow knows that they are different devices Thanks, Attila |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Each client on initial Project-add connection creates a unique identifier, which is then reused on all subsequent connects by the same client. The host name In that is irrelevant.. . one can have as many clients with the same name as one wants.
It is not recommended to run these clients with anything other than zero buffer setting as well as setting the report_immediately flag in cc_config.xml (see wiki manual) This ensures no finish task remain unreported when the instance is ended. Seem to remember there's a way to save these instances and then reload them, as and when the hardware becomes available again, someone did who provided rendering services and just loaded them back in, suspecting this needed identical hardware, or maintaining an index of which goes to each specific box. Cloning is not advised, this will in fact cause conflict leading to random dumping of tasks and fetching new. |
||
|
|
Synapp.IO
Cruncher United States Joined: Sep 16, 2017 Post Count: 18 Status: Offline Project Badges:
|
Hello,
Thank you for the quick answer. Unfortunately I'm still lost a little bit: If I understand correctly you're saying that I should make it so that the image from which the Google Cloud creates new instances (the "master"/"gold" image) does not contain BOINC set up, but rather I should link BOINC with WCG the first time such an instances starts up. Also, that I should change "report_results_immediately" to 1. These are great suggestions and I shall implement them promptly. In fact this may explain the fact that I'm seeing less results from these machines than I was expecting. You are also correct that these instances "remain intact" after being stopped and they could be re-used as per the documentation (https://cloud.google.com/compute/docs/instanc...ible#preemption_process): """ Preempted instances still appear in your project, but you are not charged for the instance hours while it remains in a TERMINATED state. You can access and recover data from any persistent disks that are attached to the instance, but those disks still incur storage charges until you delete them. """ The trick is that I'm using Preemtible instances together with Managed Instance Groups. This acts like a watchdog / scheduler - ie. I set the number of instances I want and if instances die, it tries to start new ones to take its place. I don't think that the scheduler re-uses TERMINATED instances, but rather starts fresh ones (although I will try to clarify this with Google Support). So re-using older instances is not a very viable option. Do you have any idea: - if having many devices associated with an account would be a problem for the WCG database? (just a quick calculation: if I run 100 instances for one month, restarting every 24 hours, that would be roughly 3000 devices) - is there a way to permanently remove devices from the WCG database? or perhaps to "merge" multiple devices into one (even if it's after the fact - for example I could try to periodically merge old instances into one). |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Don't think registering 3K devices creates a problem. The way I'm reading your observations, it does imply that the clients are not re-used, thus each time they are ended, any instance times number of threads, times 100 are tasks left in limbo, essentially not resumed by same client, thus say 16 threads per instance, 1600 per day will never be completed, 48K tasks will monthly turn No Reply..
There's code to abort unfinished tasks on shutdown and force a communication to tell the project these need immediate reassignment But, "You are also correct that these instances "remain intact" after being stopped and they could be re-used as per the documentation", suggests the instances are being restored and with that the installed BOINC clients, i.e. the tasks in progress would resume. Not sure though how these task restarts will behave in such environment, they could resume from intermediate checkpoint or crash. Trial and error in that case. No, there's no merge or delete facility, creating/registering new clients with a unique number is permanent, opposed to other projects that do have merge facility, provided the old and new device/client credentials are identical. |
||
|
|
Synapp.IO
Cruncher United States Joined: Sep 16, 2017 Post Count: 18 Status: Offline Project Badges:
|
Could you please point me to the "code to abort unfinished tasks on shutdown and force a communication to tell the project these need immediate reassignment"?
Thank you, Attila |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
abort_jobs_on_exit per https://boinc.berkeley.edu/wiki/Client_configuration going into the cc_config.xml. The update can be forced by sending an update request through the boinccmd tool https://boinc.berkeley.edu/wiki/Boinccmd_tool
No idea if the abort triggers a communication by itself,, the description suggest so with "If 1, abort jobs and update projects when client exits. Useful on grids where disk gets wiped after each run.", but guess it needs to be allowed a little time before the instance is shut down. |
||
|
|
Synapp.IO
Cruncher United States Joined: Sep 16, 2017 Post Count: 18 Status: Offline Project Badges:
|
Thank you again for all your help and advice. I've been crunching now for almost 30 days.
I do have a follow up question though if you have the time: 3 days ago I increased my cluster size (because I received confirmation that we can dedicate more resources to it). And I see corresponding increase in my "Results returned" graph, however no increase in the "Total Run Time" or "Points generated" graph. Any idea why that might be? Based on the BOINC logs those instances do seem to finish / upload WUs, so I'm not sure what's going on... |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Your daily history, in numbers, can be viewed at https://www.worldcommunitygrid.org/ms/viewMemberStatHistory.do
----------------------------------------Of course all devices/clients are registered to the same member account. At times members have 2 and suddenly results go to the blue yonder. On the Result Status pages you can filter per device to see if and when results returned go valid. At start of a new client many results will pass through the Pending Validation state until enough data is collected and the device is rated as reliable, from which point its allowed to compute tasks alone, no verification needed by a second party, except for incidental reverification of reliability (Mapping Cancer Markers always requires quorum 2). [Edit 2 times, last edit by SekeRob* at Oct 21, 2017 11:18:39 AM] |
||
|
|
Synapp.IO
Cruncher United States Joined: Sep 16, 2017 Post Count: 18 Status: Offline Project Badges:
|
Thanks for the info. On viewMemberStatHistory.do I see:
10/21/2017 0:001:08:19:36 5,884 29 10/20/2017 1:041:13:00:25 1,903,436 9,174 10/19/2017 1:018:20:46:03 1,786,841 9,018 10/18/2017 1:044:02:53:36 1,892,786 9,244 10/17/2017 1:014:04:20:42 1,925,279 6,626 10/16/2017 1:012:07:59:34 2,005,855 7,179 10/15/2017 1:015:00:33:44 1,968,455 6,743 10/14/2017 1:020:07:57:36 1,958,492 6,670 10/13/2017 1:016:20:17:13 1,939,777 6,712 As you can see the "no. results returned" jumps by ~3k on the 18th (due to the increase in cluster size), however "points generated" / "total runtime" stay the same. The quorum explanation for MCM seems like a very plausible explanation. I also checked and I have ~400 WUs "pending verification". So, perhaps once those are verified, the statistics will be updated. Cheers. |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Very strange and not adding up. Only valid results are counted, so a 'Pending' explosion does not seem to be the explanation (400 or so would be a normal with your kind of daily result output). Another possibility is a project selection switch or their runtimes. Not currently up to speed, but sometimes WCG projects have drastic duration changes, so that could explain the big rise in results without translating in points/runtime.
----------------------------------------Do check the Result Status (RS) pages and filter on Invalid. It would not explain the disparity either as Invalid give full runtime credit, but just half points... just in case the new ones are producing Errors, filter on Errors as well. Something is fishy. The field to monitor on your My Contribution page is the device installation count. If a device has a returned result, it will be added there. It should then have jumped 3 days ago. [Edit 1 times, last edit by SekeRob* at Oct 21, 2017 4:46:30 PM] |
||
|
|
|