Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 12
Posts: 12   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3282 times and has 11 replies Next Thread
Synapp.IO
Cruncher
United States
Joined: Sep 16, 2017
Post Count: 18
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: How to avoid creating many devices for a cluster?

Back again, with some more info but unfortunately no solution :-|

First of all, thank you for your continued support. I guess you're doing this in your free time. If you know of any "official" support channel for WCG, please let me know what those are, since I feel bad for using your time.

Now back to the issue at hand:

- it seems to continue. It has now been 10 days since the cluster size increase, and while "Results returned" have stabilized at the new level, "Points Generated" and "Total Run Time" have still not changed (so this seems to indicate that it isn't an issue with a delay in the finished WUs being validated).

- my device installations are constantly increasing (because the instances in the cluster are cleaned up by Google Cloud after 24 hours and replaced with new ones).

- I prefer WUs for MCM and SCC, though I have checked "If there is no work available for the project(s) I have selected above, please send me work from another project." in my profile. So the machines are are working mainly on MCM WUs

- I logged in to a couple of such machines and looked at "boinc.log". From that it seems that it's finishing WUs (after ~20 hours of processing) and uploading them.

- I looked at the "My contribution" page and things look "normal" - as far as I can tell. The instances all points generated / results returned. I also tried to check what the "points generated" / "results returned" is and for all instances it's between 500 and 800 (this is the same range my laptop is in for example).

- On the results page I have no "Invalid" results. I do have a lot of results in the "Error" state where it says Status: Detached, but I guess that's normal since it happens whenever an preemtible instance is cleaned up.

My next step is to update the "global_prefs_override.xml" for the cluster so that it downloads less WUs in advance (it doesn't make sense because the instances have very good network connection all the time and hopefully this also will reduce the "detached" WUs). Particularly:

<suspend_cpu_usage>80.000000</suspend_cpu_usage>
<work_buf_min_days>0.000000</work_buf_min_days>
<work_buf_additional_days>0.000000</work_buf_additional_days>
<max_ncpus_pct>100.000000</max_ncpus_pct>
<disk_max_used_gb>0.000000</disk_max_used_gb>
<disk_max_used_pct>100.000000</disk_max_used_pct>
<disk_min_free_gb>0.000000</disk_min_free_gb>
<cpu_usage_limit>100.000000</cpu_usage_limit>


Also, I'm adding the following to cc_config.xml to perhaps improve the computation speed a little:

<lower_client_priority>0</lower_client_priority>
<no_priority_change>1</no_priority_change>

I also looked into solutions to persist state between restarts of preemtible VMs, but I didn't find anything straight-forward (my biggest stumbling block is the following: lets say I have a shared storage with the structure /instances/boinc-1, /instances/boinc-2, ...etc - how does an instance know when it comes up which directory to use?)

Again, these changes probably will improve the situation somewhat (a little bit more processing time and less detached WUs), but somehow I don't think it will solve the underlying issue.

I keep hammering away at this because I don't want all that processing power to be spent in wain...
[Oct 28, 2017 12:09:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: How to avoid creating many devices for a cluster?

Don't worry about my time... if this is solved, I'll get a triple platinum support rating and that WCG mug which never ever arrived before ;O)))))

Strictly speaking, if you set the website device profile(s) to the values you want to plug into the override file, then you would not need to mess with the client prefs/install image. They'd get fetched from the project on first connect. The cc_config defaults are the best (though the manual does not state what they are for these 2 tags), so no need to change the priority settings.

<lower_client_priority>0</lower_client_priority> Think default is 0, do not change the 'normal' prioriity of what a user prog normally gets.
<no_priority_change>1</no_priority_change> - By default, science apps run lowest, but do take all spare cpu cycles. Running them at normal, as what the client does, could raise contention. Trial and error.

Pointing instances to paths to reload prior used installs, and multiple at that is finicky, if not a challenge. Certainly, near closing, you'd have to write back the datadir AFTER shutting down the client, but before killing the VM instance. You'd have to keep some central indexing to ensure the same is not loaded twice. Out of my depth (no experience to be of any help on this, but know it's being done by some).

You could write to Contact Us https://www.worldcommunitygrid.org/viewContactUs.do f.a.o. knreed. He's the master, also on the development committee and seen him contribute and commit code to Berkeley BOINC on GIThub under an alias.
[Oct 28, 2017 4:05:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 12   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread