Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 169
|
![]() |
Author |
|
OldChap
Veteran Cruncher UK Joined: Jun 5, 2009 Post Count: 978 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
If anyone is concerned by running too many at once then this:
----------------------------------------<app_config> <app> <name>fahv</name> <max_concurrent>8 </max_concurrent> </app> </app_config> Where the number = actual threads Try going to 24 in <ncpus> Edit: Don't forget to use preferences to increase number of days cache to cope ![]() [Edit 3 times, last edit by OldChap at Jun 9, 2015 8:37:53 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I just dropped out of this project as I also recently dropped out of MCM1 as well. I tossed OET into the mix briefly before turning it off and am now running only UGM. I was returning around 2000 FAAH WU's a day; so this should help some trying to reach goals. I only dropped out of OET as once you have proven a host to be reliable, you will get WU's that require a quorum of 1. So I let the various hosts gobble up 20 to 30 of them before removing that project. Once they are considered reliable I will probably return. Might as well as make the most of the processing power and no need to have WU's sent to a wingman when they are not required. Plus I will probably aim for 100 years on UGM anyway. I should see 100 years on MCM1 in the next few days as the results are validated when the wingman sends their results back. I'm expecting to be over 10 years on UGM by the end of the month; currently sitting a little over 2 years right now.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
There's absolutely no gain to actually run more units concurrent than the system has cores. As noted before, efficiency drops proportionally and overhead increases, so use app_config to control the running number.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
OK, now see what
----------------------------------------6/9/2015 10:55:14 PM Not requesting tasks: too many runnable tasks" is doing, a mysterious value at first sight. When 32 ncpus * 35 would be 1120, got this message when hitting 1003 units and a 10 days per core computed buffer depth [though the setting is just 5]... 10 being the standard deadline. Need to cut back again ![]() Seem to remember in past the message was more along the line of "not requesting work because it wont finish in time", but with this simulation backdoor open, unexpected things could and do happen. edit: The only one truly maxed again is the Linux, but then it does them 60% faster than Windows ![]() [Edit 1 times, last edit by Former Member at Jun 9, 2015 9:10:10 PM] |
||
|
OldChap
Veteran Cruncher UK Joined: Jun 5, 2009 Post Count: 978 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Increase cache in preferences by a day maybe?
----------------------------------------The machine is by now confused about runtime ![]() [Edit 1 times, last edit by OldChap at Jun 9, 2015 9:36:13 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Nope, the "too many runnable tasks" appeared on the fastest Windows, with a computed buffer depth of 4.1 days and 1007 tasks on the host. Something might be reacting on the servers with latency.
![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm using app_config to only run #of WUs equal to cores. I agree, anymore than that and it's just hurry up and wait and probably causes extra overhead in the CPU to no gain. This is causing the 1000 WU limit:
----------------------------------------Author: David Anderson <davea at ssl.berkeley.edu> Date: Sun Jul 7 13:13:57 2013 -0700 client: don't request work from a project w/ > 1000 runnable jobs Because of O(N^2) algorithms, the client becomes CPU-intensive when there are lots of jobs. This limit could be somewhat lower. 4d47e2f170ae638a0121c4a31cc4a9f54a75848a diff --git a/client/work_fetch.cpp b/client/work_fetch.cpp index a94779f..2745554 100644 --- a/client/work_fetch.cpp +++ b/client/work_fetch.cpp @@ -664,6 +664,16 @@ void WORK_FETCH::setup() { PROJECT* p = rp->project; p->sched_priority -= rp->estimated_flops_remaining()/max_queued_flops; } + + // don't request work from projects w/ > 1000 runnable jobs + // + for (unsigned int i=0; i<gstate.projects.size(); i++) { + PROJECT* p = gstate.projects; + if (p->pwf.n_runnable_jobs > 1000 && !p->pwf.cant_fetch_work_reason) { + p->pwf.cant_fetch_work_reason = CANT_FETCH_WORK_TOO_MANY_RUNNABLE; + } + } + std::sort( gstate.projects.begin(), gstate.projects.end() It seems to ignore the work cache setting. Mine are still set to 2.5 days and yet I have about 7 days of work on my slowest machine. Mine have been set this way for about 2 weeks now and it's maintaining the ~1000 per host per project... If someone had a 4 socket 8-core machine (not unusual these days), the 1000 per host would be the limit not the 35 per core. [Edit 1 times, last edit by Doneske at Jun 10, 2015 2:12:28 AM] |
||
|
Eric_Kaiser
Veteran Cruncher Germany (Hessen) Joined: May 7, 2013 Post Count: 1047 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The condition above was not correct. I had up to 1300 something wu on my servers each. Concurrent 12 and ncpu set to 64.
----------------------------------------Servers finished 450 wu/day and Caches were set to 10+10 ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm using app_config to only run #of WUs equal to cores. I agree, anymore than that and it's just hurry up and wait and probably causes extra overhead in the CPU to no gain. This is causing the 1000 WU limit: Author: David Anderson <davea at ssl.berkeley.edu> Date: Sun Jul 7 13:13:57 2013 -0700 client: don't request work from a project w/ > 1000 runnable jobs Because of O(N^2) algorithms, the client becomes CPU-intensive when there are lots of jobs. This limit could be somewhat lower. 4d47e2f170ae638a0121c4a31cc4a9f54a75848a diff --git a/client/work_fetch.cpp b/client/work_fetch.cpp index a94779f..2745554 100644 --- a/client/work_fetch.cpp +++ b/client/work_fetch.cpp @@ -664,6 +664,16 @@ void WORK_FETCH::setup() { PROJECT* p = rp->project; p->sched_priority -= rp->estimated_flops_remaining()/max_queued_flops; } + + // don't request work from projects w/ > 1000 runnable jobs + // + for (unsigned int i=0; i<gstate.projects.size(); i++) { + PROJECT* p = gstate.projects; + if (p->pwf.n_runnable_jobs > 1000 && !p->pwf.cant_fetch_work_reason) { + p->pwf.cant_fetch_work_reason = CANT_FETCH_WORK_TOO_MANY_RUNNABLE; + } + } + std::sort( gstate.projects.begin(), gstate.projects.end() It seems to ignore the work cache setting. Mine are still set to 2.5 days and yet I have about 7 days of work on my slowest machine. Mine have been set this way for about 2 weeks now and it's maintaining the ~1000 per host per project... If someone had a 4 socket 8-core machine (not unusual these days), the 1000 per host would be the limit not the 35 per core. (Now I remember from that previous era) In effect, as soon as the last work fetch tops the total IP over 1000 for a project [not sure this works at app level of total WCG], the next connect is 'too many', no request. Rule 35 * cores or > 1000, whichever is greater, one controlled by server, the other by client. See ncpus becoming too popular and getting addressed... it's really a deprecated function ** as it was for simulating multi-core devices [some are beta testing with this it seems, probably run permanently with this elevated setting in anticipation ![]() The issue of thousands in cache is client overhead maintaining the list [your client state gets big], but more so the troublesome refreshing of the BOINC Manager GUI view and that does eat CPU time to the point of crashing. This is another reason to use BOINCTasks as it can summarize the buffer according project app and status, so have a single line for 'ready to start' saying there are 1005 on one host [4 days worth]. Eric's 10+10 cache is pointless. Here the client will just go into 'won't finish in time, no work sent', setting 20 total cache when deadline is 10. Major traffic jam kind of continuous and the machine being locked out from the repair stream. The rule having gone into clients after July 7, 2013 (?) would of course mean those before would not stop at 1,000. Remember, setting the buffer now highest means that by the time the current last is run, all subsequent work being fetched and run will have been in the buffer for the full cache depth. The art is to buffer up just before the true supply end is near. Meantime exp.164 has advanced to 896690. At 650K day we'd need 3.4 more days to dry one out as of this moment. A bunch of the 888 [experiment 162] also dropped in between. The question on Exp. 165/166 remaining open ![]() Edit: Seems Doneske's post has a missing closing tag for italics, so put one at start of my reply. Edit2: ** hmmm, on reflection probably not deprecated. Some have trouble having their very high multi-core devices being fully recognized by BOINC, so they use ncpus to get all cores crunching [Is this a bug in the website profile logic where there's still the N processors AND percent of processors... a conflict? Supposedly after client 6.2.28 or so the N processors was to be ignored, local prefs only giving %, then puzzling people not getting all computing]. [Edit 2 times, last edit by Former Member at Jun 10, 2015 8:47:24 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It's almost entertaining... the first connect whence the IP drops below < 1001 and got 45. Seen one call giving 98 for FAHV, i.e. there's some fluidity on how much can be buffered...1098 is possible
![]() 5232 World Community Grid 6/10/2015 9:49:59 AM Reporting 1 completed tasks 5233 World Community Grid 6/10/2015 9:49:59 AM Requesting new tasks for CPU 5234 World Community Grid 6/10/2015 9:49:59 AM [sched_op] CPU work request: 5032459.56 seconds; 0.00 devices 5235 World Community Grid 6/10/2015 9:50:08 AM Scheduler request completed: got 45 new tasks |
||
|
|
![]() |