World Community Grid - View Thread - FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

World Community Grid Forums

Category: Completed Research

Forum: FightAIDS@Home

Thread: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 169

[ ]

Author

This topic has been viewed 137140 times and has 168 replies

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

5 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

If anyone is concerned by running too many at once then this:

<app_config>

<app>
<name>fahv</name>
<max_concurrent>8 </max_concurrent>
</app>

</app_config>

Where the number = actual threads

Try going to 24 in <ncpus>

Edit: Don't forget to use preferences to increase number of days cache to cope

----------------------------------------

----------------------------------------
[Edit 3 times, last edit by OldChap at Jun 9, 2015 8:37:53 PM]

[Jun 9, 2015 8:27:51 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

I just dropped out of this project as I also recently dropped out of MCM1 as well. I tossed OET into the mix briefly before turning it off and am now running only UGM. I was returning around 2000 FAAH WU's a day; so this should help some trying to reach goals. I only dropped out of OET as once you have proven a host to be reliable, you will get WU's that require a quorum of 1. So I let the various hosts gobble up 20 to 30 of them before removing that project. Once they are considered reliable I will probably return. Might as well as make the most of the processing power and no need to have WU's sent to a wingman when they are not required. Plus I will probably aim for 100 years on UGM anyway. I should see 100 years on MCM1 in the next few days as the results are validated when the wingman sends their results back. I'm expecting to be over 10 years on UGM by the end of the month; currently sitting a little over 2 years right now.

[Jun 9, 2015 8:44:17 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

There's absolutely no gain to actually run more units concurrent than the system has cores. As noted before, efficiency drops proportionally and overhead increases, so use app_config to control the running number.

[Jun 9, 2015 8:47:51 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

OK, now see what

6/9/2015 10:55:14 PM Not requesting tasks: too many runnable tasks"

is doing, a mysterious value at first sight.

When 32 ncpus * 35 would be 1120, got this message when hitting 1003 units and a 10 days per core computed buffer depth [though the setting is just 5]... 10 being the standard deadline. Need to cut back again blushing

Seem to remember in past the message was more along the line of "not requesting work because it wont finish in time", but with this simulation backdoor open, unexpected things could and do happen.

edit: The only one truly maxed again is the Linux, but then it does them 60% faster than Windows biggrin

----------------------------------------
[Edit 1 times, last edit by Former Member at Jun 9, 2015 9:10:10 PM]

[Jun 9, 2015 9:08:09 PM]

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

Increase cache in preferences by a day maybe?

The machine is by now confused about runtime

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by OldChap at Jun 9, 2015 9:36:13 PM]

[Jun 9, 2015 9:33:14 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

Nope, the "too many runnable tasks" appeared on the fastest Windows, with a computed buffer depth of 4.1 days and 1007 tasks on the host. Something might be reacting on the servers with latency. confused

[Jun 9, 2015 10:08:06 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

I'm using app_config to only run #of WUs equal to cores. I agree, anymore than that and it's just hurry up and wait and probably causes extra overhead in the CPU to no gain. This is causing the 1000 WU limit:

Author: David Anderson <davea at ssl.berkeley.edu>
Date: Sun Jul 7 13:13:57 2013 -0700

client: don't request work from a project w/ > 1000 runnable jobs

Because of O(N^2) algorithms, the client becomes CPU-intensive
when there are lots of jobs.
This limit could be somewhat lower.

4d47e2f170ae638a0121c4a31cc4a9f54a75848a
diff --git a/client/work_fetch.cpp b/client/work_fetch.cpp
index a94779f..2745554 100644
--- a/client/work_fetch.cpp
+++ b/client/work_fetch.cpp
@@ -664,6 +664,16 @@ void WORK_FETCH::setup() {
PROJECT* p = rp->project;
p->sched_priority -= rp->estimated_flops_remaining()/max_queued_flops;
}
+
+ // don't request work from projects w/ > 1000 runnable jobs
+ //
+ for (unsigned int i=0; i<gstate.projects.size(); i++) {
+ PROJECT* p = gstate.projects;
+ if (p->pwf.n_runnable_jobs > 1000 && !p->pwf.cant_fetch_work_reason) {
+ p->pwf.cant_fetch_work_reason = CANT_FETCH_WORK_TOO_MANY_RUNNABLE;
+ }
+ }
+
std::sort(
gstate.projects.begin(),
gstate.projects.end()

It seems to ignore the work cache setting. Mine are still set to 2.5 days and yet I have about 7 days of work on my slowest machine. Mine have been set this way for about 2 weeks now and it's maintaining the ~1000 per host per project...

If someone had a 4 socket 8-core machine (not unusual these days), the 1000 per host would be the limit not the 35 per core.

----------------------------------------
[Edit 1 times, last edit by Doneske at Jun 10, 2015 2:12:28 AM]

[Jun 10, 2015 2:08:49 AM]

Eric_Kaiser
Veteran Cruncher
Germany (Hessen)
Joined: May 7, 2013
Post Count: 1047
Status: Offline
Project Badges:

20 year badge for Mapping Cancer Markers

20 year badge for Uncovering Genome Mysteries

20 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

100 year badge for OpenPandemics - COVID-19


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

The condition above was not correct. I had up to 1300 something wu on my servers each. Concurrent 12 and ncpu set to 64.
Servers finished 450 wu/day and Caches were set to 10+10

----------------------------------------

[Jun 10, 2015 5:27:56 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

(Now I remember from that previous era) In effect, as soon as the last work fetch tops the total IP over 1000 for a project [not sure this works at app level of total WCG], the next connect is 'too many', no request. Rule 35 * cores or > 1000, whichever is greater, one controlled by server, the other by client. See ncpus becoming too popular and getting addressed... it's really a deprecated function ** as it was for simulating multi-core devices [some are beta testing with this it seems, probably run permanently with this elevated setting in anticipation talk to the hand

]

The issue of thousands in cache is client overhead maintaining the list [your client state gets big], but more so the troublesome refreshing of the BOINC Manager GUI view and that does eat CPU time to the point of crashing. This is another reason to use BOINCTasks as it can summarize the buffer according project app and status, so have a single line for 'ready to start' saying there are 1005 on one host [4 days worth].

Eric's 10+10 cache is pointless. Here the client will just go into 'won't finish in time, no work sent', setting 20 total cache when deadline is 10. Major traffic jam kind of continuous and the machine being locked out from the repair stream. The rule having gone into clients after July 7, 2013 (?) would of course mean those before would not stop at 1,000. Remember, setting the buffer now highest means that by the time the current last is run, all subsequent work being fetched and run will have been in the buffer for the full cache depth. The art is to buffer up just before the true supply end is near.

Meantime exp.164 has advanced to 896690. At 650K day we'd need 3.4 more days to dry one out as of this moment. A bunch of the 888 [experiment 162] also dropped in between. The question on Exp. 165/166 remaining open confused

Edit: Seems Doneske's post has a missing closing tag for italics, so put one at start of my reply.

Edit2: ** hmmm, on reflection probably not deprecated. Some have trouble having their very high multi-core devices being fully recognized by BOINC, so they use ncpus to get all cores crunching [Is this a bug in the website profile logic where there's still the N processors AND percent of processors... a conflict? Supposedly after client 6.2.28 or so the N processors was to be ignored, local prefs only giving %, then puzzling people not getting all computing].

----------------------------------------
[Edit 2 times, last edit by Former Member at Jun 10, 2015 8:47:24 AM]

[Jun 10, 2015 8:22:32 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: FA@H Progress Tracking: 99%, 29 APR 2015; finish estimate: mid JUN 2015

It's almost entertaining... the first connect whence the IP drops below < 1001 and got 45. Seen one call giving 98 for FAHV, i.e. there's some fluidity on how much can be buffered...1098 is possible biggrin

5232 World Community Grid 6/10/2015 9:49:59 AM Reporting 1 completed tasks
5233 World Community Grid 6/10/2015 9:49:59 AM Requesting new tasks for CPU
5234 World Community Grid 6/10/2015 9:49:59 AM [sched_op] CPU work request: 5032459.56 seconds; 0.00 devices
5235 World Community Grid 6/10/2015 9:50:08 AM Scheduler request completed: got 45 new tasks

[Jun 10, 2015 8:39:29 AM]

[ ]