World Community Grid - View Thread - Is it time to revisit the 35 WU limit?

World Community Grid Forums

Category: Support

Forum: BOINC Agent Support

Thread: Is it time to revisit the 35 WU limit?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 34

[ ]

Author

This topic has been viewed 8284 times and has 33 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Is it time to revisit the 35 WU limit?

Think the client's performance starts degrading too given the manager and core client interact once per second or so.

[Sep 19, 2019 12:37:27 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Is it time to revisit the 35 WU limit?

The 1000 limit should be raised to at least 5000 or maybe 10000 but not removed altogether. IIRC the reason it is there is to prevent a problem machine from executing 1000s of work units in a short period and the WUs all have errors. It is sort of a fail safe mechanism. However, with the new AMD 7702 processor with 64 cores and 128 threads, it seems the 1000 is quite restrictive. If one was to put the 7702 in a dual socket system, that would be 256 threads in one machine. I'm running the 1st gen AMD ZEN (32 cores per processor) in a dual socket machine and I can't run SCC1 now on that machine. That machine has 128 threads but I only get 35 WUs per scheduler request and the request happens every 2 minutes. Most of the first 35 have completed before the next scheduler request happens. Can't keep the machine busy exclusively on SCC1. I loaded it up with MIP1 (1064) WUs and it was empty in 48 hours. WUs per scheduler request, project limits, and BOINC client limits all need to be revisited.

----------------------------------------
[Edit 1 times, last edit by Doneske at Sep 19, 2019 2:59:47 PM]

[Sep 19, 2019 2:59:20 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Is it time to revisit the 35 WU limit?

"that would be 256 threads"

Another hard restriction... 200 threads is the maximum BOINC supports or to be more precise, 200 job slots. Who can afford it are in multi-client concurrent install territory.

[Sep 19, 2019 3:48:29 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Is it time to revisit the 35 WU limit?

I have run more than 200 job slots on the 128 thread system. The slot count reached 236. This was due to jobs getting preempted with "leave in memory" specified. Specifically had MCM1 running when a bunch of FAH2 started. Preempted most the MCM1. Boinc didn't say a word. Why would that restriction be there? Disk space? Seems useless...

Before running multiple clients concurrently, I would look into re-compiling the client to eliminate the restriction. Same with the 1000 hard limit in the client_state.h file. Been meaning to give it a try but haven't got around to it yet.

----------------------------------------
[Edit 1 times, last edit by Doneske at Sep 19, 2019 5:05:34 PM]

[Sep 19, 2019 5:04:01 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Is it time to revisit the 35 WU limit?

Maybe they removed the limit, but last I can find in Github is

checkin_notes_2011
Showing the top two matches Last indexed on Jun 27, 2018
    - feeder: change the DB query to skip jobs for deprecated apps.
        Otherwise, if you have a deprecated app with >= 200 jobs
        (200 is the query's limit)
        it could always get jobs for that app,
        and never put anything into the cache.

[Sep 19, 2019 5:44:58 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Is it time to revisit the 35 WU limit?

That looks like a query limit....

I just tested it:

set ncpus to 256
started client and verified it read cc_config
downloaded 256 jobs
they are started fine.
went to slots directory and highest directory number was 255.
Stopped client and reverted cc_config back to 128

[Sep 19, 2019 8:24:47 PM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:

45 day badge for Help Cure Muscular Dystrophy

20 year badge for Mapping Cancer Markers

1 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Is it time to revisit the 35 WU limit?

Regarding the 70 work units per core or thread limit:

Looks like the # of work units per core/thread is referenced in "max_jobs_exceeded" in /sched/sched_send.cpp

        if (g_wreq->max_jobs_exceeded()) {
            sprintf(buf, "This computer has reached a limit on tasks in progress");

and defined in /sched/sched_types.h

bool max_jobs_exceeded() {
        if (max_jobs_on_host_exceeded) return true;
        for (int i=0; i<NPROC_TYPES; i++) {
            if (max_jobs_on_host_proc_type_exceeded[ i ]) return true;
        }
        return false;
    }

and is dependent on the max_jobs_on_host_proc_type_exceeded value. ~~I think all of this code is SERVER specific, which means each project e.g. WCG defines it server-side.~~ Someone correct me if I'm reading this wrong.

Edit: Is the /sched (Scheduler) built into the client? If so, then it's not server-side. I'll have to keep looking through the code unless someone can point me in the right direction.

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 5 times, last edit by hchc at Sep 20, 2019 1:59:11 AM]

[Sep 20, 2019 12:51:35 AM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:


Re: Is it time to revisit the 35 WU limit?

Regarding the 1000 total runnable jobs limit, like @Doneske said, it's in /client/client_state.h


#define WF_MAX_RUNNABLE_JOBS    1000
    // don't fetch work from a project if it has this many runnable jobs.
    // This is a failsafe mechanism to prevent infinite fetching

and referenced in /client/work_fetch.cpp

if (p->pwf.n_runnable_jobs > WF_MAX_RUNNABLE_JOBS) {
        // don't request work from projects w/ > 1000 runnable jobs
        //
        return PROJECT_REASON_TOO_MANY_RUNNABLE;
    }

This is hard-coded into the BOINC client, which means we can change the code for everyone.

Edited to Add: I opened Issue #3295 in the BOINC GitHub.

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 1 times, last edit by hchc at Sep 20, 2019 4:15:52 AM]

[Sep 20, 2019 12:59:38 AM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:


Re: Is it time to revisit the 35 WU limit?

Regarding the 200 concurrent tasks, I can't find that anywhere, but it's a problem for people with Epyc/Threadripper beasts who will have more than 200 threads going full steam.

Anyone know where this is defined?

Edit: Looks like @Doneske tested with 256 simulated CPUs which all ran concurrently, so maybe this is not an issue.

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 2 times, last edit by hchc at Sep 20, 2019 4:14:02 AM]

[Sep 20, 2019 1:07:07 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Is it time to revisit the 35 WU limit?

The 1000 limit has been brought up a couple of times in the BOINC message boards and I think David Anderson didn't want to change it. I compiled a boinc client on a Centos 7 system when I couldn't find a distributed client that would work due to library differences or libraries missing entirely. The issue then becomes keeping it updated. Admittedly, the client probably doesn't need to be updated that much but it would once in a while and if you have a significant number of hosts, that becomes a chore. If I was more familiar with module mapping from the linker, it would be worth trying to find the constant in a binary module and just zapping it to a different value. It may be worth bringing it up again as AMD is changing the landscape with the high core count EPYC, Threadripper, and Ryzen. Intel isn't far behind. I'm just wondering if the BOINC ecosystem is becoming slightly tiered in the respect that there are still many, many systems under 32 cores but there are also a growing number of high core count systems entering the environment. Maybe there needs to be a parameter that can be entered at start up that allows the client to cater to larger thread count systems. Such as --LARGE_THREAD_COUNT that would tell the client to use larger limits both on the server and client side. By using a parameter, they wouldn't have to maintain different clients. It would be off by default. Just thinking out loud.

[Sep 20, 2019 2:33:04 PM]

[ ]