World Community Grid - View Thread - Manually control order of workunits

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: Manually control order of workunits

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 16

[ ]

Author

This topic has been viewed 2725 times and has 15 replies

leloft
Cruncher
Joined: Jun 8, 2017
Post Count: 23
Status: Offline
Project Badges:

180 day badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Manually control order of workunits

Hello. My work cache contains more work than be done in the deadlines. Specifically, it contains 48 ARP units with estimated times of 72 hrs; 21 of these are being worked with estimate times to completion of between approx 1 to 21 hrs, the remaining ones have deadlines of approx. 96 hrs. However, the work cache also contains 23 OPD units (est times of 6h, deadlines 46h). I have already manually aborted 3 (est 72hr) ARP units 48h ahead of their deadlines in the hope that they will be picked up as stragglers.
Is there any way that I can manually prioritise the processing of the ARP units above the OPD ones to minimise the number of work units that have to get aborted?
I'm not sure how this overload happened, although I have been having problems with the device profiles not being enforced.
* edit: I have set 'no more work' via boinccmd until the backlog is cleared and set memory use to 100%.

----------------------------------------
[Edit 1 times, last edit by leloft at Aug 10, 2021 9:09:00 AM]

[Aug 10, 2021 8:55:17 AM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2089
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Manually control order of workunits

Is there any way that I can manually prioritise the processing of the ARP units above the OPD ones to minimise the number of work units that have to get aborted?

Yes.
You could make use of the file app_config.xml and limit the number of concurrent OPN-tasks, like this:

$ cat > app_config.xml <<+
<app_config>
	<app>
		<name>opn1</name>
		<max_concurrent>5</max_concurrent>
	</app>
</app_config>
+

(Setting 5 OPN1-tasks as the limit, as an example.)

Put the file app_config.xml into BOINC's subdirectory projects/www.worldcommunitygrid.org/ and force re-reading of the config files (e.g. through the following command:)

boinccmd --read_cc_config

Just for fun, if you have installed the file correctly, try running this command:

boinccmd --get_app_config http://www.worldcommunitygrid.org

(It will show BOINC's understanding of the file's contents.)

----------------------------------------
[Edit 1 times, last edit by adriverhoef at Aug 10, 2021 10:21:01 AM]

[Aug 10, 2021 10:00:40 AM]

leloft
Cruncher
Joined: Jun 8, 2017
Post Count: 23
Status: Offline
Project Badges:


Re: Manually control order of workunits

Thank you. That was far more straightforward than I had hoped! Very clear and helpful instructions.

[Aug 10, 2021 10:31:13 AM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12146
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

180 day badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: Manually control order of workunits

leloft

Going forward to prevent recurrence of the problem, you should amend the cache limits in your Device Profiles. Currently, ARP units are readily available (within an hour) and OPN & MCM are instantly available. There is no need to hold more than a few spares in excess of the numbers being crunched.

For instance, for an 8 thread machine, could have app_config.xml set to crunch 4 ARP, 3 OPN & 2 MCM. The maximum recommended is half of threads for ARP and the total of 1 over the total threads allows for shortages.

Then the profile could be set to a maximum of 5 ARP, 4 OPN & 3 MCM so there is always 1 spare of each to allow for the time between completing a unit and the next one being downloaded.

For a different number of threads available, scale those figures up or down in proportion.

The more you hold in cache, the less likely your machine is to be considered as 'reliable' by WCG. It also slows down the production of new ARP units and also getting your wingman's units validated.

If you have more than one machine then app_config.xml should be installed on each machine. The max_concurrent can be different on each machine or the same, but stick to the maximum of 50% of threads for ARP. You can have different profiles on different machines or use one for all machines.

Mike

[Aug 10, 2021 12:13:04 PM]

leloft
Cruncher
Joined: Jun 8, 2017
Post Count: 23
Status: Offline
Project Badges:


Re: Manually control order of workunits

If you have more than one machine then app_config.xml should be installed on each machine. The max_concurrent can be different on each machine or the same, but stick to the maximum of 50% of threads for ARP. You can have different profiles on different machines or use one for all machines.

Thank you. I have set up app_config.xml on the three machines that are using arp1, using the parameters you suggest for each.
Many thanks

[Aug 10, 2021 2:01:18 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12146
Status: Offline
Project Badges:


Re: Manually control order of workunits

I should have mentioned that you have to activate app_config.xml in each machine by clicking on Options and then Read Config files each time you make a change.

Mike

[Aug 10, 2021 2:20:33 PM]

hiimebm
Senior Cruncher
United States
Joined: Oct 19, 2014
Post Count: 305
Status: Offline
Project Badges:

90 day badge for The Clean Energy Project - Phase 2

10 year badge for Mapping Cancer Markers

90 day badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

1 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: Manually control order of workunits

App_config would work but is not necessary here, since as mentioned you can control the max # of workunits from each individual project from your Device Profiles page. You may also want to set the queue to "0" days in the Manager software itself

----------------------------------------

[Aug 10, 2021 2:23:59 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12146
Status: Offline
Project Badges:


Re: Manually control order of workunits

Actually, app_config.xml is necessary because the other projects are so much shorter and there would be an imbalance if you hold a spare or spares in cache. ARP would hog the machine to the limit of its cache most of the time.

Mike

[Aug 11, 2021 1:38:49 AM]

leloft
Cruncher
Joined: Jun 8, 2017
Post Count: 23
Status: Offline
Project Badges:


Re: Manually control order of workunits

I have taken and implemented all the advice given over the last few days. I have just had to manually abort over a hundred ARP units that have been sent to 2 (4-core) machines in the last few hours. Over 50 of them were downloaded even after i issued nomorework and updated the project. I was able to prevent the download of more work by suspending network activity. I have changed the profile of both machines to default (no ARP) while they chew their way through a few hundred OPN/MCM units.

I am at a loss. How could this possibly have happened: the shared profile was set to 2 ARP, 2 OPN and 1 MCM, with a work cache of 1 day.

This is also a heads up to the project admins: there are a hundred or so ARP units that have just been aborted. Acording to my results status, several of them appear to be processed, but i only aborted the waiting and downloading ones.

I'd very much appreciate hearing from someone who is running a debian buster build of boinc 7.16.16!

[Aug 13, 2021 2:03:32 PM]

sam6861
Advanced Cruncher
Joined: Mar 31, 2020
Post Count: 107
Status: Offline
Project Badges:

45 day badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

5 year badge for OpenPandemics - COVID-19


Re: Manually control order of workunits

There may be some BOINC bugs with the use of max_concurrent that can cause constant fetches up to 1000 tasks.
https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,43530
Probably can try to avoid BOINC per app max_concurrent which have some bugs.

Use this website, Settings, device manager, choose a profile, scroll down to project limits. Check all devices and profile, some might be set to unlimited or something. After some changes is made, press Save.

[Aug 13, 2021 5:11:38 PM]

[ ]