Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Active Research Forum: Africa Rainfall Project Thread: Manually control order of workunits |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 16
|
Author |
|
leloft
Cruncher Joined: Jun 8, 2017 Post Count: 23 Status: Offline Project Badges: |
Hello. My work cache contains more work than be done in the deadlines. Specifically, it contains 48 ARP units with estimated times of 72 hrs; 21 of these are being worked with estimate times to completion of between approx 1 to 21 hrs, the remaining ones have deadlines of approx. 96 hrs. However, the work cache also contains 23 OPD units (est times of 6h, deadlines 46h). I have already manually aborted 3 (est 72hr) ARP units 48h ahead of their deadlines in the hope that they will be picked up as stragglers.
----------------------------------------Is there any way that I can manually prioritise the processing of the ARP units above the OPD ones to minimise the number of work units that have to get aborted? I'm not sure how this overload happened, although I have been having problems with the device profiles not being enforced. * edit: I have set 'no more work' via boinccmd until the backlog is cleared and set memory use to 100%. [Edit 1 times, last edit by leloft at Aug 10, 2021 9:09:00 AM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2089 Status: Offline Project Badges: |
Is there any way that I can manually prioritise the processing of the ARP units above the OPD ones to minimise the number of work units that have to get aborted? Yes. You could make use of the file app_config.xml and limit the number of concurrent OPN-tasks, like this: $ cat > app_config.xml <<+(Setting 5 OPN1-tasks as the limit, as an example.) Put the file app_config.xml into BOINC's subdirectory projects/www.worldcommunitygrid.org/ and force re-reading of the config files (e.g. through the following command:) boinccmd --read_cc_config Just for fun, if you have installed the file correctly, try running this command: boinccmd --get_app_config http://www.worldcommunitygrid.org(It will show BOINC's understanding of the file's contents.) [Edit 1 times, last edit by adriverhoef at Aug 10, 2021 10:21:01 AM] |
||
|
leloft
Cruncher Joined: Jun 8, 2017 Post Count: 23 Status: Offline Project Badges: |
Thank you. That was far more straightforward than I had hoped! Very clear and helpful instructions.
|
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12146 Status: Offline Project Badges: |
leloft
Going forward to prevent recurrence of the problem, you should amend the cache limits in your Device Profiles. Currently, ARP units are readily available (within an hour) and OPN & MCM are instantly available. There is no need to hold more than a few spares in excess of the numbers being crunched. For instance, for an 8 thread machine, could have app_config.xml set to crunch 4 ARP, 3 OPN & 2 MCM. The maximum recommended is half of threads for ARP and the total of 1 over the total threads allows for shortages. Then the profile could be set to a maximum of 5 ARP, 4 OPN & 3 MCM so there is always 1 spare of each to allow for the time between completing a unit and the next one being downloaded. For a different number of threads available, scale those figures up or down in proportion. The more you hold in cache, the less likely your machine is to be considered as 'reliable' by WCG. It also slows down the production of new ARP units and also getting your wingman's units validated. If you have more than one machine then app_config.xml should be installed on each machine. The max_concurrent can be different on each machine or the same, but stick to the maximum of 50% of threads for ARP. You can have different profiles on different machines or use one for all machines. Mike |
||
|
leloft
Cruncher Joined: Jun 8, 2017 Post Count: 23 Status: Offline Project Badges: |
If you have more than one machine then app_config.xml should be installed on each machine. The max_concurrent can be different on each machine or the same, but stick to the maximum of 50% of threads for ARP. You can have different profiles on different machines or use one for all machines. Thank you. I have set up app_config.xml on the three machines that are using arp1, using the parameters you suggest for each. Many thanks |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12146 Status: Offline Project Badges: |
I should have mentioned that you have to activate app_config.xml in each machine by clicking on Options and then Read Config files each time you make a change.
Mike |
||
|
hiimebm
Senior Cruncher United States Joined: Oct 19, 2014 Post Count: 305 Status: Offline Project Badges: |
App_config would work but is not necessary here, since as mentioned you can control the max # of workunits from each individual project from your Device Profiles page. You may also want to set the queue to "0" days in the Manager software itself
---------------------------------------- |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12146 Status: Offline Project Badges: |
Actually, app_config.xml is necessary because the other projects are so much shorter and there would be an imbalance if you hold a spare or spares in cache. ARP would hog the machine to the limit of its cache most of the time.
Mike |
||
|
leloft
Cruncher Joined: Jun 8, 2017 Post Count: 23 Status: Offline Project Badges: |
I have taken and implemented all the advice given over the last few days. I have just had to manually abort over a hundred ARP units that have been sent to 2 (4-core) machines in the last few hours. Over 50 of them were downloaded even after i issued nomorework and updated the project. I was able to prevent the download of more work by suspending network activity. I have changed the profile of both machines to default (no ARP) while they chew their way through a few hundred OPN/MCM units.
I am at a loss. How could this possibly have happened: the shared profile was set to 2 ARP, 2 OPN and 1 MCM, with a work cache of 1 day. This is also a heads up to the project admins: there are a hundred or so ARP units that have just been aborted. Acording to my results status, several of them appear to be processed, but i only aborted the waiting and downloading ones. I'd very much appreciate hearing from someone who is running a debian buster build of boinc 7.16.16! |
||
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges: |
There may be some BOINC bugs with the use of max_concurrent that can cause constant fetches up to 1000 tasks.
https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,43530 Probably can try to avoid BOINC per app max_concurrent which have some bugs. Use this website, Settings, device manager, choose a profile, scroll down to project limits. Check all devices and profile, some might be set to unlimited or something. After some changes is made, press Save. |
||
|
|