Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 4
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2206 times and has 3 replies Next Thread
bluestang
Senior Cruncher
USA
Joined: Oct 1, 2010
Post Count: 274
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Aborting Stuck Tasks

Would be nice if there was a way for the Server to Abort Stuck WUs when BOINC communicates back to project.

Every now and then, I get an OPNG WU that will run for hours before I catch it and manually Abort it. These are WUs that usually take less than 10 min if ran concurrently with other WUs.

If there was a way to have them caught and aborted after running for a period of time with no increase in WU progress or something that would be great!
----------------------------------------
[Oct 27, 2021 2:47:57 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Bryn Mawr
Senior Cruncher
Joined: Dec 26, 2018
Post Count: 384
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Aborting Stuck Tasks

Would be nice if there was a way for the Server to Abort Stuck WUs when BOINC communicates back to project.

Every now and then, I get an OPNG WU that will run for hours before I catch it and manually Abort it. These are WUs that usually take less than 10 min if ran concurrently with other WUs.

If there was a way to have them caught and aborted after running for a period of time with no increase in WU progress or something that would be great!


I think that there is a time limit built into the WU but it relies on the WU checkpointing - if it is stuck then it will never get to check whether it is over the limit.
[Oct 27, 2021 4:27:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Aborting Stuck Tasks

Would be nice if there was a way for the Server to Abort Stuck WUs when BOINC communicates back to project.

The WCG server doesn't have any clue about tasks getting or being stuck, because the BOINC client (also) doesn't know. The BOINC client is the link between the tasks on your computer and the WCG server.

The BOINC client isn't aware of stuck tasks, because it doesn't know about any definition of stuck. However, it does know about a time limit. When that limit is exceeded, your 'stuck' task will be aborted.

In the past, I've had GPU tasks that had a time limit of 1.68 hours (100 minutes) with an expected duration of only 3 minutes, while they needed more than 100 minutes to run. Sure enough they were aborted by the BOINC client after 100 minutes of runtime. That's the BOINC client saying "Your task is taking too long, I'm aborting it." It doesn't say "Your task is stuck, I'm aborting it." However, it's the BOINC client's way of saying "Your task is stuck, so I'm aborting it" (if you want to look at it that way wink ).

Every now and then, I get an OPNG WU that will run for hours before I catch it and manually Abort it. These are WUs that usually take less than 10 min if ran concurrently with other WUs.

If there was a way to have them caught and aborted after running for a period of time with no increase in WU progress or something that would be great!


Then you need a computer program running in the background to check every now and then if there is a task that exceeds your limits, meeting your definition of being stuck. cool

EDIT: added a simple program:

Say you would run this script, letting the process sleep every 5 minutes till the next check, then letting it detect if there is an OPNG-task running for at least 10 minutes, and if there is, abort it. With echoed comments. For your pleasure. smile Use at your own risk. biggrin

while sleep 300; do
printf "[%s] " "`date`";
wcgresults -HNrB | grep OPNG |
while read elapsed name; do
case $elapsed in
(0:0[0-9]:[0-5][0-9])
echo Steady as she goes, $name, elapsed time = $elapsed
;;
(*)
echo Whoa! Task $name needs attention, elapsed time = $elapsed, aborting it;
boinccmd --task http://www.worldcommunitygrid.org/ $name abort
;;
esac;
done;
done

----------------------------------------
[Edit 2 times, last edit by adriverhoef at Oct 27, 2021 6:18:42 PM]
[Oct 27, 2021 4:47:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bluestang
Senior Cruncher
USA
Joined: Oct 1, 2010
Post Count: 274
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Aborting Stuck Tasks

Yeah, wasn't sure exactly what could be done on the Server side. I guess I was thinking of the "time limit" on a WU as you guys mentioned...thanks for clarifying what I was trying to suggest.

@adriverhoef Thank you very much for that script! I will give it a go tomorrow when I get back to my machines.

Also, is that a script for Linux or Windows?
----------------------------------------
----------------------------------------
[Edit 1 times, last edit by bluestang at Oct 28, 2021 2:08:42 PM]
[Oct 28, 2021 12:29:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread