| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 9
|
|
| Author |
|
|
jjhc
Cruncher Joined: Jan 5, 2018 Post Count: 2 Status: Offline Project Badges:
|
Hello:
My first time in this forum, apologies if I break any rules... I've been running this software for a few weeks now. So far so good, but there is one oddity I have noticed. I have a task (MCM1_0139548_9816_1) which appears, to my untrained eye, to have gone off on its own into la-la land. It has been sitting at 93.750% complete for several days, is running at high priority, elapsed time is 96+ hours, time remaining is 5+ hours and slowly increasing (by about 1 second every 15 seconds elapsed), and its deadline was about 20 hours ago. There are no error messages that I can find, although to be fair, I'm not sure that I have looked everywhere. There's nothing obvious in the event log (although this is tens of thousands of lines long so I may have missed something: is there any way to search it?), and nothing in the stdoutae file, and all the stderr files are empty. This is on a Windows 7 machine, all patched up except for the most recent round of fixes (the Spectre/Meltdown ones). The Results web page just says 'No Reply' for this task. So my question is: what, if anything, should I do? Let the task run until something closes it down automatically? Abort it myself? Something else? If there is any more useful data I can supply, please let me know what this might be. Thanks for any advice. Sorry if this info is in some FAQ somewhere - I did look but found nothing useful there. Jonathan |
||
|
|
pcwr
Ace Cruncher England Joined: Sep 17, 2005 Post Count: 10903 Status: Offline Project Badges:
|
Welcome.
----------------------------------------Have you tried to restart the BOINC service or reboot the computer? This will make the WU continue from a known good point. Patrick ![]() |
||
|
|
jjhc
Cruncher Joined: Jan 5, 2018 Post Count: 2 Status: Offline Project Badges:
|
I have not tried restarting BOINC. How does one do that?
Rebooting the PC is a multi-hour task and one I tend to avoid until it becomes necessary. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi,
I have started seeing processes like the following in a ps output ... boinc 11246 0.0 0.0 142276 28 ? SN 2017 1:25 \_ ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.36_x86_64-pc-linux-gnu -SettingsFile MCM1_0138935_1855.txt -DatabaseFile dataset-cur atedOvarian_EarlyLate_v1.0.txt boinc 13200 0.0 0.0 126424 24 ? SN Jan01 0:33 \_ ../../projects/www.worldcommunitygrid.org/wcgrid_fahb_bedam_7.18_x86_64-pc-linux-gnu -seed 319670693 -trickle 0 -upload 0 -wcgval 10000 They seem to be sitting there not using any CPU, one of them started last year sometime and seem to be stuck. I have noticed this on multiple machines and I have just been killing the processes when I see them. Wonder if its the same thing as your seeing? Its not really a problem from my point of view and it looks like boinc grabs more work when this happens anyway (so I have 14 processes running on a 12 core box when 2 are stuck). Just thought I would say. Matt. |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
They seem to be sitting there not using any CPU, one of them started last year sometime and seem to be stuck. I have noticed this on multiple machines and I have just been killing the processes when I see them. If they started last year, they're almost certainly expired. Just manually abort them from the BOINC client, as opposed to killing the processes.
[Edit 1 times, last edit by hchc at Feb 1, 2018 3:10:37 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I have discovered when task seem to lock up, Restarting the computer sometimes helps. Otherwise I emai worldCommunitGrid with the problem tasks
|
||
|
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 2173 Status: Offline Project Badges:
|
I have more or less frequently one of those tasks, I am pretty sure I posted about this before with no reaction from the techs.
Restarting doesn't help, as those WUs usually have a (near) 0% last checkpoint. If they continue, they usually end up in an computation error. I got used to abort any of those tasks if they run more than 25% of the average runtime on that host. Sad thing however is that I do not (easy) have access to all the hosts crunch on, hence I don't know how how many of them get held up by those "ghost tasks". If one of those tasks "hangs", others from the same project are just processing fine, have not been able to identify any commonality among those faulty WUs... Ralf |
||
|
|
Chris Doran
Cruncher Joined: Apr 28, 2007 Post Count: 4 Status: Offline Project Badges:
|
It's a problem that's been around for a year or more. Unlike others, I've never seen another task start once I've got a runaway, so it's a complete block on crunching.
This is what I do: In the BOINC Manager Tasks tab right pane, left click on the runaway to highlight it, then on Suspend in the left pane. Another task will start shortly with Progress % clocking up. Wait a minute or so, then Resume the original task. It may start running immediately or may wait until the second task completes. I've read that to get this trick to work, you need to go to Tools\Computing preferences\disk and memory usage and uncheck "Leave applications in memory whilst suspended", but don't know whether this is really necessary. You need to check for "runaway" tasks every few hours, at least daily. |
||
|
|
GaryWorster
Cruncher United States Joined: Apr 16, 2013 Post Count: 7 Status: Offline Project Badges:
|
I've been having runaway tasks lately as well. Frequently mine will get to 100%, but then just sit there with the Elapsed Time continuing to count up and the Remaining Time blank. Right now I have 7 runaway tasks continuing to increase the Elapsed Time with no Remaining Time left, all preventing new tasks from starting.
----------------------------------------Suspending/Resuming hasn't seemed to help in the past, but maybe I haven't given it enough chance to recycle itself. I'll try unchecking the "Leave applications in memory whilst suspended" option as well. Thanks for the tip, Chris Doran. ![]() ![]() |
||
|
|
|