| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 18
|
|
| Author |
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
I've had a few ARP1 tasks fail to start (typically when babysitting and suspending/unsuspending) and get stuck with "--" elapsed time. Restarting the BOINC service resolved the issue. Rebooting would also do the trick. Worst case just abort that stuck task. Have you guys tried restarting the BOINC service or BOINC altogether? That's worked for me with stuck ARP1 tasks.
|
||
|
|
PowerFactor
Ace Cruncher Joined: Dec 9, 2016 Post Count: 4033 Status: Offline Project Badges:
|
Have you guys tried restarting the BOINC service or BOINC altogether? That's worked for me with stuck ARP1 tasks. It looks like you guys are more creative than I am. I just abort these 0% stuck ARP tasks. [Edit 1 times, last edit by thepeacemaker7 at Feb 17, 2020 2:09:25 AM] |
||
|
|
BladeD
Ace Cruncher USA Joined: Nov 17, 2004 Post Count: 28976 Status: Offline Project Badges:
|
Have you guys tried restarting the BOINC service or BOINC altogether? That's worked for me with stuck ARP1 tasks. It looks like you guys are more creative than I am. I just abort these 0% stuck ARP tasks. Yep! |
||
|
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 823 Status: Offline Project Badges:
|
For me it seems to be when my computer is sluggish (2~3 gigs of ram free) and needs a reboot that I'll see an ARP running but no elapsed time. A system reboot frees up the Win 7 box (yes, I need to upgrade and my Win 8(?) boxes don't seem to have this issue, just my 7) and the WU runs file. The only issue is I can't babysit the program often, so I sometimes notice it isn't really running but have less than 24 hours until the deadline, in which I just need to abort it. Yes, I can still run it, but it'll be given to someone else before I can complete it, then two machines are doing the crunching when only one would be sufficient.
----------------------------------------As there is no elapsed time, I don't there is anything in the logs. Edit: On this machine the WU ARP1_0019868_003_1 is going fine, but ARP1_0017829-003_0 has no elapsed time. The WU ARP1_0019868_003_1 was going until I got some FAH2, then it paused and ARP1_0017829-003_0 tried to start but got no progress. ![]() [Edit 1 times, last edit by Seoulpowergrid at Feb 19, 2020 5:50:02 AM] |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
On this machine the WU ARP1_0019868_003_1 is going fine, but ARP1_0017829-003_0 has no elapsed time. The WU ARP1_0019868_003_1 was going until I got some FAH2, then it paused and ARP1_0017829-003_0 tried to start but got no progress. The machine received a FAH2, then ARP1_0019868_003_1 was paused and FAH2 was started because of its 1 day deadline, or wasn't it? So I'm wondering why ARP1_0017829-003_0 was trying to start. ![]() |
||
|
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 823 Status: Offline Project Badges:
|
I added that info to show the issue wasn't with all ARP WUs on this machine, just some but not others. So by saying the specific WUs, maybe that could help one of the admin to better understand the problem.
----------------------------------------To rephrase my previous statement, I was away from that computer for at least 36 hours, so I need to guess a little to fill in the gaps. At least one of the two ARP WUs were running and getting proper elapsed time. Then 8 FA@H came in with the 24-hour deadline, so FA@H jump the queue and everything else stops as this machine has 8 threads. As FA@H finish, they free up slots for other WCG WUs to resume running. The ARP that was running fine previous to the FA@H (ARP1_0019868_003_1), continued to run and finish without problem. The other ARP (ARP1_0017829-003_0) started sometime, unsure when, but as there was no elapsed time it could have been running a few days for all I know, but as it was without elapsed time there isn't a clear way for me to know without the log files. The lack of elapsed time, and what was now a short deadline, meant that even if I restarted my machine and got the WU to correctly start, there wasn't enough time for it to finish before it would get sent to a new wingman. ![]() |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
That clarifies the case, Seoulpowergrid, thanks!
Which reminds me, knowing the name of the workunit that kept running without elapsed time and knowing where to find the information, we can grep it! But what if we don't know the name(s) of the workunit(s) anymore? Still, there is a way … [root]# grep aborted /var/log/messages-20200216 | sed 's/.*]://' There are the names of the tasks! Ouch! That's 4 aborts. Still, we can check the names now … [root]# for aborted in ARP1_0021515_002_2 ARP1_0032763_002_0 ARP1_0002259_003_0 ARP1_0031879_002_1; Now we can focus on one task: # less -j9 +/ARP1_0031879_002_1 /var/log/messages-20200216 (It didn't make me much wiser, but at least now we know when the tasks were started.) |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Since the 'Event' log file is stored on disc, stderrdae.txt + stdoutdae.txt + stdoutdae.old for the client, mine set very verbose and 3000 lines, got all event activity back to January 28, 2020. Maybe you can look in there to reconstruct what happened in terms of starting and suspending.
|
||
|
|
|