Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 18
Posts: 18   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4436 times and has 17 replies Next Thread
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

I've had a few ARP1 tasks fail to start (typically when babysitting and suspending/unsuspending) and get stuck with "--" elapsed time. Restarting the BOINC service resolved the issue. Rebooting would also do the trick. Worst case just abort that stuck task.

Have you guys tried restarting the BOINC service or BOINC altogether? That's worked for me with stuck ARP1 tasks.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Feb 17, 2020 1:33:29 AM]   Link   Report threatening or abusive post: please login first  Go to top 
PowerFactor
Ace Cruncher
Joined: Dec 9, 2016
Post Count: 4033
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!


Have you guys tried restarting the BOINC service or BOINC altogether? That's worked for me with stuck ARP1 tasks.


It looks like you guys are more creative than I am. I just abort these 0% stuck ARP tasks.
----------------------------------------
[Edit 1 times, last edit by thepeacemaker7 at Feb 17, 2020 2:09:25 AM]
[Feb 17, 2020 2:08:31 AM]   Link   Report threatening or abusive post: please login first  Go to top 
BladeD
Ace Cruncher
USA
Joined: Nov 17, 2004
Post Count: 28976
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!


Have you guys tried restarting the BOINC service or BOINC altogether? That's worked for me with stuck ARP1 tasks.


It looks like you guys are more creative than I am. I just abort these 0% stuck ARP tasks.

Yep!
----------------------------------------
[Feb 17, 2020 11:48:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Seoulpowergrid
Veteran Cruncher
Joined: Apr 12, 2013
Post Count: 823
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

For me it seems to be when my computer is sluggish (2~3 gigs of ram free) and needs a reboot that I'll see an ARP running but no elapsed time. A system reboot frees up the Win 7 box (yes, I need to upgrade and my Win 8(?) boxes don't seem to have this issue, just my 7) and the WU runs file. The only issue is I can't babysit the program often, so I sometimes notice it isn't really running but have less than 24 hours until the deadline, in which I just need to abort it. Yes, I can still run it, but it'll be given to someone else before I can complete it, then two machines are doing the crunching when only one would be sufficient.

As there is no elapsed time, I don't there is anything in the logs.

Edit: On this machine the WU ARP1_0019868_003_1 is going fine, but ARP1_0017829-003_0 has no elapsed time. The WU ARP1_0019868_003_1 was going until I got some FAH2, then it paused and ARP1_0017829-003_0 tried to start but got no progress.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Seoulpowergrid at Feb 19, 2020 5:50:02 AM]
[Feb 19, 2020 5:47:22 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

On this machine the WU ARP1_0019868_003_1 is going fine, but ARP1_0017829-003_0 has no elapsed time. The WU ARP1_0019868_003_1 was going until I got some FAH2, then it paused and ARP1_0017829-003_0 tried to start but got no progress.

The machine received a FAH2, then ARP1_0019868_003_1 was paused and FAH2 was started because of its 1 day deadline, or wasn't it? So I'm wondering why ARP1_0017829-003_0 was trying to start. thinking
[Feb 19, 2020 10:16:31 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Seoulpowergrid
Veteran Cruncher
Joined: Apr 12, 2013
Post Count: 823
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

I added that info to show the issue wasn't with all ARP WUs on this machine, just some but not others. So by saying the specific WUs, maybe that could help one of the admin to better understand the problem.

To rephrase my previous statement, I was away from that computer for at least 36 hours, so I need to guess a little to fill in the gaps. At least one of the two ARP WUs were running and getting proper elapsed time. Then 8 FA@H came in with the 24-hour deadline, so FA@H jump the queue and everything else stops as this machine has 8 threads. As FA@H finish, they free up slots for other WCG WUs to resume running.

The ARP that was running fine previous to the FA@H (ARP1_0019868_003_1), continued to run and finish without problem. The other ARP (ARP1_0017829-003_0) started sometime, unsure when, but as there was no elapsed time it could have been running a few days for all I know, but as it was without elapsed time there isn't a clear way for me to know without the log files. The lack of elapsed time, and what was now a short deadline, meant that even if I restarted my machine and got the WU to correctly start, there wasn't enough time for it to finish before it would get sent to a new wingman.
----------------------------------------

[Feb 20, 2020 2:33:36 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

That clarifies the case, Seoulpowergrid, thanks!

Which reminds me, knowing the name of the workunit that kept running without elapsed time and knowing where to find the information, we can grep it!
But what if we don't know the name(s) of the workunit(s) anymore? crying Still, there is a way … wink
[root]# grep aborted /var/log/messages-20200216 | sed 's/.*]://' 
10-Feb-2020 22:52:07 [World Community Grid] task ARP1_0021515_002_2 aborted by user
10-Feb-2020 22:52:07 [World Community Grid] task ARP1_0032763_002_0 aborted by user
10-Feb-2020 23:14:39 [World Community Grid] task ARP1_0002259_003_0 aborted by user
10-Feb-2020 23:14:39 [World Community Grid] task ARP1_0031879_002_1 aborted by user

There are the names of the tasks! cool

Ouch! That's 4 aborts. Still, we can check the names now …
[root]# for aborted in ARP1_0021515_002_2 ARP1_0032763_002_0 ARP1_0002259_003_0 ARP1_0031879_002_1;
do
grep $aborted /var/log/messages-20200209 | sed 's/.*]://';
done | grep Starting

05-Feb-2020 15:10:24 [World Community Grid] Starting task ARP1_0021515_002_2
06-Feb-2020 16:16:27 [World Community Grid] Starting task ARP1_0032763_002_0
[root]# for aborted in ARP1_0021515_002_2 ARP1_0032763_002_0 ARP1_0002259_003_0 ARP1_0031879_002_1;
do grep $aborted /var/log/messages-20200216 | sed 's/.*]://'; done | grep Starting

10-Feb-2020 03:44:00 [World Community Grid] Starting task ARP1_0002259_003_0
10-Feb-2020 03:44:00 [World Community Grid] Starting task ARP1_0031879_002_1

Now we can focus on one task:
# less -j9 +/ARP1_0031879_002_1 /var/log/messages-20200216 
(It didn't make me much wiser, but at least now we know when the tasks were started.)
[Feb 21, 2020 4:29:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

Since the 'Event' log file is stored on disc, stderrdae.txt + stdoutdae.txt + stdoutdae.old for the client, mine set very verbose and 3000 lines, got all event activity back to January 28, 2020. Maybe you can look in there to reconstruct what happened in terms of starting and suspending.
[Feb 21, 2020 4:58:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 18   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread