Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 18
Posts: 18   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4428 times and has 17 replies Next Thread
BladeD
Ace Cruncher
USA
Joined: Nov 17, 2004
Post Count: 28976
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
WU status running....but nothing!

Says status is running but nothing is happening...ie. Elapsed time counter doesn't start!

Application
Africa Rainfall Project 7.27
Name
ARP1_0025618_002
State
Running
Received
2/4/2020 2:20:47 AM
Report deadline
2/11/2020 2:19:25 AM
Estimated computation size
252,041 GFLOPs
CPU time
---
CPU time since checkpoint
---
Elapsed time
---
Estimated time remaining
19:18:45
Fraction done
0.000%
Virtual memory size
0 bytes
Working set size
0 bytes
Directory
slots/20
Executable
wcgrid_arp1_wrf_7.27_windows_x86_64

----------------------------------------
[Feb 6, 2020 9:35:54 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

I've had that for months after a boot, jobs re-starting, but not using CPU time. They'd crashed though after 30 seconds, serially, which is some BOINC mechanism. A reinstall fixed the problem until the next boot, suspecting permissions somehow broke at that time. I've now added myself explicitly to the boinc:project/boinc_master user groups, see https://www.tenforums.com/tutorials/88049-add...-groups-windows-10-a.html , JIC. Last 2 monthly Windows cycles is has not reoccurred.
----------------------------------------
[Edit 1 times, last edit by Former Member at Feb 6, 2020 11:03:20 AM]
[Feb 6, 2020 10:57:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

Rather than spending more time on a previously registered problem, I would abort the unit as you have not spent any time on it. Someone else will pick it up and probably not have the problem. Your wingman will probably be wondering when yours will finish.

Mike
[Feb 6, 2020 3:19:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
BladeD
Ace Cruncher
USA
Joined: Nov 17, 2004
Post Count: 28976
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

Rather than spending more time on a previously registered problem, I would abort the unit as you have not spent any time on it. Someone else will pick it up and probably not have the problem. Your wingman will probably be wondering when yours will finish.

Mike

Yep, that what I did.
----------------------------------------
[Feb 7, 2020 7:33:52 AM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

I've had a few ARP1 tasks fail to start (typically when babysitting and suspending/unsuspending) and get stuck with "--" elapsed time. Restarting the BOINC service resolved the issue. Rebooting would also do the trick. Worst case just abort that stuck task.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Feb 7, 2020 9:03:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

It has been tempting to hang on to problem units because of the lack of supply, but now we have had over 2,000 returned in a half day, it is more feasible to abort problem units.

Mike
[Feb 7, 2020 10:52:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

Just had two of these over the past few days. Shows running in BOINC client but does not show up when doing a ps -ef | grep wcgrid which would indicate that BOINC didn't start the process. BOINC says it is in slot 28. After 2 hours of running the slot directory looks like this:

-rw-r--r-- 1 boinc boinc 106 Feb 10 08:36 wcgrid_arp1_wrf_7.27_x86_64-pc-linux-gnu
-rw-r--r-- 1 boinc boinc 106 Feb 10 08:36 graphics_app
-rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_background.tga
-rw-r--r-- 1 boinc boinc 22986 Feb 10 08:36 VEGPARM.TBL
-rw-r--r-- 1 boinc boinc 4399 Feb 10 08:36 SOILPARM.TBL
-rw-r--r-- 1 boinc boinc 1334 Feb 10 08:36 my_file_d01.txt
-rw-r--r-- 1 boinc boinc 29820 Feb 10 08:36 LANDUSE.TBL
-rw-r--r-- 1 boinc boinc 261 Feb 10 08:36 GENPARM.TBL
-rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 Courier.txf
-rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 Courier-Bold.txf
-rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_wcg.tga
-rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_twc.tga
-rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_ibm.tga
-rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_desc.tga
-rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_delft.tga
-rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_boinc.tga
-rw-r--r-- 1 boinc boinc 847552 Feb 10 08:36 RRTMG_LW_DATA
-rw-r--r-- 1 boinc boinc 749248 Feb 10 08:36 RRTM_DATA
-rw-r--r-- 1 boinc boinc 44501 Feb 10 08:36 MPTABLE.TBL
-rw-r--r-- 1 boinc boinc 8550 Feb 10 08:36 URBPARM_UZE.TBL
-rw-r--r-- 1 boinc boinc 11188 Feb 10 08:36 URBPARM.TBL
-rw-r--r-- 1 boinc boinc 680368 Feb 10 08:36 RRTMG_SW_DATA
-rw-r--r-- 1 boinc boinc 708 Feb 10 08:36 ozone_plev.formatted
-rw-r--r-- 1 boinc boinc 536 Feb 10 08:36 ozone_lat.formatted
-rw-r--r-- 1 boinc boinc 543744 Feb 10 08:36 ozone.formatted
-rw-r--r-- 1 boinc boinc 1785 Feb 10 08:36 namelist.input
-rw-r--r-- 1 boinc boinc 49385088 Feb 10 08:36 wrfbdy_d01
-rw-r--r-- 1 boinc boinc 659852 Feb 10 08:36 wrflowinp_d02
-rw-r--r-- 1 boinc boinc 659852 Feb 10 08:36 wrflowinp_d01
-rw-r--r-- 1 boinc boinc 659852 Feb 10 08:36 wrflowinp_d03
-rw-r--r-- 1 boinc boinc 17514256 Feb 10 08:36 wrfrst_d01.7z
-rw-r--r-- 1 boinc boinc 15502599 Feb 10 08:36 wrfrst_d02.7z
-rw-r--r-- 1 boinc boinc 14338586 Feb 10 08:36 wrfrst_d03.7z
-rw-r--r-- 1 boinc boinc 6319 Feb 10 08:36 init_data.xml
-rw-r--r-- 1 boinc boinc 12485444 Feb 10 08:36 wrfinput_d01

No stdout.txt is seen. Looks like the slot directory got loaded but execution was never transferred to the executable.

UPDATE: turned off LAIM and suspended the task, then resumed the task. It didn't make any difference. Shutdown the client and restarted the client and the task started execution in the same slot (28). I've seen this on two machines both with the 7.16.3 BOINC client from costamagnagianfranco PPA on Ubuntu 19.10. Seen a lot of segmentation faults on the 19.10 version of Ubuntu with the 5.3 kernel and 2.30 glib. However, I have also noticed that the 7.16.3 BOINC client from the PPA doesn't honor preferences. So, don't know if this problem is related to ARP, the 7.16.3 client, or Ubuntu 19.10. Have not seen any segmentation problems with 19.04 of Ubuntu but have seen client problems with 7.16.3 on 19.04. In the process of migrating to CentOS 8 with 7.16.1 client
----------------------------------------
[Edit 2 times, last edit by Doneske at Feb 13, 2020 5:06:11 PM]
[Feb 13, 2020 4:52:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

Just had two of these over the past few days. Shows running in BOINC client but does not show up when doing a ps -ef | grep wcgrid which would indicate that BOINC didn't start the process. BOINC says it is in slot 28. After 2 hours of running the slot directory looks like this:
<snip-snip-snip>
-rw-r--r-- 1 boinc boinc 17514256 Feb 10 08:36 wrfrst_d01.7z
-rw-r--r-- 1 boinc boinc 15502599 Feb 10 08:36 wrfrst_d02.7z
-rw-r--r-- 1 boinc boinc 14338586 Feb 10 08:36 wrfrst_d03.7z
<snip-snip-snip>

Did you also notice 0.000% of progress, Doneske?

I'm thinking of this possible scenario:
As soon as an ARP1 unit starts, its files will be unpacked in the slots directory and said three files should have been unpacked also. They are not. Which probably means that something got stuck and BOINC thinks that said ARP1 unit is smoothly running, while it is not. How can this happen? confused Maybe something else got in the way, pausing the ARP1 unit while it was still unpacking files, thereby interrupting the unpacking process. Then, sometime later, the ARP1 unit gets instructions to continue, but the unpacking doesn't continue, so it keeps sitting there, waiting for the unpacking to finish, which doesn't happen, because the unpacking process isn't running (anymore).

Again, this is all pure speculation.
[Feb 14, 2020 11:54:38 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

Yes, It was sitting at 0.000%.

I was trying to provide as much diagnostic information as I could before resetting or restarting the work unit. It's up to the techs at this point but unless it happens a significant number of times, they will just accept the casualties....
[Feb 14, 2020 1:28:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: WU status running....but nothing!

Yes, It was sitting at 0.000%.

I was trying to provide as much diagnostic information as I could before resetting or restarting the work unit.
You did a good job there. I experienced the same thing a few days ago, but failed to take a look into the slots directory before aborting the tasks that got stuck at 0.000%.
[Feb 14, 2020 2:05:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 18   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread