| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 18
|
|
| Author |
|
|
BladeD
Ace Cruncher USA Joined: Nov 17, 2004 Post Count: 28976 Status: Offline Project Badges:
|
Says status is running but nothing is happening...ie. Elapsed time counter doesn't start!
----------------------------------------Application Africa Rainfall Project 7.27 Name ARP1_0025618_002 State Running Received 2/4/2020 2:20:47 AM Report deadline 2/11/2020 2:19:25 AM Estimated computation size 252,041 GFLOPs CPU time --- CPU time since checkpoint --- Elapsed time --- Estimated time remaining 19:18:45 Fraction done 0.000% Virtual memory size 0 bytes Working set size 0 bytes Directory slots/20 Executable wcgrid_arp1_wrf_7.27_windows_x86_64 |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I've had that for months after a boot, jobs re-starting, but not using CPU time. They'd crashed though after 30 seconds, serially, which is some BOINC mechanism. A reinstall fixed the problem until the next boot, suspecting permissions somehow broke at that time. I've now added myself explicitly to the boinc:project/boinc_master user groups, see https://www.tenforums.com/tutorials/88049-add...-groups-windows-10-a.html , JIC. Last 2 monthly Windows cycles is has not reoccurred.
----------------------------------------[Edit 1 times, last edit by Former Member at Feb 6, 2020 11:03:20 AM] |
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
Rather than spending more time on a previously registered problem, I would abort the unit as you have not spent any time on it. Someone else will pick it up and probably not have the problem. Your wingman will probably be wondering when yours will finish.
Mike |
||
|
|
BladeD
Ace Cruncher USA Joined: Nov 17, 2004 Post Count: 28976 Status: Offline Project Badges:
|
Rather than spending more time on a previously registered problem, I would abort the unit as you have not spent any time on it. Someone else will pick it up and probably not have the problem. Your wingman will probably be wondering when yours will finish. Mike Yep, that what I did. |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
I've had a few ARP1 tasks fail to start (typically when babysitting and suspending/unsuspending) and get stuck with "--" elapsed time. Restarting the BOINC service resolved the issue. Rebooting would also do the trick. Worst case just abort that stuck task.
----------------------------------------
|
||
|
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12594 Status: Offline Project Badges:
|
It has been tempting to hang on to problem units because of the lack of supply, but now we have had over 2,000 returned in a half day, it is more feasible to abort problem units.
Mike |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Just had two of these over the past few days. Shows running in BOINC client but does not show up when doing a ps -ef | grep wcgrid which would indicate that BOINC didn't start the process. BOINC says it is in slot 28. After 2 hours of running the slot directory looks like this:
-----------------------------------------rw-r--r-- 1 boinc boinc 106 Feb 10 08:36 wcgrid_arp1_wrf_7.27_x86_64-pc-linux-gnu -rw-r--r-- 1 boinc boinc 106 Feb 10 08:36 graphics_app -rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_background.tga -rw-r--r-- 1 boinc boinc 22986 Feb 10 08:36 VEGPARM.TBL -rw-r--r-- 1 boinc boinc 4399 Feb 10 08:36 SOILPARM.TBL -rw-r--r-- 1 boinc boinc 1334 Feb 10 08:36 my_file_d01.txt -rw-r--r-- 1 boinc boinc 29820 Feb 10 08:36 LANDUSE.TBL -rw-r--r-- 1 boinc boinc 261 Feb 10 08:36 GENPARM.TBL -rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 Courier.txf -rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 Courier-Bold.txf -rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_wcg.tga -rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_twc.tga -rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_ibm.tga -rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_desc.tga -rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_delft.tga -rw-r--r-- 1 boinc boinc 87 Feb 10 08:36 arp1_boinc.tga -rw-r--r-- 1 boinc boinc 847552 Feb 10 08:36 RRTMG_LW_DATA -rw-r--r-- 1 boinc boinc 749248 Feb 10 08:36 RRTM_DATA -rw-r--r-- 1 boinc boinc 44501 Feb 10 08:36 MPTABLE.TBL -rw-r--r-- 1 boinc boinc 8550 Feb 10 08:36 URBPARM_UZE.TBL -rw-r--r-- 1 boinc boinc 11188 Feb 10 08:36 URBPARM.TBL -rw-r--r-- 1 boinc boinc 680368 Feb 10 08:36 RRTMG_SW_DATA -rw-r--r-- 1 boinc boinc 708 Feb 10 08:36 ozone_plev.formatted -rw-r--r-- 1 boinc boinc 536 Feb 10 08:36 ozone_lat.formatted -rw-r--r-- 1 boinc boinc 543744 Feb 10 08:36 ozone.formatted -rw-r--r-- 1 boinc boinc 1785 Feb 10 08:36 namelist.input -rw-r--r-- 1 boinc boinc 49385088 Feb 10 08:36 wrfbdy_d01 -rw-r--r-- 1 boinc boinc 659852 Feb 10 08:36 wrflowinp_d02 -rw-r--r-- 1 boinc boinc 659852 Feb 10 08:36 wrflowinp_d01 -rw-r--r-- 1 boinc boinc 659852 Feb 10 08:36 wrflowinp_d03 -rw-r--r-- 1 boinc boinc 17514256 Feb 10 08:36 wrfrst_d01.7z -rw-r--r-- 1 boinc boinc 15502599 Feb 10 08:36 wrfrst_d02.7z -rw-r--r-- 1 boinc boinc 14338586 Feb 10 08:36 wrfrst_d03.7z -rw-r--r-- 1 boinc boinc 6319 Feb 10 08:36 init_data.xml -rw-r--r-- 1 boinc boinc 12485444 Feb 10 08:36 wrfinput_d01 No stdout.txt is seen. Looks like the slot directory got loaded but execution was never transferred to the executable. UPDATE: turned off LAIM and suspended the task, then resumed the task. It didn't make any difference. Shutdown the client and restarted the client and the task started execution in the same slot (28). I've seen this on two machines both with the 7.16.3 BOINC client from costamagnagianfranco PPA on Ubuntu 19.10. Seen a lot of segmentation faults on the 19.10 version of Ubuntu with the 5.3 kernel and 2.30 glib. However, I have also noticed that the 7.16.3 BOINC client from the PPA doesn't honor preferences. So, don't know if this problem is related to ARP, the 7.16.3 client, or Ubuntu 19.10. Have not seen any segmentation problems with 19.04 of Ubuntu but have seen client problems with 7.16.3 on 19.04. In the process of migrating to CentOS 8 with 7.16.1 client [Edit 2 times, last edit by Doneske at Feb 13, 2020 5:06:11 PM] |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Just had two of these over the past few days. Shows running in BOINC client but does not show up when doing a ps -ef | grep wcgrid which would indicate that BOINC didn't start the process. BOINC says it is in slot 28. After 2 hours of running the slot directory looks like this: <snip-snip-snip> -rw-r--r-- 1 boinc boinc 17514256 Feb 10 08:36 wrfrst_d01.7z<snip-snip-snip> Did you also notice 0.000% of progress, Doneske? I'm thinking of this possible scenario: As soon as an ARP1 unit starts, its files will be unpacked in the slots directory and said three files should have been unpacked also. They are not. Which probably means that something got stuck and BOINC thinks that said ARP1 unit is smoothly running, while it is not. How can this happen? Maybe something else got in the way, pausing the ARP1 unit while it was still unpacking files, thereby interrupting the unpacking process. Then, sometime later, the ARP1 unit gets instructions to continue, but the unpacking doesn't continue, so it keeps sitting there, waiting for the unpacking to finish, which doesn't happen, because the unpacking process isn't running (anymore).Again, this is all pure speculation. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Yes, It was sitting at 0.000%.
I was trying to provide as much diagnostic information as I could before resetting or restarting the work unit. It's up to the techs at this point but unless it happens a significant number of times, they will just accept the casualties.... |
||
|
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2346 Status: Offline Project Badges:
|
Yes, It was sitting at 0.000%. You did a good job there. I experienced the same thing a few days ago, but failed to take a look into the slots directory before aborting the tasks that got stuck at 0.000%.I was trying to provide as much diagnostic information as I could before resetting or restarting the work unit. |
||
|
|
|