Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 7
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1835 times and has 6 replies Next Thread
gordonbb
Cruncher
Canada
Joined: May 14, 2019
Post Count: 19
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Stuck Tasks

One of my systems has 3 OPN tasks stuck. The normal time to complete tasks for this system is estimated at 2:01:19.

HW Specs: AMD Ryzen 2600x; 16 GB DDR4-3200; RTX 1070Ti; Ubuntu 18.04.5 LTS Desktop with current patches and latest HWE stack.

I first saw this a few days ago and ended up aborting the tasks as they just stayed at a fixed % completed and the Load Avereage on the system was down from ~12 to 12-X where X was the number of stuck Tasks.

I noticed the same thing today. About two days ago we did have a Thunderstorm and though the system is not on a UPS it is connected to a surge suppressor on a UPS and curiously the systems on the Load side of that UPS both rebooted but this system did not reboot (shows an up-time of almost 5 days.

Here are the "Properties" of the three tasks that are currently stuck:
Application: OpenPandemics - COVID 19 7.17 
Name: OPN1_0057699_00361
State: Running
Received: 2021-07-23 7:46:05 PM
Report deadline: 2021-07-30 7:46:03 PM
Estimated computation size: 35,441 GFLOPs
CPU time: 01:11:00
CPU time since checkpoint: 00:04:00
Elapsed time: 2d 09:45:38
Estimated time remaining: 13:34:50
Fraction done: 80.964%
Virtual memory size: 193.34 MB
Working set size: 132.51 MB
Directory: slots/9
Process ID: 15646
Progress rate: 1.440% per hour
Executable: wcgrid_opn1_autodock_7.17_x86_64-pc-linux-gnu

Application: OpenPandemics - COVID 19 7.17
Name: OPN1_0057765_03282
State: Running
Received: 2021-07-24 2:18:09 PM
Report deadline: 2021-07-31 2:18:08 PM
Estimated computation size: 35,441 GFLOPs
CPU time: 01:32:47
CPU time since checkpoint: 00:00:58
Elapsed time: 1d 13:17:44
Estimated time remaining: 15:22:34
Fraction done: 70.808%
Virtual memory size: 162.94 MB
Working set size: 101.66 MB
Directory: slots/2
Process ID: 30603
Progress rate: 1.800% per hour
Executable: wcgrid_opn1_autodock_7.17_x86_64-pc-linux-gnu

Application: OpenPandemics - COVID 19 7.17
Name: OPN1_0057721_00322
State: Running
Received: 2021-07-24 1:36:05 AM
Report deadline: 2021-07-31 1:36:04 AM
Estimated computation size: 35,441 GFLOPs
CPU time: 01:00:35
CPU time since checkpoint: 00:01:31
Elapsed time: 2d 04:38:08
Estimated time remaining: 1d 18:02:30
Fraction done: 55.595%
Virtual memory size: 179.68 MB
Working set size: 118.68 MB
Directory: slots/10
Process ID: 14282
Progress rate: 1.080% per hour
Executable: wcgrid_opn1_autodock_7.17_x86_64-pc-linux-gnu

----------------------------------------

AMD - 2600x, 2 x 2700, 2700x, 3900x, 3950x, 2 x 5900x, 5950x
Intel - E3-1231v3, 9900K
NVidia - GTX 1060 6GB, 1660ti, 1070ti; RTX 2060, 2060s, 2070a, 5 x 2070s
----------------------------------------
[Edit 1 times, last edit by gordonbb at Jul 27, 2021 2:37:59 PM]
[Jul 27, 2021 2:34:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7693
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck Tasks

The quickest and easiest thing to do is to reboot the system and see if that allows the tasks to resume their normal progress. if this does not work, please post about the first 30 lines of the log after reboot and that may give a clue as to what is happening.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jul 27, 2021 2:52:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gordonbb
Cruncher
Canada
Joined: May 14, 2019
Post Count: 19
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck Tasks

The quickest and easiest thing to do is to reboot the system and see if that allows the tasks to resume their normal progress. if this does not work, please post about the first 30 lines of the log after reboot and that may give a clue as to what is happening.
Cheers
Thanks! That did the trick.
The 3 tasks are now progressing once again and curiously the Elapsed time is now showing "normal" values:

Name: OPN1_0057699_00361
CPU time: 01:09:56
Elapsed time: 01:10:15
Estimated time remaining: 00:23:00

Name: OPN1_0057765_03282
CPU time: 01:36:44
Elapsed time: 01:36:54
Estimated time remaining: 00:31:23

Name: OPN1_0057721_00322
CPU time: 01:05:06
Elapsed time: 01:05:13
Estimated time remaining: 00:50:57
----------------------------------------

AMD - 2600x, 2 x 2700, 2700x, 3900x, 3950x, 2 x 5900x, 5950x
Intel - E3-1231v3, 9900K
NVidia - GTX 1060 6GB, 1660ti, 1070ti; RTX 2060, 2060s, 2070a, 5 x 2070s
[Jul 27, 2021 3:04:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
biini
Senior Cruncher
Finland
Joined: Jan 25, 2007
Post Count: 334
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck Tasks

Seems like they resumed on the last checkpoint. I had few of those also couple of months ago.
One obvious reason I found (on windows) that gpu driver was updated automatically.
----------------------------------------

rtx, xeon, i9, ryzen, rnd laptops
dAM0NES 1991 ppl interested in beer, amigas or electornic music
----------------------------------------
[Edit 1 times, last edit by biini at Jul 28, 2021 7:39:07 AM]
[Jul 28, 2021 7:38:22 AM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck Tasks

Many times you can get stuck tasks progressing again without a reboot by suspending them for a few seconds and then resuming them.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


[Jul 28, 2021 1:54:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ca05065
Senior Cruncher
Joined: Dec 4, 2007
Post Count: 326
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Stuck Tasks

When using the suspend / resume technique it is best to set LAIM (leave application in memory) off. This ensures that program and checkpoint files are read from disk. Remember to set LAIM on afterwards.
[Jul 28, 2021 3:46:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gordonbb
Cruncher
Canada
Joined: May 14, 2019
Post Count: 19
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
biggrin Re: Stuck Tasks

Many times you can get stuck tasks progressing again without a reboot by suspending them for a few seconds and then resuming them.

When using the suspend / resume technique it is best to set LAIM (leave application in memory) off. This ensures that program and checkpoint files are read from disk. Remember to set LAIM on afterwards.

Thanks - I'll try that next time. I had another task stick (OPN1_0058122_00445_0). I usually see this by noticing htop showing less than all threads utilized then looking at BOINC Manager to see the offending task.

A reboot again set the stuck tasks elapsed time but looking closer in htop the CPU% for the task was still at 0% after the reboot and the Progress in BOINC Manager was not increasing.

That one I Aborted and after the abort it showed "Computation Error" before reporting.

The logs are showing squat but I've them at the default verbosity.

EDIT - Another one. After the reboot the system naturally picked up a few OPNG tasks for the GPU to crunch on. The last of these (OPNG_0068524_00030 ) got stuck at 27% so I did as suggested, set LTIM to off and suspended then resumed the task and it finished to completion.

So I suspect I have a marginal core on this CPU. I'm using a -0.100V Vcore offset to under-volt the processor a "titch" so I'm going to remove that and see.

It's strange, this system has been chugging away for months with nary an issue. It did, however, recently have the ATI HD5870 replaced with a GTX 1070Ti and the RAM increased from 4x4GB DDR4-2400 to 2x8GB DDR4-3200 but I wiped and re-installed the OS (Ubuntu 18.04.5 LTS Desktop). The RAM and GPU were swapped from another system that was also running OPN. "She who must be obeyed" has her birthday soon and wants a gaming system once again to play games with the kids (well, young adults) so this system is destined to running Windows 10 in a few days once I get my 100 year badge ๐Ÿ˜
----------------------------------------

AMD - 2600x, 2 x 2700, 2700x, 3900x, 3950x, 2 x 5900x, 5950x
Intel - E3-1231v3, 9900K
NVidia - GTX 1060 6GB, 1660ti, 1070ti; RTX 2060, 2060s, 2070a, 5 x 2070s
----------------------------------------
[Edit 1 times, last edit by gordonbb at Jul 29, 2021 2:55:12 AM]
[Jul 29, 2021 2:24:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread