Hanging sh.exe process

Post support questions that relate to the Windows 2003/2000/XP operating systems.

Moderators: Developers, Moderators

User avatar
adrianmarsh
Cacti User
Posts: 437
Joined: Wed Aug 17, 2005 8:51 am
Location: UK

Hanging sh.exe process

Post by adrianmarsh »

Every now and then (once a week average), I see a sh.exe process on my server running that brings the W2K AS 1Gz CPU to a crawl. Killing the process seems to recover the machine. I've not had the chance to completely back-trace the process IDs to find the parent yet (as its difficult to locate the parent).

What I was able to do was a cygwin remote-ps call, which gave:

0 3584 0 ? 15:46:19 C:\cygwin\bin\sh.exe

And seems to suggest the sh.exe was a rogue process. This makes me think its one of the scripts in cacti (no other app on the server uses cygwin). I'll take a look through the scripts to see if I can find what spawns it - anyone else see this ?

A.
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

In the past, yes and I set up a script that kills run away processes every hour. Are you running the latest version of Cygwin?
User avatar
adrianmarsh
Cacti User
Posts: 437
Joined: Wed Aug 17, 2005 8:51 am
Location: UK

Post by adrianmarsh »

cygwin dll : 1005.18.0.0


Not sure if this is the culpret yet, but its the only script I think I have that seems to use sh.exe :

Process Explorer showed me the full exe path that I guess Cacti calls :

C:\cygwin\bin\sh.exe -c "perl C:/inetpub/wwwroot/cacti/scripts/w32_query_OperatingSystem.pl swodell38.uk-lab.lucent.com get TotalVisibleMemorySize,FreePhysicalMemory,TotalVirtualMemorySize,FreeVirtualMemory"

This one didn't hang, but as the w32 script uses WMI, maybe theres a dependancy on WMI returning a result (that doesn't, and therefore the sh.exe hangs).

Maybe the timeout function of cactid doesn't catch this ? (Currently 25s).

Next time I see it, I'll grab more data.
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

Hmm, yea... sounds like WMI could be the culprit. Next time SH is hung, see if there are any open connections to a server on the DCOM port (135).
User avatar
adrianmarsh
Cacti User
Posts: 437
Joined: Wed Aug 17, 2005 8:51 am
Location: UK

Post by adrianmarsh »

Yep, definitely a sh.exe process spawned by cactid. Proc. Exp. showed it was the same script too. No open ports on that process though (no listening, connecting or otherwise).
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

Guess it might be time to run a script every so often that kills runaway cactid associated processes.
User avatar
adrianmarsh
Cacti User
Posts: 437
Joined: Wed Aug 17, 2005 8:51 am
Location: UK

Post by adrianmarsh »

You said you'd had a script to do this in the past. Could you share it ?
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

Sure, nothing special

Code: Select all

@echo off
cd C:\Program Files\SysInternals\Pstools
pskill perl.exe
pskill php.exe
pskill cmd.exe
You'd probably want to add in pskill sh.exe.
BelgianViking
Cacti User
Posts: 97
Joined: Thu Mar 24, 2005 4:59 am
Location: Brussels, Belgium

Post by BelgianViking »

I'm having the same thing. My recently installed server (on older hardware) was running constantly at 100% CPU. I found 4 or 5 sh.exe processes running at +/- 20% each. I killed them and everything was back to normal.
Just want to add that I'm not running any WMI scripts yet, so it can't be related to that. Just doing SNMP and some perl scripts.
[size=75][color=#EE5019]| Cacti 0.8.6g | MySQL 4.1.14 w Query Cache | Net-SNMP 5.2.1 | IIS 6 | fast-cgi | PHP 5.0.3 | RRDtool 1.2.9 | Windows 2003 Server SP1 | Cactid 0.8.6f |
| Dell 2450 - 2x P3 733 MHz, 1GB RAM |[/size][/color]
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

adrianmarsh: Maybe you could add a timeout/self terminate function to your WMI scripts and see if that helps.

BelgianViking: Same goes for you...

Both of you polling machines with your scripts which have a high latency (what is the average pings to them?) ?
User avatar
adrianmarsh
Cacti User
Posts: 437
Joined: Wed Aug 17, 2005 8:51 am
Location: UK

Post by adrianmarsh »

I'm not sure which client its trying to reach, next time I'll see if its in the run command.

Generally though, the Devices view says all of my current PC times are <2ms, and the highest average is 21.16ms, all with 99.5+ availability.

I'll take a look at the script too, though I'm no perl programmer..

A.
User avatar
TheWitness
Developer
Posts: 17007
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Post by TheWitness »

Please try the SVN version under branch 0.8.6. Also, please state you cygwin dll version in your signature line.

Thanks,

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
ErikF
Posts: 4
Joined: Wed Oct 26, 2005 8:38 am

Post by ErikF »

I have the same problem with cactid 0.8.6f-1 and also with latest version from CVS.

It is hanging almost always when invoking a script, whereby it does not matter which kind of script it is (Win32-CMD file, Perl script, native Win32-command line program). It hangs even on the simplest Win32-CMD-script with the contents "@echo 1".

I installed cacti on 2 machines, both running WinXP SP2 with latest cygwin. So far, everything is working fine, pinging hosts, SNMP queries, MySQL queries succeed. But running a script is nearly impossible.

So, I grabbed the sources of cactid 0.8.6f-1 compiled it and started debugging it. I concentrated on functions "nft_popen" and "nft_pclose". Somehow I thought there may be a race-condition and/or deadlock hidden there. I am not an expert in dealing with Unix threads and processes though. I did not find any concrete problem. One very strange thing I experienced though. The hang of SH.EXE occurs much more often when the "Poller Logging Level" is set to "DEBUG". When set to a lower level, it succeeds more often (but though not always). -- Could be an indication of synchronization problems with the inherited file descriptors.

Additionally, when the child process hangs, the "nft_pclose" function stalls in the loop which is invoking the "waitpid". It stays in "waitpid" for rather long time.

Also, I traced all possible return values from "pipe", "close", "dup2" and "select" function calls. No additional information gathered. All those functions are returning "success" and though the child process hangs.

Somehow I start to think it's a problem at cygwin side or a behavior of SH.EXE. Because of this and because I wanted to get more speed with my scripts I changed the "nft_pclose" function to not invoke "/bin/sh" at all but to execute the specified command directly. Actually, according the documentation of "execve" there should not be a need to start "/bin/sh" at all (please correct me if I am wrong, Unix is not "my home system"). The documentation says, that if the specified program has the ".sh" extension, it is started via the shell anyway. So, it would still be backward compatible to already existing script command lines which specify a .sh file. You may want consider in changing this.

For me, there was a nice speedup from 2.5 sec. to 1.8 sec. after dropping "/bin/sh" and -- more important -- I have no more hanging child processes... But we will see, I have to test this for longer period of time.
User avatar
TheWitness
Developer
Posts: 17007
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Post by TheWitness »

I'll give it a whirl. I was also thinking of an appropriate application of the kill command too.

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
User avatar
TheWitness
Developer
Posts: 17007
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Post by TheWitness »

Can you please post your modified nft_popen.c.

Thanks,

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Post Reply

Who is online

Users browsing this forum: No registered users and 0 guests