Hanging sh.exe process
Moderators: Developers, Moderators
- adrianmarsh
- Cacti User
- Posts: 437
- Joined: Wed Aug 17, 2005 8:51 am
- Location: UK
Hanging sh.exe process
Every now and then (once a week average), I see a sh.exe process on my server running that brings the W2K AS 1Gz CPU to a crawl. Killing the process seems to recover the machine. I've not had the chance to completely back-trace the process IDs to find the parent yet (as its difficult to locate the parent).
What I was able to do was a cygwin remote-ps call, which gave:
0 3584 0 ? 15:46:19 C:\cygwin\bin\sh.exe
And seems to suggest the sh.exe was a rogue process. This makes me think its one of the scripts in cacti (no other app on the server uses cygwin). I'll take a look through the scripts to see if I can find what spawns it - anyone else see this ?
A.
What I was able to do was a cygwin remote-ps call, which gave:
0 3584 0 ? 15:46:19 C:\cygwin\bin\sh.exe
And seems to suggest the sh.exe was a rogue process. This makes me think its one of the scripts in cacti (no other app on the server uses cygwin). I'll take a look through the scripts to see if I can find what spawns it - anyone else see this ?
A.
In the past, yes and I set up a script that kills run away processes every hour. Are you running the latest version of Cygwin?
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
- adrianmarsh
- Cacti User
- Posts: 437
- Joined: Wed Aug 17, 2005 8:51 am
- Location: UK
cygwin dll : 1005.18.0.0
Not sure if this is the culpret yet, but its the only script I think I have that seems to use sh.exe :
Process Explorer showed me the full exe path that I guess Cacti calls :
C:\cygwin\bin\sh.exe -c "perl C:/inetpub/wwwroot/cacti/scripts/w32_query_OperatingSystem.pl swodell38.uk-lab.lucent.com get TotalVisibleMemorySize,FreePhysicalMemory,TotalVirtualMemorySize,FreeVirtualMemory"
This one didn't hang, but as the w32 script uses WMI, maybe theres a dependancy on WMI returning a result (that doesn't, and therefore the sh.exe hangs).
Maybe the timeout function of cactid doesn't catch this ? (Currently 25s).
Next time I see it, I'll grab more data.
Not sure if this is the culpret yet, but its the only script I think I have that seems to use sh.exe :
Process Explorer showed me the full exe path that I guess Cacti calls :
C:\cygwin\bin\sh.exe -c "perl C:/inetpub/wwwroot/cacti/scripts/w32_query_OperatingSystem.pl swodell38.uk-lab.lucent.com get TotalVisibleMemorySize,FreePhysicalMemory,TotalVirtualMemorySize,FreeVirtualMemory"
This one didn't hang, but as the w32 script uses WMI, maybe theres a dependancy on WMI returning a result (that doesn't, and therefore the sh.exe hangs).
Maybe the timeout function of cactid doesn't catch this ? (Currently 25s).
Next time I see it, I'll grab more data.
Hmm, yea... sounds like WMI could be the culprit. Next time SH is hung, see if there are any open connections to a server on the DCOM port (135).
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
- adrianmarsh
- Cacti User
- Posts: 437
- Joined: Wed Aug 17, 2005 8:51 am
- Location: UK
Guess it might be time to run a script every so often that kills runaway cactid associated processes.
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
- adrianmarsh
- Cacti User
- Posts: 437
- Joined: Wed Aug 17, 2005 8:51 am
- Location: UK
Sure, nothing special
You'd probably want to add in pskill sh.exe.
Code: Select all
@echo off
cd C:\Program Files\SysInternals\Pstools
pskill perl.exe
pskill php.exe
pskill cmd.exe
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
-
- Cacti User
- Posts: 97
- Joined: Thu Mar 24, 2005 4:59 am
- Location: Brussels, Belgium
I'm having the same thing. My recently installed server (on older hardware) was running constantly at 100% CPU. I found 4 or 5 sh.exe processes running at +/- 20% each. I killed them and everything was back to normal.
Just want to add that I'm not running any WMI scripts yet, so it can't be related to that. Just doing SNMP and some perl scripts.
Just want to add that I'm not running any WMI scripts yet, so it can't be related to that. Just doing SNMP and some perl scripts.
[size=75][color=#EE5019]| Cacti 0.8.6g | MySQL 4.1.14 w Query Cache | Net-SNMP 5.2.1 | IIS 6 | fast-cgi | PHP 5.0.3 | RRDtool 1.2.9 | Windows 2003 Server SP1 | Cactid 0.8.6f |
| Dell 2450 - 2x P3 733 MHz, 1GB RAM |[/size][/color]
| Dell 2450 - 2x P3 733 MHz, 1GB RAM |[/size][/color]
adrianmarsh: Maybe you could add a timeout/self terminate function to your WMI scripts and see if that helps.
BelgianViking: Same goes for you...
Both of you polling machines with your scripts which have a high latency (what is the average pings to them?) ?
BelgianViking: Same goes for you...
Both of you polling machines with your scripts which have a high latency (what is the average pings to them?) ?
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
- adrianmarsh
- Cacti User
- Posts: 437
- Joined: Wed Aug 17, 2005 8:51 am
- Location: UK
I'm not sure which client its trying to reach, next time I'll see if its in the run command.
Generally though, the Devices view says all of my current PC times are <2ms, and the highest average is 21.16ms, all with 99.5+ availability.
I'll take a look at the script too, though I'm no perl programmer..
A.
Generally though, the Devices view says all of my current PC times are <2ms, and the highest average is 21.16ms, all with 99.5+ availability.
I'll take a look at the script too, though I'm no perl programmer..
A.
- TheWitness
- Developer
- Posts: 17007
- Joined: Tue May 14, 2002 5:08 pm
- Location: MI, USA
- Contact:
Please try the SVN version under branch 0.8.6. Also, please state you cygwin dll version in your signature line.
Thanks,
TheWitness
Thanks,
TheWitness
True understanding begins only when we realize how little we truly understand...
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
I have the same problem with cactid 0.8.6f-1 and also with latest version from CVS.
It is hanging almost always when invoking a script, whereby it does not matter which kind of script it is (Win32-CMD file, Perl script, native Win32-command line program). It hangs even on the simplest Win32-CMD-script with the contents "@echo 1".
I installed cacti on 2 machines, both running WinXP SP2 with latest cygwin. So far, everything is working fine, pinging hosts, SNMP queries, MySQL queries succeed. But running a script is nearly impossible.
So, I grabbed the sources of cactid 0.8.6f-1 compiled it and started debugging it. I concentrated on functions "nft_popen" and "nft_pclose". Somehow I thought there may be a race-condition and/or deadlock hidden there. I am not an expert in dealing with Unix threads and processes though. I did not find any concrete problem. One very strange thing I experienced though. The hang of SH.EXE occurs much more often when the "Poller Logging Level" is set to "DEBUG". When set to a lower level, it succeeds more often (but though not always). -- Could be an indication of synchronization problems with the inherited file descriptors.
Additionally, when the child process hangs, the "nft_pclose" function stalls in the loop which is invoking the "waitpid". It stays in "waitpid" for rather long time.
Also, I traced all possible return values from "pipe", "close", "dup2" and "select" function calls. No additional information gathered. All those functions are returning "success" and though the child process hangs.
Somehow I start to think it's a problem at cygwin side or a behavior of SH.EXE. Because of this and because I wanted to get more speed with my scripts I changed the "nft_pclose" function to not invoke "/bin/sh" at all but to execute the specified command directly. Actually, according the documentation of "execve" there should not be a need to start "/bin/sh" at all (please correct me if I am wrong, Unix is not "my home system"). The documentation says, that if the specified program has the ".sh" extension, it is started via the shell anyway. So, it would still be backward compatible to already existing script command lines which specify a .sh file. You may want consider in changing this.
For me, there was a nice speedup from 2.5 sec. to 1.8 sec. after dropping "/bin/sh" and -- more important -- I have no more hanging child processes... But we will see, I have to test this for longer period of time.
It is hanging almost always when invoking a script, whereby it does not matter which kind of script it is (Win32-CMD file, Perl script, native Win32-command line program). It hangs even on the simplest Win32-CMD-script with the contents "@echo 1".
I installed cacti on 2 machines, both running WinXP SP2 with latest cygwin. So far, everything is working fine, pinging hosts, SNMP queries, MySQL queries succeed. But running a script is nearly impossible.
So, I grabbed the sources of cactid 0.8.6f-1 compiled it and started debugging it. I concentrated on functions "nft_popen" and "nft_pclose". Somehow I thought there may be a race-condition and/or deadlock hidden there. I am not an expert in dealing with Unix threads and processes though. I did not find any concrete problem. One very strange thing I experienced though. The hang of SH.EXE occurs much more often when the "Poller Logging Level" is set to "DEBUG". When set to a lower level, it succeeds more often (but though not always). -- Could be an indication of synchronization problems with the inherited file descriptors.
Additionally, when the child process hangs, the "nft_pclose" function stalls in the loop which is invoking the "waitpid". It stays in "waitpid" for rather long time.
Also, I traced all possible return values from "pipe", "close", "dup2" and "select" function calls. No additional information gathered. All those functions are returning "success" and though the child process hangs.
Somehow I start to think it's a problem at cygwin side or a behavior of SH.EXE. Because of this and because I wanted to get more speed with my scripts I changed the "nft_pclose" function to not invoke "/bin/sh" at all but to execute the specified command directly. Actually, according the documentation of "execve" there should not be a need to start "/bin/sh" at all (please correct me if I am wrong, Unix is not "my home system"). The documentation says, that if the specified program has the ".sh" extension, it is started via the shell anyway. So, it would still be backward compatible to already existing script command lines which specify a .sh file. You may want consider in changing this.
For me, there was a nice speedup from 2.5 sec. to 1.8 sec. after dropping "/bin/sh" and -- more important -- I have no more hanging child processes... But we will see, I have to test this for longer period of time.
- TheWitness
- Developer
- Posts: 17007
- Joined: Tue May 14, 2002 5:08 pm
- Location: MI, USA
- Contact:
I'll give it a whirl. I was also thinking of an appropriate application of the kill command too.
TheWitness
TheWitness
True understanding begins only when we realize how little we truly understand...
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
- TheWitness
- Developer
- Posts: 17007
- Joined: Tue May 14, 2002 5:08 pm
- Location: MI, USA
- Contact:
Can you please post your modified nft_popen.c.
Thanks,
TheWitness
Thanks,
TheWitness
True understanding begins only when we realize how little we truly understand...
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Who is online
Users browsing this forum: No registered users and 0 guests