Monitor Windows via WMI from Cacti on Linux

hapklaar · Post by **hapklaar** » Mon May 25, 2009 3:23 pm

First of all kudo's to claymen for this great script. It extends the usability of Cacti so much. Thanks!

I managed to create some graphs for monitoring the read and write latencies on our HP EVA. The only problem is that my graphs are full of gaps. The cacti log shows for an unsuccesfull poll:

Code: Select all

05/25/2009 10:00:57 PM - CMDPHP: Poller[0] Host[266] DS[4867] CMD: /usr/bin/php -q /usr/share/cacti/site/scripts/wmi.php -h 'andpm01' -u '/etc/cacti/auth.txt' -w 'Win32_PerfFormattedData_EVAPMEXT_HPEVAPhysicalDiskGroup' -n '' -k 'Name' -v 'ANEVA101 - DiskGroup 2' -c 'ReadLatencyus,WriteLatencyus', output: U

and for a succesfull poll of the exact same datasource:

Code: Select all

05/25/2009 09:55:52 PM - CMDPHP: Poller[0] Host[266] DS[4867] CMD: /usr/bin/php -q /usr/share/cacti/site/scripts/wmi.php -h 'andpm01' -u '/etc/cacti/auth.txt' -w 'Win32_PerfFormattedData_EVAPMEXT_HPEVAPhysicalDiskGroup' -n '' -k 'Name' -v 'ANEVA101 - DiskGroup 2' -c 'ReadLatencyus,WriteLatencyus', output: Name:ANEVA101_-_DiskGroup_2 ReadLatencyus:13735 WriteLatencyus:12733

Why do some polls return an invalid output? Could this be a timeout issue?

EDIT: added a graph to show the problem:

claymen · Post by **claymen** » Mon May 25, 2009 6:27 pm

Could be but hard to know from the cacti logs you posted.

Setup the log file path and enable debug level 2. This will write out a heap of details about each run. Hopefully it will help you pinpoint whats causing the problem

hapklaar · Post by **hapklaar** » Tue May 26, 2009 2:35 am

Not sure what to look for here. After setting the loglevel to debug, the line where the result should be is still the same with no reason for the 'U' as far as I can see.

Code: Select all

05/26/2009 09:30:48 AM - CMDPHP: Poller[0] Host[266] DS[4868] WARNING: Result from CMD not valid.  Partial Result: U
05/26/2009 09:30:48 AM - CMDPHP: Poller[0] Host[266] DS[4868] CMD: /usr/bin/php -q /usr/share/cacti/site/scripts/wmi.php -h 'andpm01' -u '/etc/cacti/auth.txt' -w 'Win32_PerfFormattedData_EVAPMEXT_HPEVAPhysicalDiskGroup' -n '' -k 'Name' -v 'ANEVA101 - DiskGroup 3' -c 'ReadLatencyus,WriteLatencyus', output: U

Or were you not referring to the cacti log?

claymen · Post by **claymen** » Tue May 26, 2009 2:43 am

No not the cacti log but the actual wmi.php logs which you enable by setting debug level 2. You will find a stack of log files in the path you specified.

hapklaar · Post by **hapklaar** » Tue May 26, 2009 5:25 am

Only one file is created there, for the only collection that is succesful at the moment. No debug files are generated for the datasources that give "output: U".

EDIT: actually all datasources from this host stopped working, getting the following error: NTSTATUS: NT code 0xc002001b - NT code 0xc002001b

hapklaar · Post by **hapklaar** » Tue May 26, 2009 9:57 am

Ok had to reboot the windows host, it all of a sudden refused to answer to any WMI requests.

Now it's working again and for every missing datapoint in cacti, there is no entry in the wmi logging.

When I manually run the command in quick succession I sometimes get the error "NTSTATUS: NT code 0xc00706be - NT code 0xc00706be" or "NTSTATUS: NT code 0xc00706ba - NT code 0xc00706ba"

Thomas.Pacce · Post by **Thomas.Pacce** » Thu May 28, 2009 4:44 am

Why do i get broken graphs like the one attached?

I have setup this graph for a number of host, some display just correctly whereas others are messed up.

claymen · Post by **claymen** » Thu May 28, 2009 6:54 am

hapklaar wrote:Ok had to reboot the windows host, it all of a sudden refused to answer to any WMI requests.

Now it's working again and for every missing datapoint in cacti, there is no entry in the wmi logging.

When I manually run the command in quick succession I sometimes get the error "NTSTATUS: NT code 0xc00706be - NT code 0xc00706be" or "NTSTATUS: NT code 0xc00706ba - NT code 0xc00706ba"

From memory isn't that RPC server unavailable?

claymen · Post by **claymen** » Thu May 28, 2009 6:55 am

Thomas.Pacce wrote:Why do i get broken graphs like the one attached?

I have setup this graph for a number of host, some display just correctly whereas others are messed up.

Not sure mate, looks like its not getting results back properly, again if you setup the level 2 debugging to dump out the logs of whats going on it might give you a better idea.

Debug level 2 dumps out a heap of info, all the inputs, the direct output, the exact command being run basically everything you need to know to see whats going on.

hapklaar · Post by **hapklaar** » Thu May 28, 2009 6:22 pm

claymen wrote:
hapklaar wrote:Ok had to reboot the windows host, it all of a sudden refused to answer to any WMI requests.

Now it's working again and for every missing datapoint in cacti, there is no entry in the wmi logging.

When I manually run the command in quick succession I sometimes get the error "NTSTATUS: NT code 0xc00706be - NT code 0xc00706be" or "NTSTATUS: NT code 0xc00706ba - NT code 0xc00706ba"
From memory isn't that RPC server unavailable?

It looks like it. However if I try again right after that, I do get a result. Thsi could be causing thomas his problem also. It occurs on multiple win2k3 hosts by the way. Would it be possible to include a retry in the script? Or do you know what might cause this?

PS Once every day the WMI service seems to crash since I've been monitoring our EVA's using this script. Only killing the process and starting the service gets it back on track

claymen · Post by **claymen** » Thu May 28, 2009 6:32 pm

hapklaar wrote:
claymen wrote:
hapklaar wrote:Ok had to reboot the windows host, it all of a sudden refused to answer to any WMI requests.

Now it's working again and for every missing datapoint in cacti, there is no entry in the wmi logging.

When I manually run the command in quick succession I sometimes get the error "NTSTATUS: NT code 0xc00706be - NT code 0xc00706be" or "NTSTATUS: NT code 0xc00706ba - NT code 0xc00706ba"
From memory isn't that RPC server unavailable?
It looks like it. However if I try again right after that, I do get a result. Thsi could be causing thomas his problem also. It occurs on multiple win2k3 hosts by the way. Would it be possible to include a retry in the script? Or do you know what might cause this?

PS Once every day the WMI service seems to crash since I've been monitoring our EVA's using this script. Only killing the process and starting the service gets it back on track

Adding a retry has the potential to blow out your poll time. Remember you have to get every result in under 5 minutes (by default) and adding a retry means every data source using the script has a potential of doubling its time to run (or more depending on number of tries).

I'm not saying you can't but there is a potential for it to cause other problems with your poller.

hapklaar · Post by **hapklaar** » Sat May 30, 2009 7:46 am

Ok, but that's not really an issue for me as currently a total poll takes a little over 60 seconds. And that's with cmd.php.

I really can't figure out why that error pops up that often, was hoping you might...

claymen · Post by **claymen** » Sat May 30, 2009 7:53 am

hapklaar wrote:Ok, but that's not really an issue for me as currently a total poll takes a little over 60 seconds. And that's with cmd.php.

I really can't figure out why that error pops up that often, was hoping you might...

Wow 60 seconds seems high.

We have the following and get a poll time of about 30seconds
10k+ data sources
500+ wmi data sources
450 hosts total

hapklaar · Post by **hapklaar** » Sat May 30, 2009 8:21 am

claymen wrote:
hapklaar wrote:Ok, but that's not really an issue for me as currently a total poll takes a little over 60 seconds. And that's with cmd.php.

I really can't figure out why that error pops up that often, was hoping you might...
Wow 60 seconds seems high.

We have the following and get a poll time of about 30seconds
10k+ data sources
500+ wmi data sources
450 hosts total

As long as it's under 300 secs, there's no problem

Maybe you are using spine or another fast poller? Or maybe most of your hosts are local. Most of my 8k data sources are on remote (slow) sites.

claymen · Post by **claymen** » Sat May 30, 2009 8:35 am

hapklaar wrote:
claymen wrote:
hapklaar wrote:Ok, but that's not really an issue for me as currently a total poll takes a little over 60 seconds. And that's with cmd.php.

I really can't figure out why that error pops up that often, was hoping you might...
Wow 60 seconds seems high.

We have the following and get a poll time of about 30seconds
10k+ data sources
500+ wmi data sources
450 hosts total
As long as it's under 300 secs, there's no problem

Maybe you are using spine or another fast poller? Or maybe most of your hosts are local. Most of my 8k data sources are on remote (slow) sites.

True so long as its under 300 it works, but ideally the faster the better

The more head room you have the better you can cope with spikes in poll time.

Our data sources span the entire country but we are an ISP so the links between them all are nice and fast. If they were slow our customers wouldn't be too impressed...

Monitor Windows via WMI from Cacti on Linux

Who is online