Devices Skipping Polling Cycles

Post support questions that relate to the Windows 2003/2000/XP operating systems.

Moderators: Developers, Moderators

Post Reply
sfsrjstw
Posts: 9
Joined: Tue Oct 26, 2010 11:16 am

Devices Skipping Polling Cycles

Post by sfsrjstw »

I have recently installed 0.8.8a onto a Server 2008 Standard x64 running SP2. I have added in nearly all of my devices, but I've been noticing some odd behavior in graphs where there would just be no data displayed. I was trying to figuer out why this might be happening, and noticed that it seems that some of the devices are getting skipped when the polling process run. For one device which I have data gaps in the graph, the Cacti Log shows:

09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] Total Time: 20 Seconds
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] DS[52] SNMP: v2: 140.104.11.62, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.22, value: 8503887633
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] DS[52] SNMP: v2: 140.104.11.62, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.22, value: 105555786481
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] DS[53] SNMP: v2: 140.104.11.62, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.24, value: 1819542241793
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] DS[53] SNMP: v2: 140.104.11.62, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.24, value: 9744566089
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] DS[54] SNMP: v2: 140.104.11.62, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.49, value: 3038148475274
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] DS[54] SNMP: v2: 140.104.11.62, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.49, value: 179462330242
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] NOTE: There are '6' Polling Items for this Host
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] Recache DataQuery[12] OID: .1.3.6.1.2.1.1.3.0, output: 3047829067
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 3047829067
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] Hostname[140.104.11.62] NOTICE: HOST EVENT: Host Returned from DOWN State
09/11/2012 07:35:46 AM - SPINE: Poller[0] Host[117] TH[1] Total Time: 41 Seconds
09/11/2012 07:35:46 AM - SPINE: Poller[0] Host[117] TH[1] NOTE: There are '6' Polling Items for this Host
09/11/2012 07:35:46 AM - SPINE: Poller[0] Host[117] Hostname[140.104.11.62] ERROR: HOST EVENT: Host is DOWN Message: Host did not respond to SNMP, ICMP: Ping timed out
09/11/2012 07:35:46 AM - SPINE: Poller[0] Host[117] SNMP Ping Error: Unknown error: 2
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] Total Time: 20 Seconds
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] DS[52] SNMP: v2: 140.104.11.62, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.22, value: 8503770647
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] DS[52] SNMP: v2: 140.104.11.62, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.22, value: 105555238140
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] DS[53] SNMP: v2: 140.104.11.62, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.24, value: 1819343422985
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] DS[53] SNMP: v2: 140.104.11.62, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.24, value: 9740069115
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] DS[54] SNMP: v2: 140.104.11.62, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.49, value: 3037813940975
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] DS[54] SNMP: v2: 140.104.11.62, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.49, value: 179447823984
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] NOTE: There are '6' Polling Items for this Host
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] Recache DataQuery[12] OID: .1.3.6.1.2.1.1.3.0, output: 3047528980
09/11/2012 07:30:26 AM - SPINE: Poller[0] Host[117] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 3047528980

Note there is a time skip from 7:35 to 8:20. I was doing my Cacti log fitering just using Host[117] to show me all entries for just this one host. Has anyone else seen this before? I did a seach, but didn't run across anything.
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Re: Devices Skipping Polling Cycles

Post by BSOD2600 »

The log contains the likely reason why it was skipped:
09/11/2012 08:20:27 AM - SPINE: Poller[0] Host[117] Hostname[140.104.11.62] NOTICE: HOST EVENT: Host Returned from DOWN State
09/11/2012 07:35:46 AM - SPINE: Poller[0] Host[117] Hostname[140.104.11.62] ERROR: HOST EVENT: Host is DOWN Message: Host did not respond to SNMP, ICMP: Ping timed out
09/11/2012 07:35:46 AM - SPINE: Poller[0] Host[117] SNMP Ping Error: Unknown error: 2
sfsrjstw
Posts: 9
Joined: Tue Oct 26, 2010 11:16 am

Re: Devices Skipping Polling Cycles

Post by sfsrjstw »

The host wasn't actually down though. It was a false positive. (A different issue I'm working on trying to track down). I did a ping of the host when Cacti first said it was down and it wasn't actually down. Even if it was down, shouldn't it have tried it again in 5 minutes when the next cycle ran, or does it skip it for a certain number of cycles? If it skips it, is that a value I can set somewhere and have overlooked?
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Re: Devices Skipping Polling Cycles

Post by BSOD2600 »

Both SNMP and ICMP pings were successful, when cacti (well spine) reported it as down? Could try changing the downtime detection method and/or thresholds.
sfsrjstw
Posts: 9
Joined: Tue Oct 26, 2010 11:16 am

Re: Devices Skipping Polling Cycles

Post by sfsrjstw »

Correct, both SNMP and ICMP pings were successful when spine reported them as down.

I have my detection/timeout settings as follows:

Detection: Ping or SNMP Uptime
Ping Method: ICMP Ping
Ping Timeout Value: 5000
Ping Retry Count: 3

SNMP Version: V2
SNMP Timeout: 5000
Max OID's: 50

Looking through the logs, I'm also noticing the following for the hosts which seem to be skipping cycles:

SPINE: Poller[0] Host[212] SNMP Ping Error: Unknown error: 2

I'm guessing that's probably why, but I don't have any ideas what the unknown error might be.
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Re: Devices Skipping Polling Cycles

Post by BSOD2600 »

This is spine 0.8.8a, correct?

Try verbose spine logging to see if there are any more useful details. Unknown error 2 appears to have been around for the past year.
spine -R -S -V 5 -f 117 -l 117
sfsrjstw
Posts: 9
Joined: Tue Oct 26, 2010 11:16 am

Re: Devices Skipping Polling Cycles

Post by sfsrjstw »

Correct...it is 0.8.8a using the Windows packaged installer, all other fuctions seem to be working fine.
Verbose from command line yeilds the following:

Code: Select all

C:\Spine>spine.exe -R -S -V 5 -f 225 -l 225
SPINE: Using spine config file [spine.conf]
09/13/2012 12:51:00 PM - SPINE: Poller[0] DEBUG: The path_php_server variable is
 c:/inetpub/wwwroot/cacti/script_server.php
09/13/2012 12:51:00 PM - SPINE: Poller[0] DEBUG: The path_cactilog variable is C
:/Inetpub/wwwroot/cacti/log/cacti.log
DEBUG: The log_destination variable is 4 (STDOUT)
DEBUG: The path_php variable is C:/php/php.exe
DEBUG: The availability_method variable is 4
DEBUG: The ping_recovery_count variable is 1
DEBUG: The ping_failure_count variable is 1
DEBUG: The ping_method variable is 1
DEBUG: The ping_retries variable is 3
DEBUG: The ping_timeout variable is 5000
DEBUG: The snmp_retries variable is 4
DEBUG: The log_perror variable is 1
DEBUG: The log_pwarn variable is 1
DEBUG: The boost_redirect variable is 0
DEBUG: The log_pstats variable is 1
DEBUG: The threads variable is 20
DEBUG: The polling interval is 300 seconds
DEBUG: The number of concurrent processes is 10
DEBUG: The script timeout is 45
DEBUG: The number of php script servers to run is 10
DEBUG: StartHost='225', EndHost='225', TotalPHPScripts='0
DEBUG: The PHP Script Server is Not Required
DEBUG: The Maximum SNMP OID Get Size is 50
Version 0.8.8a starting
DEBUG: MySQL is Thread Safe!
SPINE: Initializing Net-SNMP API
DEBUG: Issues with SNMP Header Version information, assuming old version of Net-
SNMP.
cygwin warning:
  MS-DOS style path detected: C:\php\extras\mibs
  Preferred POSIX equivalent is: /php/extras/mibs
  CYGWIN environment variable option "nodosfilewarning" turns off this warning.
  Consult the user's guide for more details about POSIX paths:
    http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
SPINE: Initializing PHP Script Server(s)
NOTE: Spine will support multithread device polling.
NOTE: Spine is behaving in a 0.8.7g+ manner
DEBUG: Initial Value of Active Threads is 0
DEBUG: Valid Thread to be Created▒DEBUG: In Poller, About to Start Polling of Ho
st
DEBUG: The Value of Active Threads is 1
DEBUG: Valid Thread to be Created▒DEBUG: In Poller, About to Start Polling of Ho
st
DEBUG: The Value of Active Threads is 2
Host[0] TH[1] Total Time:     0 Seconds
Host[0] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
DEBUG: The Value of Active Threads is 1
Host[225] DEBUG: Entering ICMP Ping
Host[225] DEBUG: ICMP Host Alive, Try Count:1, Time:0.0000 ms
Host[225] PING Result: ICMP: Host is Alive
Host[225] SNMP Result: SNMP not performed due to setting or ping result
Host[225] TH[1] RECACHE: Processing 2 items in the auto reindex cache for '140.1
04.10.64
Host[225] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 74597910
Host[225] TH[1] Recache DataQuery[12] OID: .1.3.6.1.2.1.1.3.0, output: 74597910
Host[225] TH[1] NOTE: There are '4' Polling Items for this Host
Host[225] TH[1] DS[363] SNMP: v2: 140.104.10.64, dsname: traffic_in, oid: .1.3.6
.1.2.1.31.1.1.1.6.22, value: 3365015677
Host[225] TH[1] DS[363] SNMP: v2: 140.104.10.64, dsname: traffic_out, oid: .1.3.
6.1.2.1.31.1.1.1.10.22, value: 94413979920
Host[225] TH[1] DS[364] SNMP: v2: 140.104.10.64, dsname: traffic_in, oid: .1.3.6
.1.2.1.31.1.1.1.6.25, value: 100972576826
Host[225] TH[1] DS[364] SNMP: v2: 140.104.10.64, dsname: traffic_out, oid: .1.3.
6.1.2.1.31.1.1.1.10.25, value: 4725557932
Host[225] TH[1] Total Time: 0.015 Seconds
Host[225] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function

DEBUG: The Value of Active Threads is 0
DEBUG: Thread Cleanup Complete
DEBUG: PHP Script Server Pipes Closed
DEBUG: Allocated Variable Memory Freed
DEBUG: MYSQL Free & Close Completed
DEBUG: Net-SNMP Close Completed
Time: 0.3590 s, Threads: 20, Hosts: 2
I had to use another host as Host 117 was properly responding and graphing correctly, so I took a different one that it was reporting as down, but I was able to ping and do an SNMP query against at the time when editing the host. I'm thinking that my graph gaps are related to the unknown error 2. When I run it manually like this, I see the entry about the device returning from a downed state. I'm guessing that's because I'm forcing it to go manually check the device again, and it was off line before, but now it sees it on line.
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Re: Devices Skipping Polling Cycles

Post by BSOD2600 »

Hmm, guess you'll have to keep an eye on the cacti.log for more occurrences of unknown error 2. Then quickly run spine in debug mode to try and catch what's really going on.
sfsrjstw
Posts: 9
Joined: Tue Oct 26, 2010 11:16 am

Re: Devices Skipping Polling Cycles

Post by sfsrjstw »

Is there a better spine command than the one listed earlier that I should run when I catch a host as being down when it's really not? I've run the command given earlier as soon as the a device is showing down in the Cacti Logs, but it's always responding to the manual run of spine.
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Re: Devices Skipping Polling Cycles

Post by BSOD2600 »

Short of always running cacti with developer logging level enabled, I do not believe so (searching for other threads like these is where I found the spine command which TheWitness found helpful). Of course, with that high logging level, the logs can get very big fast...
sfsrjstw
Posts: 9
Joined: Tue Oct 26, 2010 11:16 am

Re: Devices Skipping Polling Cycles

Post by sfsrjstw »

I think I'll purge the log, flip it to developer level logging, and let it run for a cycle. If there are any down with the Unknown error 2, then I'll post the log file, and change the logging back to normal. Hopefully we can narrow down what's causing the false positives.
sfsrjstw
Posts: 9
Joined: Tue Oct 26, 2010 11:16 am

Re: Devices Skipping Polling Cycles

Post by sfsrjstw »

I purged the Cacti log and ran a cycle with poller logging set to Debug level. And had a host report with the Unknown error 2. The host ID is 185. A manual run of spine once the host reported as down is also shown. The manual run reported the device as up. Host 185 has it's checks configuered as follows:

Down Detection: Ping or SNMP
Ping Type: ICMP
Ping Timeout: 5000
Ping Retry: 3
SNMP Version: 2
SNMP Timeout: 5000

Code: Select all

C:\Spine>spine.exe -R -S -V 5 -f 185 -l 185
SPINE: Using spine config file [spine.conf]
09/19/2012 11:31:32 AM - SPINE: Poller[0] DEBUG: The path_php_server variable is
 c:/inetpub/wwwroot/cacti/script_server.php
09/19/2012 11:31:32 AM - SPINE: Poller[0] DEBUG: The path_cactilog variable is C
:/Inetpub/wwwroot/cacti/log/cacti.log
DEBUG: The log_destination variable is 4 (STDOUT)
DEBUG: The path_php variable is C:/php/php.exe
DEBUG: The availability_method variable is 4
DEBUG: The ping_recovery_count variable is 1
DEBUG: The ping_failure_count variable is 1
DEBUG: The ping_method variable is 1
DEBUG: The ping_retries variable is 3
DEBUG: The ping_timeout variable is 5000
DEBUG: The snmp_retries variable is 4
DEBUG: The log_perror variable is 1
DEBUG: The log_pwarn variable is 1
DEBUG: The boost_redirect variable is 0
DEBUG: The log_pstats variable is 1
DEBUG: The threads variable is 15
DEBUG: The polling interval is 300 seconds
DEBUG: The number of concurrent processes is 8
DEBUG: The script timeout is 60
DEBUG: The number of php script servers to run is 10
DEBUG: StartHost='185', EndHost='185', TotalPHPScripts='0
DEBUG: The PHP Script Server is Not Required
DEBUG: The Maximum SNMP OID Get Size is 50
Version 0.8.8a starting
DEBUG: MySQL is Thread Safe!
SPINE: Initializing Net-SNMP API
DEBUG: Issues with SNMP Header Version information, assuming old version of Net-
SNMP.
cygwin warning:
  MS-DOS style path detected: C:\php\extras\mibs
  Preferred POSIX equivalent is: /php/extras/mibs
  CYGWIN environment variable option "nodosfilewarning" turns off this warning.
  Consult the user's guide for more details about POSIX paths:
    http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
SPINE: Initializing PHP Script Server(s)
NOTE: Spine will support multithread device polling.
NOTE: Spine is behaving in a 0.8.7g+ manner
DEBUG: Initial Value of Active Threads is 0
DEBUG: Valid Thread to be Created▒DEBUG: In Poller, About to Start Polling of Ho
st
DEBUG: The Value of Active Threads is 1
DEBUG: Valid Thread to be Created▒DEBUG: In Poller, About to Start Polling of Ho
st
DEBUG: The Value of Active Threads is 2
Host[0] TH[1] Total Time:     0 Seconds
Host[0] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
DEBUG: The Value of Active Threads is 1
Host[185] DEBUG: Entering ICMP Ping
Host[185] DEBUG: ICMP Host Alive, Try Count:1, Time:16.0000 ms
Host[185] PING Result: ICMP: Host is Alive
Host[185] SNMP Result: SNMP not performed due to setting or ping result
Host[185] Hostname[140.104.10.81] NOTICE: HOST EVENT: Host Returned from DOWN St
ate
Host[185] TH[1] RECACHE: Processing 2 items in the auto reindex cache for '140.1
04.10.81
Host[185] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 541154130
Host[185] TH[1] Recache DataQuery[12] OID: .1.3.6.1.2.1.1.3.0, output: 541154130

Host[185] TH[1] NOTE: There are '4' Polling Items for this Host
Host[185] TH[1] DS[280] SNMP: v2: 140.104.10.81, dsname: traffic_in, oid: .1.3.6
.1.2.1.31.1.1.1.6.34, value: 22204420922
Host[185] TH[1] DS[280] SNMP: v2: 140.104.10.81, dsname: traffic_out, oid: .1.3.
6.1.2.1.31.1.1.1.10.34, value: 345074652679
Host[185] TH[1] DS[281] SNMP: v2: 140.104.10.81, dsname: traffic_in, oid: .1.3.6
.1.2.1.31.1.1.1.6.49, value: 662743470919
Host[185] TH[1] DS[281] SNMP: v2: 140.104.10.81, dsname: traffic_out, oid: .1.3.
6.1.2.1.31.1.1.1.10.49, value: 56351613463
Host[185] TH[1] Total Time: 0.032 Seconds
Host[185] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function

DEBUG: The Value of Active Threads is 0
DEBUG: Thread Cleanup Complete
DEBUG: PHP Script Server Pipes Closed
DEBUG: Allocated Variable Memory Freed
DEBUG: MYSQL Free & Close Completed
DEBUG: Net-SNMP Close Completed
Time: 0.3280 s, Threads: 15, Hosts: 2
Attachments
cacti.txt
Cacti Log
(1.14 MiB) Downloaded 320 times
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Re: Devices Skipping Polling Cycles

Post by BSOD2600 »

What type of device is host 185? I notice from the log that several other hosts time out. Any similarities between them per chance?

Agreed, it does seem a bit odd. I'll see if the Spine author can investigate into this problem.

Code: Select all

09/19/2012 11:30:12 AM - SPINE: Poller[0] Host[185] DEBUG: Entering ICMP Ping
09/19/2012 11:30:17 AM - SPINE: Poller[0] Host[185] DEBUG: Exceeded Host Timeout, Retrying
09/19/2012 11:30:22 AM - SPINE: Poller[0] Host[185] DEBUG: Exceeded Host Timeout, Retrying
09/19/2012 11:30:27 AM - SPINE: Poller[0] Host[185] DEBUG: Exceeded Host Timeout, Retrying
09/19/2012 11:30:32 AM - SPINE: Poller[0] Host[185] DEBUG: Exceeded Host Timeout, Retrying
09/19/2012 11:30:32 AM - SPINE: Poller[0] Host[185] DEBUG: Entering SNMP Ping
09/19/2012 11:30:52 AM - SPINE: Poller[0] Host[185] SNMP Ping Error: Unknown error: 2
09/19/2012 11:30:52 AM - SPINE: Poller[0] Host[185] PING Result: ICMP: Ping timed out
09/19/2012 11:30:52 AM - SPINE: Poller[0] Host[185] Hostname[140.x] ERROR: HOST EVENT: Host is DOWN Message: Host did not respond to SNMP, ICMP: Ping timed out
09/19/2012 11:30:52 AM - SPINE: Poller[0] Host[185] TH[1] NOTE: There are '4' Polling Items for this Host
09/19/2012 11:30:52 AM - SPINE: Poller[0] Host[185] TH[1] Total Time:    40 Seconds
09/19/2012 11:30:52 AM - SPINE: Poller[0] Host[185] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
sfsrjstw
Posts: 9
Joined: Tue Oct 26, 2010 11:16 am

Re: Devices Skipping Polling Cycles

Post by sfsrjstw »

Sorry for the dealy in posting the answer to your question. I'm seeing it on a variety of host types. Most are Nortel/Avaya switches; however, that's the majority of what I'm monitoring/graphing. I am also getting the false positives on Aruba AP's as well as some Unix servers. The Aruba AP's and Unix servers I'm not even doing SNMP queries against...only ping checkes. I've tried both UDP and ICMP in case there was a problem with one version of the ping check vs. the other, but it doesn't seem to make any difference.

Host 185 happens to be a Norte/Avaya 2550 switch.
Post Reply

Who is online

Users browsing this forum: No registered users and 5 guests