[solved] Spine SNMP timeout detected

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Developers, Moderators

Post Reply
alokvimawala
Posts: 12
Joined: Wed Jul 08, 2009 9:59 am

[solved] Spine SNMP timeout detected

Post by alokvimawala »

I had posted this to the "Development Version" section, but seeing that the problem also occurs on the release version of spine, I am cross-posting it here.

I just migrated Cacti from a host running RHEL4 to Ubuntu 9.04.
Once all the pre-requisites were satisfied and everything was installed properly, I have noticed a lot of:

Code: Select all

07/08/2009 09:55:26 AM - SPINE: Poller[0] Host[261] DS[2438] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9' 
I do not know if the issue existed on the old install, so the migration might not be relevant.

The timeout issue impacts others hosts as well but not all hosts are affected by the timeout issue.

I was running Spine version 0.8.7e, but switched to the latest SVN code in the hopes that it might fix the problem, but that did not help.

I am running the following versions of the relevant software:
Spine - Latest from SVN (revision 5160)
Cacti - 0.8.7e
rrdtool - 1.3.1
NET-SNMP - 5.4.1
php-snmp - 5.2.6
mysql - 5.0.75

The poller settings are as follows:
Maximum Concurrent Poller Processes: 5
Maximum Threads per Process: 10
Number of PHP Script Servers: 10
Script and Script Server Timeout Value: 25
The Maximum SNMP OID's Per SNMP Get Request: 10

In terms of troubleshooting, besides replacing the latest release of Spine with the latest version from SVN, I also tried running spine manually:

Code: Select all

cactiuser@hsg-axe:~$ spine -R -L 5 -H 261
SPINE: Using spine config file [/etc/spine.conf]
SPINE: Version 0.8.7e starting
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[988] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[989] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[989] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[990] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[990] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[991] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[991] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[2435] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[2436] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:31 AM - SPINE: Poller[0] Host[261] DS[2437] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
07/08/2009 11:11:39 AM - SPINE: Poller[0] Host[261] DS[2438] WARNING: SNMP timeout detected [2000 ms], ignoring host '10.194.150.9'
SPINE: Time: 16.1216 s, Threads: 10, Hosts: 2
I also ran tcpdump while the above command was executed and I have attached the output file along with this post.

I also switched the poller mechanism to cmd.php, but that also resulted in errors:

Code: Select all

07/08/2009 11:26:14 AM - CMDPHP: Poller[0] Host[171] DS[913] WARNING: Result from SNMP not valid. Partial Result: U
I am not sure what I can try next to fix the issue...

Thanks!
Attachments
tcp_dump_150-9.txt
extension changed to .txt so it could be uploaded
(698 Bytes) Downloaded 203 times
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

devices timing out and returning invalid snmp data are two separate issues. Read http://docs.cacti.net/manual:087:4_help.2_debugging

Do those devices respond in under 2000ms when you manually ping?
Does the cmd.php poller say they time out too or mark them as down?
alokvimawala
Posts: 12
Joined: Wed Jul 08, 2009 9:59 am

Post by alokvimawala »

This is what I actually get from CMDPHP:

Code: Select all

07/09/2009 01:23:27 PM - CMDPHP: Poller[0] Host[261] ERROR: HOST EVENT: Host is DOWN Message: Host did not respond to SNMP
This is different than what I had posted earlier:

Code: Select all

07/08/2009 11:26:14 AM - CMDPHP: Poller[0] Host[171] DS[913] WARNING: Result from SNMP not valid. Partial Result: U 
I apologize for that as the two hosts are different and I should have picked that up.

On the ping side of things, things look much better:

Code: Select all

cactiuser@hsg-axe:~$ ping 10.194.150.9 -c 4
PING 10.194.150.9 (10.194.150.9) 56(84) bytes of data.
64 bytes from 10.194.150.9: icmp_seq=1 ttl=253 time=1.26 ms
64 bytes from 10.194.150.9: icmp_seq=2 ttl=253 time=1.27 ms
64 bytes from 10.194.150.9: icmp_seq=3 ttl=253 time=1.27 ms
64 bytes from 10.194.150.9: icmp_seq=4 ttl=253 time=1.25 ms

--- 10.194.150.9 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 1.252/1.268/1.278/0.044 ms
As per the debugging guide, I reduced the number of OIDs per request to 1 for 10.194.150.9 and I will monitor to see if that helps.

Note: Cacti is currently usng cmd.php as its poller.
alokvimawala
Posts: 12
Joined: Wed Jul 08, 2009 9:59 am

Post by alokvimawala »

Changing the Maximum OID's Per Get Request parameter to 1 from 10 changed the host in question from Down to Recovering, if this does the trick then I will run a sql upate against all devices that received the timeout issue and change their Maximum OID's Per Get Request value to 1 as well.

I am assuming that this will negatively impact the time it takes to scan all the devices... is this correct?
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

alokvimawala wrote:I am assuming that this will negatively impact the time it takes to scan all the devices... is this correct?
Spine only uses this parameter, IIRC.

Yes, in theory it can increase the time it takes to poll data from devices. No one has done any real benchmarking though to find out how much of an improvement.
alokvimawala
Posts: 12
Joined: Wed Jul 08, 2009 9:59 am

Post by alokvimawala »

I changed all the max_oids values for all hosts and also under the poller tab in settings to 1.
However, I am still getting the "SNMP timeout detected" error.

Surprisingly, the error went away for the device 150.9 device, but still persists for other devices.

Current settings are as follows:
System Wide
Maximum Threads per Process: 20
Number of PHP Script Servers: 10
Script and Script Server Timeout Value: 25
The Maximum SNMP OID's Per SNMP Get Request: 1
Host Specific
SNMP Timeout: 2000
Maximum OID's Per Get Request: 1
All the hosts defined in Cacti have the same timeout and max OID settings.

I am going to switch from spine back to cmd.php to be able to collect data.
alokvimawala
Posts: 12
Joined: Wed Jul 08, 2009 9:59 am

Post by alokvimawala »

OK... quick update.
CMD.PHP is working without any problems now.
I ended up deleting all datasources and graphs associated with problematic hosts and re-created them. This took care of the "Partial Result: U" messages.

I will try to switch the poller over to spine in a little bit and try again with the settings as per the previous post.

The did change Maximum Concurrent Poller Processes to 20 from 5 to speed up the polling process and have it not take almost 5 minutes to complete.
alokvimawala
Posts: 12
Joined: Wed Jul 08, 2009 9:59 am

Post by alokvimawala »

Another update:
I went through the list of hosts that cmd.php was having problems with and deleted all data sources (and related graphs) associated with the hosts.
After deletion of said data sources (and related graphs), I recreated them for those devices and cmd.php was happy again.
I am not really sure why this is the case but everything is happy with cmd.php again.
Once things were working with cmd.php, I switched the poller over to spine and things started working with spine again too.
I have been timeout error free all weekend :)

Thanks to BSOD2600 for the help and pointing me in the direction of the debugging document.
Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests