spine does not appear to honor SNMP retries setting?

Blargh · Post by **Blargh** » Wed Sep 09, 2015 7:49 am

All,

I have a Cacti setup polling quite a few hosts, but some of them are pretty slow/obstinate about returning SNMP data, so needing SNMP retries is an issue. However, as it stands right now, the spine poller does not appear to do them for me (at least when using spine 0.8.8f and NET-SNMP version 5.4.3)

I have the poller set to run every five minutes, but on x2 and x7 (instead of x0 and x5), just to balance out against a different Cacti instance on the same ESX cluster. In the Cacti logs, I see:

09/09/2015 04:52:05 AM - SPINE: Poller[0] Host[188] TH[1] DS[299689] WARNING: SNMP timeout detected [5000 ms], ignoring host 'xxx.xxx.xxx.xxx'

(Yes, I have verified via command line tools the host is answering queries, community string is correct, etc., it's just a little slow and obstinate about it).

However, since it is logging it at (poller_start)+5 seconds, and the timeout is set for 5 seconds (yes, this is long, I've been troubleshooting), it's pretty clear the poller is giving up after the first timeout, and not honoring the SNMP retries parameter set in Cacti (right now set to 5).

Reproducing this is a little tricky - it's a bit random what devices on any given poll cycle won't response, so the usual trick of running spine in debug/foreground mode doesn't work reliably. There are a large-ish number of hosts (about 160), so running spine in a full run in debug isn't something I'd like to try until every other option is exhausted.

I dug in the spine source code (0.8.8f) a little, and in snmp.c, at line 180, session.retries is hard set to 3 - which wouldn't be a problem (it is not referencing the host's record for retries which is a bug, but not really impacting to my current problem), except that doesn't seem to be working at all (since if I was getting 3 retries, the above timeout line would occur at 04:52:15, not 04:52:05).

I added in a little logging to snmp_get_multi, just to verify, right after the "retry:" label:

Code: Select all

        debugsess = (struct snmp_session *)snmp_sess_session(current_host->snmp_session);
        SPINE_LOG(("DEBUG: Session.retries = %li, session.timeout = %li", debugsess->retries, debugsess->timeout));
        status = snmp_sess_synch_response(current_host->snmp_session, pdu, &response);

And I do get the expected:

Code: Select all

DEBUG: Session.retries = 3, session.timeout = 5000000

At this point, I'm speculating that NET-SNMP's snmp_sess_synch_response doesn't honor or properly handle the retries parameter, but I'm out of time right now to dig into it more. I searched these forums and didn't see anything about this, so I'm not sure if something is just broken on my end, or if this is expected behavior for the library, and the spine poller needs to handle SNMPERR_TIMEOUT differently. Anyone seen this before, or have thoughts?

spine does not appear to honor SNMP retries setting?

spine does not appear to honor SNMP retries setting?

Who is online