Few graphs are missing data time after time

marijonas · Post by **marijonas** » Tue Jun 22, 2010 3:27 pm

Hi,

Few of my collections (GAUGES) sometimes have missing data. That sometimes happens randomly like 3-5 polls out of 10.

I collect data using SNMP queries. My polling cycle is 1 minutes and poller takes takes ~35secs for all collections to collect (so no loops).
Initially I suspected device is not returning data since it's very busy, but at the same time, I ran a snmpwalk every 1second from cacti server to see values and I'm getting response on every single query.

Also suspected, that cacti doesn't show only when my response value is 100 (I measure CPU load), but recognized that one more graph is also missing data, which collects GAUGES and their values vary from 0 to ~5.

Strange is that I'm having issues only collecting data from NetApp devices. All other devices (routers, switches) respond well.

I run cacti 0.8.7e on Linux (RedHat)
Spine enabled
Net-SNMP 5.3.2.2

BSOD2600 · Post by **BSOD2600** » Tue Jun 22, 2010 3:41 pm

Turn the cacti logging level to medium and watch if data is getting returned or not during the gaps.

marijonas · Post by **marijonas** » Wed Jun 23, 2010 12:41 pm

Now I'm more confused. Data between logs and what is recorded into rrd is completely different.

This is what I get in debug mode:

06/23/2010 05:26:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 59
06/23/2010 05:25:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 100
06/23/2010 05:24:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 74
06/23/2010 05:23:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:22:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 64
06/23/2010 05:21:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 50
06/23/2010 05:20:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 67
06/23/2010 05:19:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 100
06/23/2010 05:18:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:17:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:16:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 60
06/23/2010 05:15:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 59
06/23/2010 05:14:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 92
06/23/2010 05:13:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:12:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 83

and here is what I see in my rrd export (please be aware that time goes different directions in these files):

"2010-06-23 17:11:00","2.2800000000e+01"
"2010-06-23 17:12:00","4.9800000000e+01"
"2010-06-23 17:13:00","3.3200000000e+01"
"2010-06-23 17:14:00","5.5200000000e+01"
"2010-06-23 17:15:00","7.1650000000e+01"
"2010-06-23 17:16:00","5.9600000000e+01"
"2010-06-23 17:17:00","2.3000000000e+01"
"2010-06-23 17:18:00","0.0000000000e+00"
"2010-06-23 17:19:00","6.1666666667e+01"
"2010-06-23 17:20:00","8.0200000000e+01"
"2010-06-23 17:21:00","5.6516666667e+01"
"2010-06-23 17:22:00","5.8400000000e+01"
"2010-06-23 17:23:00","2.4533333333e+01"
"2010-06-23 17:24:00","4.4400000000e+01"
"2010-06-23 17:25:00","9.0033333333e+01"
"2010-06-23 17:26:00","7.5400000000e+01"
"2010-06-23 17:27:00","6.2083333333e+01"

BSOD2600 · Post by **BSOD2600** » Wed Jun 23, 2010 3:51 pm

1) it appears that netapp device is the true culprit for the holes in the graphs.
2) the rrdtool data doesn't match because of how rrdtool handles updates. details http://www.vandenbogaerdt.nl/rrdtool/process.php

marijonas · Post by **marijonas** » Mon Jun 28, 2010 10:30 am

It doesn't make for me any sense so far

1. I would complain about NetApp as well if I wouldn't run snmpwalk script on same cacti server, which queries same OID every 1 second. And it gets data all the time. Mostly, what I recognized, that on these periods when CPU laod is ~100, I get value into my snmpwalk query, but cacti shows 0 over there.

2. I use GAUGE in my data collection, so rrd output should show exact value as is gets from snmp query, shouldn't it?

Please, correct me if I'm wrong or miss something in this troubleshooting chain.

One moment, I also suspected that my graph retrieved data from wrong data source (like mistake configuring graph templates), but double-checked that and looks like it is configured correctly. Also, I do collect same CPU load from some other NetApp devices, which are probably less busy and collection doesn't have any "holes" so far.

Can it be related with max number of snmp values on one snmpbulk or amount of parameters monitored on one device?

BSOD2600 · Post by **BSOD2600** » Mon Jun 28, 2010 3:51 pm

marijonas wrote: 2. I use GAUGE in my data collection, so rrd output should show exact value as is gets from snmp query, shouldn't it?

Not exactly. Please read the link I posted on how rrdtool compensates for time update offset.

marijonas wrote:Can it be related with max number of snmp values on one snmpbulk or amount of parameters monitored on one device?

Yes to both. Different snmp devices/versions behave differently with both instances. Testing will reveal which config options work best for each device class.

marijonas · Post by **marijonas** » Wed Jun 30, 2010 1:39 pm

Hi,

I tried to debug the issue, but now confused even more.
I suspected badly working template, so created another one from a scratch with very same OID collection and started to collect on two NetApp devices. This is what I got:

netApp-01: original data source shows zeros all the time. new data source collects data, which looks on a same level as I get from command line using snmpwalk (keeps on 60-80% range).

netApp-02: original data source collected, we assumed correct data with some timing patterns which represented load cycle depending on hour. And data level was about the same what we get originaly from devices native tools (20-30% range). New data source shows random numbers (without timing pattern) in range of 60-80%.

I turned debug, so I could check what data sources receives from snmpwalk and it looks like cacti quering the right device with right SNMP OID.

I attached collection for netApp-02 for both, original and new one collection, for a same SNMP OID.

Any hints are more than welcomed.

BSOD2600 · Post by **BSOD2600** » Wed Jun 30, 2010 7:35 pm

Looking in the poller cache, they're polling the exact same OIDs for the old/new templates, yet the results are widely different?

marijonas · Post by **marijonas** » Wed Jul 07, 2010 8:56 am

Yes. That's what it is.
Polling very same OIDs, but results are different.

BSOD2600 · Post by **BSOD2600** » Wed Jul 07, 2010 11:47 am

I guess it's something with the devices then, as they're polling the same OID, and you've verified the returned data in the cacti.log. time to contact the vendor.

Cacti

Few graphs are missing data time after time

Few graphs are missing data time after time

Who is online