Few graphs are missing data time after time

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Developers, Moderators

Post Reply
marijonas
Posts: 31
Joined: Tue Jan 13, 2004 10:16 am

Few graphs are missing data time after time

Post by marijonas »

Hi,

Few of my collections (GAUGES) sometimes have missing data. That sometimes happens randomly like 3-5 polls out of 10.

I collect data using SNMP queries. My polling cycle is 1 minutes and poller takes takes ~35secs for all collections to collect (so no loops).
Initially I suspected device is not returning data since it's very busy, but at the same time, I ran a snmpwalk every 1second from cacti server to see values and I'm getting response on every single query.

Also suspected, that cacti doesn't show only when my response value is 100 (I measure CPU load), but recognized that one more graph is also missing data, which collects GAUGES and their values vary from 0 to ~5.

Strange is that I'm having issues only collecting data from NetApp devices. All other devices (routers, switches) respond well.

I run cacti 0.8.7e on Linux (RedHat)
Spine enabled
Net-SNMP 5.3.2.2
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

Turn the cacti logging level to medium and watch if data is getting returned or not during the gaps.
marijonas
Posts: 31
Joined: Tue Jan 13, 2004 10:16 am

Post by marijonas »

Now I'm more confused. Data between logs and what is recorded into rrd is completely different.


This is what I get in debug mode:

06/23/2010 05:26:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 59
06/23/2010 05:25:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 100
06/23/2010 05:24:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 74
06/23/2010 05:23:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:22:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 64
06/23/2010 05:21:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 50
06/23/2010 05:20:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 67
06/23/2010 05:19:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 100
06/23/2010 05:18:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:17:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:16:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 60
06/23/2010 05:15:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 59
06/23/2010 05:14:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 92
06/23/2010 05:13:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:12:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 83


and here is what I see in my rrd export (please be aware that time goes different directions in these files):

"2010-06-23 17:11:00","2.2800000000e+01"
"2010-06-23 17:12:00","4.9800000000e+01"
"2010-06-23 17:13:00","3.3200000000e+01"
"2010-06-23 17:14:00","5.5200000000e+01"
"2010-06-23 17:15:00","7.1650000000e+01"
"2010-06-23 17:16:00","5.9600000000e+01"
"2010-06-23 17:17:00","2.3000000000e+01"
"2010-06-23 17:18:00","0.0000000000e+00"
"2010-06-23 17:19:00","6.1666666667e+01"
"2010-06-23 17:20:00","8.0200000000e+01"
"2010-06-23 17:21:00","5.6516666667e+01"
"2010-06-23 17:22:00","5.8400000000e+01"
"2010-06-23 17:23:00","2.4533333333e+01"
"2010-06-23 17:24:00","4.4400000000e+01"
"2010-06-23 17:25:00","9.0033333333e+01"
"2010-06-23 17:26:00","7.5400000000e+01"
"2010-06-23 17:27:00","6.2083333333e+01"
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

1) it appears that netapp device is the true culprit for the holes in the graphs.
2) the rrdtool data doesn't match because of how rrdtool handles updates. details http://www.vandenbogaerdt.nl/rrdtool/process.php
marijonas
Posts: 31
Joined: Tue Jan 13, 2004 10:16 am

Post by marijonas »

It doesn't make for me any sense so far :)

1. I would complain about NetApp as well if I wouldn't run snmpwalk script on same cacti server, which queries same OID every 1 second. And it gets data all the time. Mostly, what I recognized, that on these periods when CPU laod is ~100, I get value into my snmpwalk query, but cacti shows 0 over there.

2. I use GAUGE in my data collection, so rrd output should show exact value as is gets from snmp query, shouldn't it?

Please, correct me if I'm wrong or miss something in this troubleshooting chain.

One moment, I also suspected that my graph retrieved data from wrong data source (like mistake configuring graph templates), but double-checked that and looks like it is configured correctly. Also, I do collect same CPU load from some other NetApp devices, which are probably less busy and collection doesn't have any "holes" so far.

Can it be related with max number of snmp values on one snmpbulk or amount of parameters monitored on one device?
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

marijonas wrote: 2. I use GAUGE in my data collection, so rrd output should show exact value as is gets from snmp query, shouldn't it?
Not exactly. Please read the link I posted on how rrdtool compensates for time update offset.
marijonas wrote:Can it be related with max number of snmp values on one snmpbulk or amount of parameters monitored on one device?
Yes to both. Different snmp devices/versions behave differently with both instances. Testing will reveal which config options work best for each device class.
marijonas
Posts: 31
Joined: Tue Jan 13, 2004 10:16 am

Post by marijonas »

Hi,

I tried to debug the issue, but now confused even more.
I suspected badly working template, so created another one from a scratch with very same OID collection and started to collect on two NetApp devices. This is what I got:

netApp-01: original data source shows zeros all the time. new data source collects data, which looks on a same level as I get from command line using snmpwalk (keeps on 60-80% range).

netApp-02: original data source collected, we assumed correct data with some timing patterns which represented load cycle depending on hour. And data level was about the same what we get originaly from devices native tools (20-30% range). New data source shows random numbers (without timing pattern) in range of 60-80%.

I turned debug, so I could check what data sources receives from snmpwalk and it looks like cacti quering the right device with right SNMP OID.

I attached collection for netApp-02 for both, original and new one collection, for a same SNMP OID.

Any hints are more than welcomed.
Attachments
Identical OID collections
Identical OID collections
Screenshot.png (95.28 KiB) Viewed 1828 times
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

Looking in the poller cache, they're polling the exact same OIDs for the old/new templates, yet the results are widely different?
marijonas
Posts: 31
Joined: Tue Jan 13, 2004 10:16 am

Post by marijonas »

Yes. That's what it is.
Polling very same OIDs, but results are different.
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

I guess it's something with the devices then, as they're polling the same OID, and you've verified the returned data in the cacti.log. time to contact the vendor.
Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest