Few graphs are missing data time after time
Moderators: Developers, Moderators
Few graphs are missing data time after time
Hi,
Few of my collections (GAUGES) sometimes have missing data. That sometimes happens randomly like 3-5 polls out of 10.
I collect data using SNMP queries. My polling cycle is 1 minutes and poller takes takes ~35secs for all collections to collect (so no loops).
Initially I suspected device is not returning data since it's very busy, but at the same time, I ran a snmpwalk every 1second from cacti server to see values and I'm getting response on every single query.
Also suspected, that cacti doesn't show only when my response value is 100 (I measure CPU load), but recognized that one more graph is also missing data, which collects GAUGES and their values vary from 0 to ~5.
Strange is that I'm having issues only collecting data from NetApp devices. All other devices (routers, switches) respond well.
I run cacti 0.8.7e on Linux (RedHat)
Spine enabled
Net-SNMP 5.3.2.2
Few of my collections (GAUGES) sometimes have missing data. That sometimes happens randomly like 3-5 polls out of 10.
I collect data using SNMP queries. My polling cycle is 1 minutes and poller takes takes ~35secs for all collections to collect (so no loops).
Initially I suspected device is not returning data since it's very busy, but at the same time, I ran a snmpwalk every 1second from cacti server to see values and I'm getting response on every single query.
Also suspected, that cacti doesn't show only when my response value is 100 (I measure CPU load), but recognized that one more graph is also missing data, which collects GAUGES and their values vary from 0 to ~5.
Strange is that I'm having issues only collecting data from NetApp devices. All other devices (routers, switches) respond well.
I run cacti 0.8.7e on Linux (RedHat)
Spine enabled
Net-SNMP 5.3.2.2
Turn the cacti logging level to medium and watch if data is getting returned or not during the gaps.
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
Now I'm more confused. Data between logs and what is recorded into rrd is completely different.
This is what I get in debug mode:
06/23/2010 05:26:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 59
06/23/2010 05:25:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 100
06/23/2010 05:24:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 74
06/23/2010 05:23:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:22:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 64
06/23/2010 05:21:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 50
06/23/2010 05:20:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 67
06/23/2010 05:19:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 100
06/23/2010 05:18:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:17:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:16:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 60
06/23/2010 05:15:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 59
06/23/2010 05:14:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 92
06/23/2010 05:13:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:12:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 83
and here is what I see in my rrd export (please be aware that time goes different directions in these files):
"2010-06-23 17:11:00","2.2800000000e+01"
"2010-06-23 17:12:00","4.9800000000e+01"
"2010-06-23 17:13:00","3.3200000000e+01"
"2010-06-23 17:14:00","5.5200000000e+01"
"2010-06-23 17:15:00","7.1650000000e+01"
"2010-06-23 17:16:00","5.9600000000e+01"
"2010-06-23 17:17:00","2.3000000000e+01"
"2010-06-23 17:18:00","0.0000000000e+00"
"2010-06-23 17:19:00","6.1666666667e+01"
"2010-06-23 17:20:00","8.0200000000e+01"
"2010-06-23 17:21:00","5.6516666667e+01"
"2010-06-23 17:22:00","5.8400000000e+01"
"2010-06-23 17:23:00","2.4533333333e+01"
"2010-06-23 17:24:00","4.4400000000e+01"
"2010-06-23 17:25:00","9.0033333333e+01"
"2010-06-23 17:26:00","7.5400000000e+01"
"2010-06-23 17:27:00","6.2083333333e+01"
This is what I get in debug mode:
06/23/2010 05:26:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 59
06/23/2010 05:25:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 100
06/23/2010 05:24:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 74
06/23/2010 05:23:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:22:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 64
06/23/2010 05:21:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 50
06/23/2010 05:20:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 67
06/23/2010 05:19:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 100
06/23/2010 05:18:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:17:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:16:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 60
06/23/2010 05:15:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 59
06/23/2010 05:14:23 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 92
06/23/2010 05:13:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 0
06/23/2010 05:12:24 PM - SPINE: Poller[0] Host[23] DS[2912] SNMP: v2: 10.245.208.77, dsname: netappCPUBusy, oid: .1.3.6.1.4.1.789.1.2.1.3.0, value: 83
and here is what I see in my rrd export (please be aware that time goes different directions in these files):
"2010-06-23 17:11:00","2.2800000000e+01"
"2010-06-23 17:12:00","4.9800000000e+01"
"2010-06-23 17:13:00","3.3200000000e+01"
"2010-06-23 17:14:00","5.5200000000e+01"
"2010-06-23 17:15:00","7.1650000000e+01"
"2010-06-23 17:16:00","5.9600000000e+01"
"2010-06-23 17:17:00","2.3000000000e+01"
"2010-06-23 17:18:00","0.0000000000e+00"
"2010-06-23 17:19:00","6.1666666667e+01"
"2010-06-23 17:20:00","8.0200000000e+01"
"2010-06-23 17:21:00","5.6516666667e+01"
"2010-06-23 17:22:00","5.8400000000e+01"
"2010-06-23 17:23:00","2.4533333333e+01"
"2010-06-23 17:24:00","4.4400000000e+01"
"2010-06-23 17:25:00","9.0033333333e+01"
"2010-06-23 17:26:00","7.5400000000e+01"
"2010-06-23 17:27:00","6.2083333333e+01"
1) it appears that netapp device is the true culprit for the holes in the graphs.
2) the rrdtool data doesn't match because of how rrdtool handles updates. details http://www.vandenbogaerdt.nl/rrdtool/process.php
2) the rrdtool data doesn't match because of how rrdtool handles updates. details http://www.vandenbogaerdt.nl/rrdtool/process.php
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
It doesn't make for me any sense so far
1. I would complain about NetApp as well if I wouldn't run snmpwalk script on same cacti server, which queries same OID every 1 second. And it gets data all the time. Mostly, what I recognized, that on these periods when CPU laod is ~100, I get value into my snmpwalk query, but cacti shows 0 over there.
2. I use GAUGE in my data collection, so rrd output should show exact value as is gets from snmp query, shouldn't it?
Please, correct me if I'm wrong or miss something in this troubleshooting chain.
One moment, I also suspected that my graph retrieved data from wrong data source (like mistake configuring graph templates), but double-checked that and looks like it is configured correctly. Also, I do collect same CPU load from some other NetApp devices, which are probably less busy and collection doesn't have any "holes" so far.
Can it be related with max number of snmp values on one snmpbulk or amount of parameters monitored on one device?
1. I would complain about NetApp as well if I wouldn't run snmpwalk script on same cacti server, which queries same OID every 1 second. And it gets data all the time. Mostly, what I recognized, that on these periods when CPU laod is ~100, I get value into my snmpwalk query, but cacti shows 0 over there.
2. I use GAUGE in my data collection, so rrd output should show exact value as is gets from snmp query, shouldn't it?
Please, correct me if I'm wrong or miss something in this troubleshooting chain.
One moment, I also suspected that my graph retrieved data from wrong data source (like mistake configuring graph templates), but double-checked that and looks like it is configured correctly. Also, I do collect same CPU load from some other NetApp devices, which are probably less busy and collection doesn't have any "holes" so far.
Can it be related with max number of snmp values on one snmpbulk or amount of parameters monitored on one device?
Not exactly. Please read the link I posted on how rrdtool compensates for time update offset.marijonas wrote: 2. I use GAUGE in my data collection, so rrd output should show exact value as is gets from snmp query, shouldn't it?
Yes to both. Different snmp devices/versions behave differently with both instances. Testing will reveal which config options work best for each device class.marijonas wrote:Can it be related with max number of snmp values on one snmpbulk or amount of parameters monitored on one device?
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
Hi,
I tried to debug the issue, but now confused even more.
I suspected badly working template, so created another one from a scratch with very same OID collection and started to collect on two NetApp devices. This is what I got:
netApp-01: original data source shows zeros all the time. new data source collects data, which looks on a same level as I get from command line using snmpwalk (keeps on 60-80% range).
netApp-02: original data source collected, we assumed correct data with some timing patterns which represented load cycle depending on hour. And data level was about the same what we get originaly from devices native tools (20-30% range). New data source shows random numbers (without timing pattern) in range of 60-80%.
I turned debug, so I could check what data sources receives from snmpwalk and it looks like cacti quering the right device with right SNMP OID.
I attached collection for netApp-02 for both, original and new one collection, for a same SNMP OID.
Any hints are more than welcomed.
I tried to debug the issue, but now confused even more.
I suspected badly working template, so created another one from a scratch with very same OID collection and started to collect on two NetApp devices. This is what I got:
netApp-01: original data source shows zeros all the time. new data source collects data, which looks on a same level as I get from command line using snmpwalk (keeps on 60-80% range).
netApp-02: original data source collected, we assumed correct data with some timing patterns which represented load cycle depending on hour. And data level was about the same what we get originaly from devices native tools (20-30% range). New data source shows random numbers (without timing pattern) in range of 60-80%.
I turned debug, so I could check what data sources receives from snmpwalk and it looks like cacti quering the right device with right SNMP OID.
I attached collection for netApp-02 for both, original and new one collection, for a same SNMP OID.
Any hints are more than welcomed.
- Attachments
-
- Identical OID collections
- Screenshot.png (95.28 KiB) Viewed 1827 times
Looking in the poller cache, they're polling the exact same OIDs for the old/new templates, yet the results are widely different?
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
I guess it's something with the devices then, as they're polling the same OID, and you've verified the returned data in the cacti.log. time to contact the vendor.
| Scripts: Monitor processes | RFC1213 MIB | DOCSIS Stats | Dell PowerEdge | Speedfan | APC UPS | DOCSIS CMTS | 3ware | Motorola Canopy |
| Guides: Windows Install | [HOWTO] Debug Windows NTFS permission problems |
| Tools: Windows All-in-one Installer |
Who is online
Users browsing this forum: Mohammedejaz and 0 guests