Bad SNMP Indexes - empty and/or duplicate graphs during automation

tophercer · Post by **tophercer** » Thu Oct 22, 2020 12:53 pm

Hello,

I've got a server at my workplace that is monitoring a number of CMTSs and polling FEC data for each active interface. The intent behind the server is to give it the IPs of our CMTSs, automate the creation of graphs on each CMTS, then create thresholds with the thold plugin and set up automatic emails when FEC levels are problematic. For about 8 months I had the server up and running, and suddenly I ran into problem after problem in August. At the current time, the biggest issue that I'm experiencing is that individual CMTSs seemingly spontaneously stop graphing. Additionally, I sometimes (but not always) get duplicate graphs.

Here's an image showing it:
https://i.imgur.com/UjR5Ejq.png
The top left and bottom right graphs are the "originals". The top right and bottom left graphs were created automatically at the same time the originals stopped graphing. The duplicates stopped graphing as well after a few days, and nothing graphed for a few days, and then the originals suddenly started populating again.

The logfile gives a warning as follows:
12-Oct-2020 23:59:09 - POLLER: Poller[Main Poller] WARNING: You have 3 Devices with bad SNMP Indexes. Devices: Device[Bangor-CBR8], Device[Nazareth-CBR8], Device[Pburg-CBR8] totalling 1096 Data Sources. Please Either Re-Index, Delete or Disable these Data Sources.

My automation schedule runs overnight every night, with various times from 1am to 6am, so that they do not overlap even when they have a lot of graphs to populate. The timestamp of the first SNMP warning listing a particular CMTS matches up to the time when each CMTS would be re-scanned automatically. The timestamp of duplicate graphs (if any are generated) also lines up to the time that the automation is scheduled to run. If I re-index the devices as suggested by the warning message, sometimes nothing happens, and sometimes it will "flip" the polling back to the original graph. If a duplicate graph was created, that duplicate graph will stop populating as the original resumes. Also, sometimes the graphs will "flip" back on their own without any action on my part. I have not found any surefire way to force an older graph to begin populating again. Sometimes a duplicate graph isn't created on the same day that an original stops graphing, and sometimes it is. Sometimes a duplicate is created a few days later.

Upon checking the rrd files, I've found that new rrds are being generated, and the old ones are being "abandoned". The old ones are still present in the appropriate folder, but the last-modified date just sits there and shows them not changing. The number of rrds processed each polling cycle does not change from what it was before duplicates are created. The total number of graphs on each device doubles from what it should be.

Other problems I've dealt with lately are SQL failures, gaps in the graphs, and poller timeouts, which all seem to be related and which suddenly stopped happening about a week ago, for no reason that I can find. I had thought that the SQL errors might be causing this issue as well (although the timestamps did not line up), but now I'm not sure. That said, I haven't actually seen any graphs "split" like this in the past few days, so perhaps they were related after all. If they are relevant, I can post some excerpts of those logs as well. At this point I have the thold plugin entirely disabled, because enabling it caused so many graph gaps. If I enable it again, I suspect it will trigger SQL failures again, which will be my next issue to tackle. I have not made any changes to this server in the past week, to try to collect as much data as possible, but I have not ruled out networking changes during that time frame.

I greatly appreciate any help that anybody can provide.

Post by **TheWitness** » Thu Oct 22, 2020 6:10 pm

You've likely picked an incorrect index. You should find the primary key, that which does not change about the object that you are monitoring and then ensure that it's the sort field, and the index for the Data Query. If you don't do that, then your graphs will consistently break.

If there is a re-index even that causes that "index" field, that is other than the primary key to change, you need to set a trigger to either "index_count_changed", "uptime_goes_backward", or "verify all indexes" (which will always re-index). The later adds more overhead, but if you can not arrive at something where the primary key is fixed, and you have no other way to re-index, then you are stuck there.

I prefer to write little plugins for data like this to gather the data async, and then the Data Query simply graphs the data that is always index by the primary key.

I hope that helps, I suggest you get more familiar with the Data Queries. Script Server based are a middle ground as you have way more control than simply using SNMP.

tophercer · Post by **tophercer** » Fri Oct 23, 2020 9:43 am

Sorry, I'm not sure where what you mean. I don't know where I would go to pick an index. The devices that I have graphing were all automatically detected as the only IPs on the networks set up under the Automation section, using a sysDescr of "Linux". The graph rule selects for an h.description including "CBR8", an ifDesc including "upstream" and an ifAlias that is not empty. I see options in the data query to choose script data or script server data instead of SNMP data, but I don't think that's what you were referencing. I see options in the Device Defaults section of the Settings page to choose between "uptime", "index count", and "verify all". Uptime is what is currently selected. But I don't even see that option within individual devices or within the Device Template or Device Rule used to make them. Do I need to look at the xml file directly? Would it suffice to change the defaults to "verify all" and then delete and re-scan all devices? I'd hate to lose so much data, but if it will make it consistent going forward, I'll do it.

Side note I forgot to mention in the original post, this server is one of 3 running cacti to monitor these devices, and the other two servers aren't experiencing this problem. This one was fine for 8 months without an issue, and only started acting up around the time I disabled the thold plugin to investigate the graph gaps and poller timeouts. The 3 CMTSs that have experienced this error so far have not had any major changes done to them at the time when this issue started.

Post by **TheWitness** » Fri Oct 23, 2020 1:19 pm

https://github.com/Cacti/documentation/ ... Queries.md

Cacti

Bad SNMP Indexes - empty and/or duplicate graphs during automation

Bad SNMP Indexes - empty and/or duplicate graphs during automation

Re: Bad SNMP Indexes - empty and/or duplicate graphs during automation

Re: Bad SNMP Indexes - empty and/or duplicate graphs during automation

Re: Bad SNMP Indexes - empty and/or duplicate graphs during automation

Who is online