Bad SNMP Indexes - empty and/or duplicate graphs during automation

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Developers, Moderators

Post Reply
tophercer
Posts: 4
Joined: Thu Oct 22, 2020 11:55 am

Bad SNMP Indexes - empty and/or duplicate graphs during automation

Post by tophercer »

Hello,

I've got a server at my workplace that is monitoring a number of CMTSs and polling FEC data for each active interface. The intent behind the server is to give it the IPs of our CMTSs, automate the creation of graphs on each CMTS, then create thresholds with the thold plugin and set up automatic emails when FEC levels are problematic. For about 8 months I had the server up and running, and suddenly I ran into problem after problem in August. At the current time, the biggest issue that I'm experiencing is that individual CMTSs seemingly spontaneously stop graphing. Additionally, I sometimes (but not always) get duplicate graphs.

Here's an image showing it:
https://i.imgur.com/UjR5Ejq.png
The top left and bottom right graphs are the "originals". The top right and bottom left graphs were created automatically at the same time the originals stopped graphing. The duplicates stopped graphing as well after a few days, and nothing graphed for a few days, and then the originals suddenly started populating again.

The logfile gives a warning as follows:
12-Oct-2020 23:59:09 - POLLER: Poller[Main Poller] WARNING: You have 3 Devices with bad SNMP Indexes. Devices: Device[Bangor-CBR8], Device[Nazareth-CBR8], Device[Pburg-CBR8] totalling 1096 Data Sources. Please Either Re-Index, Delete or Disable these Data Sources.

My automation schedule runs overnight every night, with various times from 1am to 6am, so that they do not overlap even when they have a lot of graphs to populate. The timestamp of the first SNMP warning listing a particular CMTS matches up to the time when each CMTS would be re-scanned automatically. The timestamp of duplicate graphs (if any are generated) also lines up to the time that the automation is scheduled to run. If I re-index the devices as suggested by the warning message, sometimes nothing happens, and sometimes it will "flip" the polling back to the original graph. If a duplicate graph was created, that duplicate graph will stop populating as the original resumes. Also, sometimes the graphs will "flip" back on their own without any action on my part. I have not found any surefire way to force an older graph to begin populating again. Sometimes a duplicate graph isn't created on the same day that an original stops graphing, and sometimes it is. Sometimes a duplicate is created a few days later.

Upon checking the rrd files, I've found that new rrds are being generated, and the old ones are being "abandoned". The old ones are still present in the appropriate folder, but the last-modified date just sits there and shows them not changing. The number of rrds processed each polling cycle does not change from what it was before duplicates are created. The total number of graphs on each device doubles from what it should be.

Other problems I've dealt with lately are SQL failures, gaps in the graphs, and poller timeouts, which all seem to be related and which suddenly stopped happening about a week ago, for no reason that I can find. I had thought that the SQL errors might be causing this issue as well (although the timestamps did not line up), but now I'm not sure. That said, I haven't actually seen any graphs "split" like this in the past few days, so perhaps they were related after all. If they are relevant, I can post some excerpts of those logs as well. At this point I have the thold plugin entirely disabled, because enabling it caused so many graph gaps. If I enable it again, I suspect it will trigger SQL failures again, which will be my next issue to tackle. I have not made any changes to this server in the past week, to try to collect as much data as possible, but I have not ruled out networking changes during that time frame.

I greatly appreciate any help that anybody can provide.
User avatar
TheWitness
Developer
Posts: 17007
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Re: Bad SNMP Indexes - empty and/or duplicate graphs during automation

Post by TheWitness »

You've likely picked an incorrect index. You should find the primary key, that which does not change about the object that you are monitoring and then ensure that it's the sort field, and the index for the Data Query. If you don't do that, then your graphs will consistently break.

If there is a re-index even that causes that "index" field, that is other than the primary key to change, you need to set a trigger to either "index_count_changed", "uptime_goes_backward", or "verify all indexes" (which will always re-index). The later adds more overhead, but if you can not arrive at something where the primary key is fixed, and you have no other way to re-index, then you are stuck there.

I prefer to write little plugins for data like this to gather the data async, and then the Data Query simply graphs the data that is always index by the primary key.

I hope that helps, I suggest you get more familiar with the Data Queries. Script Server based are a middle ground as you have way more control than simply using SNMP.
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
tophercer
Posts: 4
Joined: Thu Oct 22, 2020 11:55 am

Re: Bad SNMP Indexes - empty and/or duplicate graphs during automation

Post by tophercer »

Sorry, I'm not sure where what you mean. I don't know where I would go to pick an index. The devices that I have graphing were all automatically detected as the only IPs on the networks set up under the Automation section, using a sysDescr of "Linux". The graph rule selects for an h.description including "CBR8", an ifDesc including "upstream" and an ifAlias that is not empty. I see options in the data query to choose script data or script server data instead of SNMP data, but I don't think that's what you were referencing. I see options in the Device Defaults section of the Settings page to choose between "uptime", "index count", and "verify all". Uptime is what is currently selected. But I don't even see that option within individual devices or within the Device Template or Device Rule used to make them. Do I need to look at the xml file directly? Would it suffice to change the defaults to "verify all" and then delete and re-scan all devices? I'd hate to lose so much data, but if it will make it consistent going forward, I'll do it.

Side note I forgot to mention in the original post, this server is one of 3 running cacti to monitor these devices, and the other two servers aren't experiencing this problem. This one was fine for 8 months without an issue, and only started acting up around the time I disabled the thold plugin to investigate the graph gaps and poller timeouts. The 3 CMTSs that have experienced this error so far have not had any major changes done to them at the time when this issue started.
User avatar
TheWitness
Developer
Posts: 17007
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Re: Bad SNMP Indexes - empty and/or duplicate graphs during automation

Post by TheWitness »

True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Post Reply

Who is online

Users browsing this forum: No registered users and 0 guests