Some graphs not populated, real time works, templates messed up I guess

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Developers, Moderators

Post Reply
nicolatron
Posts: 40
Joined: Mon Jan 28, 2008 4:11 am

Some graphs not populated, real time works, templates messed up I guess

Post by nicolatron »

Hello,

Recently my main traffic template was messed out, I guess I imported some F5 templates that includes dependencies and overwrite it.
When I noticed, the new template had a lot more items, with graphic percentages, graph totalling and stuff I don't need, also it didn't have the "turn bytes into bits, Make negative" CDEF that I use for output traffic. I use a very basic graph template, with output traffic as neggative, as I use to create aggregates on it.

I created a new Graph template with this in mind, and migrated all traffic graphs to that template.
graph_template.png
graph_template.png (116.16 KiB) Viewed 1482 times
Now many graphs are messed up. Many have different max [traffic_out] values defined on the rrd files, which seems to come from " SNMP - Interface Statistics " 'ifSpeed' field, with values like: 4294967295,
1000000000, depending on the interface.

In troubleshooting -> Data Sources debugging the recommend action for the "offending" graphs was to execute commands like:
/usr/bin/rrdtool tune /opt/cacti/rra/core-vss_traffic_in_291.rrd --maximum traffic_out:4294967295

So I thought to edit the data source "traffic_in" and "traffic_out" to hardcore a value adecuate to 10G interfaces, so all graphs have the same min and max values, then change all rrd accordingly.

So I changed the "Data Collections -> Data Queries" -> SNMP - Interface Statistics accordingly on my associated Graph Template "0 Interface - Traffic (bits/sec)" which has this Associated Data Templates:
associated_data_templates.png
associated_data_templates.png (34.93 KiB) Viewed 1482 times
In Templates->Data Source -> Interface - Traffic I disabed for both traffic_in and traffic_out the check to allow the user override on data source creation and set a max limit for both of them on 10000000000.
data_template.png
data_template.png (64.91 KiB) Viewed 1482 times
Now the graphs are not updated, on troubleshooting for the relevant data sources, I get no suggestion on fix. Just says "Waiting on Data Source Check to Complete" forever.

Looking at the Data source debug I see this:
data_source_info.png
data_source_info.png (73.67 KiB) Viewed 1482 times
My take on it ...It seems that after my changes to the data source max (not allowing user override, and thus not using the value returned on ifSpeed field) the way the data is treated has changed.
To me it seems the normal behaviour should be, on a given polling, take the value on the counter (the OID that is being polled), substract the value on the last polling, divide it by the number of seconds passed between polling, then compare it to the max and min values to see if it fits. Now it seems to just compare the value on the counter, finds it's higher than defined max and discard it.
Does this makes any sense?.
User avatar
TheWitness
Developer
Posts: 17059
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Re: Some graphs not populated, real time works, templates messed up I guess

Post by TheWitness »

Many SNMP vendors return the incorrect ifHighSpeed and ifSpeed entries. We attempt to use ifHighSpeed regardless if it's set. If your vendor is not returning the correct ifHighSpeed, you should contact them and have them correct that. If that is not possible, setting the 'max' value to something other than |query_ifSpeed| is recommended. Then, you will need to "tune" your RRDfiles. Not much that can be done about that.

As to the other issues, we would need to know about your versions.
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
nicolatron
Posts: 40
Joined: Mon Jan 28, 2008 4:11 am

Re: Some graphs not populated, real time works, templates messed up I guess

Post by nicolatron »

Hello,

Thanks for your attention.

System is linux SLES 15 SP2
cacti-1.2.19
cacti-spine 1.2.19
mysql Ver 15.1 Distrib 10.4.22-MariaDB
RRDtool 1.7.0
NET-SNMP version: 5.7.3
php 7.4.6
I'm using boost. With one main cacti and 2 collectors (almost all of devices run on one of the collectors).
Only plugin I have installed (and enabled) is network-weathermap (phalek version for cacti-1x).

Just read when checking version, that I should remove php-snmp as I'm using snmp version 3. Just uninstalled it on all 3 servers.

Greetings,
Nico
nicolatron
Posts: 40
Joined: Mon Jan 28, 2008 4:11 am

Re: Some graphs not populated, real time works, templates messed up I guess

Post by nicolatron »

I keep having problems. And it has escalated.

I've pulled the plunge and set all traffic graph data sources trafic_in and traffic_out max "hardcoded" to 12500000000.
Changed all rrd accordingly.
find . -type f -name '*_traffic_*' -exec /usr/bin/rrdtool tune {} --maximum traffic_out:12500000000 \;
find . -type f -name '*_traffic_*' -exec /usr/bin/rrdtool tune {} --maximum traffic_in:12500000000 \;

The device is responding all right, with a sample:
OV01LNXVZ5:~ # snmpwalk_comms.sh isen .1.3.6.1.2.1.31.1.1.1.6.26769
SNMPv2-SMI::mib-2.31.1.1.1.6.26769 = Counter64: 2226435189767357

Real time graphs works.

But no graph is being created, rrds are not actualized.

Seems to me spine is not doing it's thing, for any of the devices with the problem, it seems to not gather snmp data at all. This is being executed from the poller in charge of device with id 188.

/usr/local/spine/bin/spine -R -V 5 -H 188
SPINE: Using spine config file [/etc/cacti/spine.conf]
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The path_php_server variable is /opt/cacti/script_server.php
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The path_cactilog variable is /opt/cacti/log/cacti.log
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The log_destination variable is 1 (FILE)
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The path_php variable is /usr/bin/php
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The availability_method variable is 2
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The ping_recovery_count variable is 3
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The ping_failure_count variable is 2
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The ping_method variable is 2
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The ping_retries variable is 1
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The ping_timeout variable is 400
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The snmp_retries variable is 3
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The log_perror variable is 1
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The log_pwarn variable is 0
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The boost_redirect variable is 1
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The boost_rrd_update_enable variable is 1
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The log_pstats variable is 0
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The threads variable is 1
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The polling interval is 300 seconds
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The number of concurrent processes is 1
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The script timeout is 25
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The selective_device_debug variable is 333,38,188
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The spine_log_level variable is 0
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The number of php script servers to run is 1
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The number of active data source profiles is 1
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Device List to be polled='188', TotalPHPScripts='0'
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The PHP Script Server is Not Required
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: The Maximum SNMP OID Get Size is 10
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Selective Debug Devices 333,38,188
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Total Connections made 1
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Creating Local Connection Pool of 1 threads.
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Creating Local Connection 0.
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Total Connections made 2
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Version 1.2.19 starting
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: MySQL is Thread Safe!
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Spine running as 0 UID, 0 EUID
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Spine is running as root.
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Spine has got ICMP
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Initializing Net-SNMP API
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Issues with SNMP Header Version information, assuming old version of Net-SNMP.
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Initializing PHP Script Server(s)
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] Spine will support multithread device polling.
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Initial Value of Available Threads is 1 (0 outstanding)
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Valid Thread to be Created (140597606397696)
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Available Threads is 0 (1 outstanding)
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597635796736] WARNING: Device[0] polling sleeping while waiting for 1 Threads to End
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597606397696] DEBUG: Device[0] HT[1] In Poller, About to Start Polling
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597606397696] DEBUG: Traversing Local Connection Pool for free connection.
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597606397696] DEBUG: Checking Local Pool ID 0.
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597606397696] DEBUG: Allocating Local Pool ID 0.
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597606397696] Device[0] HT[1] Total Time: 0.0022 Seconds
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597606397696] DEBUG: Freeing Local Pool ID 0
2022-06-09 07:54:45 - SPINE: Poller[1] PID[9366] PT[140597606397696] DEBUG: Device[0] HT[1] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
2022-06-09 07:54:46 - SPINE: Poller[1] PID[9366] PT[140597635796736] The final count of Threads is 0
2022-06-09 07:54:46 - SPINE: Poller[1] PID[9366] PT[140597635796736] INFO: Thread complete for Device[0] and 0 to 0 sources
2022-06-09 07:54:46 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Closing Local Connection Pool ID 0
2022-06-09 07:54:46 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Thread Cleanup Complete
2022-06-09 07:54:46 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: PHP Script Server Pipes Closed
2022-06-09 07:54:46 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Allocated Variable Memory Freed
2022-06-09 07:54:46 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: MYSQL Free & Close Completed
2022-06-09 07:54:46 - SPINE: Poller[1] PID[9366] PT[140597635796736] DEBUG: Net-SNMP Close Completed
2022-06-09 07:54:46 - SPINE: Poller[1] PID[9366] PT[140597635796736] Time: 0.5342 s, Threads: 1, Devices: 1

I've rebuilt poller cache a couple of times.

And if go to view poller cache, there are items for the device in question.

isen - Traffic - Bridge-Aggregation1 isen SNMP Version: 2, Community: xxx, OID: .1.3.6.1.2.1.31.1.1.1.6.26769
RRD: /opt/cacti/rra/isen_traffic_in_2213.rrd

I've noticed on the main cacti database there are 121 tables like this:
poller_output_boost_arch_1654718405
poller_output_boost_arch_1654718706
poller_output_boost_arch_1654719006
[...]
I guess that's normal.

I've disabled weathermap plugin, becouse althout it works has a tendency to flood my log with errors.

Now i'm seeing this in the log:
2022-06-09 08:25:04 - BOOST ERROR: Failed to retrieve archive table name

Did this:
./poller_boost.php --force --debug
DEBUG: Checking if Boost is ready to run.
DEBUG: Last Runtime was 2022-06-08 20:00:05 (1654711205).
DEBUG: Next Runtime is 2022-06-08 22:00:05 (1654718405).
DEBUG: Records Found:0, Max Threshold:1000000.
DEBUG: Time to Run Boost, Force Run is true!
DEBUG: Parallel Process Setup Begins.

Next runtime points to yesterday!
2022-06-08 22:00:05 (1654718405)

Now it's 2022-06-09 08:33:14 (spain time, gmt-2 i think)

Now I see this:
2022-06-09 08:25:04 - BOOST ERROR: Failed to retrieve archive table name
2022-06-09 08:25:04 - SYSTEM WARNING: Primary Admin account notifications disabled! Unable to send administrative Email.
2022-06-09 08:25:04 - BOOST SVR WARNING: Detected Poller Boost Overrun, Possible Boost Poller Crash
2022-06-09 08:25:04 - SYSTEM DSSTATS STATS: Type:HOURLY, Time:0.0252 RRDUser:0.0000 RRDSystem:0.0000 RRDReal:0.0000
2022-06-09 08:25:03 - SYSTEM STATS: Time:1.1287 Method:spine Processes:1 Threads:1 Hosts:0 HostsPerProcess:0 DataSources:0 RRDsProcessed:0
2022-06-09 08:23:30 - BOOST ERROR: Failed to retrieve archive table name
2022-06-09 08:23:30 - SYSTEM WARNING: Primary Admin account notifications disabled! Unable to send administrative Email.
2022-06-09 08:23:30 - BOOST SVR WARNING: Detected Poller Boost Overrun, Possible Boost Poller Crash
2022-06-09 08:23:30 - BOOST WARNING: Boost Poller forced by command line.
2022-06-09 08:20:05 - SYSTEM DSSTATS STATS: Type:HOURLY, Time:0.0100 RRDUser:0.0000 RRDSystem:0.0000 RRDReal:0.0000
2022-06-09 08:20:05 - BOOST ERROR: Failed to retrieve archive table name
2022-06-09 08:20:05 - SYSTEM WARNING: Primary Admin account notifications disabled! Unable to send administrative Email.
2022-06-09 08:20:05 - BOOST SVR WARNING: Detected Poller Boost Overrun, Possible Boost Poller Crash

Boost statistics also show next runtime will be in the past.
Captura-2022-06-09_08_34.png
Captura-2022-06-09_08_34.png (89.94 KiB) Viewed 1459 times
BTW, this is the environment where I removed one of the collectors yesterday. This boost error might be related to that, but the spine polling problem was happening before I removed the collector.

rebooted both servers and deleted all poller_output_boost_arch* tables on main poller database. Let's if that changes something...

Moved all devices to Main and spine is getting data now.

/usr/local/spine/bin/spine -R -V 1 -H 188
SPINE: Using spine config file [/etc/cacti/spine.conf]
Version 1.2.19 starting
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] DEBUG: Device[188] HT[1] In Poller, About to Start Polling
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] DEBUG: Entering ICMP Ping
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] DEBUG: Attempting to ping 10.96.110.53, seq 0 (Retry 0 of 1)
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] DEBUG: ICMP Device Alive, Try Count:1, Time:3.1099 ms
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] DEBUG: Entering SNMP Ping
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] Updating Full System Information Table
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] DEBUG: Device[188] HT[1] RECACHE: Processing 1 items in the auto reindex cache for '10.96.110.53'
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] HT[1] DQ[4] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 16274774700 < output: 16274782500)
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] HT[1] NOTE: There are '140' Polling Items for this Device
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] HT[1] DS[2171] TT[11.89] SNMP: v2: 10.96.110.53, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.1, value: 1861052715215273
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] HT[1] DS[2171] TT[12.02] SNMP: v2: 10.96.110.53, dsname: traffic_out, oid: .1.3.6.1.2.1.31.1.1.1.10.1, value: 975880497816757
2022-06-09 10:33:42 - SPINE: Poller[1] PID[4230] PT[139844555650816] Device[188] HT[1] DS[2172] TT[12.09] SNMP: v2: 10.96.110.53, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1.6.5, value: 2885633629142

Will wait a few polling cycles to see if everything starts to work again, then move them back to collector and see what happends.

In the meantime also changed poller from/to CMD.php/spine back and forth.

"Fixing" spine has seem to fix also boost problem and weird next schedule in the past.

Last Start Time: 2022-06-09 10:33:12
Last Run Duration: 50 seconds (1.00 percent of update frequency)
RRD Updates: 4,070
Peak Poller Memory: N/A
Max Poller Memory Allowed: 512 MBytes

Update Frequency: 2 Hours
Concurrent Processes: 2
Next Start Time: 2022-06-09 12:33:12
Maximum Records: 200,000 Records
Maximum Allowed Runtime: 20 Minutes

Update 20220609 11:18
Everything works when devices are polled from main.

If I move them to collector machine it fails, spine polls nothing and I get this error in log:
2022-06-09 10:48:04 - CMDPHP ERROR: A DB Exec Failed!, Error: Out of range value for column 'snmp_sysUpTimeInstance' at row 1
User avatar
TheWitness
Developer
Posts: 17059
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Re: Some graphs not populated, real time works, templates messed up I guess

Post by TheWitness »

Well that out of range error is interesting. Wish we had @ notification like GitHub so I could get Sean Mancini to chime in. Why don't you open a ticket there.
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest