v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s)

Post support questions that directly relate to Linux/Unix operating systems.

Moderators: Developers, Moderators

ringo
Posts: 13
Joined: Wed Jan 16, 2019 9:35 am

v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s)

Post by ringo »

After upgrading from v1.1.38 to v1.2.0, i have major polling/graphing problems. I think i'm dealing with 2 problems and i've been able to temporarily solve the issues by disabling device re-indexing.


First off, the poller seems to be improperly detecting device uptimes and therefore repeatedly running a recache on many devices when not needed and casuing high poller run times exceeding the max run time, stopping early, and causing gaps in many graphs.

Second, when this device recache event takes place, it seems to sometimes update/change the ifIndex numbers in sql table data_local column snmp_index, which in turn after a poller rebuild for that device, changes the oid value in sql table poller_item column arg1. This then makes for invalid snmp queries, and lost graph data.


An example of this;

First issue...

Here is a snipit of the cacti logs showing a compairison of the uptimes, some are a compairison of different formats:
POLLER: Poller[1] Device[1112] DQ[11] RECACHE ASSERT FAILED '945646764<109:10:51:41.74
POLLER: Poller[1] Device[901] DQ[11] RECACHE ASSERT FAILED '972878730<112:14:29:56.22
POLLER: Poller[1] Device[1179] DQ[11] RECACHE ASSERT FAILED '797702<0:2:16:55.59
POLLER: Poller[1] Device[937] DQ[11] RECACHE ASSERT FAILED '31:9:57:15.19<31:10:02:14.95
POLLER: Poller[1] Device[398] DQ[11] RECACHE ASSERT FAILED '158:9:55:26.33<158:10:00:26.96
POLLER: Poller[1] Device[944] DQ[11] RECACHE ASSERT FAILED '45:9:56:38.18<45:10:01:37.98

Looking at a few hosts, you can see the different formats in the sql data:
mysql> select * from poller_reindex where host_id in (1110,1113,1053,871,886,714,693);
+---------+---------------+--------+---------+----+-----------------+--------------------+
| host_id | data_query_id | action | present | op | assert_value | arg1 |
+---------+---------------+--------+---------+----+-----------------+--------------------+
| 693 | 11 | 0 | 1 | < | 3656718 | .1.3.6.1.2.1.1.3.0 |
| 714 | 11 | 0 | 1 | < | 1386029135 | .1.3.6.1.2.1.1.3.0 |
| 871 | 11 | 0 | 1 | < | 9:10:14:44.56 | .1.3.6.1.2.1.1.3.0 |
| 886 | 11 | 0 | 1 | < | 504742323 | .1.3.6.1.2.1.1.3.0 |
| 1053 | 11 | 0 | 1 | < | 82:10:13:46.73 | .1.3.6.1.2.1.1.3.0 |
| 1110 | 11 | 0 | 1 | < | 112:12:41:30.80 | .1.3.6.1.2.1.1.3.0 |
| 1113 | 11 | 0 | 1 | < | 405:13:21:33.96 | .1.3.6.1.2.1.1.3.0 |
+---------+---------------+--------+---------+----+-----------------+--------------------+

Looking at a specific device, the assert_value values in the table poller_reindex do seem to be alternating from one format to another. eg:
+---------+---------------+--------+---------+----+--------------+--------------------+
| host_id | data_query_id | action | present | op | assert_value | arg1 |
+---------+---------------+--------+---------+----+--------------+--------------------+
| 1110 | 11 | 0 | 1 | < | 971115165 | .1.3.6.1.2.1.1.3.0 |
+---------+---------------+--------+---------+----+--------------+--------------------+
+---------+---------------+--------+---------+----+-----------------+--------------------+
| host_id | data_query_id | action | present | op | assert_value | arg1 |
+---------+---------------+--------+---------+----+-----------------+--------------------+
| 1110 | 11 | 0 | 1 | < | 112:22:51:33.41 | .1.3.6.1.2.1.1.3.0 |
+---------+---------------+--------+---------+----+-----------------+--------------------+

Whatever this issue is, it seems to maybe be the cause of excessive reindexing of devices.



Second issue.....

When the reindex happens for a device, snmp_index values can change to invalid data:
SELECT * FROM data_local WHERE host_id=1315;
+-------+------------------+---------+---------------+------------+
| id | data_template_id | host_id | snmp_query_id | snmp_index |
+-------+------------------+---------+---------------+------------+
| 15899 | 54 | 1315 | 11 | 1/2/1 |
| 15898 | 54 | 1315 | 11 | 1/1/4 |
| 15897 | 54 | 1315 | 11 | 1/1/3 |
| 15896 | 54 | 1315 | 11 | 1/1/2 |
| 15895 | 54 | 1315 | 11 | 1/1/1 |
| 15900 | 54 | 1315 | 11 | 1/3/1 |
+-------+------------------+---------+---------------+------------+

after the rebuild_poller_cache:
select host_id, arg1 from poller_item where host_id=1315;
+---------+--------------------------------+
| host_id | arg1 |
+---------+--------------------------------+
| 1315 | .1.3.6.1.2.1.31.1.1.1.10.1/3/1 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.6.1/3/1 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.10.1/1/1 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.6.1/1/1 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.10.1/1/2 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.6.1/1/2 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.10.1/1/3 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.6.1/1/3 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.10.1/1/4 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.6.1/1/4 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.10.1/2/1 |
| 1315 | .1.3.6.1.2.1.31.1.1.1.6.1/2/1 |
+---------+--------------------------------+
these oid values are incorrect and produce snmp query failures, and graphs stop working.



To FIX this issue, i have had to disable the code in cmd.php that does the ASSERT check that will set "$assert_fail = true;" and therefore disable the excessive device reindexing. Then, restore the data in the data_local table from the backup before upgrade. Finally, rebuild poller cache. This restores the oids back to normal, and graphing continues. This of course is not a permanent fix since re-indexing is now disabled.
User avatar
Osiris
Cacti Guru User
Posts: 1424
Joined: Mon Jan 05, 2015 10:10 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by Osiris »

Please update lib/snmp.php from the develop branch. This was recently re-reported. Better yet, use spine. It does not have these problems.
Before history, there was a paradise, now dust.
ringo
Posts: 13
Joined: Wed Jan 16, 2019 9:35 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by ringo »

I did see the update lib/snmp.php from develop branch post, and tried it already, but that didn't seem to clear up the problems. I haven't tried using spine yet. I'll give that a shot. thanks.
ringo
Posts: 13
Joined: Wed Jan 16, 2019 9:35 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by ringo »

I found instructions to compile spine and configure sql credentials, but not about what to do after that? I figured out in cacti/console/configuration/settings/poller and change poller type to spine. is that it? it didn't work so well..

looks like complete failure.

2019/01/16 22:45:01 - SYSTEM WARNING: Primary Admin account notifications disabled! Unable to send administrative Email.
2019/01/16 22:50:00 - POLLER: Poller[1] Maximum runtime of 298 seconds exceeded. Exiting.
2019/01/16 22:50:00 - SYSTEM WARNING: Primary Admin account notifications disabled! Unable to send administrative Email.
2019/01/16 22:50:00 - SNMPAGENT WARNING: No notification receivers configured for event: cactiNotifyPollerRuntimeExceeding (CACTI-MIB)
2019/01/16 22:50:00 - SYSTEM STATS: Time:298.3263 Method:spine Processes:10 Threads:15 Hosts:644 HostsPerProcess:65 DataSources:14900 RRDsProcessed:0
2019/01/16 22:50:03 - SYSTEM THOLD STATS: Time:0.7035 Tholds:0 TotalDevices:644 DownDevices:17 NewDownDevices:0
2019/01/16 22:55:01 - POLLER: Poller[1] Maximum runtime of 298 seconds exceeded. Exiting.
2019/01/16 22:55:01 - SYSTEM WARNING: Primary Admin account notifications disabled! Unable to send administrative Email.
2019/01/16 22:55:01 - SNMPAGENT WARNING: No notification receivers configured for event: cactiNotifyPollerRuntimeExceeding (CACTI-MIB)
2019/01/16 22:55:01 - SYSTEM STATS: Time:298.6826 Method:spine Processes:10 Threads:15 Hosts:644 HostsPerProcess:65 DataSources:14900 RRDsProcessed:0
2019/01/16 22:55:03 - SYSTEM THOLD STATS: Time:0.3850 Tholds:0 TotalDevices:644 DownDevices:17 NewDownDevices:0
2019/01/16 23:00:01 - POLLER: Poller[1] Maximum runtime of 298 seconds exceeded. Exiting.
2019/01/16 23:00:01 - SYSTEM WARNING: Primary Admin account notifications disabled! Unable to send administrative Email.
2019/01/16 23:00:01 - SNMPAGENT WARNING: No notification receivers configured for event: cactiNotifyPollerRuntimeExceeding (CACTI-MIB)
2019/01/16 23:00:01 - SYSTEM STATS: Time:298.5854 Method:spine Processes:10 Threads:15 Hosts:644 HostsPerProcess:65 DataSources:14900 RRDsProcessed:0

EDIT: yep confirmed. I have another gap in all my graphs now.
ringo
Posts: 13
Joined: Wed Jan 16, 2019 9:35 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by ringo »

after trying some things and rebuilding poller cache, my poller cache data has again been all messed up and i will have to restore from backup. Here's a small snipit for a single device and 1 interface:



from poller_item:
+---------+---------------------------------------------+---------------------+
| host_id | arg1 | last_updated |
+---------+---------------------------------------------+---------------------+
| 1298 | .1.3.6.1.2.1.31.1.1.1.6.TenGigE0/1/0/3.100 | 2019-01-17 10:35:04 |
| 1298 | .1.3.6.1.2.1.31.1.1.1.10.TenGigE0/1/0/3.100 | 2019-01-17 10:35:04 |
+---------+---------------------------------------------+---------------------+

from host_snmp_cache:
+---------+---------------+--------------+--------------------+------------+------------------------------+---------+---------------------+
| host_id | snmp_query_id | field_name | field_value | snmp_index | oid | present | last_updated |
+---------+---------------+--------------+--------------------+------------+------------------------------+---------+---------------------+
| 1298 | 11 | ifDescr | TenGigE0/1/0/3.100 | 200 | .1.3.6.1.2.1.2.2.1.2.200 | 1 | 2019-01-15 20:23:50 |
| 1298 | 11 | ifHighSpeed | 10000 | 200 | .1.3.6.1.2.1.31.1.1.1.15.200 | 1 | 2019-01-15 20:23:50 |
| 1298 | 11 | ifHwAddr | | 200 | .1.3.6.1.2.1.2.2.1.6.200 | 1 | 2019-01-15 20:23:50 |
| 1298 | 11 | ifIndex | 200 | 200 | .1.3.6.1.2.1.2.2.1.1.200 | 1 | 2019-01-15 20:23:50 |
| 1298 | 11 | ifName | TenGigE0/1/0/3.100 | 200 | .1.3.6.1.2.1.31.1.1.1.1.200 | 1 | 2019-01-15 20:23:50 |
| 1298 | 11 | ifOperStatus | Up | 200 | .1.3.6.1.2.1.2.2.1.8.200 | 1 | 2019-01-15 20:23:50 |
| 1298 | 11 | ifSpeed | 4294967295 | 200 | .1.3.6.1.2.1.2.2.1.5.200 | 1 | 2019-01-15 20:23:50 |
| 1298 | 11 | ifType | l2vlan | 200 | .1.3.6.1.2.1.2.2.1.3.200 | 1 | 2019-01-15 20:23:50 |
+---------+---------------+--------------+--------------------+------------+------------------------------+---------+---------------------+

the poller cache rebuild seems to have used the ifName value instead of the ifIndex. This obviously causes snmp query "invalid response(s)".
User avatar
Osiris
Cacti Guru User
Posts: 1424
Joined: Mon Jan 05, 2015 10:10 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by Osiris »

What Cacti version are you on? There were a few issues with Develop. The best way to solve them is to delete the data query and then re-add it. Restoring from back might even be better.
Before history, there was a paradise, now dust.
ringo
Posts: 13
Joined: Wed Jan 16, 2019 9:35 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by ringo »

upgraded to cacti v1.2.0

I have restored some data from sql backup to get the graph data flowing again. However, i have had to disable the re-indexing check in the cmd.php.

i've removed the 3 x if block here:
if (($index_item['op'] == '=') && ($index_item['assert_value'] != trim($output))) {
...
} elseif (($index_item['op'] == '>') && ($index_item['assert_value'] < trim($output))) {
...
} elseif (($index_item['op'] == '<') && ($index_item['assert_value'] > trim($output))) {
...
}

It will continue to graph as is, but if i re-enable that, it totally messes up the poller info as described previously and wrecks all my graphs. As best i can tell so far is something is using ifNames to populate the data_local table when it needs to be using ifIndex because it then appends this value to the OID in the poller_item table. i have more code digging to do. I would love some hints.
User avatar
Osiris
Cacti Guru User
Posts: 1424
Joined: Mon Jan 05, 2015 10:10 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by Osiris »

You can re-enable it, but before you do, inspect the table poller_reindex and remove the bad entries, also remember to update lib/snmp.php or start using spine first. Keep an eye out for 1.2.1, it has some important fixes.
Before history, there was a paradise, now dust.
ringo
Posts: 13
Joined: Wed Jan 16, 2019 9:35 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by ringo »

i've already tried the updated snmp.php. didn't help. spine didn't work. I can't re-enable the assert check because it will screw up the data. I've already tried a couple times. Is there a ways to export all device/graph config, and re-import into a new install? I see export/import templates, but that seems to be it.
User avatar
Osiris
Cacti Guru User
Posts: 1424
Joined: Mon Jan 05, 2015 10:10 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by Osiris »

Just focus on getting spine to work. I mean it's pretty simple. Just take your time.
Before history, there was a paradise, now dust.
ringo
Posts: 13
Joined: Wed Jan 16, 2019 9:35 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by ringo »

Thanks for the tips. I'm thinking i may have different issues than the poller itself. I'll continue to troubleshoot. Now it's starting to look like part of my issue might be due to changes in interface names on the interfaces being graphed? If the name changes, what does cacti do? When a reindex is done on a device where the interface name has changed, does cacti consider the graph gone because the name is different?
User avatar
Osiris
Cacti Guru User
Posts: 1424
Joined: Mon Jan 05, 2015 10:10 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by Osiris »

IfName is pretty static for 99+% of the devices out there. Make sure you test spine before you enable it. That is a common issue. Additionally let us now your phone version and wether or not php-snmp is installed.
Before history, there was a paradise, now dust.
ringo
Posts: 13
Joined: Wed Jan 16, 2019 9:35 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by ringo »

Oops, when I said interface names in my last post, i ment the description name added to the interface config in a device, so i guess it would be the snmp ifAlias value? What does cacti do when that value changes?

eg. for a device verbose query, i get this:
Total: 0.540000, Delta: 0.010000, Checking for Sort Field change. No changes detected.
Total: 0.540000, Delta: 0.000000, Update Data Query Sort Cache complete
Total: 0.540000, Delta: 0.000000, Index Removal Detected! PreviousIndex: Port 1 - AirFiber 24ghz BH 2 Holl
Total: 0.540000, Delta: 0.010000, Remapping Graphs to their new Indexes
Total: 0.540000, Delta: 0.000000, Index Association with Local Data complete
Total: 0.570000, Delta: 0.030000, No Index Changes Detected, Skipping Re-Index and Poller Cache Re-population
Total: 0.570000, Delta: 0.000000, Automation Executing for Data Query complete
Total: 0.570000, Delta: 0.000000, Plugin hooks complete

The "index removal detected" i think is a problem. Trying to find out why it's removed. The interface is NOT removed. Obviously the device still has that interface. However, the description of that interface HAS indeed changed! I'm trying to figure out if this removal is caused by the change of the description.
ringo
Posts: 13
Joined: Wed Jan 16, 2019 9:35 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by ringo »

Sorry this is getting long but giving up is not really an option. The punch line to prevent the "Index Removal Detected" is at the bottom.

1 - lib/data_query.php does this bellow, which i guess loads all interface rows for device from sql into $local_data.

/* recalculate/change sort order */
$local_data = db_fetch_assoc_prepared('SELECT dl.id AS local_data_id, dl.host_id,
dl.snmp_query_id, dl.snmp_index,
did.data_input_field_id, did.data_template_data_id,
did.value AS query_index, "' . $old_sort_field . '" AS sort_field
FROM data_local AS dl
INNER JOIN data_template_data AS dtd
ON dl.id = dtd.local_data_id
INNER JOIN data_input_fields AS dif
ON dtd.data_input_id = dif.data_input_id
LEFT JOIN data_input_data AS did
ON dtd.id = did.data_template_data_id
AND dif.id = did.data_input_field_id
WHERE dif.data_name="index_value"
AND dl.snmp_query_id = ?
AND dl.host_id = ?',
array($snmp_query_id, $host_id));

DEBUG, for device in question: $old_sort_field="ifDescr", $snmp_query_id=11, and $host_id=155

2 - then, "foreach($local_data as $data_source) {", which i guess is for each sql row found:
if (!$forced_type) {
// See if the index is in the host cache
$current_index = db_fetch_cell_prepared('SELECT snmp_index
FROM host_snmp_cache
WHERE host_id = ?
AND snmp_query_id = ?
AND field_name = ?
AND field_value = ?',
array($host_id, $snmp_query_id, $data_source['sort_field'], $data_source['query_index']));

DEBUG, for device/interface in question: host_id=155, snmp_query_id=11, field_name="ifDescr", and field_value='Port 1 - AirFiber 24ghz BH 2 Holl'

NOTE: so it is checking the host_snmp_cache table using: field_name="ifDescr", and field_value='Port 1 - AirFiber 24ghz BH 2 Holl'
and it's NOT finding anything.

I have found that $data_source['query_index'] here:
mysql> select * from data_input_data where data_template_data_id=15422 and data_input_field_id=13;
+---------------------+-----------------------+---------+-----------------------------------+
| data_input_field_id | data_template_data_id | t_value | value |
+---------------------+-----------------------+---------+-----------------------------------+
| 13 | 15422 | | Port 1 - AirFiber 24ghz BH 2 Holl |
+---------------------+-----------------------+---------+-----------------------------------+


3 - further down, because it wasn't found, set the snmp_index = "" in data_local. After poller cache update, it makes the OID have a blank index number: .1.3.6.1.2.1.31.1.1.1.6. and therefore graphing stops/fails.
if ($current_index != '') {
...
} elseif ($data_source['snmp_index'] != '' && !$forced_type) {
// Found a deleted index, masking off to prevent issues
query_debug_timer_offset('data_query', __('Index Removal Detected! PreviousIndex: %s', $data_source['query_index']));

db_execute_prepared('UPDATE data_local
SET snmp_index = ""
WHERE id = ?',
array($data_source['local_data_id']));

$removed_ids[] = $data_source['local_data_id'];
}



Punchline:
if i change that data_input_data table row:
+---------------------+-----------------------+---------+-----------------------------------+
| data_input_field_id | data_template_data_id | t_value | value |
| 13 | 15422 | | Port 1 - AirFiber 24ghz BH 2 Holl |
to this:
| 13 | 15422 | | Port 1 |

Then above in #2, it will query using: host_id=155, snmp_query_id=11, field_name="ifDescr", and field_value='Port 1'

...and it finds it:
mysql> select * from host_snmp_cache where host_id=155 AND snmp_query_id=11 AND field_name = 'ifDescr' AND field_value = 'Port 1';
+---------+---------------+------------+-------------+------------+------------------------+---------+---------------------+
| host_id | snmp_query_id | field_name | field_value | snmp_index | oid | present | last_updated |
+---------+---------------+------------+-------------+------------+------------------------+---------+---------------------+
| 155 | 11 | ifDescr | Port 1 | 1 | .1.3.6.1.2.1.2.2.1.2.1 | 1 | 2019-01-22 00:38:22 |
+---------+---------------+------------+-------------+------------+------------------------+---------+---------------------+
1 row in set (0.00 sec)

and does NOT remove.

my data_input_data table has many rows with column 'value' having a format of
'Port # - Descr'
and also just:
'Port #'

the data is the same in my sql backups, so it's always been this way. I'm not sure why this has only become a problem now.
User avatar
Osiris
Cacti Guru User
Posts: 1424
Joined: Mon Jan 05, 2015 10:10 am

Re: v1.2.0: RECACHE ASSERT FAILED, poller Invalid Response(s

Post by Osiris »

So, in the 1.2 series, the ifDescr was removed as the primary sort field for ifName, which is more reliable. In your case it appears to be working. What you need to do is edit the data source for that interface and now manually map the correct ifName and it's corresponding index. In those rare cases where the ifDescription changed, there was no way to magically "guess" the new mapping. Once all you devices are on ifName, you should not have this problem moving forward.
Before history, there was a paradise, now dust.
Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests