Thold log occasionally reporting thresholds of 0

Simiantality · Post by **Simiantality** » Wed Sep 05, 2012 12:52 pm

(Forgive me if this has been posted - I attempted searching and couldn't find anything.)

I work for a company (with yousillygoose) that uses Cacti to monitor several thousand devices. We are currently beginning a hardware migration and software version upgrade, and have noticed a small bug that I have been unable to squash myself, with my admittedly limited knowledge.

We are finding that some thresholds have been logged as breaching a high/low threshold of 0 on occasion.

Now, these thresholds are not actually alerting incorrectly - they are triggering correctly in every case, as far as I can tell. However, when they are reporting to rsyslog, they are claiming to be breaching a threshold of 0.

The syslog reporting is also inconsistent. That is, the same threshold will sometimes report correctly (ie. disk usage over 90% full), and then incorrectly; this can swap back and forth as often as every polling cycle.

This is not a common issue. I would estimate that well over 90% of the time, everything works perfectly. As a result, troubleshooting has been very difficult.

We are using Cacti ver 0.8.7i, with Thold 0.4.9-3, on CentOS 5.something. We have modified thold_functions.php just a tiny bit to help our Smarts system sort our Cacti alerts correctly. This was a simple addition of a few characters in the message logged at the end of the logger functions, like so:

Code: Select all

	if (strval($breach_up) == 'ok') {
	  syslog($syslog_level, "CACTIALERT " . $desc . ' restored to normal with ' . $currentval . ' at trigger ' . $trigger . ' out of ' . $triggerct);
	} else {
		syslog($syslog_level, "CACTIALERT " . $desc . ' went ' . ($breach_up ? 'above' : 'below') . ' threshold of ' . $threshld . " - source: cacti31");
	}

I doubt that this change is responsible for the issue, but it is certainly possible, so I have included it for the sake of completeness. Syslog itself doesn't seem to be the issue, as I added a line of code in php to write the value of $threshld to a text file, and the value it was passing to syslog was actually 0.

Has anyone encountered this already? Have I simply overlooked something that has already been fixed?

Thank you very much for your help. Let me know if I can provide any other relevant information.

Simiantality · Post by **Simiantality** » Mon Sep 10, 2012 9:33 am

So I believe I have identified the issue and pushed a possible change to alleviate it. This is not a perfect fix.

It looks to me like line 1361

Code: Select all

$breach_up = ($item['thold_hi'] != '' && $currentval > $item['thold_hi']);

defines the $breach_up variable have a value of 1 when the current value being monitored is above the high threshold, and the high threshold has also been set to any value.

However, I have observed that in certain circumstances, the threshold is not removed from an alert state when the current value is temporarily set to 0. My guess is that this occurs when the device does not respond to a single polling cycle. When this happens, the code erroneously triggers a 'restored to normal' state, and writes to the logger with line 1492:

Code: Select all

logger($item['name'], 'ok', 0, $currentval, $warning_trigger, $item['thold_warning_fail_count'], $url);

or line 1512:

Code: Select all

logger($item['name'], 'warning', 0, $currentval, $trigger, $item['thold_fail_count'], $url);

I have found that in our particular set up, I can modify the code to check for the $item['thold_hi'] existing without checking for the current state, and pass this into the logger function instead of the '0' value above, like so:

Code: Select all

$temp_breach_up = ($item['thold_hi'] != '');
logger($item['name'], 'warning', ($temp_breach_up ? $item['thold_hi'] : $item['thold_low']), $currentval, $trigger, $item['thold_fail_count'], $url);

This seems to alleviate the issue in my above post. As my entire experience in php has been limited to this particular task, and php is now my strongest programming language as a result, I'm not sure if my fix has broken something else.

Please critique my code and let me know where I might have made a mistake, or if my reasoning itself was erroneous.

Cacti

Thold log occasionally reporting thresholds of 0

Thold log occasionally reporting thresholds of 0

Re: Thold log occasionally reporting thresholds of 0

Who is online