I'm trying to set up Time Based Thresholds and finding some unexpected behavior.
First, it appears that Cacti will send a "NORMAL" alert after a single breach, rather than waiting for the measured value to go above or below the High / Low values so many times in the measurement window. To illustrate:
I've created a threshold template against the "Unix - Logged in Users" data source, as shown below:
I'm using a 1 minute polling cycle.
If I understand things correctly, this threshold should enter the Warning state only if there are three polling cycles within 10 minutes where the number of logged in users is greater than 1. And it should enter the Alert state only if there are three polling cycles within 10 minutes where the number of logged in users is greater than 3. In both cases the breaches don't need to be consecutive, they could be spread throughout the 10 minute window. And I should only ever get the "NORMAL" alert after either of those conditions has happened.
However I find that if I log into the server twice, so that the number of logged in users is 2, just long enough for 1 polling cycle to catch that number, and then I log out of both sessions, the next polling cycle sees 0 logged in users and a "NORMAL" alert is generated. No WARNING alert was generated, since the condition was not true for 3 out of the last 10 polling cycles.
I've tried this both with my production instance running Cacti 1.2.23 and Thold 1.5.2, as well as a test instance running 1.2.25 and the latest develop branch of Thold 1.8.
Looking at the code in thold_function.php, within the latest develop branch, I think I see the reason why this is happening. On lines 3311 and 3380 there are IF statements which look at whether the variable "alertstat" is nonzero. That variable gets set on line 2165 to the value of thold_alert within the thold_data database. That database value gets set in the previous polling cycle--if the previous polling cycle was breach, then thold_alert gets set to either STAT_HI or STAT_LO on line 2999, and saved into the database on line 3146. If the previous cycle value was warning breach, the same thing happens on lines 3151 and 3305.
I'm thinking that the issue here is that thold_alert is being set based on the current value, when it should instead be set based on whether the number of failures / warning_failures exceed the trigger / warning_trigger. Does this make sense or am I missing something?
The second thing I noticed, while reviewing the code for that first item, is that the code which sends ALERT / RE-ALERT messages is contained within the IF statement on line 2994 (if ($breach_up || $breach_down)). This means that alert messages will only be sent if the current value of the threshold is breach. While this might work fine for the initial alert, since we would need the current polling cycle to get us over the trigger level, I believe it could lead to situations where the re-alert is not sent, for example:
If the current value of a graph is not breach, but there are enough breaches in the measurement window to exceed the trigger, and we are at the re-alert cycle time: a re-alert should be sent, but I believe it would not be sent, since the current value is not breach. If this is correct, then the logic around sending re-alerts might need to be moved outside of this IF statement. Same would be true for warning re-alerts.
The third item is more of a question about the intended functionality. Because I've been using the "Unix - Logged in Users" data source for testing, I've been dealing with very controllable whole numbers, which might be unusual compared with most graphs in Cacti (part of the reason I chose it for testing). I noticed that if the value of the graph is exactly the same as the threshold level--for example if my threshold is 1 logged in user, and I log in exactly once to the server--the threshold is not considered breached.
This appears to be based on the fact that in the code, for example lines 2948-2951, we are using "less than" or "greater than" and not "less than or equal to" or "greater than or equal to". I just wanted to confirm this is the intended functionality? If so then in my test case I could simply make my threshold 0.9 instead of 1.
Intended behavior of Time Based Thresholds
Moderators: Developers, Moderators
Re: Intended behavior of Time Based Thresholds
Just replying since there have been no other replies yet. Anyone else using time-based thresholds and if so, have you noticed what I'm describing?
As I've been looking at the code and thinking about how it might be modified to address the issues I've described, another theoretical problem occurred to me: maybe an unusual use case but it is possible to have both high and low thresholds set--what if the measured value in the graph is swinging above the high threshold and below the low threshold, enough times within the measurement period. Would we expect that to trigger a high alarm, a low alarm, or both? Right now it doesn't appear that this circumstance is accounted for.
Looking for feedback on this either way--if I'm misunderstanding something and things are working or expected, or if indeed things are not working as expected. I'll plan to submit a bug but will wait a bit more for any feedback first.
As I've been looking at the code and thinking about how it might be modified to address the issues I've described, another theoretical problem occurred to me: maybe an unusual use case but it is possible to have both high and low thresholds set--what if the measured value in the graph is swinging above the high threshold and below the low threshold, enough times within the measurement period. Would we expect that to trigger a high alarm, a low alarm, or both? Right now it doesn't appear that this circumstance is accounted for.
Looking for feedback on this either way--if I'm misunderstanding something and things are working or expected, or if indeed things are not working as expected. I'll plan to submit a bug but will wait a bit more for any feedback first.
Who is online
Users browsing this forum: No registered users and 2 guests