Threshold Alert / Warning - Some missing emails

Support questions about the Threshold plugin

Moderators: Developers, Moderators

Post Reply
ptaylor874
Posts: 45
Joined: Fri Jan 04, 2008 11:45 am

Threshold Alert / Warning - Some missing emails

Post by ptaylor874 »

Last week, I set up an alert in Threshold for CPU utilization on our Firewall Management server.

It was configured with a Warning threshold of 30% for 10 minutes and an Alert threshold of 60% for 3 minutes. Please let me know if the way I configured it is not according to how it was designed.

We have a nightly process that runs on the server which takes the processor up close to 100% for about an hour. After three minutes, we see the email come out telling us about the problem. Two nights ago, after this process was done, we got the "NORMAL" email as well. Yesterday, we didn't get this normal email, so I enabled the DEBUG feature of THOLD. This morning, we also didn't get the normal email.

By filtering the log looking for "email", I spotted when it sent the Alert email, but there was no other email send attempt. Here's the related portion of the log. I've manually filtered out the only other Threshold set on this server. This first segment shows a THOLD check before the event, the three checks where it is breached with the third breach being where the email is sent.

Code: Select all

02/24/2012 01:00:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:00:04 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:6.8
02/24/2012 01:00:04 AM - THOLD: Threshold HI / Low check is normal HI:60  LOW: VALUE:6.8
02/24/2012 01:00:04 AM - SYSTEM THOLD STATS: Time:0.0078 Tholds:2 DownHosts:0
02/24/2012 01:01:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:01:05 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:84.8
02/24/2012 01:01:05 AM - THOLD: Threshold HI / Low check breached HI:60  LOW: VALUE:84.8
02/24/2012 01:01:05 AM - SYSTEM THOLD STATS: Time:0.0072 Tholds:2 DownHosts:0
02/24/2012 01:02:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:02:04 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:87.6667
02/24/2012 01:02:04 AM - THOLD: Threshold HI / Low check breached HI:60  LOW: VALUE:87.6667
02/24/2012 01:02:04 AM - SYSTEM THOLD STATS: Time:0.0079 Tholds:2 DownHosts:0
02/24/2012 01:03:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:03:04 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:91.4167
02/24/2012 01:03:04 AM - THOLD: Threshold HI / Low check breached HI:60  LOW: VALUE:91.4167
02/24/2012 01:03:04 AM - THOLD: Alerting is necessary
02/24/2012 01:03:04 AM - THOLD: Preparing to send email
02/24/2012 01:03:04 AM - THOLD: Sending email to 'firewallalertmailinglist@mydomain.com'
02/24/2012 01:03:04 AM - SYSTEM THOLD STATS: Time:0.1496 Tholds:2 DownHosts:0
From 01:03 until 01:56, it's in the breach condition, then this happens:

Code: Select all

02/24/2012 01:56:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:56:10 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:71.3607
02/24/2012 01:56:10 AM - THOLD: Threshold HI / Low check breached HI:60  LOW: VALUE:71.3607
02/24/2012 01:56:10 AM - SYSTEM THOLD STATS: Time:0.0085 Tholds:2 DownHosts:0
02/24/2012 01:57:02 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:57:04 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:55.0678
02/24/2012 01:57:04 AM - THOLD: Threshold HI / Low Warning check breached HI:30  LOW: VALUE:55.0678
02/24/2012 01:57:04 AM - SYSTEM THOLD STATS: Time:0.0080 Tholds:2 DownHosts:0
02/24/2012 01:58:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:58:05 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:55.6066
02/24/2012 01:58:05 AM - THOLD: Threshold HI / Low Warning check breached HI:30  LOW: VALUE:55.6066
02/24/2012 01:58:05 AM - SYSTEM THOLD STATS: Time:0.0082 Tholds:2 DownHosts:0
02/24/2012 01:59:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:59:06 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:55.1333
02/24/2012 01:59:06 AM - THOLD: Threshold HI / Low Warning check breached HI:30  LOW: VALUE:55.1333
02/24/2012 01:59:06 AM - SYSTEM THOLD STATS: Time:0.0081 Tholds:2 DownHosts:0
02/24/2012 02:00:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:00:09 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:54.7167
02/24/2012 02:00:09 AM - THOLD: Threshold HI / Low Warning check breached HI:30  LOW: VALUE:54.7167
02/24/2012 02:00:09 AM - SYSTEM THOLD STATS: Time:0.0083 Tholds:2 DownHosts:0
02/24/2012 02:01:02 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:01:07 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:23.8983
02/24/2012 02:01:07 AM - THOLD: Threshold HI / Low check is normal HI:60  LOW: VALUE:23.8983
02/24/2012 02:01:07 AM - SYSTEM THOLD STATS: Time:0.0081 Tholds:2 DownHosts:0
02/24/2012 02:02:02 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:02:07 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:11.35
02/24/2012 02:02:07 AM - THOLD: Threshold HI / Low check is normal HI:60  LOW: VALUE:11.35
02/24/2012 02:02:07 AM - SYSTEM THOLD STATS: Time:0.0081 Tholds:2 DownHosts:0
02/24/2012 02:03:02 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:03:06 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:11.2167
02/24/2012 02:03:06 AM - THOLD: Threshold HI / Low check is normal HI:60  LOW: VALUE:11.2167
02/24/2012 02:03:06 AM - SYSTEM THOLD STATS: Time:0.0080 Tholds:2 DownHosts:0
02/24/2012 02:04:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:04:05 AM - THOLD: Checking Threshold:  DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:5.918
02/24/2012 02:04:05 AM - THOLD: Threshold HI / Low check is normal HI:60  LOW: VALUE:5.918
02/24/2012 02:04:05 AM - SYSTEM THOLD STATS: Time:0.0080 Tholds:2 DownHosts:0
Summarizing the above:
At 01:57, it changes to the "Warning check breached" condition.
At 02:01, it returns to the normal condition. Several minutes passed, but no "Normal" email is sent.

So, I think the problem has to do with the fact that it went from "breach" to "warning breach" condition. Perhaps also because it didn't stay in that condition for the full 10 minutes required to do a warning alert. I theorize that at some point in this process, whatever flag that's set to know that it's "breached" is cleared, and since no Warning email is sent, it thinks that it doesn't need to send a Normal condition email.

I've actually removed the Warning levels from this threshold, so I'm expecting that tomorrow it will work as expected. I'm guessing the few times this has worked correctly were when it went directly from greater than 60% CPU to less than 30% CPU within one poll cycle.

Is this a known issue?

Should I simply set up two thresholds, one with a Warning level, the other with an Alert level to ensure that they don't overlap? Can you set two thresholds against the same data source?

Thanks,
Paul
noname
Cacti Guru User
Posts: 1566
Joined: Thu Aug 05, 2010 2:04 am
Location: Japan

Re: Threshold Alert / Warning - Some missing emails

Post by noname »

ptaylor874 wrote: So, I think the problem has to do with the fact that it went from "breach" to "warning breach" condition. Perhaps also because it didn't stay in that condition for the full 10 minutes required to do a warning alert. I theorize that at some point in this process, whatever flag that's set to know that it's "breached" is cleared, and since no Warning email is sent, it thinks that it doesn't need to send a Normal condition email.
It appears to be correct.
For HI/LOW threshold on thold-v0.4.9-3, "Normal" condition mail will be sent in the following 2 cases:
- when "warning threshold was triggered (not only breached)"
- when "alert threshold was triggered (not only breached)"

But, it is judged by failed count ('thold_warning_fail_count' and 'thold_fail_count').
Unfortunately the failed counter is reset when status was changed between "Alert" and "Warning", so "Normal" condition mail won't be sent because the previous triggered status was cleared.
(It seems that previous status is used to determine where Cacti should send a mail to -- "Alert Emails or "Warning Emails".)

For details, please see the lines 1359-1532 in 'thold_functions.php'.

On a trial basis, if I removed these lines (and comma at the previous line) in thold_functions.php, "Normal" mail was sent successfully.
But it might cause another (bad?) effect. (I'm not sure)

Code: Select all

1413: thold_warning_fail_count=0
1477: thold_fail_count=0
So I think, perhaps it requires some specific field which holds the previous triggered status of each thresholds -or- another criteria to determine "what email to send".

// Sorry my English
ptaylor874
Posts: 45
Joined: Fri Jan 04, 2008 11:45 am

Re: Threshold Alert / Warning - Some missing emails

Post by ptaylor874 »

I'm pretty sure this is the problem in my original case.

I altered my settings so that I'm only using the Alert notification, not Warning,. Since I've done that, it has behaved exactly as expected. Saturday, Sunday, and this morning I received Alert notification emails when the processor utilization exceeded 60% for 3 minutes, and within an hour, it completed, the processor utilization dropped off, and I received a "Normal" email.

While there may be cases for having both a Warning and Alert notification, I can do without it for the moment. If I do need that additional level, I think it would work fine if I added an additional Notification on the same device, except only using the Warning parameters, not the Alert parameters.
Post Reply

Who is online

Users browsing this forum: No registered users and 3 guests