It was configured with a Warning threshold of 30% for 10 minutes and an Alert threshold of 60% for 3 minutes. Please let me know if the way I configured it is not according to how it was designed.
We have a nightly process that runs on the server which takes the processor up close to 100% for about an hour. After three minutes, we see the email come out telling us about the problem. Two nights ago, after this process was done, we got the "NORMAL" email as well. Yesterday, we didn't get this normal email, so I enabled the DEBUG feature of THOLD. This morning, we also didn't get the normal email.
By filtering the log looking for "email", I spotted when it sent the Alert email, but there was no other email send attempt. Here's the related portion of the log. I've manually filtered out the only other Threshold set on this server. This first segment shows a THOLD check before the event, the three checks where it is breached with the third breach being where the email is sent.
Code: Select all
02/24/2012 01:00:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:00:04 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:6.8
02/24/2012 01:00:04 AM - THOLD: Threshold HI / Low check is normal HI:60 LOW: VALUE:6.8
02/24/2012 01:00:04 AM - SYSTEM THOLD STATS: Time:0.0078 Tholds:2 DownHosts:0
02/24/2012 01:01:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:01:05 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:84.8
02/24/2012 01:01:05 AM - THOLD: Threshold HI / Low check breached HI:60 LOW: VALUE:84.8
02/24/2012 01:01:05 AM - SYSTEM THOLD STATS: Time:0.0072 Tholds:2 DownHosts:0
02/24/2012 01:02:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:02:04 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:87.6667
02/24/2012 01:02:04 AM - THOLD: Threshold HI / Low check breached HI:60 LOW: VALUE:87.6667
02/24/2012 01:02:04 AM - SYSTEM THOLD STATS: Time:0.0079 Tholds:2 DownHosts:0
02/24/2012 01:03:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:03:04 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:91.4167
02/24/2012 01:03:04 AM - THOLD: Threshold HI / Low check breached HI:60 LOW: VALUE:91.4167
02/24/2012 01:03:04 AM - THOLD: Alerting is necessary
02/24/2012 01:03:04 AM - THOLD: Preparing to send email
02/24/2012 01:03:04 AM - THOLD: Sending email to 'firewallalertmailinglist@mydomain.com'
02/24/2012 01:03:04 AM - SYSTEM THOLD STATS: Time:0.1496 Tholds:2 DownHosts:0
Code: Select all
02/24/2012 01:56:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:56:10 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:71.3607
02/24/2012 01:56:10 AM - THOLD: Threshold HI / Low check breached HI:60 LOW: VALUE:71.3607
02/24/2012 01:56:10 AM - SYSTEM THOLD STATS: Time:0.0085 Tholds:2 DownHosts:0
02/24/2012 01:57:02 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:57:04 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:55.0678
02/24/2012 01:57:04 AM - THOLD: Threshold HI / Low Warning check breached HI:30 LOW: VALUE:55.0678
02/24/2012 01:57:04 AM - SYSTEM THOLD STATS: Time:0.0080 Tholds:2 DownHosts:0
02/24/2012 01:58:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:58:05 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:55.6066
02/24/2012 01:58:05 AM - THOLD: Threshold HI / Low Warning check breached HI:30 LOW: VALUE:55.6066
02/24/2012 01:58:05 AM - SYSTEM THOLD STATS: Time:0.0082 Tholds:2 DownHosts:0
02/24/2012 01:59:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 01:59:06 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:55.1333
02/24/2012 01:59:06 AM - THOLD: Threshold HI / Low Warning check breached HI:30 LOW: VALUE:55.1333
02/24/2012 01:59:06 AM - SYSTEM THOLD STATS: Time:0.0081 Tholds:2 DownHosts:0
02/24/2012 02:00:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:00:09 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:54.7167
02/24/2012 02:00:09 AM - THOLD: Threshold HI / Low Warning check breached HI:30 LOW: VALUE:54.7167
02/24/2012 02:00:09 AM - SYSTEM THOLD STATS: Time:0.0083 Tholds:2 DownHosts:0
02/24/2012 02:01:02 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:01:07 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:23.8983
02/24/2012 02:01:07 AM - THOLD: Threshold HI / Low check is normal HI:60 LOW: VALUE:23.8983
02/24/2012 02:01:07 AM - SYSTEM THOLD STATS: Time:0.0081 Tholds:2 DownHosts:0
02/24/2012 02:02:02 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:02:07 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:11.35
02/24/2012 02:02:07 AM - THOLD: Threshold HI / Low check is normal HI:60 LOW: VALUE:11.35
02/24/2012 02:02:07 AM - SYSTEM THOLD STATS: Time:0.0081 Tholds:2 DownHosts:0
02/24/2012 02:03:02 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:03:06 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:11.2167
02/24/2012 02:03:06 AM - THOLD: Threshold HI / Low check is normal HI:60 LOW: VALUE:11.2167
02/24/2012 02:03:06 AM - SYSTEM THOLD STATS: Time:0.0080 Tholds:2 DownHosts:0
02/24/2012 02:04:03 AM - THOLD: Checking Threshold:'cpmhdq01w - CPU Usage - User [cpu_user]', Graph:'6350'
02/24/2012 02:04:05 AM - THOLD: Checking Threshold: DS:cpu_user RRA_ID:8909 DATA_ID:24244 VALUE:5.918
02/24/2012 02:04:05 AM - THOLD: Threshold HI / Low check is normal HI:60 LOW: VALUE:5.918
02/24/2012 02:04:05 AM - SYSTEM THOLD STATS: Time:0.0080 Tholds:2 DownHosts:0
At 01:57, it changes to the "Warning check breached" condition.
At 02:01, it returns to the normal condition. Several minutes passed, but no "Normal" email is sent.
So, I think the problem has to do with the fact that it went from "breach" to "warning breach" condition. Perhaps also because it didn't stay in that condition for the full 10 minutes required to do a warning alert. I theorize that at some point in this process, whatever flag that's set to know that it's "breached" is cleared, and since no Warning email is sent, it thinks that it doesn't need to send a Normal condition email.
I've actually removed the Warning levels from this threshold, so I'm expecting that tomorrow it will work as expected. I'm guessing the few times this has worked correctly were when it went directly from greater than 60% CPU to less than 30% CPU within one poll cycle.
Is this a known issue?
Should I simply set up two thresholds, one with a Warning level, the other with an Alert level to ensure that they don't overlap? Can you set two thresholds against the same data source?
Thanks,
Paul