Feature Request - Unusual Counter Wraparounds

bestpa · Post by **bestpa** » Tue Apr 10, 2012 2:55 pm

Hi,

I absolutely love CACTI.

I run on various systems and collect many counters from a variety of ISP and telecom-style equipment.

One of the problems we run into is when we data source from a counter that can roll around from time to time, but not on a normal 32-bit or 64-bit integer. This can occur:
- when an application is restarted or reloaded (ie. BIND zone statistics)
- when the only counter provided is according to time (ie. hourly zero-ing out of statistics, common on telecom MSS's)

In order to temporarily fix this problem, we have built a "buffer" zone in a non-related mysql table, where the current poll delta is examined and added to a never-ending value. When a call is made to an external script from cacti, that script presents a "TO-REPORT" value, which will never expect to wrap until a natural 64-bit rollover number is found.

Is it possible to identify , or flag, certain data sources as being "abnormal", such that when a wrap-around is detected, it can use previous deltas to guesstimate what the delta might look like, instead of assuming a 64- or 32- bit wraparound?

Techniques such as Data-Source limits (NaN) create fugly gaps in our charts. The temporary fix we use is cumbersome at best.

Thanks!

Post by **gandalf** » Tue Apr 10, 2012 3:24 pm

Usually, filtering "invalid" updates is done by using a specific MAX value on the data source item.
R.

bestpa · Post by **bestpa** » Tue Apr 10, 2012 4:00 pm

Thank you for your quick response.

Is it not true that if a DS limit is reached, the result is a "NaN", which will pollute the clean line drawn in a graph by creating a "gap" in the represented data? (example, graphing: 9933, 9948, 9999, NaN, 9893, etc).

This is not desirable for a clean-looking graph.

Also, please confirm that
-a DS Limit on a "GAUGE" will be enforced on the guage number retrieved from the poll (ie. MAX 100 if a gauge reading percent 0-100).
-a DS Limit on a "COUNTER" will be enforced as a maximum delta (ie., if the limit is 10000, and the delta for counter poll is +10000, then anything above that will also be marked as NaN).

Thanks.

bestpa · Post by **bestpa** » Wed Apr 11, 2012 9:05 am

I beleive the answer to my request may lie in CDEFs. I am working with the RRDtool documentation to determine if I can represent a spike above a certain limit as "last-known-good".

http://linux.die.net/man/1/rrdtutorial

Working with CDEFs is better than poisoning an RRD with a DS Limit that might increase from time to time. (CDEF easier to modify than running rrdtool modify against an already-created RRD file).

Thoughts?

Post by **gandalf** » Wed Apr 11, 2012 10:13 am

bestpa wrote:Is it not true that if a DS limit is reached, the result is a "NaN", which will pollute the clean line drawn in a graph by creating a "gap" in the represented data? (example, graphing: 9933, 9948, 9999, NaN, 9893, etc).

This will indeed produce a NaN, as the real value can't be guessed.
This is discussed very often. But Tobi Oetiker (and my pesonal) opinion is not to fake any number in.
IMHO, it is better to tackle those NaNs when creating the graph (see your comment on CDEFs)
R.

Post by **gandalf** » Wed Apr 11, 2012 10:16 am

bestpa wrote:I beleive the answer to my request may lie in CDEFs. I am working with the RRDtool documentation to determine if I can represent a spike above a certain limit as "last-known-good".

http://linux.die.net/man/1/rrdtutorial

Working with CDEFs is better than poisoning an RRD with a DS Limit that might increase from time to time. (CDEF easier to modify than running rrdtool modify against an already-created RRD file).

Thoughts?

My recommendation is not to tackle the spikes during graphing (which is quite a valid approach) but to tackle the NaNs.
You may indeed tweak the data sources heartbeat to make rrdtool fill in the gaps automatically. But later, you won't know, when such a spike occured. So I do not recommend doing so.
I come back to writing NaNs into the rrd file and keeping the dropouts as you then will know, what really happened.
But you may of course try to fill the missing data with some AVERAGE value.

You may find that this is going to be a philosophical discussion

R.

bestpa · Post by **bestpa** » Wed Apr 11, 2012 10:52 am

Thank you again for your assistance.

Philosophical indeed, but with some practical applications:

Here's the scenario:

I am polling a device that increments a counter, but resets the counter to 0 every hour. This is frustrating, because the rollover occurs on a timer, and not on a predictable counter upper-limit.

Here's an example of the polled data i am presenting every 5 minutes, I apologise for the length of the output, but it is necessary to see what I mean. Please review data source "num.NUMBER:"

Code: Select all

stats@cacti1:~/MSS/script$ while $1; do ./mss_statistics_report.pl  ; echo ; sleep 300 ;  done
num.NUMBER:345445 per.NUMBER:100.00 num.ANSWERED:295345 per.ANSWERED:85.49 num.NOT:24048 per.NOT:6.96 
****num.NUMBER:18568 per.NUMBER:100.00 num.ANSWERED:16008 per.ANSWERED:86.21 num.NOT:1219 per.NOT:6.56 
num.NUMBER:49677 per.NUMBER:100.00 num.ANSWERED:42774 per.ANSWERED:86.10 num.NOT:3316 per.NOT:6.67 
num.NUMBER:80450 per.NUMBER:100.00 num.ANSWERED:69173 per.ANSWERED:85.98 num.NOT:5447 per.NOT:6.77 
num.NUMBER:110434 per.NUMBER:100.00 num.ANSWERED:94969 per.ANSWERED:85.99 num.NOT:7524 per.NOT:6.81 
num.NUMBER:140072 per.NUMBER:100.00 num.ANSWERED:120388 per.ANSWERED:85.94 num.NOT:9540 per.NOT:6.81 
num.NUMBER:169418 per.NUMBER:100.00 num.ANSWERED:145688 per.ANSWERED:85.99 num.NOT:11547 per.NOT:6.81 
num.NUMBER:198479 per.NUMBER:100.00 num.ANSWERED:170618 per.ANSWERED:85.96 num.NOT:13507 per.NOT:6.80 
num.NUMBER:227567 per.NUMBER:100.00 num.ANSWERED:195524 per.ANSWERED:85.91 num.NOT:15584 per.NOT:6.84 
num.NUMBER:256212 per.NUMBER:100.00 num.ANSWERED:220160 per.ANSWERED:85.92 num.NOT:17531 per.NOT:6.84 
num.NUMBER:284417 per.NUMBER:100.00 num.ANSWERED:244299 per.ANSWERED:85.89 num.NOT:19412 per.NOT:6.82 
num.NUMBER:312537 per.NUMBER:100.00 num.ANSWERED:268593 per.ANSWERED:85.93 num.NOT:21312 per.NOT:6.81 
num.NUMBER:340735 per.NUMBER:100.00 num.ANSWERED:292906 per.ANSWERED:85.96 num.NOT:23143 per.NOT:6.79 
*****num.NUMBER:17675 per.NUMBER:100.00 num.ANSWERED:15260 per.ANSWERED:86.33 num.NOT:1184 per.NOT:6.69 
num.NUMBER:46406 per.NUMBER:100.00 num.ANSWERED:39853 per.ANSWERED:85.87 num.NOT:3205 per.NOT:6.90 
num.NUMBER:74902 per.NUMBER:100.00 num.ANSWERED:64340 per.ANSWERED:85.89 num.NOT:5130 per.NOT:6.84 
num.NUMBER:103056 per.NUMBER:100.00 num.ANSWERED:88569 per.ANSWERED:85.94 num.NOT:7019 per.NOT:6.81 
num.NUMBER:131197 per.NUMBER:100.00 num.ANSWERED:112736 per.ANSWERED:85.92 num.NOT:8902 per.NOT:6.78 
num.NUMBER:159573 per.NUMBER:100.00 num.ANSWERED:136922 per.ANSWERED:85.80 num.NOT:10873 per.NOT:6.81 
num.NUMBER:188056 per.NUMBER:100.00 num.ANSWERED:161420 per.ANSWERED:85.83 num.NOT:12777 per.NOT:6.79 
num.NUMBER:216888 per.NUMBER:100.00 num.ANSWERED:186030 per.ANSWERED:85.77 num.NOT:14855 per.NOT:6.84 
num.NUMBER:244995 per.NUMBER:100.00 num.ANSWERED:210118 per.ANSWERED:85.76 num.NOT:16791 per.NOT:6.85 
num.NUMBER:273234 per.NUMBER:100.00 num.ANSWERED:234355 per.ANSWERED:85.77 num.NOT:18747 per.NOT:6.86 
num.NUMBER:301288 per.NUMBER:100.00 num.ANSWERED:258563 per.ANSWERED:85.81 num.NOT:20598 per.NOT:6.83 
num.NUMBER:329851 per.NUMBER:100.00 num.ANSWERED:283113 per.ANSWERED:85.83 num.NOT:22531 per.NOT:6.83 
*****num.NUMBER:17854 per.NUMBER:100.00 num.ANSWERED:15278 per.ANSWERED:85.57 num.NOT:1316 per.NOT:7.37 
num.NUMBER:47054 per.NUMBER:100.00 num.ANSWERED:40133 per.ANSWERED:85.29 num.NOT:3476 per.NOT:7.38 
num.NUMBER:76090 per.NUMBER:100.00 num.ANSWERED:64987 per.ANSWERED:85.40 num.NOT:5612 per.NOT:7.37 
num.NUMBER:105145 per.NUMBER:100.00 num.ANSWERED:89637 per.ANSWERED:85.25 num.NOT:7745 per.NOT:7.36 
num.NUMBER:134024 per.NUMBER:100.00 num.ANSWERED:114248 per.ANSWERED:85.24 num.NOT:9929 per.NOT:7.40

Because this data source is a type "COUNTER", the delta from the previous value returned is compared. Because the current value is less than the last retrieved, the delta will be calculated as a 32-bit rollover. This causes really large spikes in my graphs. The delta between "sane" polls should always be about 30000, give or take.

I now see three ways to combat a data source such as this:

If i were to enforce a DataSource limit of 30001, the value NaN would be stored. This would be acceptable, except for a couple things: first off, the limit of 30001 might change next week, and a more reasonable data limit of 33331 might be more appropriate. I would then have to modify the rrd with a new LIMIT - very cumbersome. Also, when graphing, a NaN would be represented as a gap in data. An otherwise smooth line would be interupted by whitespace. Ugly graph.
If i were to allow the very large delta to be stored in the RRD file, I would like a way to graph around it. So defining a CDEF that says "PSEUDOcode: if this value is greater than 30001, then just continue the line of the graph using the last datapoint that falls within the range of 0-30000). In my mind, this is a better solution because I'm OK with a graph that's not exactly accurate on those hourly datapoints. Also, a CDEF can easily be modified in CACTI and because it is referenced by a graph template, it would change all graphs that use it dynamically
this is the most terrible option, which i have done in the past. That option is to buffer the datasource based on my rules and eternally increment it to the 64-bit limit. Now when the external script is called from cacti to update the RRD, it will use the "buffered" value.

To further explain the third option, here is a psuedocode table. I have implemented this in the past using a separately stored "to-present" value. This is a very very difficult option and requires much scripting (gathering, processing, storing) and (retreiving stored, presenting to cacti)

Code: Select all

counter from device, true delta from last poll, represented counter towards RRD DS
0, 0, 0
100, 100, 100
200, 100, 200
300, 100, 300 
0, (assume 100), 400
100, 100, 500
200, 100, 600
300, 100, 700
0, (assume 100), 800
..., ..., 2^64

I am looking for a shortcut that is easier than that method! It's just not scalable. CDEF seems to be the key for me, but I am not at all skilled in RPN.

Again, I am ok with "fudging" the graph a little bit here and there.

bestpa · Post by **bestpa** » Thu Apr 12, 2012 8:13 am

Here is a CDEF that was given to me over on the RRDtool forums.

CDEF:cleanx=x,UN,PREV,x,IF

It's purpose is to replace unknown values (NaN) with the previous value, exactly what we need to fill in gaps.

I expect to replace the "UN" with a range in order to not have to put limitations on my DS (LIMIT). I will try this today and post my results.

Post by **gandalf** » Thu Apr 12, 2012 9:51 am

bestpa wrote:[*] If i were to allow the very large delta to be stored in the RRD file, I would like a way to graph around it. So defining a CDEF that says "PSEUDOcode: if this value is greater than 30001, then just continue the line of the graph using the last datapoint that falls within the range of 0-30000). In my mind, this is a better solution because I'm OK with a graph that's not exactly accurate on those hourly datapoints. Also, a CDEF can easily be modified in CACTI and because it is referenced by a graph template, it would change all graphs that use it dynamically

I agree that this makes most sense in your case and is more flexible
R.

bestpa · Post by **bestpa** » Thu Apr 12, 2012 1:22 pm

I have found two CDEFs that work, depending on the result.

1) Draw NaN same as previous known good

CACTI: CDEF=UN,PREV,CURRENT_DATA_SOURCE,IF

2) Draw abnormally high value (>50000) as previous known good

CACTI: CDEF=CURRENT_DATA_SOURCE,50000,GT,PREV,CURRENT_DATA_SOURCE,IF

Documents that helped me understand RPN and CDEF:

http://oss.oetiker.ch/rrdtool/tut/rpntutorial.en.html
http://oss.oetiker.ch/rrdtool/doc/rrdgraph_rpn.en.html
http://oss.oetiker.ch/rrdtool/tut/cdeftutorial.en.html

Perhaps these CDEFs should be shipped as a pre-configured CDEF definition with future versions of CACTI, that is my "Feature Request".

I will leave it at that and thank you and the community (both here and RRDtool forums) for helping me understand.

Thank You.

Cacti

Feature Request - Unusual Counter Wraparounds

Feature Request - Unusual Counter Wraparounds

Re: Feature Request - Unusual Counter Wraparounds

Re: Feature Request - Unusual Counter Wraparounds

Re: Feature Request - Unusual Counter Wraparounds

Re: Feature Request - Unusual Counter Wraparounds

Re: Feature Request - Unusual Counter Wraparounds

Re: Feature Request - Unusual Counter Wraparounds

Re: Feature Request - Unusual Counter Wraparounds

Re: Feature Request - Unusual Counter Wraparounds

Re: Feature Request - Unusual Counter Wraparounds

Who is online