I've been having problems, pretty much since I started using cacti with getting spikes "counter" graphs. It was never a big deal, as I used to only graph a couple counter sources (apache), and they were reset to 0 often (producing a spike due to "overflow" compensation).
Now, I'm graphing a LOT of counters, and I had the same problem. I changed them to DERIVE, and now, for the most part, they seem to be working ok. But then today, I noticed a major problem:
http://status.digitalorphans.org/graphs.php/ip_traffic
That big spike that's on there is evenly split between all protocols. That *never* happens. Usually http traffic saturates that graph. You can see this more closely if you click on "usage graphs" and browse through the other traffic graphs.
I started looking around on the rrdtool mailing list, and found this:
http://www.ee.ethz.ch/~slist/rrd-users/msg00090.html
This kind of makes sense.. is this something that can be integreated into cacti (and on what kind of timeline)? I can write a script to act as a middle man for cacti's data input, buffering the data, and changing anything that's lower than it used to be to "U", but this is probably something that should be incorperated into cacti itself.
counter resets
Moderators: Developers, Moderators
Hmm... I may be wrong but always thought it's integrated into cacti. This is exactly what minimum and maximum values are for in DS settings. If a value is out of range rrdtool will put into rrd file "U" instead. Spikes are really high so it's quite easy to set limits. In case of counters one need to remember that the rate is stored not counter value so the limit should be related to rate.
- bulek
- bulek
But the problem lies in the way couters are handled.
If you use a COUNTER type, then anytime the value is lower, rrdtool assumes an overflow occured (based on min/max values) and compensates for it. So basically, from what I understand, say you have a max value of 255 (1 byte). You get these data points: 63, 120, 170, 8, 55. rrdtool will then plot 57 (120-63), 50 (170-120), 93 (here, an overflow is detected, so 255(max) - 170 + , and 47 (55-8).
That might be fine if you happen to be using soemthing that overflows.. but perhaps when it went to 8, that was when the counter was reset on purpose.
If you use DERIVE, it avoids this problem of overflows, but there is something else wrong (and i'm not quite sure what). Sine DERIVE is just measuring the rate of change (so, same as a counter does), when the counter is reset, its measuing a rate change - if its at 170 and goes to 8, it measures 8-170 = -162 (tho i'm not sure how it handles negative.. but technically, thats the derivative of that sequence).
I'm really not quite sure what is going on with my graph, as it put everything evenly spaced out, so any theories are welcome :p
But basically what that post is saying, is that between the 170 and 8, it should put in a "U" unknown value, because it's unknown - if it was reset, theres no way to know how high it actually went before being reset . For example, maybe it was reset while it was on 170, so it should plot 8 on the graph, but maybe it got up to 220 before it was reset, in which case the 58 should be plotted. Theres no way for rrdtool to know.
If you use a COUNTER type, then anytime the value is lower, rrdtool assumes an overflow occured (based on min/max values) and compensates for it. So basically, from what I understand, say you have a max value of 255 (1 byte). You get these data points: 63, 120, 170, 8, 55. rrdtool will then plot 57 (120-63), 50 (170-120), 93 (here, an overflow is detected, so 255(max) - 170 + , and 47 (55-8).
That might be fine if you happen to be using soemthing that overflows.. but perhaps when it went to 8, that was when the counter was reset on purpose.
If you use DERIVE, it avoids this problem of overflows, but there is something else wrong (and i'm not quite sure what). Sine DERIVE is just measuring the rate of change (so, same as a counter does), when the counter is reset, its measuing a rate change - if its at 170 and goes to 8, it measures 8-170 = -162 (tho i'm not sure how it handles negative.. but technically, thats the derivative of that sequence).
I'm really not quite sure what is going on with my graph, as it put everything evenly spaced out, so any theories are welcome :p
But basically what that post is saying, is that between the 170 and 8, it should put in a "U" unknown value, because it's unknown - if it was reset, theres no way to know how high it actually went before being reset . For example, maybe it was reset while it was on 170, so it should plot 8 on the graph, but maybe it got up to 220 before it was reset, in which case the 58 should be plotted. Theres no way for rrdtool to know.
Your example is wrong because counters use typically signed values. If you assumed 8-bit counter then it will start from 0 then it goes up to 127 and then overflows to -128. Rrdtool takes care of counter overflow very good and you will not notice that it occured on the graph. Another story is counter reset (manual, device reboot, etc. which sets counter to 0). Let me show it on another example.
Assumptions: counter 16-bit (-32768...0...32767), constant growing value by 300 (to simplify rate calculation), polling period 5 min, gives rate 1/sec.
Counter overflow:
Now the most important part are min and max value used with rrdcreate. If rate calculated is out of defined range (min and max) then it is not stored in RRA - NAN is used instead. Assuming max value set to 5 in the example above my rrd file will not be updated with 116 peak rate. It will have one NAN and then normal rate.
So I suggest using COUNTER in cacti (however I don't say it is appropriate in all cases) and setting proper min and max values in DS settings. I hope it's more clear right now.
- bulek
Assumptions: counter 16-bit (-32768...0...32767), constant growing value by 300 (to simplify rate calculation), polling period 5 min, gives rate 1/sec.
Counter overflow:
Counter reset:[...]
rate = (32400 - 32100)/300 = 1
rate = (32700 - 32400)/300 = 1
rate = (-32536 - 32700 + 65536)/300 = 1 /* overflow compensation here */
rate = (-32236 - -32536)/300 = 1
[...]
Then you can see huge peak (116 comparing to notmal 1). For DERIVE overflow compensation does not occur which means this type of DS is sensitive both for overflows and resets.[...]
rate = (30300 - 30000)/300 = 1
rate = (30600 - 30300)/300 = 1
rate = (20 - 30600 + 65536)/300 = 116.52 /* overflow compensation here - wrong because it was counter reset */
rate = (320 - 20)/300 = 1
[...]
Now the most important part are min and max value used with rrdcreate. If rate calculated is out of defined range (min and max) then it is not stored in RRA - NAN is used instead. Assuming max value set to 5 in the example above my rrd file will not be updated with 116 peak rate. It will have one NAN and then normal rate.
So I suggest using COUNTER in cacti (however I don't say it is appropriate in all cases) and setting proper min and max values in DS settings. I hope it's more clear right now.
- bulek
Replying to myself but...
I'd like to say that I was thinking again on your example for unsigned counter and I have to say it's also correct (sorry for confusion). Anyway the real conclusion of this topic is that rrdtool handles overflows for 32-bit and 64-bit counters correctly. It does not a good job in case of other counters like 8-bit or 16-bit, etc. and in case of counter reset which can happen at any time. The only automatic way to avoid graph peaks are properly set min and max values.
- bulek
I'd like to say that I was thinking again on your example for unsigned counter and I have to say it's also correct (sorry for confusion). Anyway the real conclusion of this topic is that rrdtool handles overflows for 32-bit and 64-bit counters correctly. It does not a good job in case of other counters like 8-bit or 16-bit, etc. and in case of counter reset which can happen at any time. The only automatic way to avoid graph peaks are properly set min and max values.
- bulek
As I mentioned in my followup I re-thinked this and all my signed/unsigned babling was not important. Disregard it please and focus on min/max solution to your problem. I have about 3500 graphs defined with almost 50000 graph items. They are graphing mostly counter type data sources. I was fighting graph peaks quite long and I solved the problem finally by setting proper limits on collected data.
- bulek
- bulek
Re: counter resets
How doyou know what the best value for Max is?
I am seeing similar behavior on a Marconi VCI link that resets its counters when the curcuit is rebuilt (which does happen quite a bit)
Just wondering what a good value to set here usually is?
I am seeing similar behavior on a Marconi VCI link that resets its counters when the curcuit is rebuilt (which does happen quite a bit)
Just wondering what a good value to set here usually is?
Who is online
Users browsing this forum: nocmedia and 3 guests