Interface Traffic Graphs are wrong

icewalker · Post by **icewalker** » Wed Oct 01, 2008 7:45 am

OK, I know my Subject is a bit ... well definitive, but I know for a fact that the Interface Graph is incorrect for a Windows 2003 Server.

I've included the graph below. Perhaps I am just way off base with how the graph should work.

Regardless, if you look at the graph for the last 24 hours, it shows a "Total In" of 89.43 GB. There is one problem with this number, it is off by a factor of 3 to 4. At the very least, the Total In should say 268.17 GB. Where did I get this number? This is the amount of data sent to that system last night according to a separate report. I have verified that the data IS NOT compressed during transmission (as was my theory before now).

So, does anybody have any insight? Or should I just give up trying to get useful SNMP data out of Windows (I'm not a fan of Microsoft or their non-standard ways of doing things)?

BSOD2600 · Post by **BSOD2600** » Wed Oct 01, 2008 1:58 pm

1) you need to be aware that RRDTOOL averages data over time, which will affect your bandwidth summaries. You can help to alievate this by increasing the amount of data (aka rows) which rrdtool stores. Read up about it on the rrdtool tutorial. There are also guides in the documentation site on how to increase cacit's rrdtool history.

2) If you dont like Windows SNMP, then you're free to use Net-SNMP. http://forums.cacti.net/viewtopic.php?t=26151

icewalker · Post by **icewalker** » Wed Oct 01, 2008 3:32 pm

BSOD2600 wrote:1) you need to be aware that RRDTOOL averages data over time, which will affect your bandwidth summaries. You can help to alievate this by increasing the amount of data (aka rows) which rrdtool stores. Read up about it on the rrdtool tutorial. There are also guides in the documentation site on how to increase cacit's rrdtool history.

Thank for the info. I've been doing a lot of reading about RRDTool and how the Sum of the Averages is used. The problem with this method is that it definitely is not accurate for a 24 hour period.

My RRDTool setup is the default. So a 5 minute average over 1 day is what my graph is showing. That is about as detailed as the system will get at this time, unless I got to the 1 minute interval.

But since the Interface statistics are basically a counter, I would think the sum would be accurate. But my graph is definitely not accurate based on other data I have available. If it were close (within a few percent), I would not be worried and I would just move on, but the discrepancy is by several factors and that is just too big of an error for reporting purposes.

I've written a small script to collect the In/Out Octets in 5 minute increments so that I can compare the differences between them from 7:00 PM tonight to roughly 3:00 AM (the window for most traffic on this system) without using RRDTool.

If I do my math right, then my "average" should be very very close to what is reported in the rra for this system for each time stamp. And if I take the counter value from 7:00 PM and subtract it from 3:00 AM, I should have my Total Bytes. Since the RRDTool should be taking the difference between the counters at each time stamp and dividing by 300 to get the average over 5 minutes, then the sum of those averages from 7:00 PM to 3:00 AM should be the same (or very very very close).

BSOD2600 wrote:2) If you dont like Windows SNMP, then you're free to use Net-SNMP. http://forums.cacti.net/viewtopic.php?t=26151

I was referring to Microsoft putting all the good stuff in WMI instead of SNMP. Basically, I have to install more than SNMP on my Windows Servers, I also have to install something to populate the SNMP data for me when Microsoft should have just done it anyway. Thanks the link though, I did provide some good insight into what I will be doing next with regards to the Windows Systems in our Environment.

I'll post my results from my little experiment tomorrow.

Regards,

James

icewalker · Post by **icewalker** » Thu Oct 02, 2008 8:10 am

OK, so the graph isn't "wrong per say", just the way the data is collected and displayed. More detail below:

A cursory examination of the data I collected last night (outside of RRDTool) shows some interesting trends. Basically, I am thinking that the 32Bit SNMP Counters are not adequate on this system during a 5 minute period.

A 20 minute sampling for example shows the following information

Code: Select all

Date Time          ifInOctets  ifOutOctets
10/01/2008 19:05   1287451122       790004
10/01/2008 19:10   1908483356     48854492
10/01/2008 19:15   1715390055    256938675
10/01/2008 19:20    827968154    335170876
10/01/2008 19:25   3354898017    464151381

In just 15 minutes, the counters for ifInOctet cycled at least twice. The big question is, how many times did those counters really cycle?

Due to the interface speed of 1 Gigabit/sec, it would be quite easy to overwhelm the counters multiple times (4-5 in fact at full speed) in a 5 minute period. And knowing that I send more than 250 GB through this interface every night, it would be conceivable that the 32 Bit SNMP Counters on Windows 2003 Server (and Linux) are inadequate at high throughput and that they could have cycled more than once in a 5 minute period.

Using the Perf Mon tool on Windows 2003, I was able to determine that during this same sampling period above, I was peaking at 600 mbps for several minutes. Maximum throughput on a 5 minute scale at 600 mbps is ~21 GB (God I am hoping I'm doing my math right). My data report (the one that tells me that I sent 304.45 GB Last Night through this interface) shows that in fact I sent 13 GB in a 3 minute period. Knowing this information, I can safely conclude that this particular data transfer would have forced the 32Bit SNMP Counter to cycle at least 3 times between SNMP samplings.

Reviewing the Graph attached to this message, you can indeed see that my Inbound Total is off by roughly a factor of 3.

True throughput > 304.45 GB
Reported Throughput = 91.38 GB

I would guess that this situation is also what BSOD2600 was hinting at earlier in this thread. The average would be adequate provided the sampling is fine enough. Obviously, a 5 minute average is not adequate on a high throughput device with a 32 bit counter. At this point, I have no solution to this problem besides sampling every minute. Unfortunately, that may not be a solution since I would eventually have a few hundred of systems to monitor once I get this project off the ground. It is already taking 10 seconds to monitor 14 systems with the php poller. I can certainly switch to spine, but I would find it difficult to believe that it could handle the load once we bring everything online.

Does anybody else have a recommendation? Is it possible to get decent Windows 2003 Server Statistics out of WMI? And for the record, the Windows 2003 Server in question is running 64 Bit Windows with SNMP Installed. So it is not a matter of running an appropriate OS. Even the 64 Linux Systems only report 32 Bit counters on the Interface.

I look forward to any alternatives and suggestions.

Thanks

James

BSOD2600 · Post by **BSOD2600** » Thu Oct 02, 2008 2:16 pm

One of the downfalls of Windows, is that even though it support SNMPv2, it doesn't have 64bit counters! As you've already notice, the 32bit counters roll over too fast on systems with lots of traffic -- which is where faster polling in Cacti can help fill the gap.

I wouldn't recommend using WMI with cacti for a large installation, since as I'm sure you're already aware, it has a large time overhead due to authentication/polling/etc. The snmp-informant addons might provide additional data via SNMP which you might be interested in.

Tweaking your cacti installation can result it in being able to poll thousands of devices in a short period of time. There is a metrics thread in the announcement forum, which will give you an idea of the giant systems users have -- also some ideas on optimization (threads, pollers, etc). Eventually, you might want to look into using the Boost plugin too.

icewalker · Post by **icewalker** » Thu Oct 09, 2008 12:15 pm

I apologize for the delay.

Let's suppose for a moment that I wish to poll every minute. And let us say that I'm willing to start over from scratch on all graphs. What would be the best course of action?

I've investigated altering the Data Template for Interface - Traffic and setting the "Hourly" option under Associated RRA's. Unfortunately, when I look at the data for an rrd, I get some weird results (ie time stamps are 360 seconds instead of 300 seconds, I was expecting 60 second intervals).

1223570520: 1.0243742826e+05 1.4302634013e+05
1223570880: 7.6489925664e+04 1.1376989234e+05
1223571240: 1.0166720348e+05 1.3207643704e+05
1223571600: 1.0549418648e+05 1.4706698424e+05
1223571960: 1.1711126194e+05 1.5297393792e+05
1223572320: nan nan

In conjunction with changing the Associated RRA, I also adjusted the "step" to 60 since we polling every minute instead of every 5 minutes. I'm guessing my assumption was wrong?

I also deleted the rrd files when I made the change and started the graphs over for Traffic Stats.

Again, I'm still in the test phase, so I can do anything at this time. Lastly, I have read the docs, but they talk about modifying the rrd's. That is fine, but I'm going for the simple approach since I will ZERO out everything anyway when we go to production.

Thanks

James

icewalker · Post by **icewalker** » Thu Oct 09, 2008 12:53 pm

Call me stupid. I think I got it. I wasn't querying on the correct resolution, thus I was getting 360 second intervals.

rrdtool info r0b-tsmp01_traffic_in_78.rrd wrote:
filename = "r0b-tsmp01_traffic_in_78.rrd"
rrd_version = "0003"
step = 60
last_update = 1223574362
ds[traffic_in].type = "COUNTER"
ds[traffic_in].minimal_heartbeat = 600
ds[traffic_in].min = 0.0000000000e+00
ds[traffic_in].max = 1.0000000000e+09
ds[traffic_in].last_ds = "2988928139"
ds[traffic_in].value = 5.5462295082e+02
ds[traffic_in].unknown_sec = 0
ds[traffic_out].type = "COUNTER"
ds[traffic_out].minimal_heartbeat = 600
ds[traffic_out].min = 0.0000000000e+00
ds[traffic_out].max = 1.0000000000e+09
ds[traffic_out].last_ds = "3311791590"
ds[traffic_out].value = 1.4259016393e+02
ds[traffic_out].unknown_sec = 0
..
..
..

Then when I decrease the resolution

rrdtool fetch r0b-tsmp01_traffic_in_78.rrd AVERAGE -r 60 -s -1h wrote: traffic_in traffic_out
..
1223574120: 2.5069768581e+02 6.9262523427e+01
1223574180: 3.4495668150e+02 1.7829567180e+02
1223574240: 3.1260611111e+02 1.2093944444e+02
1223574300: 3.1689469868e+02 1.2078032015e+02
1223574360: 2.7797436325e+02 7.2120954895e+01
1223574420: 3.0757316029e+02 1.5225816940e+02
1223574480: nan nan

I blame the allergy meds and the resultant fog for my failure to see this before. If anybody has any better suggestions though, I'm all ears.

Thanks

James

Post by **TheWitness** » Thu Oct 09, 2008 8:26 pm

Hey, if you can send me some of your med's, it might help me forget about my 401k...

TheWitness

icewalker · Post by **icewalker** » Wed Oct 15, 2008 7:06 am

I've made some modifications to my Cacti Installation to try and get this issue figured out. The first and most important modification was to poll every minute.

The default graph for the last day on this system seems correct.

So I'm fine with that. Like I said, the numbers appear to be correct. But, when I click on the graph to get the RRD graphs for Hourly, Daily, Weekly, etc, an interesting thing happens; only the Hourly and Yearly Graphs show correct totals. And this is surprising, because you would think the Daily Graph (having been correct on the previous page) would be correct here, but that is not the case.

Maybe I'm just way off, but I would think that because the system was zeroed out less than a week ago, that the Weekly, Monthly, and Yearly graphs would all report the same total, which they don't.

Is this a function of how it averages? Is it something with RRDTool? I'm just trying to wrap my head around this, because I need to trust the numbers and right now, I don't trust them.

BSOD2600 · Post by **BSOD2600** » Wed Oct 15, 2008 1:39 pm

This is getting beyond my understanding on how rrdtool works... Have you read through the guides in http://docs.cacti.net/?q=node/75 ?

Moving to the general forum for better exposure, since not a Windows specific problem.

Post by **gandalf** » Wed Oct 15, 2008 4:19 pm

From a first glance at this thread:

I'm happy that you managed to get around those COUNTER wraps. This is specifically dangerous when using the standard intervals of 300 sec and then switching to 60 sec polling. But it seems to me that you've managed this successfully.
The totaling has it's own mysteries. To debug this, from my point of view the rrd file itself is required. Assuming, that all updates are ok (COUNTER wraps successfully tackled) it is possible to calculate true totals, e.g. using rrdtool fetch or using advanced RRDTool VDEF based calculations. This requires definitely some time and some knowledge ...
Reinhard

icewalker · Post by **icewalker** » Thu Oct 16, 2008 7:02 am

I had the same thoughts. I'm already looking at using rrdtool to collect the data and do the math. It will take some time for me to learn how to do that. In the meantime, I'm going to post the rrd file for the system in question if somebody already has a quick script or something to perform the math.

icewalker · Post by **icewalker** » Mon Oct 20, 2008 1:34 pm

I'm still working on why these graphs are wrong. But what I did was compare the stock "Weekly (30 Minute Average)" graph to a custom report for the "Last Week" preset. I then grabbed the "command" used to generate the graph.

"Weekly (30 Minute Average)"

Code: Select all

/usr/bin/rrdtool graph - \
--imgformat=PNG \
--start=-604800 \
--end=-360 \
--title="R0B-TSMP01 - Traffic - 172.17.2.52 (Broadcom NetXtr)" \
--rigid \
--base=1000 \
--height=120 \
--width=500 \
--alt-autoscale \
--vertical-label="bytes per second" \
--slope-mode \
--font TITLE:12: \
--font AXIS:8: \
--font LEGEND:10: \
--font UNIT:8: \
DEF:a="/var/www/html/cacti/rra/r0b-tsmp01_traffic_in_78.rrd":traffic_in:AVERAGE \
DEF:b="/var/www/html/cacti/rra/r0b-tsmp01_traffic_in_78.rrd":traffic_in:MAX \
DEF:c="/var/www/html/cacti/rra/r0b-tsmp01_traffic_in_78.rrd":traffic_out:AVERAGE \
DEF:d="/var/www/html/cacti/rra/r0b-tsmp01_traffic_in_78.rrd":traffic_out:MAX \
CDEF:cdeff=c,-1,* \
AREA:a#00CF00FF:"Inbound"  \
GPRINT:a:LAST:" Current\:%8.2lf %s"  \
GPRINT:a:AVERAGE:"Average\:%8.2lf %s"  \
GPRINT:b:MAX:"Maximum\:%8.2lf %s\n"  \
COMMENT:"Total In\:  616.56 GB\n"  \
AREA:cdeff#002A97FF:"Outbound"  \
GPRINT:c:LAST:"Current\:%8.2lf %s"  \
GPRINT:c:AVERAGE:"Average\:%8.2lf %s"  \
GPRINT:d:MAX:"Maximum\:%8.2lf %s\n"  \
COMMENT:"Total Out\: 89.99 GB"

"Last Week" preset

Code: Select all

/usr/bin/rrdtool graph - \
--imgformat=PNG \
--start=1223921989 \
--end=1224526789 \
--title="R0B-TSMP01 - Traffic - 172.17.2.52 (Broadcom NetXtr)" \
--rigid \
--base=1000 \
--height=120 \
--width=500 \
--alt-autoscale \
COMMENT:"From 2008/10/13 14\:19\:49 To 2008/10/20 14\:19\:49\c" \
COMMENT:"  \n" \
--vertical-label="bytes per second" \
--slope-mode \
--font TITLE:12: \
--font AXIS:8: \
--font LEGEND:10: \
--font UNIT:8: \
DEF:a="/var/www/html/cacti/rra/r0b-tsmp01_traffic_in_78.rrd":traffic_in:AVERAGE \
DEF:b="/var/www/html/cacti/rra/r0b-tsmp01_traffic_in_78.rrd":traffic_in:MAX \
DEF:c="/var/www/html/cacti/rra/r0b-tsmp01_traffic_in_78.rrd":traffic_out:AVERAGE \
DEF:d="/var/www/html/cacti/rra/r0b-tsmp01_traffic_in_78.rrd":traffic_out:MAX \
CDEF:cdeff=c,-1,* \
AREA:a#00CF00FF:"Inbound"  \
GPRINT:a:LAST:" Current\:%8.2lf %s"  \
GPRINT:a:AVERAGE:"Average\:%8.2lf %s"  \
GPRINT:b:MAX:"Maximum\:%8.2lf %s\n"  \
COMMENT:"Total In\:  2.47 TB\n"  \
AREA:cdeff#002A97FF:"Outbound"  \
GPRINT:c:LAST:"Current\:%8.2lf %s"  \
GPRINT:c:AVERAGE:"Average\:%8.2lf %s"  \
GPRINT:d:MAX:"Maximum\:%8.2lf %s\n"  \
COMMENT:"Total Out\: 359.97 GB"

The major difference besides the Total In and Total out is the Start and End times. The time that this data was gathered is the same, so the graphs are suppose to be essentially the same timescale.

I've noticed that the "--end=-360" is off compared to "--end=1224526789" for Weekly (30 Minute Average) and Last week preset, respectively. But this is only a 6 minute difference and there is no way that system pumped ~2 TB in 6 minutes.

So I'm leaning toward the code that calculates the total for the difference in the graphs. I'm off to go look at that now.

james

icewalker · Post by **icewalker** » Mon Oct 20, 2008 2:17 pm

I haven't narrowed down the code just yet, but I can definitely post differences in the output based on the URL used for the IMG tag.

Code: Select all

http://localhost/cacti/graph_image.php?local_graph_id=67&rra_id=0&view_type=tree&graph_start=-604800&graph_end=-360

Code: Select all

http://localhost/cacti/graph_image.php?local_graph_id=67&rra_id=0&view_type=tree&graph_start=1223924995&graph_end=1224529795

Looking closely at the graphs, the time period is almost the same (just off by a few minutes in each direction). But the total in and total out is definitely way way off!!!!

I've been looking through the code but I haven't found the difference yet. It's definitely in rrd.php (duh) and there appears to be different section of code for graph_start and graph_end that is defined. But in this case, I defined them both and still got a different result. I'll keep looking but PHP is not my strength.

James

apitsos · Post by **apitsos** » Sun Apr 22, 2012 5:54 am

Hi there! It's been years from your last post, but I'd like to ask you if you finally find a solution...

I am having exactly the same problem with one of the machines I am tracking with Cacti. The strange thing is that the issue I have is for a particular Windows 7 Pro machines, but for all the other Windows machines (server 2008) I am tracking with Cacti I have correct traffic.

Thanks a lot in advance for your attention.

Cacti

Interface Traffic Graphs are wrong

Interface Traffic Graphs are wrong

Re: Interface Traffic Graphs are wrong

Re: Interface Traffic Graphs are wrong

Who is online