NetApp Filer: graphing Performance Stats and IO's (template)

DangHuynh · Post by **DangHuynh** » Fri Feb 13, 2009 3:59 pm

Hi,

I'm trying to get latency data out of volume or LUN, but there is no data. Other read/write data is OK.

cacti-spine-0.8.7a-1.el5.rf
cacti-0.8.7c-1.el5.rf
rrdtool-1.2.29-1.el5.rf
NetApp Release 7.2.5.1

Anyone has any idea?

Thanks,

Dang

filename = "storage1_read_ops_14272.rrd"
rrd_version = "0003"
step = 300
last_update = 1234558204
ds[read_ops].type = "COUNTER"
ds[read_ops].minimal_heartbeat = 600
ds[read_ops].min = 0.0000000000e+00
ds[read_ops].max = NaN
ds[read_ops].last_ds = "210982540"
ds[read_ops].value = 8.3401993355e+01
ds[read_ops].unknown_sec = 0
ds[write_ops].type = "COUNTER"
ds[write_ops].minimal_heartbeat = 600
ds[write_ops].min = 0.0000000000e+00
ds[write_ops].max = NaN
ds[write_ops].last_ds = "1502227036"
ds[write_ops].value = 6.9506976744e+02
ds[write_ops].unknown_sec = 0
ds[total_ops].type = "COUNTER"
ds[total_ops].minimal_heartbeat = 600
ds[total_ops].min = 0.0000000000e+00
ds[total_ops].max = NaN
ds[total_ops].last_ds = "1797812046"
ds[total_ops].value = 8.7536212625e+02
ds[total_ops].unknown_sec = 0
ds[avg_latency].type = "COUNTER"
ds[avg_latency].minimal_heartbeat = 600
ds[avg_latency].min = 0.0000000000e+00
ds[avg_latency].max = NaN
ds[avg_latency].last_ds = "U"
ds[avg_latency].value = NaN
ds[avg_latency].unknown_sec = 4
ds[read_latency].type = "COUNTER"
ds[read_latency].minimal_heartbeat = 600
ds[read_latency].min = 0.0000000000e+00
ds[read_latency].max = NaN
ds[read_latency].last_ds = "U"
ds[read_latency].value = NaN
ds[read_latency].unknown_sec = 4
ds[write_latency].type = "COUNTER"
ds[write_latency].minimal_heartbeat = 600
ds[write_latency].min = 0.0000000000e+00
ds[write_latency].max = NaN
ds[write_latency].last_ds = "U"
ds[write_latency].value = NaN
ds[write_latency].unknown_sec = 4
ds[other_latency].type = "COUNTER"
ds[other_latency].minimal_heartbeat = 600
ds[other_latency].min = 0.0000000000e+00
ds[other_latency].max = NaN
ds[other_latency].last_ds = "U"
ds[other_latency].value = NaN
ds[other_latency].unknown_sec = 4

sfrancis · Post by **sfrancis** » Tue Mar 10, 2009 6:40 pm

The latency is pulled via the API, not snmp, so check that you configured the script with a user that has api-* and http-login permissions on the netapp.

Or check out LogicMonitor, if you dont want to spend the time rolling your own cacti graphing and alerting for automated NetApp monitoring (and load balancers, databases, etc).

jtman2003 · Post by **jtman2003** » Fri Mar 13, 2009 4:19 pm

i can run the verbose query but still there is no graphs.

if i run the command by itself it runs no problems. just wondering if i am missing something

nothing shows up in logs as error and i am not gettting http errors.

jtman2003 · Post by **jtman2003** » Fri Mar 13, 2009 4:38 pm

Reloaded the poller cache and now i get a NAN total on my graphs.

So the netapp script isn't running? but if i look in the logs it shows that it ran.

rlund · Post by **rlund** » Mon Mar 16, 2009 10:16 am

wolf31o2 wrote:
adamshand wrote:
wolf31o2 wrote: NetApp Scripts/Templates on Git
This looks great, thanks for posting it. Any chance of a quick readme on what all the bits are for?

Cheers,
Adam.
It's quite simple. Copy the things under scripts to <path_cacti>/scripts, and copy the things under script_server and snmp_queries to their directories under <path_cacti>/resource. After that, you import the templates, which I need to update with my latest changes. In fact, I need to upload some newer scripts and such, too.

I'm planning on supporting everything that I can via several methods.

- SNMPv1 for ONTAP versions prior to 7.3
- SNMPv2/v3 using 64-bit counters for 7.3 and above
- ONTAP Manage API for people who prefer it
- SMI-S Agent scripts for SMI-S software

Of course, I'm open to any help anyone wants to give, and everything I've written is released under the GPLv2. I am adding an installer script to it, and I could use some help with documentation, too. I'd like for the installer to detect the available methods and do some initial setup based on that, so it should work out of the box for everybody, and all they should need to know is the IP addresses of their Filers and the location of their Cacti installation.

Let us know what we can do to help on the project.

Roger L

Twitter:rogerlund
Blog:http://rogerlunditblog.blogspot.com

gheppner · Post by **gheppner** » Fri Mar 27, 2009 1:03 pm

I'm finding that all the luns stats give me accurate data (as verified on the filer itself), with the exception of average latency. These numbers do not look accurate at all.

For instance, diong a "lun stats -o" for a given lun shows me average latencies around 7 or 8 ms. But cacti is showing me data in the 100 - 200 (usec? ms?) area.

I'm also wondering if the latency is really being returned in microseconds. If you use netapp-ontapsdk-perf-pl and do a "lun counter-list" you get this for latency:

Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec

So, i guess two questions here. 1) has anyone else verified the data you get with these templates is accurate and 2) is it usecs or microseconds?

It seems to me that some sort of CDEF might be required to adjust the data, but i can't figure out what.

rlund · Post by **rlund** » Fri Mar 27, 2009 1:21 pm

I am having trouble getting the API working with my FAS3140 V7.2.6.1

Anyone know if you need a certain version of data ontap for this to work?

gheppner · Post by **gheppner** » Mon Mar 30, 2009 1:31 pm

gheppner wrote:I'm finding that all the luns stats give me accurate data (as verified on the filer itself), with the exception of average latency. These numbers do not look accurate at all.

For instance, diong a "lun stats -o" for a given lun shows me average latencies around 7 or 8 ms. But cacti is showing me data in the 100 - 200 (usec? ms?) area.

I'm also wondering if the latency is really being returned in microseconds. If you use netapp-ontapsdk-perf-pl and do a "lun counter-list" you get this for latency:

Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec

So, i guess two questions here. 1) has anyone else verified the data you get with these templates is accurate and 2) is it usecs or microseconds?

It seems to me that some sort of CDEF might be required to adjust the data, but i can't figure out what.

... Ok, after some additional investigation I've concluded the following:

1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.

I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.

I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.

jlindberg · Post by **jlindberg** » Mon Mar 30, 2009 5:11 pm

gheppner wrote: ... Ok, after some additional investigation I've concluded the following:

1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.

I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.

I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.

Hi gheppner,

I've been attacking the same problem with regards to volume latency numbers. They're just way out of range (like PetaMicroseconds)

. From reading up on the ONTAPI docs it appears that you are on the right track but they mention taking 2 samples at time T1 and T2 and then calculating latency as:

(latency_T2 - latency_T1) / (total_ops_T2 - total_ops_T1)

I took the netapp-ontapsdk-perf.pl script and hacked up a version to do 2 samples of volume avg_latency 10 seconds apart using the method above and the number very closely matches the CLI "stats show" output (volume latency is in microseconds).

markdv · Post by **markdv** » Tue Mar 31, 2009 1:38 am

jlindberg wrote:
gheppner wrote: ... Ok, after some additional investigation I've concluded the following:

1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.

I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.

I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.
Hi gheppner,

I've been attacking the same problem with regards to volume latency numbers. They're just way out of range (like PetaMicroseconds) . From reading up on the ONTAPI docs it appears that you are on the right track but they mention taking 2 samples at time T1 and T2 and then calculating latency as:

(latency_T2 - latency_T1) / (total_ops_T2 - total_ops_T1)

I took the netapp-ontapsdk-perf.pl script and hacked up a version to do 2 samples of volume avg_latency 10 seconds apart using the method above and the number very closely matches the CLI "stats show" output (volume latency is in microseconds).

I think gheppner's method is a lot easier. Though my math is rusty I think his method is also mathematically correct. To try to verify I spent a couple of minutes trying it in oocalc, replicating the math sugested my netapp and what you get using an rrd and gheppner's suggestion, and it absolutely seems to yield the correct numbers.

I'm totally new to cacti and this forum btw. Been using munin to create some graphs for my filers but now I'm trying cacti because I think it would work and look much nicer.

jlindberg · Post by **jlindberg** » Tue Mar 31, 2009 9:35 am

markdv wrote:I think gheppner's method is a lot easier. Though my math is rusty I think his method is also mathematically correct. To try to verify I spent a couple of minutes trying it in oocalc, replicating the math sugested my netapp and what you get using an rrd and gheppner's suggestion, and it absolutely seems to yield the correct numbers.

Yeah, you're right. After I thought about it some more, since Cacti is treating this as a counter it basically does the subtraction between intervals for the calculation so doing the CDEF method is much simpler than what I was contemplating.

I abandoned my idea and did what gheppner suggested and the numbers look good (although, as I indicated, volume latency is indeed in microseconds).

gheppner · Post by **gheppner** » Wed Apr 01, 2009 10:58 am

jlindberg wrote:
markdv wrote:I think gheppner's method is a lot easier. Though my math is rusty I think his method is also mathematically correct. To try to verify I spent a couple of minutes trying it in oocalc, replicating the math sugested my netapp and what you get using an rrd and gheppner's suggestion, and it absolutely seems to yield the correct numbers.
Yeah, you're right. After I thought about it some more, since Cacti is treating this as a counter it basically does the subtraction between intervals for the calculation so doing the CDEF method is much simpler than what I was contemplating.

I abandoned my idea and did what gheppner suggested and the numbers look good (although, as I indicated, volume latency is indeed in microseconds).

Curiuos how you determined volume latency was in microseconds. If I pass "volume counter-list" to the perl script, it returns the units as milliseconds also:

netapp-ontapsdk-perf.pl myfilerhead "username-ommited" 'password-ommited' volume counter-list

Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec
Counter Name = total_ops Base Counter = none Privilege_level = basic Unit = per_sec
Counter Name = read_data Base Counter = none Privilege_level = basic Unit = b_per_sec
Counter Name = read_latency Base Counter = read_ops Privilege_level = basic Unit = millisec
Counter Name = read_ops Base Counter = none Privilege_level = basic Unit = per_sec
Counter Name = write_data Base Counter = none Privilege_level = basic Unit = b_per_sec
Counter Name = write_latency Base Counter = write_ops Privilege_level = basic Unit = millisec

jlindberg · Post by **jlindberg** » Mon Apr 27, 2009 1:20 pm

gheppner wrote:Curiuos how you determined volume latency was in microseconds. If I pass "volume counter-list" to the perl script, it returns the units as milliseconds also:

Hi again...

The "Unified Storage Performance Management Using Open Interfaces" design guide (3/7/2008 page 117) which I was originally using to work on the graph says that avg_latency, read_latency and write_latency units are in "USECS".

Further, comparing the numbers I was seeing from the poll against "stats show ... volume" (also in microseconds) confirmed the documentation.

Over the past several weeks I've been graphing volume latency data, the graph tracks with "stats show ... volume" data.

Put another way, if it really IS milliseconds, our reponse time is sucking badly at 4,000 mS rather than 4,000 uS!

I just though to try the netapp-ontapsdk-perf.pl query that you did and here's my results.... not sure why mine is different from yours.

Code: Select all

Counter Name = avg_latency   Base Counter = total_ops Privilege_level = basic Unit = microsec
Counter Name = total_ops     Base Counter = none      Privilege_level = basic Unit = per_sec
Counter Name = read_data     Base Counter = none      Privilege_level = basic Unit = b_per_sec
Counter Name = read_latency  Base Counter = read_ops  Privilege_level = basic Unit = microsec
Counter Name = read_ops      Base Counter = none      Privilege_level = basic Unit = per_sec
Counter Name = write_data    Base Counter = none      Privilege_level = basic Unit = b_per_sec
Counter Name = write_latency Base Counter = write_ops Privilege_level = basic Unit = microsec
Counter Name = write_ops     Base Counter = none      Privilege_level = basic Unit = per_sec
Counter Name = other_latency Base Counter = other_ops Privilege_level = basic Unit = microsec

befrenchy · Post by **befrenchy** » Tue Feb 16, 2010 4:07 pm

I'm trying to figure out what i'm doing wrong. When I run the script manually, everything works great but when i try to create new graphs for my filer, it show "This data query returned 0 rows" and when i run it in debug mode i get the following:

+ Running data query [16].
+ Found type = '4 '[script query].
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ XML file parsed ok.
+ Executing script for list of indexes 'perl /usr/share/cacti/site/scripts/netapp-ontapsdk-perf.pl fasprs02 "USERNAME" "PASSWORD" system index'
+ Executing script query 'perl /usr/share/cacti/site/scripts/netapp-ontapsdk-perf.pl fasprs02 "USERNAME" "PASSWORD" system query index'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'

Any thoughts on what i might be doing wrong? This is a brans new setup as well. Let me know if you need more information.

eschoeller · Post by **eschoeller** » Tue Feb 16, 2010 6:20 pm

gheppner wrote: ... Ok, after some additional investigation I've concluded the following:

1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.

I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.

I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.

@gheppner:

Wow, I've been running these templates for months and had no idea the volume latencies were off by so much. Thanks for tracking this issue down. I only partially understand what you've done here, mostly because I haven't looked at this template in a long time... Is there a chance you can roll a new version of this template, or at least post some updated xml's to reflect the changes you've made? I'm also wondering how this will work against the old templates and RRDs I already have running.

Thanks!

Cacti

NetApp Filer: graphing Performance Stats and IO's (template)

No Latency Data - LUN or Volume

no graphs showing up

Update

Re: SNMP versions

Data query returns 0 Rows

Who is online