NetApp Filer: graphing Performance Stats and IO's (template)

Templates, scripts for templates, scripts and requests for templates.

Moderators: Developers, Moderators

DangHuynh
Posts: 1
Joined: Fri Feb 13, 2009 3:46 pm

No Latency Data - LUN or Volume

Post by DangHuynh »

Hi,

I'm trying to get latency data out of volume or LUN, but there is no data. Other read/write data is OK.

cacti-spine-0.8.7a-1.el5.rf
cacti-0.8.7c-1.el5.rf
rrdtool-1.2.29-1.el5.rf
NetApp Release 7.2.5.1

Anyone has any idea?

Thanks,

Dang

filename = "storage1_read_ops_14272.rrd"
rrd_version = "0003"
step = 300
last_update = 1234558204
ds[read_ops].type = "COUNTER"
ds[read_ops].minimal_heartbeat = 600
ds[read_ops].min = 0.0000000000e+00
ds[read_ops].max = NaN
ds[read_ops].last_ds = "210982540"
ds[read_ops].value = 8.3401993355e+01
ds[read_ops].unknown_sec = 0
ds[write_ops].type = "COUNTER"
ds[write_ops].minimal_heartbeat = 600
ds[write_ops].min = 0.0000000000e+00
ds[write_ops].max = NaN
ds[write_ops].last_ds = "1502227036"
ds[write_ops].value = 6.9506976744e+02
ds[write_ops].unknown_sec = 0
ds[total_ops].type = "COUNTER"
ds[total_ops].minimal_heartbeat = 600
ds[total_ops].min = 0.0000000000e+00
ds[total_ops].max = NaN
ds[total_ops].last_ds = "1797812046"
ds[total_ops].value = 8.7536212625e+02
ds[total_ops].unknown_sec = 0
ds[avg_latency].type = "COUNTER"
ds[avg_latency].minimal_heartbeat = 600
ds[avg_latency].min = 0.0000000000e+00
ds[avg_latency].max = NaN
ds[avg_latency].last_ds = "U"
ds[avg_latency].value = NaN

ds[avg_latency].unknown_sec = 4
ds[read_latency].type = "COUNTER"
ds[read_latency].minimal_heartbeat = 600
ds[read_latency].min = 0.0000000000e+00
ds[read_latency].max = NaN
ds[read_latency].last_ds = "U"
ds[read_latency].value = NaN

ds[read_latency].unknown_sec = 4
ds[write_latency].type = "COUNTER"
ds[write_latency].minimal_heartbeat = 600
ds[write_latency].min = 0.0000000000e+00
ds[write_latency].max = NaN
ds[write_latency].last_ds = "U"
ds[write_latency].value = NaN

ds[write_latency].unknown_sec = 4
ds[other_latency].type = "COUNTER"
ds[other_latency].minimal_heartbeat = 600
ds[other_latency].min = 0.0000000000e+00
ds[other_latency].max = NaN
ds[other_latency].last_ds = "U"
ds[other_latency].value = NaN

ds[other_latency].unknown_sec = 4
sfrancis
Posts: 3
Joined: Tue Mar 10, 2009 6:31 pm

Post by sfrancis »

The latency is pulled via the API, not snmp, so check that you configured the script with a user that has api-* and http-login permissions on the netapp.

Or check out LogicMonitor, if you dont want to spend the time rolling your own cacti graphing and alerting for automated NetApp monitoring (and load balancers, databases, etc).
jtman2003
Posts: 2
Joined: Fri Mar 13, 2009 4:16 pm

no graphs showing up

Post by jtman2003 »

i can run the verbose query but still there is no graphs.


if i run the command by itself it runs no problems. just wondering if i am missing something


nothing shows up in logs as error and i am not gettting http errors.
jtman2003
Posts: 2
Joined: Fri Mar 13, 2009 4:16 pm

Update

Post by jtman2003 »

Reloaded the poller cache and now i get a NAN total on my graphs.

So the netapp script isn't running? but if i look in the logs it shows that it ran.
rlund
Posts: 16
Joined: Fri Apr 11, 2008 9:19 am

Re: SNMP versions

Post by rlund »

wolf31o2 wrote:
adamshand wrote:
wolf31o2 wrote: NetApp Scripts/Templates on Git
This looks great, thanks for posting it. Any chance of a quick readme on what all the bits are for?

Cheers,
Adam.
It's quite simple. Copy the things under scripts to <path_cacti>/scripts, and copy the things under script_server and snmp_queries to their directories under <path_cacti>/resource. After that, you import the templates, which I need to update with my latest changes. In fact, I need to upload some newer scripts and such, too.

I'm planning on supporting everything that I can via several methods.

- SNMPv1 for ONTAP versions prior to 7.3
- SNMPv2/v3 using 64-bit counters for 7.3 and above
- ONTAP Manage API for people who prefer it
- SMI-S Agent scripts for SMI-S software

Of course, I'm open to any help anyone wants to give, and everything I've written is released under the GPLv2. I am adding an installer script to it, and I could use some help with documentation, too. I'd like for the installer to detect the available methods and do some initial setup based on that, so it should work out of the box for everybody, and all they should need to know is the IP addresses of their Filers and the location of their Cacti installation.

Let us know what we can do to help on the project.

Roger L

Twitter:rogerlund
Blog:http://rogerlunditblog.blogspot.com
gheppner
Posts: 20
Joined: Thu Dec 04, 2008 5:10 pm

Post by gheppner »

I'm finding that all the luns stats give me accurate data (as verified on the filer itself), with the exception of average latency. These numbers do not look accurate at all.

For instance, diong a "lun stats -o" for a given lun shows me average latencies around 7 or 8 ms. But cacti is showing me data in the 100 - 200 (usec? ms?) area.

I'm also wondering if the latency is really being returned in microseconds. If you use netapp-ontapsdk-perf-pl and do a "lun counter-list" you get this for latency:

Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec

So, i guess two questions here. 1) has anyone else verified the data you get with these templates is accurate and 2) is it usecs or microseconds?

It seems to me that some sort of CDEF might be required to adjust the data, but i can't figure out what.
rlund
Posts: 16
Joined: Fri Apr 11, 2008 9:19 am

Post by rlund »

I am having trouble getting the API working with my FAS3140 V7.2.6.1

Anyone know if you need a certain version of data ontap for this to work?
gheppner
Posts: 20
Joined: Thu Dec 04, 2008 5:10 pm

Post by gheppner »

gheppner wrote:I'm finding that all the luns stats give me accurate data (as verified on the filer itself), with the exception of average latency. These numbers do not look accurate at all.

For instance, diong a "lun stats -o" for a given lun shows me average latencies around 7 or 8 ms. But cacti is showing me data in the 100 - 200 (usec? ms?) area.

I'm also wondering if the latency is really being returned in microseconds. If you use netapp-ontapsdk-perf-pl and do a "lun counter-list" you get this for latency:

Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec

So, i guess two questions here. 1) has anyone else verified the data you get with these templates is accurate and 2) is it usecs or microseconds?

It seems to me that some sort of CDEF might be required to adjust the data, but i can't figure out what.
... Ok, after some additional investigation I've concluded the following:

1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.

I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.

I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.
jlindberg
Posts: 3
Joined: Mon Mar 30, 2009 4:58 pm

Post by jlindberg »

gheppner wrote: ... Ok, after some additional investigation I've concluded the following:

1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.

I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.

I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.
Hi gheppner,

I've been attacking the same problem with regards to volume latency numbers. They're just way out of range (like PetaMicroseconds) :o. From reading up on the ONTAPI docs it appears that you are on the right track but they mention taking 2 samples at time T1 and T2 and then calculating latency as:

(latency_T2 - latency_T1) / (total_ops_T2 - total_ops_T1)

I took the netapp-ontapsdk-perf.pl script and hacked up a version to do 2 samples of volume avg_latency 10 seconds apart using the method above and the number very closely matches the CLI "stats show" output (volume latency is in microseconds).
markdv
Posts: 3
Joined: Tue Mar 31, 2009 1:01 am

Post by markdv »

jlindberg wrote:
gheppner wrote: ... Ok, after some additional investigation I've concluded the following:

1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.

I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.

I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.
Hi gheppner,

I've been attacking the same problem with regards to volume latency numbers. They're just way out of range (like PetaMicroseconds) :o. From reading up on the ONTAPI docs it appears that you are on the right track but they mention taking 2 samples at time T1 and T2 and then calculating latency as:

(latency_T2 - latency_T1) / (total_ops_T2 - total_ops_T1)

I took the netapp-ontapsdk-perf.pl script and hacked up a version to do 2 samples of volume avg_latency 10 seconds apart using the method above and the number very closely matches the CLI "stats show" output (volume latency is in microseconds).
I think gheppner's method is a lot easier. Though my math is rusty I think his method is also mathematically correct. To try to verify I spent a couple of minutes trying it in oocalc, replicating the math sugested my netapp and what you get using an rrd and gheppner's suggestion, and it absolutely seems to yield the correct numbers.

I'm totally new to cacti and this forum btw. Been using munin to create some graphs for my filers but now I'm trying cacti because I think it would work and look much nicer. :)
jlindberg
Posts: 3
Joined: Mon Mar 30, 2009 4:58 pm

Post by jlindberg »

markdv wrote:I think gheppner's method is a lot easier. Though my math is rusty I think his method is also mathematically correct. To try to verify I spent a couple of minutes trying it in oocalc, replicating the math sugested my netapp and what you get using an rrd and gheppner's suggestion, and it absolutely seems to yield the correct numbers.
Yeah, you're right. After I thought about it some more, since Cacti is treating this as a counter it basically does the subtraction between intervals for the calculation so doing the CDEF method is much simpler than what I was contemplating.

I abandoned my idea and did what gheppner suggested and the numbers look good (although, as I indicated, volume latency is indeed in microseconds).
gheppner
Posts: 20
Joined: Thu Dec 04, 2008 5:10 pm

Post by gheppner »

jlindberg wrote:
markdv wrote:I think gheppner's method is a lot easier. Though my math is rusty I think his method is also mathematically correct. To try to verify I spent a couple of minutes trying it in oocalc, replicating the math sugested my netapp and what you get using an rrd and gheppner's suggestion, and it absolutely seems to yield the correct numbers.
Yeah, you're right. After I thought about it some more, since Cacti is treating this as a counter it basically does the subtraction between intervals for the calculation so doing the CDEF method is much simpler than what I was contemplating.

I abandoned my idea and did what gheppner suggested and the numbers look good (although, as I indicated, volume latency is indeed in microseconds).
Curiuos how you determined volume latency was in microseconds. If I pass "volume counter-list" to the perl script, it returns the units as milliseconds also:

netapp-ontapsdk-perf.pl myfilerhead "username-ommited" 'password-ommited' volume counter-list

Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec
Counter Name = total_ops Base Counter = none Privilege_level = basic Unit = per_sec
Counter Name = read_data Base Counter = none Privilege_level = basic Unit = b_per_sec
Counter Name = read_latency Base Counter = read_ops Privilege_level = basic Unit = millisec
Counter Name = read_ops Base Counter = none Privilege_level = basic Unit = per_sec
Counter Name = write_data Base Counter = none Privilege_level = basic Unit = b_per_sec
Counter Name = write_latency Base Counter = write_ops Privilege_level = basic Unit = millisec
jlindberg
Posts: 3
Joined: Mon Mar 30, 2009 4:58 pm

Post by jlindberg »

gheppner wrote:Curiuos how you determined volume latency was in microseconds. If I pass "volume counter-list" to the perl script, it returns the units as milliseconds also:
Hi again...

The "Unified Storage Performance Management Using Open Interfaces" design guide (3/7/2008 page 117) which I was originally using to work on the graph says that avg_latency, read_latency and write_latency units are in "USECS".

Further, comparing the numbers I was seeing from the poll against "stats show ... volume" (also in microseconds) confirmed the documentation.

Over the past several weeks I've been graphing volume latency data, the graph tracks with "stats show ... volume" data.

Put another way, if it really IS milliseconds, our reponse time is sucking badly at 4,000 mS rather than 4,000 uS! :-)

I just though to try the netapp-ontapsdk-perf.pl query that you did and here's my results.... not sure why mine is different from yours.

Code: Select all

Counter Name = avg_latency   Base Counter = total_ops Privilege_level = basic Unit = microsec
Counter Name = total_ops     Base Counter = none      Privilege_level = basic Unit = per_sec
Counter Name = read_data     Base Counter = none      Privilege_level = basic Unit = b_per_sec
Counter Name = read_latency  Base Counter = read_ops  Privilege_level = basic Unit = microsec
Counter Name = read_ops      Base Counter = none      Privilege_level = basic Unit = per_sec
Counter Name = write_data    Base Counter = none      Privilege_level = basic Unit = b_per_sec
Counter Name = write_latency Base Counter = write_ops Privilege_level = basic Unit = microsec
Counter Name = write_ops     Base Counter = none      Privilege_level = basic Unit = per_sec
Counter Name = other_latency Base Counter = other_ops Privilege_level = basic Unit = microsec
befrenchy
Posts: 4
Joined: Thu Dec 11, 2008 11:00 am

Data query returns 0 Rows

Post by befrenchy »

I'm trying to figure out what i'm doing wrong. When I run the script manually, everything works great but when i try to create new graphs for my filer, it show "This data query returned 0 rows" and when i run it in debug mode i get the following:

+ Running data query [16].
+ Found type = '4 '[script query].
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ XML file parsed ok.
+ Executing script for list of indexes 'perl /usr/share/cacti/site/scripts/netapp-ontapsdk-perf.pl fasprs02 "USERNAME" "PASSWORD" system index'
+ Executing script query 'perl /usr/share/cacti/site/scripts/netapp-ontapsdk-perf.pl fasprs02 "USERNAME" "PASSWORD" system query index'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'


Any thoughts on what i might be doing wrong? This is a brans new setup as well. Let me know if you need more information.
eschoeller
Cacti User
Posts: 234
Joined: Mon Dec 13, 2004 3:03 pm

Post by eschoeller »

gheppner wrote: ... Ok, after some additional investigation I've concluded the following:

1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.

I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.

I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.
@gheppner:

Wow, I've been running these templates for months and had no idea the volume latencies were off by so much. Thanks for tracking this issue down. I only partially understand what you've done here, mostly because I haven't looked at this template in a long time... Is there a chance you can roll a new version of this template, or at least post some updated xml's to reflect the changes you've made? I'm also wondering how this will work against the old templates and RRDs I already have running.

Thanks!
Post Reply

Who is online

Users browsing this forum: No registered users and 8 guests