NetApp Filer: graphing Performance Stats and IO's (template)
Moderators: Developers, Moderators
No Latency Data - LUN or Volume
Hi,
I'm trying to get latency data out of volume or LUN, but there is no data. Other read/write data is OK.
cacti-spine-0.8.7a-1.el5.rf
cacti-0.8.7c-1.el5.rf
rrdtool-1.2.29-1.el5.rf
NetApp Release 7.2.5.1
Anyone has any idea?
Thanks,
Dang
filename = "storage1_read_ops_14272.rrd"
rrd_version = "0003"
step = 300
last_update = 1234558204
ds[read_ops].type = "COUNTER"
ds[read_ops].minimal_heartbeat = 600
ds[read_ops].min = 0.0000000000e+00
ds[read_ops].max = NaN
ds[read_ops].last_ds = "210982540"
ds[read_ops].value = 8.3401993355e+01
ds[read_ops].unknown_sec = 0
ds[write_ops].type = "COUNTER"
ds[write_ops].minimal_heartbeat = 600
ds[write_ops].min = 0.0000000000e+00
ds[write_ops].max = NaN
ds[write_ops].last_ds = "1502227036"
ds[write_ops].value = 6.9506976744e+02
ds[write_ops].unknown_sec = 0
ds[total_ops].type = "COUNTER"
ds[total_ops].minimal_heartbeat = 600
ds[total_ops].min = 0.0000000000e+00
ds[total_ops].max = NaN
ds[total_ops].last_ds = "1797812046"
ds[total_ops].value = 8.7536212625e+02
ds[total_ops].unknown_sec = 0
ds[avg_latency].type = "COUNTER"
ds[avg_latency].minimal_heartbeat = 600
ds[avg_latency].min = 0.0000000000e+00
ds[avg_latency].max = NaN
ds[avg_latency].last_ds = "U"
ds[avg_latency].value = NaN
ds[avg_latency].unknown_sec = 4
ds[read_latency].type = "COUNTER"
ds[read_latency].minimal_heartbeat = 600
ds[read_latency].min = 0.0000000000e+00
ds[read_latency].max = NaN
ds[read_latency].last_ds = "U"
ds[read_latency].value = NaN
ds[read_latency].unknown_sec = 4
ds[write_latency].type = "COUNTER"
ds[write_latency].minimal_heartbeat = 600
ds[write_latency].min = 0.0000000000e+00
ds[write_latency].max = NaN
ds[write_latency].last_ds = "U"
ds[write_latency].value = NaN
ds[write_latency].unknown_sec = 4
ds[other_latency].type = "COUNTER"
ds[other_latency].minimal_heartbeat = 600
ds[other_latency].min = 0.0000000000e+00
ds[other_latency].max = NaN
ds[other_latency].last_ds = "U"
ds[other_latency].value = NaN
ds[other_latency].unknown_sec = 4
I'm trying to get latency data out of volume or LUN, but there is no data. Other read/write data is OK.
cacti-spine-0.8.7a-1.el5.rf
cacti-0.8.7c-1.el5.rf
rrdtool-1.2.29-1.el5.rf
NetApp Release 7.2.5.1
Anyone has any idea?
Thanks,
Dang
filename = "storage1_read_ops_14272.rrd"
rrd_version = "0003"
step = 300
last_update = 1234558204
ds[read_ops].type = "COUNTER"
ds[read_ops].minimal_heartbeat = 600
ds[read_ops].min = 0.0000000000e+00
ds[read_ops].max = NaN
ds[read_ops].last_ds = "210982540"
ds[read_ops].value = 8.3401993355e+01
ds[read_ops].unknown_sec = 0
ds[write_ops].type = "COUNTER"
ds[write_ops].minimal_heartbeat = 600
ds[write_ops].min = 0.0000000000e+00
ds[write_ops].max = NaN
ds[write_ops].last_ds = "1502227036"
ds[write_ops].value = 6.9506976744e+02
ds[write_ops].unknown_sec = 0
ds[total_ops].type = "COUNTER"
ds[total_ops].minimal_heartbeat = 600
ds[total_ops].min = 0.0000000000e+00
ds[total_ops].max = NaN
ds[total_ops].last_ds = "1797812046"
ds[total_ops].value = 8.7536212625e+02
ds[total_ops].unknown_sec = 0
ds[avg_latency].type = "COUNTER"
ds[avg_latency].minimal_heartbeat = 600
ds[avg_latency].min = 0.0000000000e+00
ds[avg_latency].max = NaN
ds[avg_latency].last_ds = "U"
ds[avg_latency].value = NaN
ds[avg_latency].unknown_sec = 4
ds[read_latency].type = "COUNTER"
ds[read_latency].minimal_heartbeat = 600
ds[read_latency].min = 0.0000000000e+00
ds[read_latency].max = NaN
ds[read_latency].last_ds = "U"
ds[read_latency].value = NaN
ds[read_latency].unknown_sec = 4
ds[write_latency].type = "COUNTER"
ds[write_latency].minimal_heartbeat = 600
ds[write_latency].min = 0.0000000000e+00
ds[write_latency].max = NaN
ds[write_latency].last_ds = "U"
ds[write_latency].value = NaN
ds[write_latency].unknown_sec = 4
ds[other_latency].type = "COUNTER"
ds[other_latency].minimal_heartbeat = 600
ds[other_latency].min = 0.0000000000e+00
ds[other_latency].max = NaN
ds[other_latency].last_ds = "U"
ds[other_latency].value = NaN
ds[other_latency].unknown_sec = 4
The latency is pulled via the API, not snmp, so check that you configured the script with a user that has api-* and http-login permissions on the netapp.
Or check out LogicMonitor, if you dont want to spend the time rolling your own cacti graphing and alerting for automated NetApp monitoring (and load balancers, databases, etc).
Or check out LogicMonitor, if you dont want to spend the time rolling your own cacti graphing and alerting for automated NetApp monitoring (and load balancers, databases, etc).
no graphs showing up
i can run the verbose query but still there is no graphs.
if i run the command by itself it runs no problems. just wondering if i am missing something
nothing shows up in logs as error and i am not gettting http errors.
if i run the command by itself it runs no problems. just wondering if i am missing something
nothing shows up in logs as error and i am not gettting http errors.
Re: SNMP versions
wolf31o2 wrote:It's quite simple. Copy the things under scripts to <path_cacti>/scripts, and copy the things under script_server and snmp_queries to their directories under <path_cacti>/resource. After that, you import the templates, which I need to update with my latest changes. In fact, I need to upload some newer scripts and such, too.adamshand wrote:This looks great, thanks for posting it. Any chance of a quick readme on what all the bits are for?wolf31o2 wrote: NetApp Scripts/Templates on Git
Cheers,
Adam.
I'm planning on supporting everything that I can via several methods.
- SNMPv1 for ONTAP versions prior to 7.3
- SNMPv2/v3 using 64-bit counters for 7.3 and above
- ONTAP Manage API for people who prefer it
- SMI-S Agent scripts for SMI-S software
Of course, I'm open to any help anyone wants to give, and everything I've written is released under the GPLv2. I am adding an installer script to it, and I could use some help with documentation, too. I'd like for the installer to detect the available methods and do some initial setup based on that, so it should work out of the box for everybody, and all they should need to know is the IP addresses of their Filers and the location of their Cacti installation.
Let us know what we can do to help on the project.
Roger L
Twitter:rogerlund
Blog:http://rogerlunditblog.blogspot.com
I'm finding that all the luns stats give me accurate data (as verified on the filer itself), with the exception of average latency. These numbers do not look accurate at all.
For instance, diong a "lun stats -o" for a given lun shows me average latencies around 7 or 8 ms. But cacti is showing me data in the 100 - 200 (usec? ms?) area.
I'm also wondering if the latency is really being returned in microseconds. If you use netapp-ontapsdk-perf-pl and do a "lun counter-list" you get this for latency:
Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec
So, i guess two questions here. 1) has anyone else verified the data you get with these templates is accurate and 2) is it usecs or microseconds?
It seems to me that some sort of CDEF might be required to adjust the data, but i can't figure out what.
For instance, diong a "lun stats -o" for a given lun shows me average latencies around 7 or 8 ms. But cacti is showing me data in the 100 - 200 (usec? ms?) area.
I'm also wondering if the latency is really being returned in microseconds. If you use netapp-ontapsdk-perf-pl and do a "lun counter-list" you get this for latency:
Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec
So, i guess two questions here. 1) has anyone else verified the data you get with these templates is accurate and 2) is it usecs or microseconds?
It seems to me that some sort of CDEF might be required to adjust the data, but i can't figure out what.
... Ok, after some additional investigation I've concluded the following:gheppner wrote:I'm finding that all the luns stats give me accurate data (as verified on the filer itself), with the exception of average latency. These numbers do not look accurate at all.
For instance, diong a "lun stats -o" for a given lun shows me average latencies around 7 or 8 ms. But cacti is showing me data in the 100 - 200 (usec? ms?) area.
I'm also wondering if the latency is really being returned in microseconds. If you use netapp-ontapsdk-perf-pl and do a "lun counter-list" you get this for latency:
Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec
So, i guess two questions here. 1) has anyone else verified the data you get with these templates is accurate and 2) is it usecs or microseconds?
It seems to me that some sort of CDEF might be required to adjust the data, but i can't figure out what.
1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.
I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.
I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.
Hi gheppner,gheppner wrote: ... Ok, after some additional investigation I've concluded the following:
1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.
I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.
I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.
I've been attacking the same problem with regards to volume latency numbers. They're just way out of range (like PetaMicroseconds) . From reading up on the ONTAPI docs it appears that you are on the right track but they mention taking 2 samples at time T1 and T2 and then calculating latency as:
(latency_T2 - latency_T1) / (total_ops_T2 - total_ops_T1)
I took the netapp-ontapsdk-perf.pl script and hacked up a version to do 2 samples of volume avg_latency 10 seconds apart using the method above and the number very closely matches the CLI "stats show" output (volume latency is in microseconds).
I think gheppner's method is a lot easier. Though my math is rusty I think his method is also mathematically correct. To try to verify I spent a couple of minutes trying it in oocalc, replicating the math sugested my netapp and what you get using an rrd and gheppner's suggestion, and it absolutely seems to yield the correct numbers.jlindberg wrote:Hi gheppner,gheppner wrote: ... Ok, after some additional investigation I've concluded the following:
1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.
I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.
I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.
I've been attacking the same problem with regards to volume latency numbers. They're just way out of range (like PetaMicroseconds) . From reading up on the ONTAPI docs it appears that you are on the right track but they mention taking 2 samples at time T1 and T2 and then calculating latency as:
(latency_T2 - latency_T1) / (total_ops_T2 - total_ops_T1)
I took the netapp-ontapsdk-perf.pl script and hacked up a version to do 2 samples of volume avg_latency 10 seconds apart using the method above and the number very closely matches the CLI "stats show" output (volume latency is in microseconds).
I'm totally new to cacti and this forum btw. Been using munin to create some graphs for my filers but now I'm trying cacti because I think it would work and look much nicer.
Yeah, you're right. After I thought about it some more, since Cacti is treating this as a counter it basically does the subtraction between intervals for the calculation so doing the CDEF method is much simpler than what I was contemplating.markdv wrote:I think gheppner's method is a lot easier. Though my math is rusty I think his method is also mathematically correct. To try to verify I spent a couple of minutes trying it in oocalc, replicating the math sugested my netapp and what you get using an rrd and gheppner's suggestion, and it absolutely seems to yield the correct numbers.
I abandoned my idea and did what gheppner suggested and the numbers look good (although, as I indicated, volume latency is indeed in microseconds).
Curiuos how you determined volume latency was in microseconds. If I pass "volume counter-list" to the perl script, it returns the units as milliseconds also:jlindberg wrote:Yeah, you're right. After I thought about it some more, since Cacti is treating this as a counter it basically does the subtraction between intervals for the calculation so doing the CDEF method is much simpler than what I was contemplating.markdv wrote:I think gheppner's method is a lot easier. Though my math is rusty I think his method is also mathematically correct. To try to verify I spent a couple of minutes trying it in oocalc, replicating the math sugested my netapp and what you get using an rrd and gheppner's suggestion, and it absolutely seems to yield the correct numbers.
I abandoned my idea and did what gheppner suggested and the numbers look good (although, as I indicated, volume latency is indeed in microseconds).
netapp-ontapsdk-perf.pl myfilerhead "username-ommited" 'password-ommited' volume counter-list
Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = millisec
Counter Name = total_ops Base Counter = none Privilege_level = basic Unit = per_sec
Counter Name = read_data Base Counter = none Privilege_level = basic Unit = b_per_sec
Counter Name = read_latency Base Counter = read_ops Privilege_level = basic Unit = millisec
Counter Name = read_ops Base Counter = none Privilege_level = basic Unit = per_sec
Counter Name = write_data Base Counter = none Privilege_level = basic Unit = b_per_sec
Counter Name = write_latency Base Counter = write_ops Privilege_level = basic Unit = millisec
Hi again...gheppner wrote:Curiuos how you determined volume latency was in microseconds. If I pass "volume counter-list" to the perl script, it returns the units as milliseconds also:
The "Unified Storage Performance Management Using Open Interfaces" design guide (3/7/2008 page 117) which I was originally using to work on the graph says that avg_latency, read_latency and write_latency units are in "USECS".
Further, comparing the numbers I was seeing from the poll against "stats show ... volume" (also in microseconds) confirmed the documentation.
Over the past several weeks I've been graphing volume latency data, the graph tracks with "stats show ... volume" data.
Put another way, if it really IS milliseconds, our reponse time is sucking badly at 4,000 mS rather than 4,000 uS!
I just though to try the netapp-ontapsdk-perf.pl query that you did and here's my results.... not sure why mine is different from yours.
Code: Select all
Counter Name = avg_latency Base Counter = total_ops Privilege_level = basic Unit = microsec
Counter Name = total_ops Base Counter = none Privilege_level = basic Unit = per_sec
Counter Name = read_data Base Counter = none Privilege_level = basic Unit = b_per_sec
Counter Name = read_latency Base Counter = read_ops Privilege_level = basic Unit = microsec
Counter Name = read_ops Base Counter = none Privilege_level = basic Unit = per_sec
Counter Name = write_data Base Counter = none Privilege_level = basic Unit = b_per_sec
Counter Name = write_latency Base Counter = write_ops Privilege_level = basic Unit = microsec
Counter Name = write_ops Base Counter = none Privilege_level = basic Unit = per_sec
Counter Name = other_latency Base Counter = other_ops Privilege_level = basic Unit = microsec
Data query returns 0 Rows
I'm trying to figure out what i'm doing wrong. When I run the script manually, everything works great but when i try to create new graphs for my filer, it show "This data query returned 0 rows" and when i run it in debug mode i get the following:
+ Running data query [16].
+ Found type = '4 '[script query].
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ XML file parsed ok.
+ Executing script for list of indexes 'perl /usr/share/cacti/site/scripts/netapp-ontapsdk-perf.pl fasprs02 "USERNAME" "PASSWORD" system index'
+ Executing script query 'perl /usr/share/cacti/site/scripts/netapp-ontapsdk-perf.pl fasprs02 "USERNAME" "PASSWORD" system query index'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
Any thoughts on what i might be doing wrong? This is a brans new setup as well. Let me know if you need more information.
+ Running data query [16].
+ Found type = '4 '[script query].
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ XML file parsed ok.
+ Executing script for list of indexes 'perl /usr/share/cacti/site/scripts/netapp-ontapsdk-perf.pl fasprs02 "USERNAME" "PASSWORD" system index'
+ Executing script query 'perl /usr/share/cacti/site/scripts/netapp-ontapsdk-perf.pl fasprs02 "USERNAME" "PASSWORD" system query index'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
+ Found data query XML file at '/usr/share/cacti/site/resource/script_queries/query-netapp-ontapsdk-system.xml'
Any thoughts on what i might be doing wrong? This is a brans new setup as well. Let me know if you need more information.
-
- Cacti User
- Posts: 234
- Joined: Mon Dec 13, 2004 3:03 pm
@gheppner:gheppner wrote: ... Ok, after some additional investigation I've concluded the following:
1) the units returned by the API are in milliseconds, not microseconds.
2) the value returned by a call to avg_latency is not representative of the average latency per operation, but the avg latency of the total ops in a given polling period.
I added total_ops as a data source to the lun latency graph template, and then used a CDEF to divide the latency by the total ops. I now get values in the 3 - 8 ms range that are consistent with what the filer shows with lun stats -o -i 5 <lun name>.
I'm curiuos if anyone else using these templates has noticed what I've noticed, or if I'm way out in left field here.
Wow, I've been running these templates for months and had no idea the volume latencies were off by so much. Thanks for tracking this issue down. I only partially understand what you've done here, mostly because I haven't looked at this template in a long time... Is there a chance you can roll a new version of this template, or at least post some updated xml's to reflect the changes you've made? I'm also wondering how this will work against the old templates and RRDs I already have running.
Thanks!
Who is online
Users browsing this forum: No registered users and 6 guests