Host Templates for HP LeftHand Storage System (P4500, DL320)

eschoeller · Post by **eschoeller** » Tue Aug 31, 2010 6:11 pm

I have completed an initial version for two host templates. One for HP LeftHand Systems and one for HP LeftHand Clusters. Please see the following URL for the template homepage and more information:

http://docs.cacti.net/usertemplate:host:hp:lefthand

This is my first attempt at using the new template repository. Please keep all discussions and troubleshooting questions related to this template here, in this thread.

Thanks!

ottow · Post by **ottow** » Thu Sep 02, 2010 1:48 am

Just deployed this for my P4300 solution and it works just as advertized. Very nice to finally have a good view of the system loads on the SAN!

However, it seems I do have a bit of a problem with the CPU and FAN graphs; the scale is not in any temperature format I know of and they seem suspiciously stable...
See below.

Otto

eschoeller · Post by **eschoeller** » Thu Sep 02, 2010 2:07 am

YUP! Tell me about it!!

Not sure if you read through the 'Information' section on the docs page, but I mentioned that many of the graphs will be completely broken because the data HP is presenting via SNMP is completely bogus in some cases.

Originally, they were using the old Compaq CPQ HEALTH stuff which worked about 50% of the time. There was some agent running on those boxes that would die eventually and the temperature data would drop off. Since these are appliances, there was no way to restart that agent - to fix the temperature readings the whole darn appliance had to be restarted.

Now, with the newer P4300 and P4500 units, they've stopped using the CPQ objects altogether, and they've added in some stats directly to the LeftHand object tree. It's bizarre, on my older DL320s units, both the CPQ Health and the new LeftHand object report CPU temperatures correctly ... but the new P4500 units have no joy for CPU Temp and FAN speed.

What's interesting is that none of this changes with SAN iQ updates. The DL320s units continue to use CPQ, and the new P4300/4500 units don't, so there is some logic in their startup scripts that tell the appliance what software to run based on the underlying hardware being used. I can only imagine how much of a mess that is.

I gave up waiting for HP to answer my support case. When they asked if I had the most up to date MIBs, I realized that I was probably going to wait a long time for a real answer

I needed to move on to other things, so I did my best to wrap the template set up and get it out the door, despite the fact that a good chunk of it was broken. I should probably specifically list all the graphs that don't work on the docs page just so folks don't fret over it.

eschoeller · Post by **eschoeller** » Thu Sep 02, 2010 2:08 am

Additionally, I'm glad that you're finding the template set useful in some ways!!

ottow · Post by **ottow** » Thu Sep 02, 2010 2:28 am

Yes, I did read that section but I didn't make the connection with the fan and cpu graphs. Sorry about that.
For my implementation, the Cluster graphs seem to match what I see in the CMC, except for the Cache Hits which seems to be a different value (% in CMC, rather than Operations).
I can't tell about the Node graphs though; the RAID graphs seems pretty out of whack as you say, but they are plotting _something_ so I think they might be useful in a historical sense to keep track of increased load over time.
In any case, great job so far!
This is a significant improivement over CMC even with the current limitations.

eschoeller · Post by **eschoeller** » Thu Sep 02, 2010 11:55 am

Yes, the cache hits graph is in # of operations, whereas the CMC reports it as a % ... namely (CACHE_HITS/(READ_IOPS + WRITE_IOPS))*100.

I thought about re-creating this but it gets tricky because there is no TOTAL_IOPS metric, so I'd have to write a CDEF to accomplish this, and a data template would end up being re-used, which would create another data source with the same information as the RAID plot. In short, it would add some overhead to the templates. I may consider doing this.

We have a whole team of people working on our LeftHand infrastructure right now because we've been experiencing some significant problems. I walked in today and a bunch of folks were buzzing about new patches that were released several days ago, but HP hadn't told us about them yet.

http://bizsupport1.austin.hp.com/bizsup ... ntver=true

Basically every problem I outlined within several support cases have been directly addressed by these recent patches (8/24 & 8/25). That makes me very happy

We haven't applied these patches yet, and it's not up to me to decide when to do that. We are in a change-freeze period.

I'm not sure if you're able to easily apply patches or not, but give it a shot when you can and let me know if things start working correctly. One of the patches re-enables the HP insight manager CPQ objects. If that is truly the case, I'll most certainly be releasing a new version of the template that adds a LOT more statistics ... all the way down to disk level.

Stay tuned ...

erleshofer · Post by **erleshofer** » Fri Nov 26, 2010 4:55 am

@eschoeller

great job! i just registred to say THANKS for this work!
i would be more than happy&interested if you implement a per-initiator (IOps/sec) option to reflect all the view capabilities from CMC. and let us know how it goes with the support call.

peter

eschoeller · Post by **eschoeller** » Sat Nov 27, 2010 9:25 pm

Thanks Peter!

I'll certainly work on per-initiator stats once we get our LH nodes patched and I get the next release posted. The additional metrics will mostly leverage the objects in the CPQ OID tree which look specifically at the RAID controller in each of these devices. It provides *very* detailed information, all the way down to every SMART attribute for every disk ... which is very useful for predictive fault analysis. I've had this working on our DL320s units for some time now and they work great, but didn't release them because the CPQ objects weren't available on P4500 units. The patches are supposed to address that ...

As always, stay tuned!

garethwilson · Post by **garethwilson** » Tue Jan 04, 2011 9:54 am

great template, one thing i have noticed since upgrading the firmware on the nodes to v 9.0.00.3561.0 it fails to report the cpu temp/fan speeds

eschoeller · Post by **eschoeller** » Tue Jan 04, 2011 12:55 pm

bummer, sorry that happened.

What are the contents of the .1.3.6.1.4.1.9804.3.1.1.2.1.15 object? An example query below ...

snmpwalk -m ALL -t 90 -v2c -c 'COMMUNITY' IPADDRESS .1.3.6.1.4.1.9804.3.1.1.2.1.15

sfrancis · Post by **sfrancis** » Thu Jan 06, 2011 3:10 pm

That object is now marked obsolete in the MIB.
It refers to the object infoFanTable (.1.3.6.1.4.1.9804.3.1.1.2.1.111), but the fanspeed attribute of that table always returns 0.

Fan status seems to work correctly, though (.1.3.6.1.4.1.9804.3.1.1.2.1.111.1.91) - so you can at least trigger alerts when a fan fails.

We (LogicMonitor.com) had to modify our LeftHand SAN monitoring templates for this and a few other things that changed with the new release.

jftuga · Post by **jftuga** » Thu Mar 17, 2011 3:22 pm

Hi guys,

I am new to Cacti and have a question about this template. I was able to successfully import the cacti_host_template_hp_lefthand_system.xml file. I also was able to create devices using this template.

How do I import (or upload) the lefthand-*.xml (clusters, raid, temp, volumes) into Cacti? What are these used for?

Thanks,
-John

eschoeller · Post by **eschoeller** » Fri Mar 18, 2011 12:00 pm

The data queries rely on those files in order to function. They need to be placed in the resource/snmp_queries sub directory of your cacti installation. So for example, it could be /usr/local/cacti/resource/snmp_queries . It all depends on where you installed Cacti. And, you'd probably just use some file transfer mechanism to get the files to your Cacti server like scp.

eschoeller · Post by **eschoeller** » Fri Mar 18, 2011 12:08 pm

@sfrancis

So, what's your deal - you just troll the open source Cacti templates and then roll those into some commercial product?

NilsCant · Post by **NilsCant** » Wed Apr 13, 2011 4:06 am

Hi!

First off: thanks for making these templates! They've already helped us a lot to troubleshoot some outstanding issues!

I'm having one problem though: I think my volume throughput data is incorrect.

The attached graph shows a write througput of about 118 K (I assume per second)

: graph.png (28.09 KiB) Viewed 11033 times

(I disabled the stacking on the write throughput, because that didn't really make sense to me.)

Now, I believe the volume is reading and writing a lot more than 100 KB/sec.
If I use snmpget to retrieve the counters at 10 second intervals, calculate the difference and then divide by 10 seconds, I get something like this:

Code: Select all

Old: 25453781 MB , current: 25454170 MB, Diff: 389 MB
389 MB (38 MB/s)
Old: 25454170 MB , current: 25454177 MB, Diff: 7 MB
7 MB (0 MB/s)
Old: 25454177 MB , current: 25454269 MB, Diff: 92 MB
92 MB (9 MB/s)
Old: 25454269 MB , current: 25454500 MB, Diff: 231 MB
231 MB (23 MB/s)
Old: 25454500 MB , current: 25454861 MB, Diff: 361 MB
361 MB (36 MB/s

iostat on the host itself also shows a high amount of MB being read/written:

Code: Select all

[root@logging-prod ~]# iostat -m vdb 10
Linux 2.6.18-194.8.1.el5 (logging-prod.server.eu) 	04/13/2011

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16.51    0.02   20.72    4.40    0.00   58.36

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vdb             139.41         4.40         6.63   16911716   25473977

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.82    0.05   17.32   41.99    0.00   29.82

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vdb             653.10        11.13        52.96        111        529

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.58    0.03    8.33    6.00    0.00   81.07

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vdb             147.60         3.62         3.80         36         37

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.82    0.02    6.12   17.80    0.00   71.23

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vdb             470.43        23.91         8.08        239         80

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.73    0.00   10.85   49.69    0.00   31.73

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
vdb             670.20        36.50         9.18        364         91

So I think there is something going wrong with the data query?
The values in the rrd are consistent with those in the graph. :/

Any ideas?

Cacti

Host Templates for HP LeftHand Storage System (P4500, DL320)

Host Templates for HP LeftHand Storage System (P4500, DL320)

Re: Host Templates for HP LeftHand Storage System (P4500, DL

Re: Host Templates for HP LeftHand Storage System (P4500, DL

Re: Host Templates for HP LeftHand Storage System (P4500, DL

Re: Host Templates for HP LeftHand Storage System (P4500, DL

Re: Host Templates for HP LeftHand Storage System (P4500, DL

Re: Host Templates for HP LeftHand Storage System (P4500, DL

Re: Host Templates for HP LeftHand Storage System (P4500, DL

Re: Host Templates for HP LeftHand Storage System (P4500, DL

Re: Host Templates for HP LeftHand Storage System (P4500, DL

Who is online