Problem with cluster disk partitions

victorantunes · Post by **victorantunes** » Fri May 10, 2013 8:01 pm

Hello,

We have a SQL Server Cluster that consists of 3 physical nodes, and then several virtual instances distributed across this environment.

Problem is, for some reason I haven't figured out yet, all the graphs created under the "Used Space" Graph Template currently show the behavior displayed in the screenshots, wether the host I'm graphing is a virtual instance or a physical node.

I'm using SNMP v2.

Does anyone have an idea what might be causing this?

phalek · Post by **phalek** » Sat May 11, 2013 6:12 am

Out of curiosity, what is the size of the disks ?

Maybe one of these may help you:

http://docs.cacti.net/usertemplate:data ... disk_usage

or this one:

http://docs.cacti.net/usertemplate:data ... disk_usage

BSOD2600 · Post by **BSOD2600** » Mon May 13, 2013 11:59 am

Graphs look valid to me (neither used or total counters missing). Thus the question should be asked what is your SQL db doing with those partitions for such large data usage swings? Backups?

victorantunes · Post by **victorantunes** » Mon May 20, 2013 8:14 pm

Sorry for the long delay.

I've applied the templates from phalek's 1st link. Thanks for that.

So far, the graphs are behaving well, both on old template and new, so I've spent some time investigating and I seem to have discovered something interesting.

This issue only seems to occur when instances fail over from one node to another. In the huge majority of times, that occurs in a planned manner, for example such as Windows Updates in which each physical node is restarted at a time and the instances are switched around nodes during that time. Others examples do include other forms of planned or unplanned downtime.

In all those cases of failover, this issue seems to appear. However, when there's no failover, the graphs appear to be fine. I've yet to determine if this issue is related to something like a node-instance preference. For example: instance A's graphs are only displayed correctly when it's being hosted on node B, and so on.

And BSOD, I've monitored the usage rates and the graphs actually are wrong. The second graph, for example has 440GB and when it (wrongly) displays a smaller total number, the usage also decreases. It's a standard production database, there's no data usage swings like that. The graphs are wrong.

We also have a few active/active application clusters and that problem doesnt happen, so I'm guessing the instances find it strange when they're shipped to another node and can't figure out how to map their disk volumes, and thus my problem.

Sorry for the long post. Ideas, anyone?

phalek · Post by **phalek** » Tue May 21, 2013 12:17 am

That sounds like an old issue with how SNMP may represent the disks.

Basically the disk have indexes, e.g.

Code: Select all

Disk 1 = index.0
Disk 2 = index.1
Disk 3 = index.2

But updates or restarts "may" change this order to e.g.

Code: Select all

Disk 1 = index.0
Disk 3 = index.1
Disk 2 = index.2

Cacti only matches the index number, nothing else as it's unaware of the changes that happened on the system.

This should probably occur only to virtual disks e.g. iSCSI as physically attached ones usually keep their order.

Now to fix this, you will have to figure out something else to use as an index. I did this sometimes by creating a script and creating my own index.

BSOD2600 · Post by **BSOD2600** » Tue May 21, 2013 1:25 pm

phalek wrote:Cacti only matches the index number, nothing else as it's unaware of the changes that happened on the system..

Then the re-indexing method should be changed from Uptime to either of the two other options so the new indexes are picked up.

victorantunes · Post by **victorantunes** » Tue May 21, 2013 1:41 pm

@phalek
Uhm, that makes sense.

I've never tackled scripting related to disk indexes. Do you still have some of those scripts you made? Would you be willing to share them, part of their logic or the resoures I must go after, or at least point me in the right direction?

@BSOD
I've wondered about that, but I wasn't sure yet, so I haven't thoroughly tested that option. I'm guessing "Verify All Fields" would be most suitable in this case, right?

Aside from removing, re-adding and reloading the query, is it necessary to perform any other action? i.e. deleting .rrd files, etc?

Thanks for the input

BSOD2600 · Post by **BSOD2600** » Tue May 21, 2013 1:44 pm

victorantunes wrote: I'm guessing "Verify All Fields" would be most suitable in this case, right?

Aside from removing, re-adding and reloading the query, is it necessary to perform any other action? i.e. deleting .rrd files, etc?

Yea, sounds like [Verify All Fields] is the best option for this device. remove the query and readd it back with the different reindex method, no other action should be required except possibly a poller cache clear.

victorantunes · Post by **victorantunes** » Tue May 21, 2013 2:15 pm

I've changed the re-index method. I'm guessing we'll perform a failover within the next couple days and see how it works.

Will post updates on this as they occur.

Thanks a ton, guys.

Cacti

Problem with cluster disk partitions

Problem with cluster disk partitions

Re: Problem with cluster disk partitions

Re: Problem with cluster disk partitions

Re: Problem with cluster disk partitions

Re: Problem with cluster disk partitions

Re: Problem with cluster disk partitions

Re: Problem with cluster disk partitions

Re: Problem with cluster disk partitions

Re: Problem with cluster disk partitions

Who is online