Distributed Cacti - Ideas
Moderators: Developers, Moderators
Possible Solution.
After giving it some thought; Here is a possible (yet extremly ugly and long) solution using the current version of cacti with no modifications to the code or plugins.
Assume we have a master server; and a set of poller servers.
on the master, we do all initial creation and device discovery. Templates are setup, devices are discovered, graphs are created.
Once this is done. we disable the device on the master server; putting it into the disabled state; turning off polling.
Next - the first distributed poller is setup. (poller1)
We clone the database of the master server and make a new database that poller1 will use. We setup the new poller as a identical mirror of the master; including a new (separate) database copied from the master. Once complete; we remove either remove(easier); or down every device in the poller that will not be polled by that server.
[ at this point in the architecture; a device should only be enabled on the server/database it is to polled by ]
to tie it all together we just have the pollers do rrdupdates over the network via a network mapped drive, or SAN, etc. So on the pollers we have map'd the /RRA directory to the main server's repository of all RRAs.
Even though the device is disabled on the master; the graphs should be viewable, and since both the master server and the poller has the correct associations, mappings, and templates for that device; the device will be properly updated by the poller, and viewed by the master server.
Now assuming this even works, the biggest problem I see is changes going forward after the initial creation; you will need to do all the work on the master server; and then re-clone the database for the poller to use and then make all appropriate changes (disabling all devices not polled by that poller) again for most changes that happen to that device.
So while this many not be workable for installations with many devices; it might work well for my places with say; a small number of devices (40 or so); but each with thousands of interfaces.
.
Assume we have a master server; and a set of poller servers.
on the master, we do all initial creation and device discovery. Templates are setup, devices are discovered, graphs are created.
Once this is done. we disable the device on the master server; putting it into the disabled state; turning off polling.
Next - the first distributed poller is setup. (poller1)
We clone the database of the master server and make a new database that poller1 will use. We setup the new poller as a identical mirror of the master; including a new (separate) database copied from the master. Once complete; we remove either remove(easier); or down every device in the poller that will not be polled by that server.
[ at this point in the architecture; a device should only be enabled on the server/database it is to polled by ]
to tie it all together we just have the pollers do rrdupdates over the network via a network mapped drive, or SAN, etc. So on the pollers we have map'd the /RRA directory to the main server's repository of all RRAs.
Even though the device is disabled on the master; the graphs should be viewable, and since both the master server and the poller has the correct associations, mappings, and templates for that device; the device will be properly updated by the poller, and viewed by the master server.
Now assuming this even works, the biggest problem I see is changes going forward after the initial creation; you will need to do all the work on the master server; and then re-clone the database for the poller to use and then make all appropriate changes (disabling all devices not polled by that poller) again for most changes that happen to that device.
So while this many not be workable for installations with many devices; it might work well for my places with say; a small number of devices (40 or so); but each with thousands of interfaces.
.
Hey Guys,
Just to chuck my 2 cents in really... I understand the need to increase cacti to scale well... I do, however from my point of view, I am watching this thread for the following reason.
I have my MySQL server doing 2 way replication between itself and a backup, and I have my RRD's sitting on it, doing rsync to my backup every 30mins.. (i cant afford SAN space)
My problem has always been random NFS corruption with the files mounted onto my frontend/poller machine.
For me, I have a shared poller/frontend, but prefer to keep the "database" content on one server, thats bigger, more ram, with more reliable disk and backed up more often...
Is there "rrdtool server" support anywhere? so my database server becomes a true IP based database engine, I can connect to mysql on a tcp port, and the rrdtool store and a tcp port and move on.
I know I can get another machine and run.... 1xMySQL, 1xPoller/RRD, 1xFrontend, just more hardware and I still get stupid NFS corruptions
I think that with the boost image cache and buffering it would work fine on small to medium deployments?
Am I barking up the wrong tree and this has been explained before?
Just to chuck my 2 cents in really... I understand the need to increase cacti to scale well... I do, however from my point of view, I am watching this thread for the following reason.
I have my MySQL server doing 2 way replication between itself and a backup, and I have my RRD's sitting on it, doing rsync to my backup every 30mins.. (i cant afford SAN space)
My problem has always been random NFS corruption with the files mounted onto my frontend/poller machine.
For me, I have a shared poller/frontend, but prefer to keep the "database" content on one server, thats bigger, more ram, with more reliable disk and backed up more often...
Is there "rrdtool server" support anywhere? so my database server becomes a true IP based database engine, I can connect to mysql on a tcp port, and the rrdtool store and a tcp port and move on.
I know I can get another machine and run.... 1xMySQL, 1xPoller/RRD, 1xFrontend, just more hardware and I still get stupid NFS corruptions
I think that with the boost image cache and buffering it would work fine on small to medium deployments?
Am I barking up the wrong tree and this has been explained before?
- TheWitness
- Developer
- Posts: 17047
- Joined: Tue May 14, 2002 5:08 pm
- Location: MI, USA
- Contact:
I think that the RSYNC idea "may" be viable, but I don't believe that there is a need to mysql replication unless you have some other content there that updates more frequently.
I am still a ways off to implementing boost v2. I want to first test the limit's of boost v1.x. I am thinking it's around 5k-7k hosts and maybe 200k-400k graphs.
Once I get a system that big, I will likely publish my findings and possibly have to release boost v1.3 to address scalability concerns.
TheWitness
I am still a ways off to implementing boost v2. I want to first test the limit's of boost v1.x. I am thinking it's around 5k-7k hosts and maybe 200k-400k graphs.
Once I get a system that big, I will likely publish my findings and possibly have to release boost v1.3 to address scalability concerns.
TheWitness
True understanding begins only when we realize how little we truly understand...
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
we have a distributee poller working that is much simpler than what seems to be going around here
its not the prettiest thing, but all it does is to divide the hosts up between the pollers by modifying poller.php slightly and using a configuration file
my config file has a list of each of the pollers like this:
box1
box2
box3
poller.php counts the number of devices in the file(3 in this example), divides the number of hosts in Cacti by that number, and then polls the slice of bixes (so box3 would poll only the last third of all hosts)
its not a perfect solution by any means
for instance, in our company most of the devices with tons of data sources(CMTSs are a pain) fall in the first few hosts
so the first poller tends to work harder, but it doesn't seem to be doing much more work than the other 2
a big problem is that stats are all messed up and poller.php tends to run for the whole 296 seconds even though cactid ended long ago
of course, even with 7000 devices being polled, the poller has not been much of a problem (we use the distributed poller because we were approaching the 4 minute mark, and execs don't like things being that close)
rrdtool over NFS is definitely the bottleneck
at some point we're hoping to develop an rrdtool version that uses a database to store the data rather than as files
his would put all the work on the pollers (at which point we would probably add several thousand more devices)
its not the prettiest thing, but all it does is to divide the hosts up between the pollers by modifying poller.php slightly and using a configuration file
my config file has a list of each of the pollers like this:
box1
box2
box3
poller.php counts the number of devices in the file(3 in this example), divides the number of hosts in Cacti by that number, and then polls the slice of bixes (so box3 would poll only the last third of all hosts)
its not a perfect solution by any means
for instance, in our company most of the devices with tons of data sources(CMTSs are a pain) fall in the first few hosts
so the first poller tends to work harder, but it doesn't seem to be doing much more work than the other 2
a big problem is that stats are all messed up and poller.php tends to run for the whole 296 seconds even though cactid ended long ago
of course, even with 7000 devices being polled, the poller has not been much of a problem (we use the distributed poller because we were approaching the 4 minute mark, and execs don't like things being that close)
rrdtool over NFS is definitely the bottleneck
at some point we're hoping to develop an rrdtool version that uses a database to store the data rather than as files
his would put all the work on the pollers (at which point we would probably add several thousand more devices)
Would you mind posting a walkthru and examples of your work?marnues wrote:we have a distributee poller working that is much simpler than what seems to be going around here
its not the prettiest thing, but all it does is to divide the hosts up between the pollers by modifying poller.php slightly and using a configuration file
my config file has a list of each of the pollers like this:
box1
box2
box3
poller.php counts the number of devices in the file(3 in this example), divides the number of hosts in Cacti by that number, and then polls the slice of bixes (so box3 would poll only the last third of all hosts)
- TheWitness
- Developer
- Posts: 17047
- Joined: Tue May 14, 2002 5:08 pm
- Location: MI, USA
- Contact:
marnues,
You should evaluate Boost as it will solve your 4 minute issue. We should dialog.
TheWitness
You should evaluate Boost as it will solve your 4 minute issue. We should dialog.
TheWitness
True understanding begins only when we realize how little we truly understand...
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
We currently use a HA clustered mysql cacti solution, using:
- mysql 5.0 real-time clustering/replication
- GFS on a 14-disk raid5 array (HP/compaq msa500) with two servers SCSI-attached (dl380g3)
- custom poller that divides polling amongst active nodes
- load-balanced web front-end with cisco CSS switches (sessions are synchronized with a GFS share)
- all datasources are polled on 1-minute intervals, with 1-minute granularity kept for 60 days
We have a custom keepalive script that polls all nodes (via ping & snmp), checks that they are running mysql, httpd, ndbd, etc - and then marks them active or dead.
Currently we are polling 4000 elements (1700 rrd files - 8.5GB) in less than 20 seconds with both nodes active.
We used dolphin SCI cards for better MySql performance, but we have disabled them because of stability issues.
Our only single point of failure is the mysql arbitrator - but in mysql 5.1 you should be able to have a redundant arbitrator.
Out next step is to move to a 4-way cluster using dl380g4 64-bit servers with a hp/compaq MSA-1000 array.
Before we had the MSA array, we were using DRBD, which is a cost-effective network-based block-level, real-time disk mirroring solution (but not as fast).
Please let me know if anyone is interested in more detail...
ADesimone
- mysql 5.0 real-time clustering/replication
- GFS on a 14-disk raid5 array (HP/compaq msa500) with two servers SCSI-attached (dl380g3)
- custom poller that divides polling amongst active nodes
- load-balanced web front-end with cisco CSS switches (sessions are synchronized with a GFS share)
- all datasources are polled on 1-minute intervals, with 1-minute granularity kept for 60 days
We have a custom keepalive script that polls all nodes (via ping & snmp), checks that they are running mysql, httpd, ndbd, etc - and then marks them active or dead.
Currently we are polling 4000 elements (1700 rrd files - 8.5GB) in less than 20 seconds with both nodes active.
We used dolphin SCI cards for better MySql performance, but we have disabled them because of stability issues.
Our only single point of failure is the mysql arbitrator - but in mysql 5.1 you should be able to have a redundant arbitrator.
Out next step is to move to a 4-way cluster using dl380g4 64-bit servers with a hp/compaq MSA-1000 array.
Before we had the MSA array, we were using DRBD, which is a cost-effective network-based block-level, real-time disk mirroring solution (but not as fast).
Please let me know if anyone is interested in more detail...
ADesimone
- TheWitness
- Developer
- Posts: 17047
- Joined: Tue May 14, 2002 5:08 pm
- Location: MI, USA
- Contact:
Interesting.
TheWitness
TheWitness
True understanding begins only when we realize how little we truly understand...
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
I'd love to see your work on the custom pollers; including your code changes.adesimone wrote:We currently use a HA clustered mysql cacti solution, using:
- mysql 5.0 real-time clustering/replication
- GFS on a 14-disk raid5 array (HP/compaq msa500) with two servers SCSI-attached (dl380g3)
- custom poller that divides polling amongst active nodes
- load-balanced web front-end with cisco CSS switches (sessions are synchronized with a GFS share)
- all datasources are polled on 1-minute intervals, with 1-minute granularity kept for 60 days
Please let me know if anyone is interested in more detail...
ADesimone
I'd like to do the same thing using several poller machines all hooked into a single SAN environment.
I posted an idea I've been bouncing around in my head for a year or two on the requested features forum and got a reply referring me to this thread. So I'll post my ideas here too -- for what they're worth.
I admit I've only skimmed the documentation for boost (and it sounds pretty great -- especially combined with some of the stuff I've read in this thread like being able to store RRDs in more than one dir). But one way that we might be able to make cacti distributed in a (I think?) really simple way could be as follows.
Run multiple stand-alone cacti servers in diverse areas.Then on a "master" cacti server, if one could add a "referral" element to the graph tree - sort of like a header element but one that referred to a header or tree on one of the remote servers, you could get one master view of all the graphs and graph trees on the remote servers. You'd want to be able to add elements to the master's tree like "remote header", "remote tree", and "remote graph".
So each cacti server can poll whatever devices are nearby it (or for scalability reasons, whichever devices you want to make it responsible for). Each cacti server would have it's own storage for RRDs, it's own mysql server, etc. And serving up the graphs would be distributed as well.
Admittedly, this is a very simple-minded approach. But I like simple. It wouldn't require NFS over a WAN/VPN to function or mirrored mysql databases or any of that. 'Course it means that when new devices are added to one of the remote cacti servers and graphs created and all that, there might be an extra step to add some "remote" elements to the master's graph trees. (shrug) But if the master had a well organized tree structure you could delegate the polling, collection, and storage for whole branches of a centralized tree to any number of other servers.
You could even distribute authority over maintaining the cacti network then. The junior PFYs in remote offices might have privs to modify stuff on the cacti server in their branch office but not to the master server or perhaps not to modify stuff in other branch office's cacti servers.
I was even thinking perhaps this would be a good excuse for me to edjimicate myself on the plugin API... if I can unwedge a little spare time...
Brent
I admit I've only skimmed the documentation for boost (and it sounds pretty great -- especially combined with some of the stuff I've read in this thread like being able to store RRDs in more than one dir). But one way that we might be able to make cacti distributed in a (I think?) really simple way could be as follows.
Run multiple stand-alone cacti servers in diverse areas.Then on a "master" cacti server, if one could add a "referral" element to the graph tree - sort of like a header element but one that referred to a header or tree on one of the remote servers, you could get one master view of all the graphs and graph trees on the remote servers. You'd want to be able to add elements to the master's tree like "remote header", "remote tree", and "remote graph".
So each cacti server can poll whatever devices are nearby it (or for scalability reasons, whichever devices you want to make it responsible for). Each cacti server would have it's own storage for RRDs, it's own mysql server, etc. And serving up the graphs would be distributed as well.
Admittedly, this is a very simple-minded approach. But I like simple. It wouldn't require NFS over a WAN/VPN to function or mirrored mysql databases or any of that. 'Course it means that when new devices are added to one of the remote cacti servers and graphs created and all that, there might be an extra step to add some "remote" elements to the master's graph trees. (shrug) But if the master had a well organized tree structure you could delegate the polling, collection, and storage for whole branches of a centralized tree to any number of other servers.
You could even distribute authority over maintaining the cacti network then. The junior PFYs in remote offices might have privs to modify stuff on the cacti server in their branch office but not to the master server or perhaps not to modify stuff in other branch office's cacti servers.
I was even thinking perhaps this would be a good excuse for me to edjimicate myself on the plugin API... if I can unwedge a little spare time...
Brent
distributed poller
I am interested in your custom pollers, would you care to post it up?
Thanks,
luckyksc
Thanks,
luckyksc
I would be very interested in seeing bbice's idea implemented. It would be very helpful when having to manage multiple sites over not-so-stable VPN links.
I think it could probably be done with two plugins:
1) A plugin that exports the tree items and other necessary data (XML file that updates depending on the poller's schedule?) to some URL.
2) and a plugin that allows arbitrary data to be placed on graph tree's (documents, configuration files, weathermaps, and other cacti installations). I'm pretty sure this is already being worked on which is great. Then if you add the type Cacti Installation and just have a text-entry box where you can list it's IP or hostname, you could add other instances!
Of course auto-discovery would be sweet too, but that's kind of out of scope and a little more difficult to do.
What do you think?
I think it could probably be done with two plugins:
1) A plugin that exports the tree items and other necessary data (XML file that updates depending on the poller's schedule?) to some URL.
2) and a plugin that allows arbitrary data to be placed on graph tree's (documents, configuration files, weathermaps, and other cacti installations). I'm pretty sure this is already being worked on which is great. Then if you add the type Cacti Installation and just have a text-entry box where you can list it's IP or hostname, you could add other instances!
Of course auto-discovery would be sweet too, but that's kind of out of scope and a little more difficult to do.
What do you think?
Here are the 'cacti cluster' files. Please read the whitepaper first and README.first
ADesimone
ADesimone
- Attachments
-
- cacti_cluster.zip
- (428.77 KiB) Downloaded 832 times
Hi !
Here's my small patch for Cacti 0.8.7b that implements a master/slaves arch:
http://forums.cacti.net/viewtopic.php?p=127122
Cheers,
Ludo.
Here's my small patch for Cacti 0.8.7b that implements a master/slaves arch:
http://forums.cacti.net/viewtopic.php?p=127122
Cheers,
Ludo.
Who is online
Users browsing this forum: No registered users and 2 guests