New Cacti Architecture (0.8.8) - RFC Response Location
Moderators: Developers, Moderators
- oxo-oxo
- Cacti User
- Posts: 126
- Joined: Thu Aug 30, 2007 11:35 am
- Location: Silkeborg, Denmark
- Contact:
oxo-oxo thoughts in progress
This will be edited in time so as not to fill up the post
- will contain my input as one post.
Please remember to keep the date of the pdf up to date or a version nr as I will need a referance if things change ...
MySQL
If MySQL is the DB one should concider cirular replication contra cluster
A good link is
http://www.onlamp.com/pub/a/onlamp/2006 ... ation.html
The code changes to support circular replication are "small"
(I'll try and make a summary here at some stage)
- will contain my input as one post.
Please remember to keep the date of the pdf up to date or a version nr as I will need a referance if things change ...
MySQL
If MySQL is the DB one should concider cirular replication contra cluster
A good link is
http://www.onlamp.com/pub/a/onlamp/2006 ... ation.html
The code changes to support circular replication are "small"
(I'll try and make a summary here at some stage)
Owen Brotherwood, JN Data A/S, Denmark.
- adrianmarsh
- Cacti User
- Posts: 437
- Joined: Wed Aug 17, 2005 8:51 am
- Location: UK
Well.. my 2 cents so far after looking at the diags..
I like the idea of de-centralizing for larger environments, but I'd also like to ensure that it doesn't get too complicated for the smaller ones.
The major issue I see is the RRD files.. so I can see why you'd introduce a secondary cluster of databases to handle that.. but would that be duplication of the RRAs or just spreading the files around (still one file only) ?
I like the idea of de-centralizing for larger environments, but I'd also like to ensure that it doesn't get too complicated for the smaller ones.
The major issue I see is the RRD files.. so I can see why you'd introduce a secondary cluster of databases to handle that.. but would that be duplication of the RRAs or just spreading the files around (still one file only) ?
So here's my suggestions.
Just as the front-end->graphs and poller->data sources are essentially separate, I would be inclined to separate the main Cacti DB from the pollers as much as possible.
So here's my thoughts on polling.
The main DB will of course have a poller_item table.
There will be a field that determines which poller has which item.
Any time a new poller_item entry is added, the back-end assigns it to a poller and pushes this information.
This way the pollers have a local copy of poller_item.
I would assume a standard round-robin assignment and then make it modular for plugins.
I know my company will be abusing poller assignments based on the criticality of the data since we already have plans to do this.
Then I would have the pollers store to the rrdfiles themselves.
This keeps the front-end->graphs and pollers->data-sources separate.
That might not be intentional, but its how I've been interpreting the DB and it makes the most sense to me.
I guess the biggest reason I'm pushing this is that my company wants to redo the front-end almost entirely and use a full db backend for the data sources (we don't want lossy graphs).
The more modular Cacti is, the more code we can keep and less code we have to write.
Plus we can continue calling it Cacti. =D
Info:
Hosts: 12855
Data Sources: 152188
Graphs: 151450
Desired Hosts is somewhere on the order of 20,000.
We only have about 7000 active hosts because our pollers lose too much important data if we turn them all up.
Currently we are running:
2 Web Servers
4 Pollers/RRDTool writers (same boxes, since they use different ethernet ports and the pollers don't put much load on the server)
F5 load balancer
Blue Arc storage array
1 DB, which seems to be running fine, though poller_item entries seem to disappear here and there
sadly we're still on 0.8.6h since our lab is fubar'd and we don't want to upgrade till everything is tested
which means that if something I suggested breaks the plugin architecture let me know
Just as the front-end->graphs and poller->data sources are essentially separate, I would be inclined to separate the main Cacti DB from the pollers as much as possible.
So here's my thoughts on polling.
The main DB will of course have a poller_item table.
There will be a field that determines which poller has which item.
Any time a new poller_item entry is added, the back-end assigns it to a poller and pushes this information.
This way the pollers have a local copy of poller_item.
I would assume a standard round-robin assignment and then make it modular for plugins.
I know my company will be abusing poller assignments based on the criticality of the data since we already have plans to do this.
Then I would have the pollers store to the rrdfiles themselves.
This keeps the front-end->graphs and pollers->data-sources separate.
That might not be intentional, but its how I've been interpreting the DB and it makes the most sense to me.
I guess the biggest reason I'm pushing this is that my company wants to redo the front-end almost entirely and use a full db backend for the data sources (we don't want lossy graphs).
The more modular Cacti is, the more code we can keep and less code we have to write.
Plus we can continue calling it Cacti. =D
Info:
Hosts: 12855
Data Sources: 152188
Graphs: 151450
Desired Hosts is somewhere on the order of 20,000.
We only have about 7000 active hosts because our pollers lose too much important data if we turn them all up.
Currently we are running:
2 Web Servers
4 Pollers/RRDTool writers (same boxes, since they use different ethernet ports and the pollers don't put much load on the server)
F5 load balancer
Blue Arc storage array
1 DB, which seems to be running fine, though poller_item entries seem to disappear here and there
sadly we're still on 0.8.6h since our lab is fubar'd and we don't want to upgrade till everything is tested
which means that if something I suggested breaks the plugin architecture let me know
I am also glad to see the direction that cacti is taking.
Here is a whitepaper I started to write describing our current cacti implementation (so it is not entirely complete). Our goal was to have a semi-symmetrical, linearly-scaled architecture that would provide data and polling resiliency. This whitepaper uses the existing cacti 8.6 architecture - no where near as complex (or advanced) as this proposed architecture.
The polling work is divided up based on the amount of active nodes in the cluster, and they all write to shared GFS storage on a MSA1000 (connected via FC). This is a 14-disk RAID0+1 array. For database replication, we originally tried mysql cluster, but it proved to be too slow (even with SCI cards) so we went with master-master replication. We don't replicate poller_output, poller_command, poller_reindex, or poller_time tables. We use Cisco IOS SLB for least-connected load-balancing.
We do 1-minute polling and store 60-days of 1-minute data so RRDs get very large (currently 10GB for 6000 elements).
For the future of cacti, I would like to see similar design considerations for redundancy alongside of scalability - and the ability to take advantage of the redundant components for back-end load-balancing and job division. I like the idea or distributed RRD storage (with api access instead of disk access?), as the disk I/O will still be a limitation with shared storage.
One other performance issue I have seen with our current setup is with the load-balancing - it seems that linux caches alot of RRD data into RAM (we have 3GB per node) - so whichever data a specific node gathers and writes to an RRD, it can serve up much faster via the cacti front end (vs a node that must pull from disk). So there may be some merit to embedded load-balancing in the application - maybe a algorithm to send specific requests to specific nodes that may have a cache of the data - just a thought (although I am a big fan of hardware load-balancers).
So that is my 2 cents...
ADesimone
Here is a whitepaper I started to write describing our current cacti implementation (so it is not entirely complete). Our goal was to have a semi-symmetrical, linearly-scaled architecture that would provide data and polling resiliency. This whitepaper uses the existing cacti 8.6 architecture - no where near as complex (or advanced) as this proposed architecture.
The polling work is divided up based on the amount of active nodes in the cluster, and they all write to shared GFS storage on a MSA1000 (connected via FC). This is a 14-disk RAID0+1 array. For database replication, we originally tried mysql cluster, but it proved to be too slow (even with SCI cards) so we went with master-master replication. We don't replicate poller_output, poller_command, poller_reindex, or poller_time tables. We use Cisco IOS SLB for least-connected load-balancing.
We do 1-minute polling and store 60-days of 1-minute data so RRDs get very large (currently 10GB for 6000 elements).
For the future of cacti, I would like to see similar design considerations for redundancy alongside of scalability - and the ability to take advantage of the redundant components for back-end load-balancing and job division. I like the idea or distributed RRD storage (with api access instead of disk access?), as the disk I/O will still be a limitation with shared storage.
One other performance issue I have seen with our current setup is with the load-balancing - it seems that linux caches alot of RRD data into RAM (we have 3GB per node) - so whichever data a specific node gathers and writes to an RRD, it can serve up much faster via the cacti front end (vs a node that must pull from disk). So there may be some merit to embedded load-balancing in the application - maybe a algorithm to send specific requests to specific nodes that may have a cache of the data - just a thought (although I am a big fan of hardware load-balancers).
So that is my 2 cents...
ADesimone
- Attachments
-
- Cacti cluster whitepaper v3.pdf
- (391.25 KiB) Downloaded 5132 times
Hi Aurelio,adesimone wrote:I am also glad to see the direction that cacti is taking.
Here is a whitepaper I started to write describing our current cacti implementation (so it is not entirely complete). Our goal was to have a semi-symmetrical, linearly-scaled architecture that would provide data and polling resiliency. This whitepaper uses the existing cacti 8.6 architecture - no where near as complex (or advanced) as this proposed architecture.
....
....
So that is my 2 cents...
ADesimone
Great article and good first sign, for this Cacti architecture thread.
I'm impressed about the idea using DRBD and SLB as redundancy and load sharing model. I'd never thought, that it is applicable to Cacti.
But where can we found the "cluster-modified" Cacti code?
Regarding poller performance you should read this article
from David Plonka notified here by fmangeant http://forums.cacti.net/viewtopic.php?t ... ght=plonka and
http://www.usenix.org/event/lisa07/tech/plonka.html
We have not the challenge of 1-minute polling and a recovery service agreement of 1 business day. But we have the challenge to handle up to 10.000 devices with more than 150.000 data sources (currently 2.000dev/40.000ds) within 5 minutes distributed global with a latencies up to 300-400 ms. So we need a distributed solution with poller engines in each continent and a centralized database for minimal maintenance overhead.
We work with two "semi-symetrical" boxes (one database/polling, one httpd/administration/redundancy box).
For the the moment I've not enough time to discuss this topics in depth, but I will follow with interest.
We had a similar discussion at our 2.CCC.eu conference and hopefully all of these ideas will bring up the best solution for the future of Cacti.
Thanks for your contribution. Excellent.
Frizz and BownieBraun from Cologne
Cacti 0.8.6j | Cactid 0.8.6j | RRDtool 1.2.23 |
SuSe 9.x | PHP 4.4.4 | MySQL 5.0.27 | IHS 2.0.42.1
Come and join the 3.CCC.eu
http://forums.cacti.net/viewtopic.php?t=27908
SuSe 9.x | PHP 4.4.4 | MySQL 5.0.27 | IHS 2.0.42.1
Come and join the 3.CCC.eu
http://forums.cacti.net/viewtopic.php?t=27908
I posted the 'cacti cluster' files under the 'distributed cacti' topic - http://forums.cacti.net/viewtopic.php?p=120908
Thanks for the links.. I will try the new rrdtool and see how it effects performance.
ADesimone
Thanks for the links.. I will try the new rrdtool and see how it effects performance.
ADesimone
-
- Cacti User
- Posts: 59
- Joined: Tue Dec 19, 2006 4:35 pm
I completely agree. I think it would be a good idea to break this down into multiple levels of redundancy/load balancing. For a medium size business where I work, we only really need one central control server (MySQL, RRDtool, cacti web) with remote polling nodes. Obviously much larger installations need what is outlined on the RFC document. I'd say keep the RFC as the ultimate vision for redundancy, but keep in mind that alot of users can only implement a smaller system.I like the idea of de-centralizing for larger environments, but I'd also like to ensure that it doesn't get too complicated for the smaller ones.
some feedback
Great idea to improve cacti. The polling of the devices (and plugins) are really the bottle neck. And the trend is to get statistics under the 5 minutes ... again more polling.
Web services
I aggree with the others, the companies which will need this design have already load balancers.
But the question is we need from cacti some L7-status (healthcheck-state) in order to better loadbalance than a simple round-robin, or session or port check.
My feedbacks will be more for the plugins needs:
1/ first a question: Do you want inplement with MySQL database replication ? If yes the plugin community will need some interface-SQL-functions which know where is the active sql-server (or which has the data if no replication)
2/ poller_bottom and other hooks: could be done everywhere BUT for the future and because some plugins need a lot of ressources, I would prefer have the possibility to select a poller group.
2b/ How does the group decide to start the "poller_bottom" (for ex.) in case of failure ? When master/slave are working like hsrp it could be that they don't see the other and both starts the poller ... but it is perhaps not really a problem because the goals of this design are the performance.
3/ the plugins will need some functions to retrieve,add,... the RDD files from the acive/reachable rdd-server of the group. In fact I think the same as you will need to create the graphs in cacti. (the loadbalancer cannot know where is the rdd file)
PS: I will give some comments later ...
Web services
I aggree with the others, the companies which will need this design have already load balancers.
But the question is we need from cacti some L7-status (healthcheck-state) in order to better loadbalance than a simple round-robin, or session or port check.
My feedbacks will be more for the plugins needs:
1/ first a question: Do you want inplement with MySQL database replication ? If yes the plugin community will need some interface-SQL-functions which know where is the active sql-server (or which has the data if no replication)
2/ poller_bottom and other hooks: could be done everywhere BUT for the future and because some plugins need a lot of ressources, I would prefer have the possibility to select a poller group.
2b/ How does the group decide to start the "poller_bottom" (for ex.) in case of failure ? When master/slave are working like hsrp it could be that they don't see the other and both starts the poller ... but it is perhaps not really a problem because the goals of this design are the performance.
3/ the plugins will need some functions to retrieve,add,... the RDD files from the acive/reachable rdd-server of the group. In fact I think the same as you will need to create the graphs in cacti. (the loadbalancer cannot know where is the rdd file)
PS: I will give some comments later ...
Jean-Michel
cacti 0.8.7e | cmd & cactid (cactid 0.8.x) | Linux | MySQL Ver 14.7 Distrib 4.1.12, for Win32 | PHP v5.2.6 | Apache v2.x | Thold | Plugin Architecture | plugin "configuration manager" http://cactiusers.org/forums/topic257.html | plugin "IP subnet calculator IPv4 / IPV6" http://forums.cacti.net/viewtopic.php?t=15428 | plugin banner http://docs.cacti.net/userplugin:banner | Net-SNMP 5.5.2 | cygwin 1.5.18 of 02.07.2005
cacti 0.8.7e | cmd & cactid (cactid 0.8.x) | Linux | MySQL Ver 14.7 Distrib 4.1.12, for Win32 | PHP v5.2.6 | Apache v2.x | Thold | Plugin Architecture | plugin "configuration manager" http://cactiusers.org/forums/topic257.html | plugin "IP subnet calculator IPv4 / IPV6" http://forums.cacti.net/viewtopic.php?t=15428 | plugin banner http://docs.cacti.net/userplugin:banner | Net-SNMP 5.5.2 | cygwin 1.5.18 of 02.07.2005
I'm going to add more of the same here; the main reason I want multiple pollers is so the platform can scale sideways. The current monolithic nature of the architecture forces us to resolve growth issues by placing cacti on a more powerful server. It would be easier to manage, and cheaper - if we could just throw another cheap server into the mix for polling and data collection as we need them.
Anyways, on to architecture; I would agree that we should look at a modular design that makes sense without over complicating the product.
Orignally I had been a fan of the idea of breaking the project's role out to each server; i.e. mysql servers, spine servers, and web servers. However this makes the product more one-off from the main cacti branch, which would make it harder for the developers to support.
The whitepaper suggests a better approach, which is each cacti server is the base build of cacti plus some tweaks, which sounds good. I also like the idea of using GFS, however I have not done this level of load with GFS to know how well it would work, but it does give me a pet project to try next week
Anyways, on to architecture; I would agree that we should look at a modular design that makes sense without over complicating the product.
Orignally I had been a fan of the idea of breaking the project's role out to each server; i.e. mysql servers, spine servers, and web servers. However this makes the product more one-off from the main cacti branch, which would make it harder for the developers to support.
The whitepaper suggests a better approach, which is each cacti server is the base build of cacti plus some tweaks, which sounds good. I also like the idea of using GFS, however I have not done this level of load with GFS to know how well it would work, but it does give me a pet project to try next week
Just a quick thought:
What about using XML-RPC or SOAP for communications and possible RRD file Sync. I would think this could become a universal interface to all pieces of the "new" Cacti. It would allow for more modularity and you could seem to eliminate many of the sync issues (such as syncing RRD files across multiple server-grabbing data from multiple servers-etc.)
Allow the pollers to communicate to the DB server or RRD files via SOAP or XML-RPC. Have plugins use XML-RPC to access any data from Cacti.
I hope this made sense.
I will add more later as I think about it. Great topic for dicussion!
-Stephen
What about using XML-RPC or SOAP for communications and possible RRD file Sync. I would think this could become a universal interface to all pieces of the "new" Cacti. It would allow for more modularity and you could seem to eliminate many of the sync issues (such as syncing RRD files across multiple server-grabbing data from multiple servers-etc.)
Allow the pollers to communicate to the DB server or RRD files via SOAP or XML-RPC. Have plugins use XML-RPC to access any data from Cacti.
I hope this made sense.
I will add more later as I think about it. Great topic for dicussion!
-Stephen
www.cactiexchange.com -- Hosted Tools for the Cacti Community:
Cacti Perl Tool
Cacti XML Tool
Search OID-to-Names
[url=http://www.google.com/search?num=100&hl=en&safe=off&domains=forums.cacti.net&q=cacti&sitesearch=forums.cacti.net]Cacti Forums Google Index -- Better than the main page version![/url]
Cacti Perl Tool
Cacti XML Tool
Search OID-to-Names
[url=http://www.google.com/search?num=100&hl=en&safe=off&domains=forums.cacti.net&q=cacti&sitesearch=forums.cacti.net]Cacti Forums Google Index -- Better than the main page version![/url]
for scalability have you considered/looked sending the jobs/polls to pollers using some form of algorithmic hashing ? this could happen after a "are you there" check to see which pollers are there.
1 step up for that, the check includes a load field, where the poller returns a value that indicates how many "jobs" it can take.
just a though, otherwise, lgtm
1 step up for that, the check includes a load field, where the poller returns a value that indicates how many "jobs" it can take.
just a though, otherwise, lgtm
Multiple pollers would be the best fit for me.
I can keep the existing high availability hardware for 'have' to monitor deceives.
And use lower cost for the 'nice to know' things.
In practice (for me) the user interface is the least used part.
There are over 70 people that can login, but if there has never been more than 3 or 4 concurrent users.
If there is a priority, I would suggest expanding collection capacity first.
I would keep web interface on the 'main' server and free up resouces with external pollers.
With one minute polling and longer resolution in the rras, hardware limits start to become a problem.
For now, moving to an external mysql server was a big help for me.
Database space requirements are small.
If you have an old server and a bunch of 9Gig SCSI drives laying around I would recommend it for short term breathing room.
I can keep the existing high availability hardware for 'have' to monitor deceives.
And use lower cost for the 'nice to know' things.
In practice (for me) the user interface is the least used part.
There are over 70 people that can login, but if there has never been more than 3 or 4 concurrent users.
If there is a priority, I would suggest expanding collection capacity first.
I would keep web interface on the 'main' server and free up resouces with external pollers.
With one minute polling and longer resolution in the rras, hardware limits start to become a problem.
For now, moving to an external mysql server was a big help for me.
Database space requirements are small.
If you have an old server and a bunch of 9Gig SCSI drives laying around I would recommend it for short term breathing room.
Good topic, I will keep an eye on this out of interest, even though I don't think I'll have a lot to add.
My main reason for wanting "scalability" is so I can put a Poller on a client site, and have it feed back to our central office where alerts could be raised, and trends analysed. Rather than having to have one Cacti server polling devices over the internet/dedicated links to our clients.
At the moment I am putting a new Cacti install into a major client (for us, only about 20 servers/40 Monitorred Nodes), and would love to put a cacti box into each of our clients (about 70) and monitor it all from a central location...
As for front end loadbalancing I am a great fan of Hardware loadbalancing - but would steer clear of F5/Alteon - have a look at loadbalancer.org for a more cost-friendly solution...
Argon0
My main reason for wanting "scalability" is so I can put a Poller on a client site, and have it feed back to our central office where alerts could be raised, and trends analysed. Rather than having to have one Cacti server polling devices over the internet/dedicated links to our clients.
At the moment I am putting a new Cacti install into a major client (for us, only about 20 servers/40 Monitorred Nodes), and would love to put a cacti box into each of our clients (about 70) and monitor it all from a central location...
As for front end loadbalancing I am a great fan of Hardware loadbalancing - but would steer clear of F5/Alteon - have a look at loadbalancer.org for a more cost-friendly solution...
Argon0
No longer a n00by, probably, by now an 0ldby
Now Head of Technology at RSCH, back to the prickly subject of Monitorring....
Now Head of Technology at RSCH, back to the prickly subject of Monitorring....
- TheWitness
- Developer
- Posts: 17007
- Joined: Tue May 14, 2002 5:08 pm
- Location: MI, USA
- Contact:
Very good comments thus far. I will have more feedback in the next month or so.
Regards,
Larry
Regards,
Larry
True understanding begins only when we realize how little we truly understand...
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
- Howie
- Cacti Guru User
- Posts: 5508
- Joined: Thu Sep 16, 2004 5:53 am
- Location: United Kingdom
- Contact:
Which brings up another variable - in some situations, where it's actual load on a central collector that's an issue, then it would be OK to have Cacti level the load between collectors in the event of a failure, but in others (like this one perhaps), there may only be one collector that has the correct network access. For example, it may be behind a firewall at the end of a VPN on the clients private address space. Worse, the same addresses may also be valid in some other clients address space, so some kind of poller affinity needs to be available for Devices: which pollers can see this device?argon0 wrote:My main reason for wanting "scalability" is so I can put a Poller on a client site, and have it feed back to our central office where alerts could be raised, and trends analysed. Rather than having to have one Cacti server polling devices over the internet/dedicated links to our clients.
Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
Who is online
Users browsing this forum: No registered users and 0 guests