New Cacti Architecture (0.8.8) - RFC Response Location
Moderators: Developers, Moderators
- rony
- Developer/Forum Admin
- Posts: 6022
- Joined: Mon Nov 17, 2003 6:35 pm
- Location: Michigan, USA
- Contact:
I know from past discussions that this is an issue we have identified, but I'm glad someone brought it up in this thread.
[size=117][i][b]Tony Roman[/b][/i][/size]
[size=84][i]Experience is what causes a person to make new mistakes instead of old ones.[/i][/size]
[size=84][i]There are only 3 way to complete a project: Good, Fast or Cheap, pick two.[/i][/size]
[size=84][i]With age comes wisdom, what you choose to do with it determines whether or not you are wise.[/i][/size]
[size=84][i]Experience is what causes a person to make new mistakes instead of old ones.[/i][/size]
[size=84][i]There are only 3 way to complete a project: Good, Fast or Cheap, pick two.[/i][/size]
[size=84][i]With age comes wisdom, what you choose to do with it determines whether or not you are wise.[/i][/size]
I can see that you put some thought into this. I'm attempting to as well.
I work in what is essentially a NOC environment and for us alerting and prediction is everything
Couple of questions/comments/observations about the proposed architecture:
Alerting: Where would be the best pace to tap into to do alerting in your prescribed layout?? I would imagine a tap into the rrd data pipeline could provide threshold and aberrant behavior triggers.
Do you have any specific existing protocols in mind for the rrd update service? XML/RPC SOAP ....
I'm running the scenario of patch day in my head where each machine is taken out of the equation one at a time and everything looks ok until we get to the RRD update servers. I'm curious, does the db cache the polled data until the update server comes online then serilize the data to the RRD server when its back up to fill out the rrd? How could alerting work in this scenario???? Maybe tee some of the data in a sort of raid scenario where you could lose one rrd server and still have updates and alerting continue. you would need a sync mechanism to bring the short rrds up to par and a redistribution mechanism to enable adding another server to the cluster. On the httpd side you would need a way to determine which of the two rrds to display based on last update.
I work in what is essentially a NOC environment and for us alerting and prediction is everything
Couple of questions/comments/observations about the proposed architecture:
Alerting: Where would be the best pace to tap into to do alerting in your prescribed layout?? I would imagine a tap into the rrd data pipeline could provide threshold and aberrant behavior triggers.
Do you have any specific existing protocols in mind for the rrd update service? XML/RPC SOAP ....
I'm running the scenario of patch day in my head where each machine is taken out of the equation one at a time and everything looks ok until we get to the RRD update servers. I'm curious, does the db cache the polled data until the update server comes online then serilize the data to the RRD server when its back up to fill out the rrd? How could alerting work in this scenario???? Maybe tee some of the data in a sort of raid scenario where you could lose one rrd server and still have updates and alerting continue. you would need a sync mechanism to bring the short rrds up to par and a redistribution mechanism to enable adding another server to the cluster. On the httpd side you would need a way to determine which of the two rrds to display based on last update.
- Attachments
-
- rrd1-3 are the servers, colors represent individual rrd files housed on each server.
- rrd raid.jpg (56.84 KiB) Viewed 30949 times
I'm glad this topic is coming up for discussion. I definitely like the idea of having multiple pollers because we would like to be able to have multiple pollers in multiple geographic locations to keep as much of the polling traffic "local" as possible (less snmp traffic going over WAN/VPN connections and/or crossing high-latency paths).
One thing we'll want to be careful about, particularly with as many separate DB instances as I see in the diagram is that SQL database performance (at least in my experience) suffers horribly over high-latency links. This is especially true for apps doing lots of small DB queries instead of few big queries. (the last time I tried testing this it didn't matter what flavor of DB server I used either - accessing a server over a WAN circuit with a 30 ms latency dropped the number of transactions per second I could do from tens of thousands to hundreds)
I'd initially thought the way to spread my polling load out over multiple areas would be to have separate stand-alone cacti instances (say, one in each major office monitoring nearby hosts), but with hooks in the code (maybe a new plugin?) to be able to add references to graphs, hosts, and trees from remote cacti server.
But provided having poller groups separated from the rest of the architecture by medium to high latency links doesn't cause problems, this new design looks even better.
One thing we'll want to be careful about, particularly with as many separate DB instances as I see in the diagram is that SQL database performance (at least in my experience) suffers horribly over high-latency links. This is especially true for apps doing lots of small DB queries instead of few big queries. (the last time I tried testing this it didn't matter what flavor of DB server I used either - accessing a server over a WAN circuit with a 30 ms latency dropped the number of transactions per second I could do from tens of thousands to hundreds)
I'd initially thought the way to spread my polling load out over multiple areas would be to have separate stand-alone cacti instances (say, one in each major office monitoring nearby hosts), but with hooks in the code (maybe a new plugin?) to be able to add references to graphs, hosts, and trees from remote cacti server.
But provided having poller groups separated from the rest of the architecture by medium to high latency links doesn't cause problems, this new design looks even better.
Wow, I'm late to the party on this one.
A slight tweak to the proposed design that would help address those issues:
Add a new logical service, let's call it "Poller Control." The poller controller could exist on its own hardware, or hitch a ride on existing hardware with one of the other logical services.
Whenever Cacti configuration changes are made (whether through the CLI or the web interface) one file per poller is created from the database and provided to the poller controller. The poller controller then pushes those config files out to the relevant pollers. To save bandwidth, something like rsync might even be employed to keep poller configs up to date.
The individual pollers then read the config files passed to them by the poller controller and execute their polling as specified.
Instead of passing the results of the polling directly to the database, the pollers write out the results to a CSV file (or whatever is appropriate) which is then passed back to the poller controller. In the event the poller controller can't be reached, the CSV files are cached on the poller until the poller controller becomes reachable again.
The poller controller takes the output (CSV) from the pollers and inserts the data in to the database.
That should remove any need for true "real-time" communication between the remote pollers and the rest of the infrastructure which should solve any latency or link-related problems. Further all communication between the pollers and the central infrastructure is limited to file transfers of one kind or another, which makes managing the security aspects very simple.
This is an interesting direction to me. I've been tackling the "remote poller" issue by using a Nagios server with remote NRPE clients to poll data and insert the data into RRD files from which I use Cacti to generate graphs and present the data. This project may allow me to greatly simplify that project while increasing the overall reliability. I'm excited!
Andrew
I agree -- this was the concern that jumped out at me when looking at the diagram. I think the remote poller stations need to be as autonomous as possible, and need to not communicate directly with the central database servers. Not only is latency a major concern, but security is an issue as well. It would be great to have the ability to locate a poller somewhere and not have to worry about setting up a VPN to avoid exposing MySQL logins.bbice wrote:One thing we'll want to be careful about, particularly with as many separate DB instances as I see in the diagram is that SQL database performance (at least in my experience) suffers horribly over high-latency links. This is especially true for apps doing lots of small DB queries instead of few big queries.
A slight tweak to the proposed design that would help address those issues:
Add a new logical service, let's call it "Poller Control." The poller controller could exist on its own hardware, or hitch a ride on existing hardware with one of the other logical services.
Whenever Cacti configuration changes are made (whether through the CLI or the web interface) one file per poller is created from the database and provided to the poller controller. The poller controller then pushes those config files out to the relevant pollers. To save bandwidth, something like rsync might even be employed to keep poller configs up to date.
The individual pollers then read the config files passed to them by the poller controller and execute their polling as specified.
Instead of passing the results of the polling directly to the database, the pollers write out the results to a CSV file (or whatever is appropriate) which is then passed back to the poller controller. In the event the poller controller can't be reached, the CSV files are cached on the poller until the poller controller becomes reachable again.
The poller controller takes the output (CSV) from the pollers and inserts the data in to the database.
That should remove any need for true "real-time" communication between the remote pollers and the rest of the infrastructure which should solve any latency or link-related problems. Further all communication between the pollers and the central infrastructure is limited to file transfers of one kind or another, which makes managing the security aspects very simple.
This is an interesting direction to me. I've been tackling the "remote poller" issue by using a Nagios server with remote NRPE clients to poll data and insert the data into RRD files from which I use Cacti to generate graphs and present the data. This project may allow me to greatly simplify that project while increasing the overall reliability. I'm excited!
Andrew
master/slaves cacti patch
Hi !
Here's my small patch for Cacti 0.8.7b that implements a master/slaves arch:
http://forums.cacti.net/viewtopic.php?p=127122
It modifies only poller.php. The slaves send data by inserting poller_output lines in the master database using TCP port 3306.
Cheers,
Ludo.
Here's my small patch for Cacti 0.8.7b that implements a master/slaves arch:
http://forums.cacti.net/viewtopic.php?p=127122
It modifies only poller.php. The slaves send data by inserting poller_output lines in the master database using TCP port 3306.
Cheers,
Ludo.
hello
I would like your opinion on this architecture.
for you it's possible to realise this architecture ?
for the master / slave poller i want use this patch : http://forums.cacti.net/viewtopic.php?p=127122
I would like your opinion on this architecture.
for you it's possible to realise this architecture ?
for the master / slave poller i want use this patch : http://forums.cacti.net/viewtopic.php?p=127122
Well... performance for RRD update through NFS are very poor (tried with netapp 3050)
@All
My spec : 2200 appliances, 50'000 ds and 40000 graphs
Polling time with spine ~130 second
dual quad core Xeon 3ghz, 4gb ram
A distributed architecture is a good thing BUT FIRST we need to optimize Cacti. In fact I use a dual quad core xeon and my problem is that one core is always at 100% and the other sleeps.
Spine is multithreaded, but other process (and plugin) are only monothreaded. That would say that mysql is always accessed by one php process thus mysql allow maximum one CPU per process thus 1 core do the job and other sleep.
Vince
@All
My spec : 2200 appliances, 50'000 ds and 40000 graphs
Polling time with spine ~130 second
dual quad core Xeon 3ghz, 4gb ram
A distributed architecture is a good thing BUT FIRST we need to optimize Cacti. In fact I use a dual quad core xeon and my problem is that one core is always at 100% and the other sleeps.
Spine is multithreaded, but other process (and plugin) are only monothreaded. That would say that mysql is always accessed by one php process thus mysql allow maximum one CPU per process thus 1 core do the job and other sleep.
Vince
Such is the problem with PHP. PHP is supposedly thread safe, but most modules with it are not.
http://www.php.net/manual/en/install.unix.apache2.php
http://www.php.net/manual/en/install.unix.apache2.php
Warning
We do not recommend using a threaded MPM in production with Apache 2. Use the prefork MPM instead, or use Apache 1. For information on why, read the related FAQ entry on using Apache2 with a threaded MPM
- gandalf
- Developer
- Posts: 22383
- Joined: Thu Dec 02, 2004 2:46 am
- Location: Muenster, Germany
- Contact:
There will be an optimization with 087c to equally distribute data sources between different threads. It does not yet take into account different runtimes of different data sourceskillshoot wrote:A distributed architecture is a good thing BUT FIRST we need to optimize Cacti. In fact I use a dual quad core xeon and my problem is that one core is always at 100% and the other sleeps.
Reinhard
Hello all,
distributed polling was one of the topics discussed during the 3rd European Cacti Community Conference this weekend.
I tried to put together the key points of our brainstorming. Most ideas have already been around in this thread but I thought it's best to give a complete wrap up of our discussion today, even it that means to repeat a lot things.
The input below came from all participants of the conference but of course I take all responsibilities for errors or missing points
Scenarios
The overall idea would be to have one central server and a number of agents that are just polling hosts.
The central server is used for the graphic interface, that means configuring Cacti and accessing graphs, and for keeping the rrd files.
The polling agents are just polling data (as the name already suggests) and report the results back to the central server.
This mechanism can be useful in some scenarios, for example
- Remote probes
Imagine a data center with your application server and your cacti server. The application is accessed from remote branches and you have to monitor your application from the user's point of view and not just the central system.
A polling agent can be place in the remote branch to achieve that. This can be helpful for trouble-shooting or even by recovered by your SLAs.
- Latency issues
Imagine a central cacti server monitoring devices that a far away, or, in network lingo, the connection has a high latency.
For only a few devices this might be possible, but with a growing number of devices this can slow down the overall performance of your cacti.
- Remote networks
Sometimes there's the need to monitor devices in a network that is due to routing constraints not easily reachable, e.g. networks with overlapping ip address scheme when using private ip addresses.
Accessing a small number of devices can be achieved through NAT or a tunnel, but this is getting difficult with a large number of devices.
Pros and Cons
For all the above mentioned problems exists a quite simple and straightforward solution. Just setup a cacti where you need it and everything is fine.
The advantage a the concept of distributed polling agent has to deliver would be to simplify the configuration and maintenance of your complete infrastructure.
There is only one graphical interface for configuration and administration of your whole environment. You only have one point to maintain templates and scripts.
(To be honest I haven't made up my mind about that question. Even if some administrational tasks will be easier there is still something to do to keep the agents up and running.
There might be a point during the discussion when it turns out that these efforts are bigger than running several, independent cacti instances.)
Keeping the rrd files on the central server also means that the graphs can only be accessed on this server.
A distributed repository of rrd files would add much more complexity to the equation. If user from remote location need access to the graphical interface a simple reverse proxy on the agent might already do the job.
Distributed polling as described here is not fit to improve overall performance as the I/O performance for updating the rrd files is still a limiting factor.
Also it can't be used to built an HA environment for Cacti as most components are not fault proof.
Concept and Requirements
On the agent
- a poller (spine)
- a repository of the "to do's" for the poller, e.g. a mysql table with a subset of the poller_command table of the central instance. Can be achieved by MySQL replication processes. It could be helpful to create one poller_command table per polling agent in the central server.
- the correct environment, e.g. scripts have to be synchronized with the central instance and functioning on every agent.
- a way to report the results to the central server, e.g. using rrd server or a mysql table with replication back to the central server.
- a cache for the configuration and polling results in case the connection to the central server is disrupted for a longer period.
On the central server
- agent configuration (ID, ip address, maybe more?)
- agent status overview (Up, down, maybe more?)
- Mapping of a host to one or more polling agents. For scenario 2+3 one agent would be enough, for scenario 1 more agents would be desirable. Perhaps a default mapping based on the host template.
Dirk
distributed polling was one of the topics discussed during the 3rd European Cacti Community Conference this weekend.
I tried to put together the key points of our brainstorming. Most ideas have already been around in this thread but I thought it's best to give a complete wrap up of our discussion today, even it that means to repeat a lot things.
The input below came from all participants of the conference but of course I take all responsibilities for errors or missing points
Scenarios
The overall idea would be to have one central server and a number of agents that are just polling hosts.
The central server is used for the graphic interface, that means configuring Cacti and accessing graphs, and for keeping the rrd files.
The polling agents are just polling data (as the name already suggests) and report the results back to the central server.
This mechanism can be useful in some scenarios, for example
- Remote probes
Imagine a data center with your application server and your cacti server. The application is accessed from remote branches and you have to monitor your application from the user's point of view and not just the central system.
A polling agent can be place in the remote branch to achieve that. This can be helpful for trouble-shooting or even by recovered by your SLAs.
- Latency issues
Imagine a central cacti server monitoring devices that a far away, or, in network lingo, the connection has a high latency.
For only a few devices this might be possible, but with a growing number of devices this can slow down the overall performance of your cacti.
- Remote networks
Sometimes there's the need to monitor devices in a network that is due to routing constraints not easily reachable, e.g. networks with overlapping ip address scheme when using private ip addresses.
Accessing a small number of devices can be achieved through NAT or a tunnel, but this is getting difficult with a large number of devices.
Pros and Cons
For all the above mentioned problems exists a quite simple and straightforward solution. Just setup a cacti where you need it and everything is fine.
The advantage a the concept of distributed polling agent has to deliver would be to simplify the configuration and maintenance of your complete infrastructure.
There is only one graphical interface for configuration and administration of your whole environment. You only have one point to maintain templates and scripts.
(To be honest I haven't made up my mind about that question. Even if some administrational tasks will be easier there is still something to do to keep the agents up and running.
There might be a point during the discussion when it turns out that these efforts are bigger than running several, independent cacti instances.)
Keeping the rrd files on the central server also means that the graphs can only be accessed on this server.
A distributed repository of rrd files would add much more complexity to the equation. If user from remote location need access to the graphical interface a simple reverse proxy on the agent might already do the job.
Distributed polling as described here is not fit to improve overall performance as the I/O performance for updating the rrd files is still a limiting factor.
Also it can't be used to built an HA environment for Cacti as most components are not fault proof.
Concept and Requirements
On the agent
- a poller (spine)
- a repository of the "to do's" for the poller, e.g. a mysql table with a subset of the poller_command table of the central instance. Can be achieved by MySQL replication processes. It could be helpful to create one poller_command table per polling agent in the central server.
- the correct environment, e.g. scripts have to be synchronized with the central instance and functioning on every agent.
- a way to report the results to the central server, e.g. using rrd server or a mysql table with replication back to the central server.
- a cache for the configuration and polling results in case the connection to the central server is disrupted for a longer period.
On the central server
- agent configuration (ID, ip address, maybe more?)
- agent status overview (Up, down, maybe more?)
- Mapping of a host to one or more polling agents. For scenario 2+3 one agent would be enough, for scenario 1 more agents would be desirable. Perhaps a default mapping based on the host template.
Dirk
Hello, the answer is a simple yes, the server having the UI is also keeping the database. The approach is just aiming at having a distributed polling mechanism and not a HA cluster. In this regards your setup is more advanced. The modified poller script you mentioned in your earlier post is already covering most of requirements for distributed polling. I'm just wondering about 2 points. When connecting directly from the remote poller to the central database server to write the results into the database you might run into latency problems again. Depending on the WAN link and the amount data sources that are polled that is a bigger or smaller problem. The second point is the caching of information in case of an outage of the connection between the central site and the remote agent. Currently those data will be lost.
Very good news ! Have you a release date for the 0.8.7c ?gandalf wrote:There will be an optimization with 087c to equally distribute data sources between different threads. It does not yet take into account different runtimes of different data sourceskillshoot wrote:A distributed architecture is a good thing BUT FIRST we need to optimize Cacti. In fact I use a dual quad core xeon and my problem is that one core is always at 100% and the other sleeps.
Reinhard
Regards,
Vince
Humor me by reading and commenting about these thoughts:
Some of your thinking is way too centralized to ever be scalable.
The only thing that needs to be "central" is the user interface. And that's purely to make the experience as pleasant as possible while browsing the graphs. And of course to facilitate a way of _finding_ all the graphs in the first place.
Needed "layers":
1. FRONT-END
2. MULTIPLE BACK-ENDS (or just one for that matter in a _small_ setup)
Think of it as either a centralized sharded approach or a distributed approach where you delegate poller/storage to decentralized local machines in different zones (can be for both performance and network access reasons)
the FRONT END:
- purely for browsing, and configuration control of the Back-Ends via (preferably) an RPC API.
- the spider connecting it all, database which keeps track of what backend hosts a certain host and / or datasource. Either links to the back-ends directly (IMG SRC= ...) or proxies the requests (Proxying is definately the best approach, since it would also help increase security, by NOT letting the end-user talk directly to the Back-End).
BACK-END
- pollers (can be centrally triggered if you wish, but I prefer them to be centrally configured only, if the central FRONT-END server breaks, we don't want to stop polling from all our distributed back-ends.
- RRD storage
- DB back-ends for whatevers needed.
DEPENDENCIES etc:
All Back-ends are standalone as far as operations go. Only configuration changes and modifications centrally managed.
Why would all RRD files need to be stored on the same central server? It's not scalable... EVER...
If your argument is "I want all rrds in the same filesystem" then my response is; then setup a hashed directory structure with NFS mounts from your central server to all the BACK-ENDs. This is for reads or special management cases only of course.
I hope some of it makes sense since this was a very quick write-up.
Some of your thinking is way too centralized to ever be scalable.
The only thing that needs to be "central" is the user interface. And that's purely to make the experience as pleasant as possible while browsing the graphs. And of course to facilitate a way of _finding_ all the graphs in the first place.
Needed "layers":
1. FRONT-END
2. MULTIPLE BACK-ENDS (or just one for that matter in a _small_ setup)
Think of it as either a centralized sharded approach or a distributed approach where you delegate poller/storage to decentralized local machines in different zones (can be for both performance and network access reasons)
the FRONT END:
- purely for browsing, and configuration control of the Back-Ends via (preferably) an RPC API.
- the spider connecting it all, database which keeps track of what backend hosts a certain host and / or datasource. Either links to the back-ends directly (IMG SRC= ...) or proxies the requests (Proxying is definately the best approach, since it would also help increase security, by NOT letting the end-user talk directly to the Back-End).
BACK-END
- pollers (can be centrally triggered if you wish, but I prefer them to be centrally configured only, if the central FRONT-END server breaks, we don't want to stop polling from all our distributed back-ends.
- RRD storage
- DB back-ends for whatevers needed.
DEPENDENCIES etc:
All Back-ends are standalone as far as operations go. Only configuration changes and modifications centrally managed.
Why would all RRD files need to be stored on the same central server? It's not scalable... EVER...
If your argument is "I want all rrds in the same filesystem" then my response is; then setup a hashed directory structure with NFS mounts from your central server to all the BACK-ENDs. This is for reads or special management cases only of course.
I hope some of it makes sense since this was a very quick write-up.
Who is online
Users browsing this forum: No registered users and 2 guests