New Cacti Architecture (0.8.8) - RFC Response Location

oxo-oxo · Post by **oxo-oxo** » Mon Dec 17, 2007 4:52 am

This will be edited in time so as not to fill up the post
- will contain my input as one post.
Please remember to keep the date of the pdf up to date or a version nr as I will need a referance if things change ...

MySQL
If MySQL is the DB one should concider cirular replication contra cluster
A good link is
http://www.onlamp.com/pub/a/onlamp/2006 ... ation.html
The code changes to support circular replication are "small"
(I'll try and make a summary here at some stage)

adrianmarsh · Post by **adrianmarsh** » Mon Dec 17, 2007 1:28 pm

Well.. my 2 cents so far after looking at the diags..

I like the idea of de-centralizing for larger environments, but I'd also like to ensure that it doesn't get too complicated for the smaller ones.

The major issue I see is the RRD files.. so I can see why you'd introduce a secondary cluster of databases to handle that.. but would that be duplication of the RRAs or just spreading the files around (still one file only) ?

marnues · Post by **marnues** » Mon Dec 17, 2007 1:30 pm

So here's my suggestions.
Just as the front-end->graphs and poller->data sources are essentially separate, I would be inclined to separate the main Cacti DB from the pollers as much as possible.
So here's my thoughts on polling.

The main DB will of course have a poller_item table.
There will be a field that determines which poller has which item.
Any time a new poller_item entry is added, the back-end assigns it to a poller and pushes this information.
This way the pollers have a local copy of poller_item.
I would assume a standard round-robin assignment and then make it modular for plugins.
I know my company will be abusing poller assignments based on the criticality of the data since we already have plans to do this.

Then I would have the pollers store to the rrdfiles themselves.
This keeps the front-end->graphs and pollers->data-sources separate.
That might not be intentional, but its how I've been interpreting the DB and it makes the most sense to me.

I guess the biggest reason I'm pushing this is that my company wants to redo the front-end almost entirely and use a full db backend for the data sources (we don't want lossy graphs).
The more modular Cacti is, the more code we can keep and less code we have to write.
Plus we can continue calling it Cacti. =D

Info:
Hosts: 12855
Data Sources: 152188
Graphs: 151450

Desired Hosts is somewhere on the order of 20,000.
We only have about 7000 active hosts because our pollers lose too much important data if we turn them all up.

Currently we are running:
2 Web Servers
4 Pollers/RRDTool writers (same boxes, since they use different ethernet ports and the pollers don't put much load on the server)
F5 load balancer
Blue Arc storage array
1 DB, which seems to be running fine, though poller_item entries seem to disappear here and there

sadly we're still on 0.8.6h since our lab is fubar'd and we don't want to upgrade till everything is tested
which means that if something I suggested breaks the plugin architecture let me know

adesimone · Post by **adesimone** » Wed Dec 19, 2007 2:02 am

I am also glad to see the direction that cacti is taking.

Here is a whitepaper I started to write describing our current cacti implementation (so it is not entirely complete). Our goal was to have a semi-symmetrical, linearly-scaled architecture that would provide data and polling resiliency. This whitepaper uses the existing cacti 8.6 architecture - no where near as complex (or advanced) as this proposed architecture.

The polling work is divided up based on the amount of active nodes in the cluster, and they all write to shared GFS storage on a MSA1000 (connected via FC). This is a 14-disk RAID0+1 array. For database replication, we originally tried mysql cluster, but it proved to be too slow (even with SCI cards) so we went with master-master replication. We don't replicate poller_output, poller_command, poller_reindex, or poller_time tables. We use Cisco IOS SLB for least-connected load-balancing.

We do 1-minute polling and store 60-days of 1-minute data so RRDs get very large (currently 10GB for 6000 elements).

For the future of cacti, I would like to see similar design considerations for redundancy alongside of scalability - and the ability to take advantage of the redundant components for back-end load-balancing and job division. I like the idea or distributed RRD storage (with api access instead of disk access?), as the disk I/O will still be a limitation with shared storage.

One other performance issue I have seen with our current setup is with the load-balancing - it seems that linux caches alot of RRD data into RAM (we have 3GB per node) - so whichever data a specific node gathers and writes to an RRD, it can serve up much faster via the cacti front end (vs a node that must pull from disk). So there may be some merit to embedded load-balancing in the application - maybe a algorithm to send specific requests to specific nodes that may have a cache of the data - just a thought (although I am a big fan of hardware load-balancers).

So that is my 2 cents...

ADesimone

Frizz · Post by **Frizz** » Wed Dec 19, 2007 9:48 am

adesimone wrote:I am also glad to see the direction that cacti is taking.

Here is a whitepaper I started to write describing our current cacti implementation (so it is not entirely complete). Our goal was to have a semi-symmetrical, linearly-scaled architecture that would provide data and polling resiliency. This whitepaper uses the existing cacti 8.6 architecture - no where near as complex (or advanced) as this proposed architecture.
....
....

So that is my 2 cents...

ADesimone

Hi Aurelio,
Great article and good first sign, for this Cacti architecture thread.
I'm impressed about the idea using DRBD and SLB as redundancy and load sharing model. I'd never thought, that it is applicable to Cacti.
But where can we found the "cluster-modified" Cacti code?

Regarding poller performance you should read this article
from David Plonka notified here by fmangeant http://forums.cacti.net/viewtopic.php?t ... ght=plonka and
http://www.usenix.org/event/lisa07/tech/plonka.html

We have not the challenge of 1-minute polling and a recovery service agreement of 1 business day. But we have the challenge to handle up to 10.000 devices with more than 150.000 data sources (currently 2.000dev/40.000ds) within 5 minutes distributed global with a latencies up to 300-400 ms. So we need a distributed solution with poller engines in each continent and a centralized database for minimal maintenance overhead.
We work with two "semi-symetrical" boxes (one database/polling, one httpd/administration/redundancy box).

For the the moment I've not enough time to discuss this topics in depth, but I will follow with interest.
We had a similar discussion at our 2.CCC.eu conference and hopefully all of these ideas will bring up the best solution for the future of Cacti.
Thanks for your contribution. Excellent.

Frizz and BownieBraun from Cologne

adesimone · Post by **adesimone** » Thu Dec 20, 2007 12:47 am

I posted the 'cacti cluster' files under the 'distributed cacti' topic - http://forums.cacti.net/viewtopic.php?p=120908

Thanks for the links.. I will try the new rrdtool and see how it effects performance.

ADesimone

MagicOneXXX · Post by **MagicOneXXX** » Thu Dec 20, 2007 11:16 am

I like the idea of de-centralizing for larger environments, but I'd also like to ensure that it doesn't get too complicated for the smaller ones.

I completely agree. I think it would be a good idea to break this down into multiple levels of redundancy/load balancing. For a medium size business where I work, we only really need one central control server (MySQL, RRDtool, cacti web) with remote polling nodes. Obviously much larger installations need what is outlined on the RFC document. I'd say keep the RFC as the ultimate vision for redundancy, but keep in mind that alot of users can only implement a smaller system.

pepj · Post by **pepj** » Fri Dec 21, 2007 4:58 pm

Great idea to improve cacti. The polling of the devices (and plugins) are really the bottle neck. And the trend is to get statistics under the 5 minutes ... again more polling.

Web services
I aggree with the others, the companies which will need this design have already load balancers.
But the question is we need from cacti some L7-status (healthcheck-state) in order to better loadbalance than a simple round-robin, or session or port check.

My feedbacks will be more for the plugins needs:
1/ first a question: Do you want inplement with MySQL database replication ? If yes the plugin community will need some interface-SQL-functions which know where is the active sql-server (or which has the data if no replication)
2/ poller_bottom and other hooks: could be done everywhere BUT for the future and because some plugins need a lot of ressources, I would prefer have the possibility to select a poller group.
2b/ How does the group decide to start the "poller_bottom" (for ex.) in case of failure ? When master/slave are working like hsrp it could be that they don't see the other and both starts the poller ... but it is perhaps not really a problem because the goals of this design are the performance.
3/ the plugins will need some functions to retrieve,add,... the RDD files from the acive/reachable rdd-server of the group. In fact I think the same as you will need to create the graphs in cacti. (the loadbalancer cannot know where is the rdd file)

PS: I will give some comments later ...

rcaston · Post by **rcaston** » Thu Dec 27, 2007 11:59 pm

I'm going to add more of the same here; the main reason I want multiple pollers is so the platform can scale sideways. The current monolithic nature of the architecture forces us to resolve growth issues by placing cacti on a more powerful server. It would be easier to manage, and cheaper - if we could just throw another cheap server into the mix for polling and data collection as we need them.

Anyways, on to architecture; I would agree that we should look at a modular design that makes sense without over complicating the product.

Orignally I had been a fan of the idea of breaking the project's role out to each server; i.e. mysql servers, spine servers, and web servers. However this makes the product more one-off from the main cacti branch, which would make it harder for the developers to support.

The whitepaper suggests a better approach, which is each cacti server is the base build of cacti plus some tweaks, which sounds good. I also like the idea of using GFS, however I have not done this level of load with GFS to know how well it would work, but it does give me a pet project to try next week

shull · Post by **shull** » Wed Jan 02, 2008 6:30 pm

Just a quick thought:

What about using XML-RPC or SOAP for communications and possible RRD file Sync. I would think this could become a universal interface to all pieces of the "new" Cacti. It would allow for more modularity and you could seem to eliminate many of the sync issues (such as syncing RRD files across multiple server-grabbing data from multiple servers-etc.)

Allow the pollers to communicate to the DB server or RRD files via SOAP or XML-RPC. Have plugins use XML-RPC to access any data from Cacti.

I hope this made sense.

I will add more later as I think about it. Great topic for dicussion!

-Stephen

Morgan · Post by **Morgan** » Fri Jan 04, 2008 7:14 am

for scalability have you considered/looked sending the jobs/polls to pollers using some form of algorithmic hashing ? this could happen after a "are you there" check to see which pollers are there.

1 step up for that, the check includes a load field, where the poller returns a value that indicates how many "jobs" it can take.

just a though, otherwise, lgtm

wjm · Post by **wjm** » Fri Jan 04, 2008 9:13 am

Multiple pollers would be the best fit for me.
I can keep the existing high availability hardware for 'have' to monitor deceives.
And use lower cost for the 'nice to know' things.

In practice (for me) the user interface is the least used part.
There are over 70 people that can login, but if there has never been more than 3 or 4 concurrent users.

If there is a priority, I would suggest expanding collection capacity first.
I would keep web interface on the 'main' server and free up resouces with external pollers.
With one minute polling and longer resolution in the rras, hardware limits start to become a problem.

For now, moving to an external mysql server was a big help for me.
Database space requirements are small.
If you have an old server and a bunch of 9Gig SCSI drives laying around I would recommend it for short term breathing room.

argon0 · Post by **argon0** » Mon Jan 14, 2008 7:34 am

Good topic, I will keep an eye on this out of interest, even though I don't think I'll have a lot to add.

My main reason for wanting "scalability" is so I can put a Poller on a client site, and have it feed back to our central office where alerts could be raised, and trends analysed. Rather than having to have one Cacti server polling devices over the internet/dedicated links to our clients.

At the moment I am putting a new Cacti install into a major client (for us, only about 20 servers/40 Monitorred Nodes), and would love to put a cacti box into each of our clients (about 70) and monitor it all from a central location...

As for front end loadbalancing I am a great fan of Hardware loadbalancing - but would steer clear of F5/Alteon - have a look at loadbalancer.org for a more cost-friendly solution...

Argon0

Post by **TheWitness** » Tue Jan 15, 2008 11:09 pm

Very good comments thus far. I will have more feedback in the next month or so.

Regards,

Larry

Howie · Post by **Howie** » Wed Jan 16, 2008 3:45 am

argon0 wrote:My main reason for wanting "scalability" is so I can put a Poller on a client site, and have it feed back to our central office where alerts could be raised, and trends analysed. Rather than having to have one Cacti server polling devices over the internet/dedicated links to our clients.

Which brings up another variable - in some situations, where it's actual load on a central collector that's an issue, then it would be OK to have Cacti level the load between collectors in the event of a failure, but in others (like this one perhaps), there may only be one collector that has the correct network access. For example, it may be behind a firewall at the end of a VPN on the clients private address space. Worse, the same addresses may also be valid in some other clients address space, so some kind of poller affinity needs to be available for Devices: which pollers can see this device?

Cacti

New Cacti Architecture (0.8.8) - RFC Response Location

oxo-oxo thoughts in progress

some feedback

Who is online