New Cacti Architecture (0.8.8) - RFC Response Location

Post by **TheWitness** » Sun Dec 16, 2007 2:01 pm

All,

Please submit your RFC comments here. Thanks for your participation. I will attach newer versions of the RFC and provide feedback in this post.

Regards,

TheWitness

Post by **gandalf** » Sun Dec 16, 2007 3:11 pm

Hi Larry,

thank you for opening this long overdue discussion. Beneath "automation" (yes, I know ...) was this the most deeply discussed topic of 2.CCC.eu.

But personally, I have a problem with this picture. It shows very nice where everything's happening from the data point of view. (BTB: I suppose, it would be better to either use poller_item _or_ poller_cache in picture and text. Personally, I prefer poller_item as this is the name of the real table.)

But I am not sure to fully understand where the application logic lies, in other words: what about the "workflow"?

I suppose. that each Poller Group is governed by a local crontab (or a "real" daemon) that fetches data from the db server (either directly or via table replication). Output is stored in local poller_output table and replicated to the db server? What would be a criteria to associate a host/data source to some specific Poller Group? I suppose, that clock synchronization (ntp?) would be required, as the local time used for poller_output would be used for rrdtool update. What about timezones? Where should hooks like "poller_bottom" be executed?

And another part of the logic would lie with the http servers, for sure (console aka administration). But where lies the rrdtool update logic? With the database server? Or with the RRDTool servers? Would RRDfile Update Groups equal Poller Groups? If not, what would be the criteria this time?

The rrdfile data pipeline surely would be used for graphing. But a lot of plugins currently require access to rrd files as well. So there would be more than graphing only. When running from load-balanced http instances, either graph caching will fail or cache must reside on RRDfile Servers. As there are two different pipes to RRDfile Servers, perhaps synchronization between updates and graph (rrdtool fetch) is necessary.

That's my first thoughts. Surely more will follow

Reinhard

Post by **TheWitness** » Sun Dec 16, 2007 3:54 pm

gandalf wrote:thank you for opening this long overdue discussion. Beneath "automation" (yes, I know ...) was this the most deeply discussed topic of 2.CCC.eu.

Yes, long overdue.

gandalf wrote:But personally, I have a problem with this picture.

RFC's often times start that way.[/quote]

gandalf wrote:It shows very nice where everything's happening from the data point of view. (BTB: I suppose, it would be better to either use poller_item _or_ poller_cache in picture and text. Personally, I prefer poller_item as this is the name of the real table.)

I litterally "slapped" this together. I will correct that in v2.

gandalf wrote:But I am not sure to fully understand where the application logic lies, in other words: what about the "workflow"?

Yes, after sending it out, I realized I left that out. Basically, the poller, will, by default use the main servers poller_item table, for it's list of poller items. If for some reason, the main server is not reachable, it will use it's local copy and store the poller_output table locally.

The same is intended for the poller_output table. Central server first. The remote pollers will be provided instructions to "update/synchronize" their local poller_items table periodically (aka when things change). Those synchronizations would not happen any more often that every 5 minutes.

If the central server is not available, then each poller will cache the updates in their poller_output tables until such time as the remote connection is available, then it will dump sequntially, by date to the central server. By doing so, no data will be lost.

gandalf wrote:What would be a criteria to associate a host/data source to some specific Poller Group?

There would have to be modifications to the poller_output table, or another table to keep track of when it is time to poll things. That is more TBD, until I have more feedback.

gandalf wrote:I suppose, that clock synchronization (ntp?) would be required, as the local time used for poller_output would be used for rrdtool update.

Of course...

gandalf wrote:What about timezones?

You tell me...

gandalf wrote:Where should hooks like "poller_bottom" be executed?

I need feedback from people like Howie and yourself to determine "where" that should be. So, see my comment above. It's an RFC you know

gandalf wrote:And another part of the logic would lie with the http servers, for sure (console aka administration). But where lies the rrdtool update logic? With the database server? Or with the RRDTool servers?

The RRDfile Services will be asynchronous and running as daemons. They will process all items in the poller_output table as they come in and handle other requests in other threads. The main poller_output table, with some minor modifications, will be used to achieve RRDupdates.

gandalf wrote:Would RRDfile Update Groups equal Poller Groups? If not, what would be the criteria this time?

No.

gandalf wrote:The rrdfile data pipeline surely would be used for graphing. But a lot of plugins currently require access to rrd files as well.

Need feedback from Plugin developers as to "how" they would like this to work.

gandalf wrote:So there would be more than graphing only. When running from load-balanced http instances, either graph caching will fail or cache must reside on RRDfile Servers.

I expect the RRDtool Services to handle this. Each graph will know, in advance, which server it needs to talk to.

Regards,

TheWitness

Post by **gandalf** » Sun Dec 16, 2007 4:05 pm

TheWitness wrote:
gandalf wrote:It shows very nice where everything's happening from the data point of view. (BTB: I suppose, it would be better to either use poller_item _or_ poller_cache in picture and text. Personally, I prefer poller_item as this is the name of the real table.)
I litterally "slapped" this together. I will correct that in v2.

I suppose, separating data logic and workflow is the better way. Else I fear that the picture will become too crowdy.

From the current design, the database servers seem to define the limit of this architecture. While Poller and RRDfile Servers are scalable as well as http, database server is existing only once. As I understand, the second one is for failover only.
So, if there's a central poller_item table as well as poller_output, their update/delete performance will be cruical. I suppose, you're thinking of memory tables as boost uses them. And then, like boost does with rrdtool bulk update, there's the SQL bulk insert that will create some more preformance, correct?

Reinhard

Post by **TheWitness** » Sun Dec 16, 2007 4:31 pm

Yes, memory tables have a I/O rate in excess of 40k updates per second, so even though it uses a table lock mechanism, we are safe. I was thinking that making this a separate database altogether though would help other subsystems performance though, and simplify backup.

What do you think about that?

TheWitness

melchandra · Post by **melchandra** » Sun Dec 16, 2007 4:33 pm

Forgive me if this seems to be a silly question.

Is there some reason why the pollers don't directly update the RRD Files? Does it have to go through the database?

Is there a way for the Pollers to be sent both the host and oid information to query, as well as host information for the Remote RRD Update Service so they could query, and then pass the data directly to the RRD storage devices?

Howie · Post by **Howie** » Sun Dec 16, 2007 4:33 pm

gandalf wrote:The rrdfile data pipeline surely would be used for graphing. But a lot of plugins currently require access to rrd files as well. So there would be more than graphing only. When running from load-balanced http instances, either graph caching will fail or cache must reside on RRDfile Servers. As there are two different pipes to RRDfile Servers, perhaps synchronization between updates and graph (rrdtool fetch) is necessary.

Once updates have been made scalable, is there a requirement for actual load-balancing of HTTP frontends? I can see why you might want HA, but does anyone really have so many concurrent users that a single server can't cope? All I ever see are queries about the polling limitations...

With a HA setup instead, then cache-sharing isn't really necessary. Really, if you have 1000 users all hitting the same graph then you are still reducing the number of graph-drawing operations from 1000 to 2, which is probably enough. If they aren't all hitting the same graphs, then graph caching isn't going to help anyway.

(I don't hit any of these limitations with my own modest needs, so I'm just curious really. Despite what it says under my name on the left, I'm just a lowly user that talks a lot

)

Post by **TheWitness** » Sun Dec 16, 2007 4:40 pm

Howie wrote:With a HA setup instead, then cache-sharing isn't really necessary. Really, if you have 1000 users all hitting the same graph then you are still reducing the number of graph-drawing operations from 1000 to 2, which is probably enough. If they aren't all hitting the same graphs, then graph caching isn't going to help anyway.

This sort of answers the "should we invent it outselves" question on load ballancing. HA for sure, but there are already technologies for that. So, I would suspect your answer would be leave that out of the design. The capability will be there for HA, but it's always optional.

TheWitness

Post by **TheWitness** » Sun Dec 16, 2007 4:41 pm

melchandra wrote:Forgive me if this seems to be a silly question.

Is there some reason why the pollers don't directly update the RRD Files? Does it have to go through the database?

Is there a way for the Pollers to be sent both the host and oid information to query, as well as host information for the Remote RRD Update Service so they could query, and then pass the data directly to the RRD storage devices?

Yes, absolutely. When you have lot's of them, the disk i/o required is astounding. So, by batching them you can reduce I/O wait by 80-90% over time. So, the database provides that for us.

TheWitness

ben_c · Post by **ben_c** » Sun Dec 16, 2007 4:47 pm

Great document guys, need a little bit more time to analyze it.

But it is the direction Cacti needs to start heading. I know the current limitations all too well from using it in a large enterprise (50,000+ data sources).

Howie · Post by **Howie** » Sun Dec 16, 2007 4:49 pm

TheWitness wrote:This sort of answers the "should we invent it outselves" question on load ballancing. HA for sure, but there are already technologies for that. So, I would suspect your answer would be leave that out of the design. The capability will be there for HA, but it's always optional.

Indeed. I'd say stick to the architectural stuff required to support it (or at least not break it

). HA solutions are usually either platform-specific (CARP, MS NLB, ultramonkey) or external (CSS, Alteon etc) anyway.

ben_c · Post by **ben_c** » Sun Dec 16, 2007 4:49 pm

TheWitness wrote:
Howie wrote:With a HA setup instead, then cache-sharing isn't really necessary. Really, if you have 1000 users all hitting the same graph then you are still reducing the number of graph-drawing operations from 1000 to 2, which is probably enough. If they aren't all hitting the same graphs, then graph caching isn't going to help anyway.
This sort of answers the "should we invent it outselves" question on load ballancing. HA for sure, but there are already technologies for that. So, I would suspect your answer would be leave that out of the design. The capability will be there for HA, but it's always optional.

TheWitness

I tend to agree, built in application level load balancing is never an ideal situation in my opinion. If you really need it, by some hardware to do the job properly (F5, alteon etc).

Howie · Post by **Howie** » Sun Dec 16, 2007 4:56 pm

TheWitness wrote:
gandalf wrote:Where should hooks like "poller_bottom" be executed?
I need feedback from people like Howie and yourself to determine "where" that should be. So, see my comment above. It's an RFC you know

If I understand the layout correctly, it would have to be on the 'master' poller, as the remote pollers report back in to that one. Of the 'big' plugins that I can think of that actually use cacti data * (reportit, thold, weathermap), thold and wmap can both already use poller_output instead of looking at rrd files directly. I don't really see any way around it for reportit though... if the rrdfiles were physically distributed onto different machines, then there would need to be some sort of aggregate-view or 'run this on all the rrd servers' API.

Actually thold probably would work either on the local pollers or the central location, since it works with one DS at a time, but it would be easier to maintain in the centre.

* I don't use Manage, MACTrack or Discovery, but as far as I know they don't deal with rrd data, do they?

Alice · Post by **Alice** » Sun Dec 16, 2007 8:21 pm

It's a little late, even for me, so please excuse all the aberations I'm about to write

WHY do we want multiple pollers?

a) Backup - have all the data available, even if one of the Servers is not available, or can't request data from all

the devices for some reason (network outage)
b) Speed - One server can't handle all the required devices
c) Combined (a+b)

IMHO it's a different approach for every one of the three situations listed.
I THINK that speed can be solved "local" (4 separate servers: Web, poller, DB and RRD Updater)

Facts: 10676 DS, 9742 graphs, 10669 RRD files totalling 1.3GB
Polling time: ~15 seconds
Two servers, one DB, one for the rest

Backup, on the other hand, is something else.

What if we run 2 somehow completly "independent" cacti instances, both querying the same hosts, and somehow

syncronize them?
Something like replication for MySQL (I really don't know how this works) and RSNYC for RRD files?

Hmm, nice shit

Error in posting

DEBUG MODE

SQL Error : 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server

version for the right syntax to use near 'WHERE forum_id = 7' at line 3

UPDATE forums SET forum_posts = forum_posts + 1, forum_last_post_id = WHERE forum_id = 7

Line : 423
File : functions_post.php

Post by **TheWitness** » Sun Dec 16, 2007 10:37 pm

Yea, you get that when you sit on a post too long. Back button, copy, back button, repost, paste, post.

TheWitness

New Cacti Architecture (0.8.8) - RFC Response Location

New Cacti Architecture (0.8.8) - RFC Response Location

Who is online