How much is too much?
Moderators: Developers, Moderators
First off I'd like to say great product.. its about time someone developed worthy competition for MRTG.
Anyhow, we just installed the program and everything is looking fine. We have two routers that we added each with approx 400 interfaces that we need to monitor. The graphing and monitoring is looking fine, however it seems that the cmd.php is taking so long to process that it either times out or dies completely. The processes stay idle, and new processes get added every 5 minutes (due to the cron job). After about 30 minutes the server is almost completely unreachable and the graphing breaks.
My question is this... How many devices or interfaces is this program designed for? What is the max that somebody has been monitoring thus far? Any thoughts on a workaround?
Anyhow, we just installed the program and everything is looking fine. We have two routers that we added each with approx 400 interfaces that we need to monitor. The graphing and monitoring is looking fine, however it seems that the cmd.php is taking so long to process that it either times out or dies completely. The processes stay idle, and new processes get added every 5 minutes (due to the cron job). After about 30 minutes the server is almost completely unreachable and the graphing breaks.
My question is this... How many devices or interfaces is this program designed for? What is the max that somebody has been monitoring thus far? Any thoughts on a workaround?
I knew someone would ask this question eventually Cacti's current method of gathering data using the PHP binary is by no means optimal. I think multithreading is the key here.
I have a friend who has written a similair web application for network monitoring. Once he hit the 300 second overlap point, he intregrated multithreading within the PHP script. This helped some, but not enough. He then rewrote the code in c using pthreads and this made a DRAMATIC difference. I mean like from 300 seconds down to ~60.
Following these same steps is probably the best way to go to improve cacti's data gathering capabiltity. I am not the most proficient at c, but probably need the practice anyways.
For reference, at my largest installation I have 135 data sources. So far everything is kept nicely under 300 seconds. However I'm sure this will overflow if something would go down therefore causing mass-timeouts. Which in my mind is when the graphs are the most useful.
I would like to work this issue out sometime, so if anyone has any other solutions/experience; please post them here.
-Ian
I have a friend who has written a similair web application for network monitoring. Once he hit the 300 second overlap point, he intregrated multithreading within the PHP script. This helped some, but not enough. He then rewrote the code in c using pthreads and this made a DRAMATIC difference. I mean like from 300 seconds down to ~60.
Following these same steps is probably the best way to go to improve cacti's data gathering capabiltity. I am not the most proficient at c, but probably need the practice anyways.
For reference, at my largest installation I have 135 data sources. So far everything is kept nicely under 300 seconds. However I'm sure this will overflow if something would go down therefore causing mass-timeouts. Which in my mind is when the graphs are the most useful.
I would like to work this issue out sometime, so if anyone has any other solutions/experience; please post them here.
-Ian
That's a problem shared by most of the data collectors in the RRD/MRTG world. It was much worse with MRTG because RATEUP would also recreate the graphs every 5 min as well.
I know that there has been a great deal of work on multithreaded data collection over at the discussion boards for NMIS (another RRD front-end). Someone had patched their Perl code to to be able to multithread, but the drawback has been big CPU spikes on the collecting machine.
The other workaround/solution has been for remote collector daemons doing some of the processing and updating the main RRDTOOL box. That might work for servers but not routers.
I know the FPING utility will ping many hosts simultaneously. Perhaps scripting FPING for CACTI might reduce the collection overhead a bit - maybe piping the results to a file and updating individual RRDs via a PERL/PHP parsing routine could help.
For large installations, it might make sense to have more than one box set up to do collecting only and then copying the .RRD files to the main CACTI box on a timed basis. I know that commercial products like Vital Signs Network require up to 4 boxes to do data collection - and it costs $300K to poll 500 devices.
I've seen client sites get pounded with traffic by overzealous Openview polling.
Some products, like NT4 servers without the proper service packs, will eventually crash when they get hit by constant SNMP requests.
(Of course, it usually takes less than that to crash them ) Another less desired option might be to reduce the polling frequency to every ten minutes.
I know that there has been a great deal of work on multithreaded data collection over at the discussion boards for NMIS (another RRD front-end). Someone had patched their Perl code to to be able to multithread, but the drawback has been big CPU spikes on the collecting machine.
The other workaround/solution has been for remote collector daemons doing some of the processing and updating the main RRDTOOL box. That might work for servers but not routers.
I know the FPING utility will ping many hosts simultaneously. Perhaps scripting FPING for CACTI might reduce the collection overhead a bit - maybe piping the results to a file and updating individual RRDs via a PERL/PHP parsing routine could help.
For large installations, it might make sense to have more than one box set up to do collecting only and then copying the .RRD files to the main CACTI box on a timed basis. I know that commercial products like Vital Signs Network require up to 4 boxes to do data collection - and it costs $300K to poll 500 devices.
I've seen client sites get pounded with traffic by overzealous Openview polling.
Some products, like NT4 servers without the proper service packs, will eventually crash when they get hit by constant SNMP requests.
(Of course, it usually takes less than that to crash them ) Another less desired option might be to reduce the polling frequency to every ten minutes.
Well the problem is not how often they are queried.. its how long the queries take to complete. We are just monitoring too much for the script to handle.
The only obvious alternative is like RaX had posted. Moving the cmd.php to a threaded C program to handle the SNMP queries, and collect the data. Of course I have no practical experience with C so I am not going to be much help to anyone on that discussion, however I am quite familiar with PHP and shell scripting. I plan to hack apart the cmd.php sometime soon and see if there is anything that I can do to speed it up.
Anyone with any other ideas, please contribute.
The only obvious alternative is like RaX had posted. Moving the cmd.php to a threaded C program to handle the SNMP queries, and collect the data. Of course I have no practical experience with C so I am not going to be much help to anyone on that discussion, however I am quite familiar with PHP and shell scripting. I plan to hack apart the cmd.php sometime soon and see if there is anything that I can do to speed it up.
Anyone with any other ideas, please contribute.
Well IMHO I dont think that perl is going to be able to accomplish this task anywhere near as fast as PHP, but its worth a shot.
If its ported over to C then we would be in business, as its got all the threads worked out and alot faster than PHP or perl...
RaX, whats up with your friend that already has accomplished this?? is the code that different that it cannot be modified to suit us??
If its ported over to C then we would be in business, as its got all the threads worked out and alot faster than PHP or perl...
RaX, whats up with your friend that already has accomplished this?? is the code that different that it cannot be modified to suit us??
Hi!
PHP supports SNMP using the SNMP modul. The Windows binary of PHP supports SNMP, and other platforms need to be recompiled.
As I see, cacti invokes the snmpget command and processes its output. It is extreamy slow. To get one/some SNMP data, cacti creates a new process, starts snmpget in it, than processes its standard output, it takes years compared to a built in SNMP command to complete (staring a program compared calling a function).
The main problem with MRTG was not the slowness of SNMP, but the slowness of file handling (MRTG data files were simple text file), this is why Tobi used RRAs in rrdtool (which is much faster). In addition, MRTG generated graphs in every 5 minutes, this problem was solved by Tobi by splitting data collections and graph generation, and now graphs can be generated only when requested.
Doing SNMP request in parallel (multithreading) is required to in case of large datasets, but the main problem today is the slowness of the current SNMP handling of cacti!!!!
Tamas (khazy@mit.bme.hu)
PHP supports SNMP using the SNMP modul. The Windows binary of PHP supports SNMP, and other platforms need to be recompiled.
As I see, cacti invokes the snmpget command and processes its output. It is extreamy slow. To get one/some SNMP data, cacti creates a new process, starts snmpget in it, than processes its standard output, it takes years compared to a built in SNMP command to complete (staring a program compared calling a function).
The main problem with MRTG was not the slowness of SNMP, but the slowness of file handling (MRTG data files were simple text file), this is why Tobi used RRAs in rrdtool (which is much faster). In addition, MRTG generated graphs in every 5 minutes, this problem was solved by Tobi by splitting data collections and graph generation, and now graphs can be generated only when requested.
Doing SNMP request in parallel (multithreading) is required to in case of large datasets, but the main problem today is the slowness of the current SNMP handling of cacti!!!!
Tamas (khazy@mit.bme.hu)
This is a problem with all the front ends that i have used. Cricket does a decent job and is written in Perl, but eventually it falls behind too.
I have recently started to use mon (http://kernel.org/software/mon/) and it appears to have a good design that compensates for the polling/checking times. It may be possible to reuse portions of this code (written in Perl) to build a better collector for cacti.
Or better yet:
It would benefit all rrdtool users (no matter what front-end you use) to come together and work on a design for a common collection daemon that would be general purpose enough to work for all (or most) the existing front-ends. Maybe someone is already doing this? I think that writing this in C would be best (for performance reasons) but may cause serious portability problems between *nix's and windows.
-ab
I have recently started to use mon (http://kernel.org/software/mon/) and it appears to have a good design that compensates for the polling/checking times. It may be possible to reuse portions of this code (written in Perl) to build a better collector for cacti.
Or better yet:
It would benefit all rrdtool users (no matter what front-end you use) to come together and work on a design for a common collection daemon that would be general purpose enough to work for all (or most) the existing front-ends. Maybe someone is already doing this? I think that writing this in C would be best (for performance reasons) but may cause serious portability problems between *nix's and windows.
-ab
Who is online
Users browsing this forum: No registered users and 4 guests