Possible scalability troubles
Moderators: Developers, Moderators
Possible scalability troubles
Hey all,
I was recently turned on to Cacti by a friend. I really like it, especially Ver 0.8.
Here's my trouble:
I have about 150 graphs being generated on a RH 7.3 box w/ a single PIII 550, 384MB RAM. This takes just over 10 minutes to complete when run from the command line. It isn't loading up the processor or memory while it runs, but it does push up the load average to about 2-3. Most of the time spent seems to be waiting for the SNMP queries to return from the target hosts (3-5 secs per query). The cron job runs every 5 minutes and just stacks them up. Over night the load average reached 81.46 (but the box held up) and I had about 50 cmd.php's running.
My first thought was to set the cron job to run every 15 minutes rather than every 5 minutes. But now none of the graphs are showing any data. If I run the command manually a few times in a row, they do update.
Any thoughts as to why I am not getting any data with this time interval? Does it have to do with the X-Files Factor?
I have only configured about a third of the hosts on my network. So this is going to get even worse. Any suggestions would be great!
Thanks!
Chester-
I was recently turned on to Cacti by a friend. I really like it, especially Ver 0.8.
Here's my trouble:
I have about 150 graphs being generated on a RH 7.3 box w/ a single PIII 550, 384MB RAM. This takes just over 10 minutes to complete when run from the command line. It isn't loading up the processor or memory while it runs, but it does push up the load average to about 2-3. Most of the time spent seems to be waiting for the SNMP queries to return from the target hosts (3-5 secs per query). The cron job runs every 5 minutes and just stacks them up. Over night the load average reached 81.46 (but the box held up) and I had about 50 cmd.php's running.
My first thought was to set the cron job to run every 15 minutes rather than every 5 minutes. But now none of the graphs are showing any data. If I run the command manually a few times in a row, they do update.
Any thoughts as to why I am not getting any data with this time interval? Does it have to do with the X-Files Factor?
I have only configured about a third of the hosts on my network. So this is going to get even worse. Any suggestions would be great!
Thanks!
Chester-
Hi... similar problem here...
... only it took a lot more graphs to get there... 388 in all... I just startet to configure one server at a time.... thereby learning about the capabilites of cacti, snmp etc. ...
Now i am not even at 40% of all the graphs i wanted to have for monitoring. We have about 60 W2K servers, 20 Sun and a lot of Routers.
The Cacti server hardware is a Compaq PL 580 with 500Mhz. CPU and 1GB RAM... Cacti is installed on Suse Linux 8.2. The CPU definitely has a problem whereas the memory usage get's slowly up to about 450-520 and seems to stay at this level. By now i have constantly between 5-9 php processes. The webserver get's slow to respond...
I deleted about 60 of my datasources. The system slowly regenerates on a still high level but it is a pitty to live with such compromises...
Any ideas on how to improve performance
greetings, Uwe, Switzerland
... only it took a lot more graphs to get there... 388 in all... I just startet to configure one server at a time.... thereby learning about the capabilites of cacti, snmp etc. ...
Now i am not even at 40% of all the graphs i wanted to have for monitoring. We have about 60 W2K servers, 20 Sun and a lot of Routers.
The Cacti server hardware is a Compaq PL 580 with 500Mhz. CPU and 1GB RAM... Cacti is installed on Suse Linux 8.2. The CPU definitely has a problem whereas the memory usage get's slowly up to about 450-520 and seems to stay at this level. By now i have constantly between 5-9 php processes. The webserver get's slow to respond...
I deleted about 60 of my datasources. The system slowly regenerates on a still high level but it is a pitty to live with such compromises...
Any ideas on how to improve performance
greetings, Uwe, Switzerland
Some ideas have been discussed here:
http://www.raxnet.net/board/viewtopic.php?t=897
The discussion went far from initial topic and it is based on 0.68a version however default 0.8 distribution is not so different in polling engine.
- bulek
http://www.raxnet.net/board/viewtopic.php?t=897
The discussion went far from initial topic and it is based on 0.68a version however default 0.8 distribution is not so different in polling engine.
- bulek
This is definitely a scalabilty issue that is being addressed. If you grab version 0.8.1 (or a new pre-release), check out the 'cactid/' directory. A threaded c poller already exists and is functional for the most part. It is also *really* fast, I had it polling over 3000 items in under 10 seconds (on the same LAN segment, but still).
It is definitely not production quality yet, but I would certainly be up for some feedback.
-Ian
It is definitely not production quality yet, but I would certainly be up for some feedback.
-Ian
-
- Posts: 13
- Joined: Wed Jun 11, 2003 12:30 pm
- Location: Germany
- Contact:
-
- Posts: 6
- Joined: Tue Jun 10, 2003 1:31 pm
- Location: San Luis Obispo, CA
- Contact:
I'll check out cactid on 0.8.1. I tried to compile the one that came on 0.8, but it failed and I know so little about compiling that I just left it.
BTW: The delays I get seem to be more of an SNMP get delay than a processing power problem. If I do an snmpget from the command line, it takes 3-5 seconds to return with a value. Multiply that by 150 graphs and you get most of the 10 minutes it takes. So I am wondering if it's just poll delay problem, rather than processing load.
Anyone have any ideas on speedign up my SNMP replies? I'm scouring my DNS right now to see if it's lagging on reverse lookups or anything.
BTW: The delays I get seem to be more of an SNMP get delay than a processing power problem. If I do an snmpget from the command line, it takes 3-5 seconds to return with a value. Multiply that by 150 graphs and you get most of the 10 minutes it takes. So I am wondering if it's just poll delay problem, rather than processing load.
Anyone have any ideas on speedign up my SNMP replies? I'm scouring my DNS right now to see if it's lagging on reverse lookups or anything.
Mark Chester
IT Manager
Web Associates
805.782.4692
IT Manager
Web Associates
805.782.4692
-
- Posts: 6
- Joined: Tue Jun 10, 2003 1:31 pm
- Location: San Luis Obispo, CA
- Contact:
Here's a question/suggestion: Can multiple GETs be performed in a single connection to an SNMP device? If so could you write the poller to gather all SNMP GETs per host into a single chunk and grab them all at once? That would eliminate the delays caused by creating a new SNMP session for every element of data.
Mark Chester
IT Manager
Web Associates
805.782.4692
IT Manager
Web Associates
805.782.4692
Based on my knowledge you can do this with an snmpbulkget. I think this has to be supported by the SNMP device that you are polling though. It is also not supported by PHP's internal SNMP functions at this point.Koyaanisqatsi wrote:Here's a question/suggestion: Can multiple GETs be performed in a single connection to an SNMP device? If so could you write the poller to gather all SNMP GETs per host into a single chunk and grab them all at once? That would eliminate the delays caused by creating a new SNMP session for every element of data.
I am hoping that using pthreads with cactid will eliminate all of the issues created by SNMP delay. It is linked directly against [ucd|net]-snmp, so creating a new session uses up minimal resources.
-Ian
Hello again....
I read the postings recommended by bulek but i have to admit that this setup would be a bit difficult to implement... I was astonished when i read about the environment he works in... many 1000's of graphs etc. I think i would not get there even when i found find all possible datasources....
Then i tried to compile cactid (0.8.1) this morning with several combinations for ./configure--with-mysql=path but it exits with an error "mysql..."...
I do not know if i compare the right things here. Just an example from my nagios test installation which now runs on a normal PC and will be transfered to this dedicated Linux monitoring server. With it i constantly check about 900 sources incl. snmp requests... i see many dozens of nagios processes running at all times without any noticable load...
My idea was to create links from certain nagios events to the corresponding cacti graphs... What should i do now with my planned nagios/cacti/rrd combination... ? I am at a loss right now... Well well... I will see if there is a solution...
greetings, Uwe, Zurich, Switzerland
I read the postings recommended by bulek but i have to admit that this setup would be a bit difficult to implement... I was astonished when i read about the environment he works in... many 1000's of graphs etc. I think i would not get there even when i found find all possible datasources....
Then i tried to compile cactid (0.8.1) this morning with several combinations for ./configure--with-mysql=path but it exits with an error "mysql..."...
I do not know if i compare the right things here. Just an example from my nagios test installation which now runs on a normal PC and will be transfered to this dedicated Linux monitoring server. With it i constantly check about 900 sources incl. snmp requests... i see many dozens of nagios processes running at all times without any noticable load...
My idea was to create links from certain nagios events to the corresponding cacti graphs... What should i do now with my planned nagios/cacti/rrd combination... ? I am at a loss right now... Well well... I will see if there is a solution...
greetings, Uwe, Zurich, Switzerland
Like Ian mentioned the best way would be to use multithreaded poller. Since I see some people not able to compile it I think it is worth to try quick and dirty solution based on multiple copies of cmd.php.
Look at the beginning of cmd.php (v. 0.8.1) there is a line:
Here cacti gets all data sources to be polled and runs scripts or snmp queries one by one. If you can divide your data sources between let's say four different cmd.php pollers then you will get parallel polling at some extent. Number of such cmd.php pollers would depend on your machine size and number of your data sources.
Just to give you an example. Let's say we have 178 data sources. The machine is slow so you can estimate that it is able to poll about 50 data sources within 5 minutes. This means we need to have four pollers. I copy cmd.php to four copies with changed the line mentioned above:
Ok... now I remove standard cmd.php from cron and put our new four pollers instead. This should work very well (however I did not try this specific solution yet ) - the benefit is parallel snmp polling (you are saving time) and parallel rrd files updates (which becomes a performance problem at some point too). The only thing I am not sure is logging - cacti may have problem writing to the same log in parallel.
Now about snmpget. SNMP standards allow for several OIDs in one snmpget query. This however is not implemented neither in snmpget tool we are using nor in php internal snmp support. The only way to speed up snmp polling in some cases is to use snmpbulkwalk. I for example noticed that I query almost all interfaces of given device with snmpget one by one. It takes time (especially for WAN links). If I query all interface table with snmpbulkwalk at once it takes sometimes 40 times less (!). What can be done is to separate snmp polling from cacti. Such separated poller could poll devices via snmp few snmpbulkwalk queries and put the results into some mysql table. Cacti from the other side could just read these values from that table (instead of snmp polling). Great speedup and possibility to have distributed pollers on several machines.
The last thing are hanging cmd.php pollers that Koyaanisqatsi mentioned. New cacti has the following line at the beginning of cmd.php:
This means that the poller will work even a new copy is started by cron (if the first one did not finish within 5 minutes). Sometimes it's good but sometimes it results with many cmd.php copies saturating memory. You may try to change the value to 300 then after 5 minutes the poller will be killed.
- bulek
Look at the beginning of cmd.php (v. 0.8.1) there is a line:
Code: Select all
$polling_items = db_fetch_assoc("select * from data_input_data_cache");
Just to give you an example. Let's say we have 178 data sources. The machine is slow so you can estimate that it is able to poll about 50 data sources within 5 minutes. This means we need to have four pollers. I copy cmd.php to four copies with changed the line mentioned above:
Code: Select all
cmd1.php: $polling_items = db_fetch_assoc("select * from data_input_data_cache limit 0,50");
cmd2.php: $polling_items = db_fetch_assoc("select * from data_input_data_cache limit 50,50");
cmd3.php: $polling_items = db_fetch_assoc("select * from data_input_data_cache limit 100,50");
cmd4.php: $polling_items = db_fetch_assoc("select * from data_input_data_cache limit 150,50");
Now about snmpget. SNMP standards allow for several OIDs in one snmpget query. This however is not implemented neither in snmpget tool we are using nor in php internal snmp support. The only way to speed up snmp polling in some cases is to use snmpbulkwalk. I for example noticed that I query almost all interfaces of given device with snmpget one by one. It takes time (especially for WAN links). If I query all interface table with snmpbulkwalk at once it takes sometimes 40 times less (!). What can be done is to separate snmp polling from cacti. Such separated poller could poll devices via snmp few snmpbulkwalk queries and put the results into some mysql table. Cacti from the other side could just read these values from that table (instead of snmp polling). Great speedup and possibility to have distributed pollers on several machines.
The last thing are hanging cmd.php pollers that Koyaanisqatsi mentioned. New cacti has the following line at the beginning of cmd.php:
Code: Select all
ini_set("max_execution_time", "0");
- bulek
-
- Posts: 13
- Joined: Wed Jun 11, 2003 12:30 pm
- Location: Germany
- Contact:
-
- Posts: 6
- Joined: Tue Jun 10, 2003 1:31 pm
- Location: San Luis Obispo, CA
- Contact:
I played around with nice and cactid, yesterday. I have NO PROBLEMS running cactid 0.8.1 at least that I could see anyway. However, since I didn't know if I was risking the integrity of my RRDs and MySQL db using cactid, I didn't let it run long enough to see any data on the graphs. It also pushed my load average up to 10 and sat there as long as cactid ran.
Cactid finishes polling in 140 odd seconds, whereas cmd.php takes 500-600 seconds. But it doesn't appear to be paying attention to its polling interval in the cactid.conf file. As soon as it finishes a round, it starts another.
I set the nice value for cmd.php to -18 to push it way up in the priority queue. That didn't have as much effect as I hoped, because most of the time spent is not on processing - it's waiting for the snmpgets to return.
Ian, I'm very interested in cactid. Let me know if I can help you test it and speed along the development process. I'd also like to get from you a rough list of command line args and conf file options.
Cactid finishes polling in 140 odd seconds, whereas cmd.php takes 500-600 seconds. But it doesn't appear to be paying attention to its polling interval in the cactid.conf file. As soon as it finishes a round, it starts another.
I set the nice value for cmd.php to -18 to push it way up in the priority queue. That didn't have as much effect as I hoped, because most of the time spent is not on processing - it's waiting for the snmpgets to return.
Ian, I'm very interested in cactid. Let me know if I can help you test it and speed along the development process. I'd also like to get from you a rough list of command line args and conf file options.
Mark Chester
IT Manager
Web Associates
805.782.4692
IT Manager
Web Associates
805.782.4692
-
- Posts: 6
- Joined: Tue Jun 10, 2003 1:31 pm
- Location: San Luis Obispo, CA
- Contact:
I like Bulek's idea of using multiple cmd.php processes. So I took it a step further. I think a lot of the delay that develops from overlapping processes lies in when multiple processes are hitting the same host with SNMP queries. So I modified the select statement like so:
select * from data_input_data_cache where management_ip="<host_ip>"
And then create cmd_<host>.php and put them in the cron. This assures that all SNMP queries to each host are serial, but several hosts are being polled at a time. I am running 8 at once. We'll see what happens.
[Several hours later]
Well, this looks like a good workaround for me. I created separate cmd-<host>.php files for each of my servers and run them individually from cron. Running parallel, they all finish within a couple minutes total. I'm also logging the output for each one, so I can see how long they run. It is pushing up my load average pretty high though (almost 10), and drops to about 1 by the time the next round comes. So I need to watch it closely for a few days.
I did find that my main ethernet switch (48 ports) was replying to SNMP queries REALLY slowly and taking longer that 5 minutes to complete just that one host, so I removed it from cacti altogether.
select * from data_input_data_cache where management_ip="<host_ip>"
And then create cmd_<host>.php and put them in the cron. This assures that all SNMP queries to each host are serial, but several hosts are being polled at a time. I am running 8 at once. We'll see what happens.
[Several hours later]
Well, this looks like a good workaround for me. I created separate cmd-<host>.php files for each of my servers and run them individually from cron. Running parallel, they all finish within a couple minutes total. I'm also logging the output for each one, so I can see how long they run. It is pushing up my load average pretty high though (almost 10), and drops to about 1 by the time the next round comes. So I need to watch it closely for a few days.
I did find that my main ethernet switch (48 ports) was replying to SNMP queries REALLY slowly and taking longer that 5 minutes to complete just that one host, so I removed it from cacti altogether.
Last edited by Koyaanisqatsi on Thu Jun 12, 2003 6:40 pm, edited 2 times in total.
Mark Chester
IT Manager
Web Associates
805.782.4692
IT Manager
Web Associates
805.782.4692
Hello
i managed to ./configure and make the cactid The mysql_developer and ssl_developer libraries on my Suse 8.2 where missing.
then : ./configure --with-mysql=/usr and hey presto it worked !
I started the daemon in a console and could therefore see line by line and for every request which data are collected and how long it took to receive the snmp answers.
It is indeed so that it takes a while for some of the snmp answers to come in while on others it goes so fast it's a pleasure to see... Why? All in all it takes about 240 sec. to finish with for about 320 datasources. It then gives my Linux a rest for about a minute.... Only i find that the cpu (av. 80%) and cue (2-8) is still heavy... I let it run over night and see what is tomorrow...
Yes, with the cactid i get error messages on the error/discard collector in the console window. I do not know if that was the case with cmd.php too.
greetings, Uwe, Zürich, Switzerland
i managed to ./configure and make the cactid The mysql_developer and ssl_developer libraries on my Suse 8.2 where missing.
then : ./configure --with-mysql=/usr and hey presto it worked !
I started the daemon in a console and could therefore see line by line and for every request which data are collected and how long it took to receive the snmp answers.
It is indeed so that it takes a while for some of the snmp answers to come in while on others it goes so fast it's a pleasure to see... Why? All in all it takes about 240 sec. to finish with for about 320 datasources. It then gives my Linux a rest for about a minute.... Only i find that the cpu (av. 80%) and cue (2-8) is still heavy... I let it run over night and see what is tomorrow...
Yes, with the cactid i get error messages on the error/discard collector in the console window. I do not know if that was the case with cmd.php too.
greetings, Uwe, Zürich, Switzerland
Who is online
Users browsing this forum: No registered users and 4 guests