Poller performance and more

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Developers, Moderators

Jacket
Posts: 41
Joined: Tue Sep 30, 2008 11:38 pm
Location: Sofia, Bulgaria

Poller performance and more

Post by Jacket »

Hi, first of all thanks for this great peace of software. I'm using it to monitor my network of managed switches and it workes perfectly (till one point...).
The problem now is the performance of the poller - it runs for about 140 sec. I have about 1300 devices with 36 DS's per device. This makes a total of about 47000 DS's. I'm using cacti 0.8.7d (with structured rra path) and spine. Rrdtool is version 1.2.28.
The specs of the machine are:
CPU: Core2Duo E8600
RAM: 2G
HDD: 2xWDC WD1500HLFS @10k RPM in RAID
Supermicro mobo
4 poller processes, and 10 threads per process.
The hard drive makes about 1500-2500 tps when rrdtool is writing.
I was looking at what the poller does and noticed that when it's collecting the data from the network it's not writing anything to the hard drives. Collecting takes a good 40-50 secs, and that time i think maybe used more optimal to write the data, instead of waiting and then thrashing the drives at once. Can we do that with cacti?
Now you may say that 140 secs is normal with that many data sources, but i have a huge problem with 5 minute poller cronjobs. The problem is that 90% of my equipment although being all Gigabit still uses only 32bit counters, and you know how fast they flip, and i can't measure anything beyond 115MBps, sux eh? :(
Can i optimize that poller time somehow, any ideas?
I noticed that mysql uses 100% cpu time when poller is collecting data, but only on one core.
Also at about an hour poller uptime goes up to 250-300 secs, and then drops again :(
The last problem i have is graph creation performance. At first when i was creating graphs for one host (9 graphs) it took not more than a 500ms. Now the graphs are created for more than a 10 seconds, and with mysql eating 100% cpu at this time. Is this normal, or i have some problem?
User avatar
schurzi
Posts: 42
Joined: Sat Oct 11, 2008 8:45 am
Location: Germany
Contact:

Post by schurzi »

Hi Jacket,

maybe your biggest problem ist the write IO on the Disks.
I also would suggest that you put more RAM into the box.

As for MySQL Tuning, there are two variables, which may helb you in an instant:
query_cache_size and thread_cache_size

You can modify these Values online, without restarting MySQL (as of MySQL 5.0).
I suggest you set the query_cache to 32MB and the thread_cache to (poller processes * poller threads)

So you should login to your MySQL with the admin account and execute these two lines:

Code: Select all

set global query_cache_size=32*1024*1024;
set global thread_cache_size=40;
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
cigamit
Developer
Posts: 3369
Joined: Thu Apr 07, 2005 3:29 pm
Location: B/CS Texas
Contact:

Post by cigamit »

This is what the boost plugin was made to help with.
Jacket
Posts: 41
Joined: Tue Sep 30, 2008 11:38 pm
Location: Sofia, Bulgaria

Post by Jacket »

Thank you both for the replies :)
I had allready set query_cache_size to 32M, and thread_cache_size to 8 - now it's 40, as schurzi advised (4 processes, and 10 threads). That didn't do much. The thing that amazed me most was that cacti db didn't have indexes in the tables. Is there a reason for that?
So i created the indexes - http://bugs.cacti.net/view.php?id=1333
And suddenly all my problems had gone away :))
From 100s poller time had decreased to 50-60s, and no more lags when creating graphs :) Also poller started writing to rrd's from the moment it starts. No more mysql clogging the cpu also.
I say 100s, because this morning i decided, that i don't need graphs for packets/sec on each port, so a i've halved my data sources.
Then the next problem appeared... i decided to switch to 1 minute poller interval, and realized that i cannot change the step of my rrd's... galndaf said that somewere here.
So i decided to do a clean start :) - deleted all graphs/ds's and hosts (it's faster to delete graphs with ds's first, and then the hosts), reimported my old templates with 60sec step and 120 sec heartbeat, and imported my hosts again.
I have created a little php script that first injects the hosts in the cacti.host talbe, then associates graphs to the new hosts, and the submits forms every second to create the graphs :) So no problem to readd 1200 devices :) And now when creating of graphs takes less than a second, it's much more easy and faster.
What a surprise when i added all my devices and look at the log file - 29s poller uptime!! :) I still cannot believe it. Dunno what went wrong with the old install, but now i'm just so happy... :)
I'll try the boost plugin later. I've always thought that it was only for speeding up the viewing of graphs, not for poller performance, but maybe i was wrong.

Here are some graphs to see my performance now :)
If you want i can post my complete mysql setup.
Attachments
1.png
1.png (36.41 KiB) Viewed 6771 times
2.png
2.png (32.59 KiB) Viewed 6771 times
3.png
3.png (28.23 KiB) Viewed 6771 times
cigamit
Developer
Posts: 3369
Joined: Thu Apr 07, 2005 3:29 pm
Location: B/CS Texas
Contact:

Post by cigamit »

Boost is made for both. It has graph caching for speeding up the drawing of graphs, but its core function is for speeding up your polling. I have seen people go from 300 second polling times to 15 seconds by using it. Just be sure you have lots of memory, tweak MySQL to allow larger memory tables, larger mysql strings, and to ofcourse use the memory table for it.

In Essence, what it does is this. Normally cacti gets a batch of pollings, and writes them to disk as it goes. This causes a good bit of disk seeking for all the small bits it writes jumping all over the disk. Boost instead takes the data and stores it in memory for a specified amount of time, and then will write it all at once, which helps greatly with the disk I/O. Graphs, when viewed, are auto-updated on the fly before viewing.
Jacket
Posts: 41
Joined: Tue Sep 30, 2008 11:38 pm
Location: Sofia, Bulgaria

Post by Jacket »

10x, that sounds great :) I was thinking of a way to make one ramdrive for the rra/ dir and dumping it to a hdd at about an hour interval for backup (i have serious battery backup), but there seems to be no point now, as this plugin does allmost the same thing :)
I will surely upgrade to 8gig ram ASAP and will give it a try. I'll post the results here as soon as i it's done.

P.S. i noticed my poller is spiking to about 40-45 seconds at about 20 minutes interval.
This is my 3rd and biggest cacti, and the other two are doing the same thing with spine. With cmd.php there are no such spikes, but poller time is greater ofcourse :)
Did someone else noticed that?
Attachments
poller.png
poller.png (36.97 KiB) Viewed 6732 times
cigamit
Developer
Posts: 3369
Joined: Thu Apr 07, 2005 3:29 pm
Location: B/CS Texas
Contact:

Post by cigamit »

Do you have any crons running at a 20 minute interval? My polling times are ~2-3 seconds of each other using spine.
User avatar
schurzi
Posts: 42
Joined: Sat Oct 11, 2008 8:45 am
Location: Germany
Contact:

Post by schurzi »

maybe it is from the aggregation functions in the RRDs?
if you have a rra of 5 or 20 minutes defined this could be caused by the additional IO for the updates.

I could not observe such spikes for my install, but your poller time matches with the IO patterns I see on our storage.
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Jacket
Posts: 41
Joined: Tue Sep 30, 2008 11:38 pm
Location: Sofia, Bulgaria

Post by Jacket »

Sooooo... today i upgraded my machine to 8gig ram and installed the boost plugin (memory resident). You can see the results in the graphics below :)

Code: Select all

02/25/2009 07:49:16 PM - SYSTEM STATS: Time:14.7831 Method:spine Processes:4 Threads:10 Hosts:1496 HostsPerProcess:374 DataSources:30570 RRDsProcessed:0
02/25/2009 07:48:56 PM - SYSTEM BOOST STATS: Time:160.6598 RRDUpdates:1756345
Looking good isn't it? :)
Thank you verry much for you help. I didn't even dream about runtimes like that :)

However i have a little problem now when viewing graphs. They are generated pretty slowly, and i see mysql eating the cpu again. I checked poller_output_boost table, and didn't saw any indexes there too. Can this be the problem again? What if i create one A_I index field? Will this backfire on me? For instant eat my memory more?
Is there any other way of speeding up the generation of graphs?

This is my boost status:
Current Boost Status

Boost On Demand Updating: Idle
Total Data Sources: 30571
Total Boost Records: 467682

Boost Storage Statistics

Database Engine: MEMORY
Current Boost Table Size: 919 MBytes
Avg Bytes/Record: 545 Bytes
Max Record Length: 113 Bytes
Max Allowed Boost Table Size: 4 GBytes
Estimated Maximum Records: 7615188 Records

Previous Runtime

Last Start Time: 2009-02-25 19:46:16
Last Run Duration: 160.6598 Seconds
RRD Updates: 1756345
Peak Poller Memory: 98.47 MBytes
Max Poller Memory Allowed: 2048 MBytes

Run Time Configuration

Update Frequency: 1 Hour
Next Start Time: 2009-02-25 20:46:16
Maximum Records: 2000000 Records
Maximum Allowed Runtime: 20 Minutes

Boost Server Details

Server Config Status: Disabled
Multiprocess Server: Multiple Process
Update Timeout: 2 Seconds
Server/Port: localhost@9050
Authorized Update Web Servers: 127.0.0.1
RRDtool Binary Used: /usr/bin/rrdtool

Image Caching

Image Cacing Status: Enabled
Cache Directory: /usr/local/cacti/tmp
Cached Files: 48 Files
Cached Files Size: 1 MBytes
Attachments
4.png
4.png (30.46 KiB) Viewed 6676 times
3.png
3.png (27.27 KiB) Viewed 6676 times
2.png
2.png (40.73 KiB) Viewed 6676 times
1.png
1.png (37.04 KiB) Viewed 6676 times
Jacket
Posts: 41
Joined: Tue Sep 30, 2008 11:38 pm
Location: Sofia, Bulgaria

Post by Jacket »

cigamit wrote:Do you have any crons running at a 20 minute interval? My polling times are ~2-3 seconds of each other using spine.
No, i thought about that, cheked, and didn't found any aditional cronjobs.
I think schurzi maybe right. Anyway, the spikes are gone since i installed the boost plugin, which (i think) means it has something to do with IO. Now i get one little spike every hour, when boost dumps the poller output to the hdd's, but that's pretty normal :)
This can be seen here:
Attachments
5.png
5.png (32.14 KiB) Viewed 6672 times
User avatar
schurzi
Posts: 42
Joined: Sat Oct 11, 2008 8:45 am
Location: Germany
Contact:

Post by schurzi »

regarding indexes and memory table i like to point out this blog entry:
http://www.mysqlperformanceblog.com/200 ... ry-tables/

As i see in the installation script, the boost tables have a primary key, which is as goog as an index.

But this key uses hash indexing instead of btree. Maybe that is keeping your mysql process bussy.
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
User avatar
schurzi
Posts: 42
Joined: Sat Oct 11, 2008 8:45 am
Location: Germany
Contact:

Post by schurzi »

you can change the index with following commands (based on boost 2.4 install script):

Code: Select all

ALTER TABLE poller_output_boost DROP PRIMARY KEY;
ALTER TABLE poller_output_boost ADD PRIMARY KEY USING BTREE (`local_data_id`,`rrd_name`,`time`);
I'm not exactly sure if it will make things better or worse. But you may give it a try.
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Jacket
Posts: 41
Joined: Tue Sep 30, 2008 11:38 pm
Location: Sofia, Bulgaria

Post by Jacket »

It's 1000% better now :) Thanks man!
It freed about 500mb of ram instantly, but now it's being slowly consumed again. Look at the last 30 minutes of the attached graph. Not a problem, because i have plenty of free memory for now :)
Now the graphs are being generated allmost instantly, and i can hardly see mysql doing anything.

As a conclusion i think both cacti and boost tables need optimizations by default. I'm curious what the developers might say about this. Ofcourse i'm not even close to being an expert in mysql, but from what i've seen so far, 90% of my performance problems was caused by mysql, and was resolved by just adding/altering indexes.
Attachments
6.png
6.png (27.74 KiB) Viewed 6654 times
User avatar
schurzi
Posts: 42
Joined: Sat Oct 11, 2008 8:45 am
Location: Germany
Contact:

Post by schurzi »

cool, nice to hear that :)

I've been looking a bit into the Boost code and the generated SQL satements. So far I found following clues:

There are two indexes on the poller_output_boost table, both are hash indexes.
PRIMARY KEY (`local_data_id`,`rrd_name`,`time`),
KEY `time_local_data_id` (`time`,`local_data_id`)

I've taken all the SELECT queries, wich seemed relevant from boost code an ran some explains on them. None of the queries was able to use a Index. Then I altered the indexes to BTREE and every query could use an matching index.

For the index "time_local_data_id" the SQL would be:

Code: Select all

ALTER TABLE poller_output_boost DROP INDEX time_local_data_id;
ALTER TABLE poller_output_boost ADD INDEX time_local_data_id USING BTREE (`time`,`local_data_id`);
Here are the queries:
(1st collumn shows output with has, 2nd with btree index)

Code: Select all

explain SELECT MIN(local_data_id) FROM poller_output_boost WHERE time<NOW();
+----+-------------+---------------------+------+--------------------+------+---------+------+-------+-------------+
| id | select_type | table               | type | possible_keys      | key  | key_len | ref  | rows  | Extra       |
+----+-------------+---------------------+------+--------------------+------+---------+------+-------+-------------+
|  1 | SIMPLE      | poller_output_boost | ALL  | time_local_data_id | NULL | NULL    | NULL | 14624 | Using where |
+----+-------------+---------------------+------+--------------------+------+---------+------+-------+-------------+
+----+-------------+---------------------+-------+--------------------+--------------------+---------+------+-------+-------------+
| id | select_type | table               | type  | possible_keys      | key                | key_len | ref  | rows  | Extra       |
+----+-------------+---------------------+-------+--------------------+--------------------+---------+------+-------+-------------+
|  1 | SIMPLE      | poller_output_boost | range | time_local_data_id | time_local_data_id | 8       | NULL | 14624 | Using where |
+----+-------------+---------------------+-------+--------------------+--------------------+---------+------+-------+-------------+



explain SELECT DISTINCT local_data_id FROM poller_output_boost WHERE time<NOW() ORDER BY local_data_id LIMIT 10,1;
+----+-------------+---------------------+------+--------------------+------+---------+------+-------+----------------------------------------------+
| id | select_type | table               | type | possible_keys      | key  | key_len | ref  | rows  | Extra                                        |
+----+-------------+---------------------+------+--------------------+------+---------+------+-------+----------------------------------------------+
|  1 | SIMPLE      | poller_output_boost | ALL  | time_local_data_id | NULL | NULL    | NULL | 14624 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+------+--------------------+------+---------+------+-------+----------------------------------------------+
+----+-------------+---------------------+-------+--------------------+--------------------+---------+------+-------+----------------------------------------------+
| id | select_type | table               | type  | possible_keys      | key                | key_len | ref  | rows  | Extra                                        |
+----+-------------+---------------------+-------+--------------------+--------------------+---------+------+-------+----------------------------------------------+
|  1 | SIMPLE      | poller_output_boost | range | time_local_data_id | time_local_data_id | 8       | NULL | 14624 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+-------+--------------------+--------------------+---------+------+-------+----------------------------------------------+



explain SELECT * FROM poller_output_boost WHERE time<=NOW() ORDER BY local_data_id ASC, time ASC, rrd_name ASC;
+----+-------------+---------------------+------+--------------------+------+---------+------+-------+-----------------------------+
| id | select_type | table               | type | possible_keys      | key  | key_len | ref  | rows  | Extra                       |
+----+-------------+---------------------+------+--------------------+------+---------+------+-------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | ALL  | time_local_data_id | NULL | NULL    | NULL | 14624 | Using where; Using filesort |
+----+-------------+---------------------+------+--------------------+------+---------+------+-------+-----------------------------+
+----+-------------+---------------------+-------+--------------------+--------------------+---------+------+-------+-----------------------------+
| id | select_type | table               | type  | possible_keys      | key                | key_len | ref  | rows  | Extra                       |
+----+-------------+---------------------+-------+--------------------+--------------------+---------+------+-------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | range | time_local_data_id | time_local_data_id | 8       | NULL | 14623 | Using where; Using filesort |
+----+-------------+---------------------+-------+--------------------+--------------------+---------+------+-------+-----------------------------+




explain SELECT * FROM poller_output_boost WHERE local_data_id=72 ORDER BY local_data_id ASC, time ASC, rrd_name ASC;
+----+-------------+---------------------+------+---------------+------+---------+------+-------+-----------------------------+
| id | select_type | table               | type | possible_keys | key  | key_len | ref  | rows  | Extra                       |
+----+-------------+---------------------+------+---------------+------+---------+------+-------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | ALL  | PRIMARY       | NULL | NULL    | NULL | 10968 | Using where; Using filesort |
+----+-------------+---------------------+------+---------------+------+---------+------+-------+-----------------------------+
+----+-------------+---------------------+------+---------------+---------+---------+-------+------+-----------------------------+
| id | select_type | table               | type | possible_keys | key     | key_len | ref   | rows | Extra                       |
+----+-------------+---------------------+------+---------------+---------+---------+-------+------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | ref  | PRIMARY       | PRIMARY | 3       | const |    1 | Using where; Using filesort |
+----+-------------+---------------------+------+---------------+---------+---------+-------+------+-----------------------------+



explain SELECT * FROM poller_output_boost WHERE time<=NOW() AND local_data_id BETWEEN 20 AND 40 ORDER BY local_data_id ASC, time ASC, rrd_name ASC;
+----+-------------+---------------------+------+----------------------------+------+---------+------+-------+-----------------------------+
| id | select_type | table               | type | possible_keys              | key  | key_len | ref  | rows  | Extra                       |
+----+-------------+---------------------+------+----------------------------+------+---------+------+-------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | ALL  | PRIMARY,time_local_data_id | NULL | NULL    | NULL | 14624 | Using where; Using filesort |
+----+-------------+---------------------+------+----------------------------+------+---------+------+-------+-----------------------------+
+----+-------------+---------------------+-------+----------------------------+---------+---------+------+------+-----------------------------+
| id | select_type | table               | type  | possible_keys              | key     | key_len | ref  | rows | Extra                       |
+----+-------------+---------------------+-------+----------------------------+---------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | range | PRIMARY,time_local_data_id | PRIMARY | 3       | NULL |    1 | Using where; Using filesort |
+----+-------------+---------------------+-------+----------------------------+---------+---------+------+------+-----------------------------+



explain SELECT * FROM poller_output_boost WHERE  local_data_id BETWEEN 20 AND 40 ORDER BY local_data_id ASC, time ASC, rrd_name ASC;
+----+-------------+---------------------+------+---------------+------+---------+------+-------+-----------------------------+
| id | select_type | table               | type | possible_keys | key  | key_len | ref  | rows  | Extra                       |
+----+-------------+---------------------+------+---------------+------+---------+------+-------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | ALL  | PRIMARY       | NULL | NULL    | NULL | 14624 | Using where; Using filesort |
+----+-------------+---------------------+------+---------------+------+---------+------+-------+-----------------------------+
+----+-------------+---------------------+-------+---------------+---------+---------+------+------+-----------------------------+
| id | select_type | table               | type  | possible_keys | key     | key_len | ref  | rows | Extra                       |
+----+-------------+---------------------+-------+---------------+---------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | range | PRIMARY       | PRIMARY | 3       | NULL |    1 | Using where; Using filesort |
+----+-------------+---------------------+-------+---------------+---------+---------+------+------+-----------------------------+
Within these queries is one in particular, which it a bit misguided:

Code: Select all

explain SELECT * FROM poller_output_boost WHERE time<=NOW() AND local_data_id BETWEEN 20 AND 40 ORDER BY local_data_id ASC, time ASC, rrd_name ASC;
+----+-------------+---------------------+------+----------------------------+------+---------+------+-------+-----------------------------+
| id | select_type | table               | type | possible_keys              | key  | key_len | ref  | rows  | Extra                       |
+----+-------------+---------------------+------+----------------------------+------+---------+------+-------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | ALL  | PRIMARY,time_local_data_id | NULL | NULL    | NULL | 14624 | Using where; Using filesort |
+----+-------------+---------------------+------+----------------------------+------+---------+------+-------+-----------------------------+
+----+-------------+---------------------+-------+----------------------------+---------+---------+------+------+-----------------------------+
| id | select_type | table               | type  | possible_keys              | key     | key_len | ref  | rows | Extra                       |
+----+-------------+---------------------+-------+----------------------------+---------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | poller_output_boost | range | PRIMARY,time_local_data_id | PRIMARY | 3       | NULL |    1 | Using where; Using filesort |
+----+-------------+---------------------+-------+----------------------------+---------+---------+------+------+-----------------------------+
This query uses the wrong indes, I think if it would use the index "time_local_data_id" it could perform better.

Maybe we should point theWittness on these findings, so he can run tests with some more data and see if it is realy better.
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
User avatar
TheWitness
Developer
Posts: 17062
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Post by TheWitness »

With memory tables when you perform an alter command, all memory consumed by previous hole is freed. This has nothing to do about system performance.

The hash vs btree issue is only relevant for key with value that have low cardinality (aka don't vary a lot or all the same). Since I'm on a blackberry tight now, i'll have to look at the other analysis later.

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Post Reply

Who is online

Users browsing this forum: No registered users and 5 guests