wierd polling time creep

MagicOneXXX · Post by **MagicOneXXX** » Thu Apr 12, 2007 4:09 pm

lately I've been having a weird problem with cacti. It runs stable until I add another host and start graphing it. Once I do, my system stats slowly creep up from ~17 seconds into the 100's. Doing a rebuild of the poller cache resets my time to ~17 seconds, but it continues to creep up again.

Here is an example of the 'poller creep':

04/12/2007 04:54:23 PM - SYSTEM STATS: Time:17.8759 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1818 RRDsProcessed:934
04/12/2007 04:55:27 PM - SYSTEM STATS: Time:22.1139 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1814 RRDsProcessed:932
04/12/2007 04:56:29 PM - SYSTEM STATS: Time:24.1165 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1815 RRDsProcessed:934
04/12/2007 04:57:28 PM - SYSTEM STATS: Time:22.0784 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1817 RRDsProcessed:934
04/12/2007 04:58:28 PM - SYSTEM STATS: Time:23.0596 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1815 RRDsProcessed:933
04/12/2007 04:59:26 PM - SYSTEM STATS: Time:21.4659 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1818 RRDsProcessed:934
04/12/2007 05:00:30 PM - SYSTEM STATS: Time:25.5943 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1832 RRDsProcessed:941
04/12/2007 05:01:35 PM - SYSTEM STATS: Time:30.5847 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1833 RRDsProcessed:943
04/12/2007 05:02:32 PM - SYSTEM STATS: Time:27.8288 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1835 RRDsProcessed:943
04/12/2007 05:03:33 PM - SYSTEM STATS: Time:29.0359 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1833 RRDsProcessed:942
04/12/2007 05:04:34 PM - SYSTEM STATS: Time:31.3829 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1836 RRDsProcessed:943
04/12/2007 05:05:35 PM - SYSTEM STATS: Time:30.5828 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1832 RRDsProcessed:941
04/12/2007 05:06:36 PM - SYSTEM STATS: Time:30.6497 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1833 RRDsProcessed:943
04/12/2007 05:07:43 PM - SYSTEM STATS: Time:39.5587 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1835 RRDsProcessed:943
04/12/2007 05:08:45 PM - SYSTEM STATS: Time:40.2732 Method:cactid Processes:13 Threads:10 Hosts:55 HostsPerProcess:5 DataSources:1833 RRDsProcessed:942

Etc. on up. This occurs with different hosts, so it's not the host i'm trying to graph.

Any thoughts?

BSOD2600 · Post by **BSOD2600** » Thu Apr 12, 2007 6:46 pm

I gather you have the 1 min polling patch installed? Any plugins too?
I also see the number of datasources is changing with the polling times -- are devices going up/down or you polling different ones at different time intervals?

MagicOneXXX · Post by **MagicOneXXX** » Fri Apr 13, 2007 8:15 am

Here is my system info:

Cacti Version - 0.8.6i
Plugin Architecture - 1.0
Poller Type - Cactid v0.8.6i
Server Info - Linux 2.6.16.21-0.8-smp
Web Server - Apache/2.2.0 (Linux/SUSE)
PHP - 5.1.2
PHP Extensions - libxml, xml, standard, SimpleXML, SPL, session, Reflection, date, pcre, apache2handler, gd, mysql, mysqli, snmp, sockets
MySQL - 5.0.18
RRDTool - 1.2.12
SNMP - 5.3.0.1
Plugins
Thresholds (thold - v0.3.0)
Device Tracking (mactrack - v0.0.1b)
Network Discovery (discovery - v0.7)
Network Tools (tools - v0.2)
Device Monitoring (monitor - v0.7)
NTop Viewer (ntop - v0.1)
PHP Network Weathermap (weathermap - v0.9pre2)
Report Creator (reports - v0.1b)
Update Checker (update - v0.3)
Host Info (hostinfo - v0.1)

Mactrack polling is disabled at the moment.

Hardware info:
Intel Xeon 2.8GHz HT
512MB Ram

For the most part, we're using 1 minute polling for about 95% of our devices. For a few devices, over in another country, we use 5 minute polling. That's why there is a little variance.

We poll quite a few switches, so we're pulling back 24-48 data sources per device. We are also pulling back 2 core switches that each have 5 blades and somewhere around 100 ports.

I thought RAM might be the issue, but i'd expect these sort of results only if the poller went above 60 seconds and polling started to 'stumble' upon itself.

Any thoughts?

MagicOneXXX · Post by **MagicOneXXX** » Fri Apr 13, 2007 10:57 am

I've been monitoring the server before and after I add another device, and discovered that after adding the device, additional cron processes spawn. This seems to cause more php processes to spawn as well.

I've tried this a couple of times - disabling the device and letting the server 'stabilize' again - then re-enabling the device and watching the additonal processes spawn.

MagicOneXXX · Post by **MagicOneXXX** » Fri Apr 13, 2007 3:54 pm

I've done more digging, and ended up pushing the poller cron results (php -q poller.php) to a log file. here is a snippet of that log that looks interesting:

OK u:0.06 s:0.63 r:23.75
OK u:0.06 s:0.64 r:24.07
OK u:0.06 s:0.64 r:24.30
OK u:0.06 s:0.65 r:24.41
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
OK u:0.08 s:1.40 r:45.18
OK u:0.08 s:1.41 r:45.35
OK u:0.08 s:1.41 r:45.53
OK u:0.08 s:1.42 r:45.71
OK u:0.08 s:1.42 r:45.93

There is a huge jump in stats with the odd ^@ stuff between it. I don't think this is normal. The ending stats of this particular run are:

OK u:0.10 s:2.58 r:71.93

I'm guessing the r: is the total runtime?

Also, the system stats seem to be posted mid-run:

OK u:0.06 s:0.32 r:17.50
OK u:0.06 s:0.32 r:17.54
OK u:0.06 s:0.32 r:17.55
04/13/2007 04:50:32 PM - SYSTEM STATS: Time:29.2292 Method:cactid Processes:13 Threads:10 Hosts:54 HostsPerProcess:5 DataSources:1786 RRDsProcessed:918
OK u:0.06 s:0.32 r:17.56
OK u:0.06 s:0.32 r:17.57
OK u:0.06 s:0.32 r:17.58
OK u:0.06 s:0.32 r:17.58

Which seems odd to me. Shouldn't it be posted at the end of the run?

Also, if i open the log file in the beginning of a run, the file starts out with:

OK u:0.08 s:1.55 r:45.34
OK u:0.08 s:1.56 r:45.45
OK u:0.08 s:1.56 r:45.59
OK u:0.08 s:1.56 r:45.71
OK u:0.08 s:1.56 r:45.75
OK u:0.08 s:1.56 r:45.78
OK u:0.08 s:1.56 r:45.93
OK u:0.08 s:1.56 r:45.97
OK u:0.08 s:1.57 r:46.05

and if I open it again mid-run, i get the funky

OK u:0.00 s:0.01 r:2.21
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
OK u:0.08 s:1.55 r:45.34

in the log.

Combine this with the fact that there are multiple cron jobs that spawn, i'm guessing there is some double-polling going on.

Thoughts?

MagicOneXXX · Post by **MagicOneXXX** » Fri Apr 13, 2007 4:49 pm

Here is the results of a ps aux:

root 16948 0.0 0.1 1820 540 ? Ss 17:44 0:00 /usr/sbin/cron
root 16951 0.0 0.1 2420 824 ? S 17:45 0:00 /usr/sbin/cron
wwwrun 16952 0.0 0.2 2524 1104 ? Ss 17:45 0:00 /bin/bash -c /usr/bin/php5 -q /srv/www/htdocs/cacti/poller.php > /var/log/cacti/poller.log 2>&1
wwwrun 16953 1.7 2.3 52812 12068 ? D 17:45 0:01 /usr/bin/php5 -q /srv/www/htdocs/cacti/poller.php
root 17404 0.0 0.1 2420 824 ? S 17:46 0:00 /usr/sbin/cron
wwwrun 17405 0.0 0.2 2524 1100 ? Ss 17:46 0:00 /bin/bash -c /usr/bin/php5 -q /srv/www/htdocs/cacti/poller.php > /var/log/cacti/poller.log 2>&1
wwwrun 17406 6.5 2.4 53340 12436 ? S 17:46 0:01 /usr/bin/php5 -q /srv/www/htdocs/cacti/poller.php

and here is my crontab:

SHELL=/bin/bash
PATH=/usr/bin:/usr/sbin:/sbin:/bin:/usr/lib/news/bin:/usr/local/bin
#
# check scripts in cron.hourly, cron.daily, cron.weekly, and cron.monthly
#
*/1 * * * * wwwrun /usr/bin/php5 -q /srv/www/htdocs/cacti/poller.php > var/log/cacti/poller.log 2>&1

Def. a double poll. Another cron process is spawned right after the first, and both spawn a polling cycle, causing a double poll.

Suggestions?

Post by **gandalf** » Mon Apr 16, 2007 6:02 am

Please chack all crontabs
- /etc/crontab
- /etc/cron.d/cacti
- crontabs of users root/cactiuser
Reinhard

MagicOneXXX · Post by **MagicOneXXX** » Mon Apr 16, 2007 8:44 am

No go

I checked all my cron schedulers; I removed everything that was in cron.daily (mostly logrotate stuff) and restarted cron. A cron process still spawns right after it starts running the poller.

See above for what's in my crontab.

Anyone else experience this odd issue?

MagicOneXXX · Post by **MagicOneXXX** » Mon Apr 16, 2007 9:28 am

Ok, so I killed all of the cron jobs and running php processes, waited a bit, then ran 'time /usr/bin/php5 /srv/www/htdocs/cacti/poller.php'. The poller ends up reporting a total time of about 7 seconds, but the process continues on and ends up running a total of 1min 10 seconds. I think this may be causing an overrun, so another cron process spawns before the last one finished.

Why is the poller reporting such low process times, when the total time to poll is really much higher?

fmangeant · Post by **fmangeant** » Mon Apr 16, 2007 9:43 am

Hi

can you try with 2 processes and 15 theads ?

You can see this thread from Reinhard : http://forums.cacti.net/viewtopic.php?p=97180#97180

MagicOneXXX · Post by **MagicOneXXX** » Mon Apr 16, 2007 10:02 am

I set my poller settings to what is recommended, but to no avial. Poller still takes over 1 minute to complete. Poller reports it takes alot less:

OK u:0.06 s:0.29 r:7.65
OK u:0.06 s:0.29 r:7.66
OK u:0.06 s:0.29 r:7.68
OK u:0.06 s:0.29 r:7.71
04/16/2007 10:50:13 AM - SYSTEM STATS: Time:11.8671 Method:cactid Processes:4 Threads:15 Hosts:54 HostsPerProcess:14 DataSources:1784 RRDsProcessed:917
OK u:0.06 s:0.29 r:7.72
OK u:0.06 s:0.29 r:7.73
OK u:0.06 s:0.29 r:7.73
OK u:0.06 s:0.29 r:7.74
OK u:0.06 s:0.29 r:7.75

And I don't even have that many datasources/hosts!

Only bottleneck I can think of is RAM (Disks are 15k rpm) which is low at 512mb (but, free is telling me it still has 373MB ram free for programs. Most of that is going to disk caching). RAM is on order to up it to 1.5G, but that will take a few weeks to get in.

This looks alot more like a performance issue now.

Thanks thus far for all your suggestions!

Post by **TheWitness** » Mon Apr 16, 2007 9:59 pm

Make sure you don't have extra poller processes running. When you were running 13 process, I just about died. Keep in mind that if you run script server, you also get X PHP Script Server processes and who know's how many scripts running.

This can overwhelm a system quick. If your poller finishes in 11 seconds (as it should) and you continue to receive "OK" messages after the poller completes, this is RRDtool hanging up on RRDupdates.

Please keep in mind that your RRA folder MUST or should never be NFS mounted. If it is, figure something else out, please.

TheWitness

MagicOneXXX · Post by **MagicOneXXX** » Tue Apr 17, 2007 8:33 am

Thanks for the advice, TheWitness! I didn't know what the limits should be on the concurrent processes, as there isn't anything listed for recommendations in the settings page (at least in 0.8.6i).

According to the Disk I/O, the server is writing about 33mb per minute to disk, which it should have no problem doing with 15k rpm SCSI drives. What else could possibly bottleneck rrdtool?

FYI, the drives are reiserfs.

Post by **TheWitness** » Tue Apr 17, 2007 10:00 am

Here is "how" you test things:

1) Run poller.php outside of CRON. All of your RRDupdates will appear on the screen
2) While running, watch for "cactid" processes while using top or ps -ef | grep cactid.
3) When the cactid processes end, time the time it takes for all the "OK" messages to clear off of the screen.

The time difference between the last cactid process completing and the last "OK" message will be disk/rrdupdate delay.

Let me know what you find. Could be you have excellent disks and no disk cache (aka poor drivers), which will make the best disk look like crap.

TheWitness

MagicOneXXX · Post by **MagicOneXXX** » Tue Apr 17, 2007 11:20 am

here are my stats:

Cactid time: 11.3245 seconds
Total poller.php time: 1m5s

Cacti

wierd polling time creep

wierd polling time creep

System info

another observation

Poller output results

ps aux results

no go

over time?

Tried without success

thanks

Who is online