pollers hanging

esproul · Post by **esproul** » Thu Oct 27, 2005 9:59 am

Recently I had a problem on my Cacti box where it ran out of memory. That appeared to have been mysql's fault, but ever since resetting the box, I am noticing problems with the poller. I am not sure whether the mysql crash caused this or whether this is what caused mysql to consume so much memory.

It doesn't happen on every poll, but after leaving it overnight I had about 50 poller processes hung, all on the same snmpget to the same device:

Code: Select all

cacti    27755     1  0 09:35 ?        00:00:00 /usr/bin/php -q /opt/cacti-0.8.6g/cmd.php 0 5
cacti    27758 27755  0 09:35 ?        00:00:00 /usr/bin/php /opt/cacti-0.8.6g/script_server.php cmd
cacti    27778 27755  0 09:35 ?        00:00:00 /usr/bin/snmpget -O vt -c   <comm> -v 2c -t 1 -r 3 <IP>:161 .1.3.6.1.2.1.1.3.0

This is a CentOS 4.1 box running 0.8.6g with net-snmp 5.1.2, php 4.3.9, and mysql 4.1.12. I am monitoring 4 switches-- two Foundry BigIron 4000's and two Cisco Catalyst 2924's. The above device is one of the Foundrys. The OID it's hanging on is the uptime (SNMPv2-MIB::sysUpTime.0). Running the snmpget on the command line is always successful and speedy, so I am fairly certain the device is not being slow to respond.

I have 1191 data sources and 403 RRDs. The poller usually takes anywhere from 40-60 seconds.

Looking back through the logs, I do see a number of "Maximum runtime exceeded" errors before yesterday's crash. That makes me lean toward a problem with the poller, rather than a mysql bug. The weird thing is that it was running absolutely fine for a week before the problems started, with no changes to the devices being monitored.

Can anyone suggest a direction for further investigation?
Thanks,
Eric

Post by **rony** » Thu Oct 27, 2005 10:06 am

Run a check/repair on the cacti database.

Make sure the poller is not running and issue the following SQL query on the cacti database:

Code: Select all

truncate poller_output;

esproul · Post by **esproul** » Thu Oct 27, 2005 10:13 am

rony wrote:Run a check/repair on the cacti database.

Thanks for the tip. Looks like it didn't need to do anything:

Code: Select all

mysql> truncate poller_output;
Query OK, 0 rows affected (0.01 sec)

Eric

esproul · Post by **esproul** » Thu Oct 27, 2005 11:16 am

Hmmm... still happening. The poller_output table is empty, but I have three stalled pollers in the last 30 minutes. Same device, same OID.

Code: Select all

cacti    28290     1  0 11:50 ?        00:00:00 /usr/bin/php -q /opt/cacti-0.8.6g/cmd.php 0 5
cacti    28293 28290  0 11:50 ?        00:00:00 /usr/bin/php /opt/cacti-0.8.6g/script_server.php cmd
cacti    28313 28290  0 11:50 ?        00:00:00 /usr/bin/snmpget -O vt -c           -v 2c -t 1 -r 3 <IP>:161 .1.3.6.1.2.1.1.3.0
cacti    29597     1  0 12:00 ?        00:00:00 /usr/bin/php -q /opt/cacti-0.8.6g/cmd.php 0 5
cacti    29600 29597  0 12:00 ?        00:00:00 /usr/bin/php /opt/cacti-0.8.6g/script_server.php cmd
cacti    29620 29597  0 12:00 ?        00:00:00 /usr/bin/snmpget -O vt -c           -v 2c -t 1 -r 3 <IP>:161 .1.3.6.1.2.1.1.3.0
cacti    30907     1  0 12:10 ?        00:00:00 /usr/bin/php -q /opt/cacti-0.8.6g/cmd.php 0 5
cacti    30910 30907  0 12:10 ?        00:00:00 /usr/bin/php /opt/cacti-0.8.6g/script_server.php cmd
cacti    30930 30907  0 12:10 ?        00:00:00 /usr/bin/snmpget -O vt -c           -v 2c -t 1 -r 3 <IP>:161 .1.3.6.1.2.1.1.3.0

Post by **rony** » Thu Oct 27, 2005 12:12 pm

What is your host down detection method?

esproul · Post by **esproul** » Thu Oct 27, 2005 12:36 pm

rony wrote:What is your host down detection method?

SNMP only. Should I change it to SNMP+Ping?

Eric

Post by **rony** » Thu Oct 27, 2005 3:00 pm

No, just curious, there are some issues with ping only type...

Still thinking....

esproul · Post by **esproul** » Thu Oct 27, 2005 3:08 pm

Well, I went ahead and set it to SNMP+Ping, using UDP ping, and even though there are still some maximum runtime errors, it seems to be cleaning itself up OK now.

The hardware is pretty modest, so perhaps that is contributing to the issue. It's a single-CPU P3-450, but it has 1GB of RAM and SCSI disks.

Post by **TheWitness** » Thu Oct 27, 2005 9:56 pm

It's more likely a problematic script or one with a large timeout.

TheWitness

esproul · Post by **esproul** » Fri Oct 28, 2005 9:26 am

TheWitness wrote:It's more likely a problematic script or one with a large timeout.

I am using no scripts, only SNMP queries with the default SNMP settings (500ms timeout). I don't detect any upward trends in the cacti log, either. The poller either takes 40-60 seconds and finishes normally, or hangs out for more than 296 seconds, then exits. There does not seem to be a pattern to the hangs-- sometimes it's every other interval, others it could go 10 or more intervals with no problem.

All the graphs for the problem switch are now riddled with gaps, due to these timeouts. But every time I test the queries with snmpget/snmpwalk, I have no problems. There is an identical switch next to it that is not having any problem. CPU usage for both switches is identical; no more than 2% utilization. Both are Foundry BigIron 4000s with 72 10/100 ports each, and not all the ports are in use. This feels like one of those annoying "once in a while" issues that is difficult to pin down.

I guess I will try some things on the server and see whether the switch is giving inconsistent responses.

Thanks,
Eric

pollers hanging

pollers hanging

Who is online