pollers hanging

Post support questions that directly relate to Linux/Unix operating systems.

Moderators: Developers, Moderators

Post Reply
User avatar
esproul
Posts: 17
Joined: Wed Sep 14, 2005 11:50 am
Location: Baltimore, MD

pollers hanging

Post by esproul »

Recently I had a problem on my Cacti box where it ran out of memory. That appeared to have been mysql's fault, but ever since resetting the box, I am noticing problems with the poller. I am not sure whether the mysql crash caused this or whether this is what caused mysql to consume so much memory.

It doesn't happen on every poll, but after leaving it overnight I had about 50 poller processes hung, all on the same snmpget to the same device:

Code: Select all

cacti    27755     1  0 09:35 ?        00:00:00 /usr/bin/php -q /opt/cacti-0.8.6g/cmd.php 0 5
cacti    27758 27755  0 09:35 ?        00:00:00 /usr/bin/php /opt/cacti-0.8.6g/script_server.php cmd
cacti    27778 27755  0 09:35 ?        00:00:00 /usr/bin/snmpget -O vt -c   <comm> -v 2c -t 1 -r 3 <IP>:161 .1.3.6.1.2.1.1.3.0
This is a CentOS 4.1 box running 0.8.6g with net-snmp 5.1.2, php 4.3.9, and mysql 4.1.12. I am monitoring 4 switches-- two Foundry BigIron 4000's and two Cisco Catalyst 2924's. The above device is one of the Foundrys. The OID it's hanging on is the uptime (SNMPv2-MIB::sysUpTime.0). Running the snmpget on the command line is always successful and speedy, so I am fairly certain the device is not being slow to respond.

I have 1191 data sources and 403 RRDs. The poller usually takes anywhere from 40-60 seconds.

Looking back through the logs, I do see a number of "Maximum runtime exceeded" errors before yesterday's crash. That makes me lean toward a problem with the poller, rather than a mysql bug. The weird thing is that it was running absolutely fine for a week before the problems started, with no changes to the devices being monitored.

Can anyone suggest a direction for further investigation?
Thanks,
Eric
User avatar
rony
Developer/Forum Admin
Posts: 6022
Joined: Mon Nov 17, 2003 6:35 pm
Location: Michigan, USA
Contact:

Post by rony »

Run a check/repair on the cacti database.

Make sure the poller is not running and issue the following SQL query on the cacti database:

Code: Select all

truncate poller_output;
[size=117][i][b]Tony Roman[/b][/i][/size]
[size=84][i]Experience is what causes a person to make new mistakes instead of old ones.[/i][/size]
[size=84][i]There are only 3 way to complete a project: Good, Fast or Cheap, pick two.[/i][/size]
[size=84][i]With age comes wisdom, what you choose to do with it determines whether or not you are wise.[/i][/size]
User avatar
esproul
Posts: 17
Joined: Wed Sep 14, 2005 11:50 am
Location: Baltimore, MD

Post by esproul »

rony wrote:Run a check/repair on the cacti database.
Thanks for the tip. Looks like it didn't need to do anything:

Code: Select all

mysql> truncate poller_output;
Query OK, 0 rows affected (0.01 sec)
Eric
User avatar
esproul
Posts: 17
Joined: Wed Sep 14, 2005 11:50 am
Location: Baltimore, MD

Post by esproul »

Hmmm... still happening. The poller_output table is empty, but I have three stalled pollers in the last 30 minutes. Same device, same OID.

Code: Select all

cacti    28290     1  0 11:50 ?        00:00:00 /usr/bin/php -q /opt/cacti-0.8.6g/cmd.php 0 5
cacti    28293 28290  0 11:50 ?        00:00:00 /usr/bin/php /opt/cacti-0.8.6g/script_server.php cmd
cacti    28313 28290  0 11:50 ?        00:00:00 /usr/bin/snmpget -O vt -c           -v 2c -t 1 -r 3 <IP>:161 .1.3.6.1.2.1.1.3.0
cacti    29597     1  0 12:00 ?        00:00:00 /usr/bin/php -q /opt/cacti-0.8.6g/cmd.php 0 5
cacti    29600 29597  0 12:00 ?        00:00:00 /usr/bin/php /opt/cacti-0.8.6g/script_server.php cmd
cacti    29620 29597  0 12:00 ?        00:00:00 /usr/bin/snmpget -O vt -c           -v 2c -t 1 -r 3 <IP>:161 .1.3.6.1.2.1.1.3.0
cacti    30907     1  0 12:10 ?        00:00:00 /usr/bin/php -q /opt/cacti-0.8.6g/cmd.php 0 5
cacti    30910 30907  0 12:10 ?        00:00:00 /usr/bin/php /opt/cacti-0.8.6g/script_server.php cmd
cacti    30930 30907  0 12:10 ?        00:00:00 /usr/bin/snmpget -O vt -c           -v 2c -t 1 -r 3 <IP>:161 .1.3.6.1.2.1.1.3.0
User avatar
rony
Developer/Forum Admin
Posts: 6022
Joined: Mon Nov 17, 2003 6:35 pm
Location: Michigan, USA
Contact:

Post by rony »

What is your host down detection method?
[size=117][i][b]Tony Roman[/b][/i][/size]
[size=84][i]Experience is what causes a person to make new mistakes instead of old ones.[/i][/size]
[size=84][i]There are only 3 way to complete a project: Good, Fast or Cheap, pick two.[/i][/size]
[size=84][i]With age comes wisdom, what you choose to do with it determines whether or not you are wise.[/i][/size]
User avatar
esproul
Posts: 17
Joined: Wed Sep 14, 2005 11:50 am
Location: Baltimore, MD

Post by esproul »

rony wrote:What is your host down detection method?
SNMP only. Should I change it to SNMP+Ping?

Eric
User avatar
rony
Developer/Forum Admin
Posts: 6022
Joined: Mon Nov 17, 2003 6:35 pm
Location: Michigan, USA
Contact:

Post by rony »

No, just curious, there are some issues with ping only type...

Still thinking....
[size=117][i][b]Tony Roman[/b][/i][/size]
[size=84][i]Experience is what causes a person to make new mistakes instead of old ones.[/i][/size]
[size=84][i]There are only 3 way to complete a project: Good, Fast or Cheap, pick two.[/i][/size]
[size=84][i]With age comes wisdom, what you choose to do with it determines whether or not you are wise.[/i][/size]
User avatar
esproul
Posts: 17
Joined: Wed Sep 14, 2005 11:50 am
Location: Baltimore, MD

Post by esproul »

Well, I went ahead and set it to SNMP+Ping, using UDP ping, and even though there are still some maximum runtime errors, it seems to be cleaning itself up OK now.

The hardware is pretty modest, so perhaps that is contributing to the issue. It's a single-CPU P3-450, but it has 1GB of RAM and SCSI disks.
User avatar
TheWitness
Developer
Posts: 17007
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Post by TheWitness »

It's more likely a problematic script or one with a large timeout.

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
User avatar
esproul
Posts: 17
Joined: Wed Sep 14, 2005 11:50 am
Location: Baltimore, MD

Post by esproul »

TheWitness wrote:It's more likely a problematic script or one with a large timeout.
I am using no scripts, only SNMP queries with the default SNMP settings (500ms timeout). I don't detect any upward trends in the cacti log, either. The poller either takes 40-60 seconds and finishes normally, or hangs out for more than 296 seconds, then exits. There does not seem to be a pattern to the hangs-- sometimes it's every other interval, others it could go 10 or more intervals with no problem.

All the graphs for the problem switch are now riddled with gaps, due to these timeouts. But every time I test the queries with snmpget/snmpwalk, I have no problems. There is an identical switch next to it that is not having any problem. CPU usage for both switches is identical; no more than 2% utilization. Both are Foundry BigIron 4000s with 72 10/100 ports each, and not all the ports are in use. This feels like one of those annoying "once in a while" issues that is difficult to pin down.

I guess I will try some things on the server and see whether the switch is giving inconsistent responses.

Thanks,
Eric
Post Reply

Who is online

Users browsing this forum: No registered users and 0 guests