Cacti degrading...any hope?

MSTie1 · Post by **MSTie1** » Tue Oct 21, 2008 11:23 am

Okay, Have a 1.2GHz machine with 1GB of ram. Running cacti 8.7a with spine 8.7a. First had issues with there just being one gap in most of the graphs during one specific time of day, everyday. Now it seems to have degraded to several gaps randomly throughout the day. Doing a "top" shows that several php and spine processes are running. There's only one cron setup to run, however I do think that one run overlaps into the next sometimes. Have several of these in the cacti log "WARNING: Result from SNMP not valid. Partial Result: ... " and quite often this "WARNING: Poller Output Table not Empty. Potential Data Source Issues for Data Sources"

A reboot of the machine will kill off the extra processes and lower the amount of gaps, but they never entirely go away.

Here's a line from a successfull poll:
SYSTEM STATS: Time:14.4864 Method:spine Processes:1 Threads:15 Hosts:10 HostsPerProcess:10 DataSources:384 RRDsProcessed:165

Turning logging onto HIGH I get this:
POLLER: Poller[0] NOTE: Cron is configured to run too often! The Poller Interval is '60' seconds, with a minimum Cron period of '60' seconds, but only 180 seconds have passed since the poller last ran.
10/21/2008 11:06:14 AM - POLLER: Poller[0] NOTE: Poller Int: '60', Cron Int: '300', Time Since Last: '180', Max Runtime '298', Poller Runs: '5'

The polling period is every minute and cron is every 5.

(Which I've read is having more than one cron, but I've checked in etc/cron.d / crontab for users and /etc/crontab and I've only seen it mentioned in /etc/crontab)

Something else that seems weird to me is when viewing the poller cache there seems to be duplicate entries for the most part except for the OID that has 1 different number:

rtr1 - Errors - 208.x.x.x - Gi5/3 SNMP Version: 2, Community: Nagix, OID: .1.3.6.1.2.1.2.2.1.13.51
RRD: /var/www/html/rra/6509-stldist-rtr1_errors_in_470.rrd
rtr1 - Errors - 208.x.x.x - Gi5/3 SNMP Version: 2, Community: Nagix, OID: .1.3.6.1.2.1.2.2.1.19.51
RRD: /var/www/html/rra/6509-stldist-rtr1_errors_in_470.rrd
rtr1 - Errors - 208.x.x.x - Gi5/3 SNMP Version: 2, Community: Nagix, OID: .1.3.6.1.2.1.2.2.1.14.51
RRD: /var/www/html/rra/6509-stldist-rtr1_errors_in_470.rrd
rtr1 - Errors - 208.x.x.x - Gi5/3 SNMP Version: 2, Community: Nagix, OID: .1.3.6.1.2.1.2.2.1.20.51
RRD: /var/www/html/rra/6509-stldist-rtr1_errors_in_470.rrd

Don't know if this could be part of the problem or not so I'm including in case it means something to somebody.

There's also 327 data sources.

I hope I've included enough information. Hope somebody can help!

Post by **gandalf** » Tue Oct 21, 2008 4:19 pm

If this only happens once every day, please check if a backup is running at this time. The SYSTEMS STATS are quite fine; but you may want to grep STATS for a whole day.
Reinhard

Post by **TheWitness** » Tue Oct 21, 2008 8:00 pm

Well, you system is not weak. However, there are a few notes of interest:

1) Those poller items are not duplicate. Look closely at the OIDS
2) The cron sync issue was a bug corrected with 0.8.7b. It still happens from time to time if your cron start time varies a lot. This can be remediated and should be. Right now we only allow 5 second for cron to launch the process. I have increased to 10 and that seems to help. The change would be in poller.php (search for the number 5 and you will eventually track it down).
3) If using spine, the poller output table empty warnings may be from an anomaly that just received a bug ticket the other day when some error counters from a 4 counter set are missing. I am still exploring what to do with this issue. It's a vendor specific corner case.

TheWitness

MSTie1 · Post by **MSTie1** » Wed Oct 22, 2008 7:11 am

Ah touche on the backup process. That is indeed what caused the same blip everyday! Thanks!

I changed the poller.php value to 10 so I'll monitor that and see if that fixes the issue and continue to monitor the board for the poller output table empty fix. Thanks!

MSTie1 · Post by **MSTie1** » Wed Oct 22, 2008 7:32 am

There was one log error that I forgot to mention.

SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hosts Internal

Haven't seem to have found a solution that matches my situation on the forum. Any ideas?

oxo-oxo · Post by **oxo-oxo** » Thu Oct 23, 2008 5:04 pm

spine ran out of time and exits: possible gap cause ...
- over to thewitness ,,,

Code: Select all

				/* get current time and exit program if time limit exceeded */
				if (poller_counter >= 20) {
					current_time = get_time_as_double();

					if ((current_time - begin_time + 6) > poller_interval) {
						SPINE_LOG(("ERROR: Spine Timed Out While Processing Hosts Internal\n"));
						canexit = 1;
						break;
					}

					poller_counter = 0;
				}else{
					poller_counter++;
				}

Cacti degrading...any hope?

Cacti degrading...any hope?

Who is online