[SOLVED] Spine 0.8.7a Timed Out

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Developers, Moderators

Post Reply
User avatar
rcaston
Cacti User
Posts: 204
Joined: Tue Jan 06, 2004 7:47 pm
Location: US-Dallas, TX
Contact:

[SOLVED] Spine 0.8.7a Timed Out

Post by rcaston »

01/11/2008 05:09:48 PM - SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hosts Internal
01/11/2008 05:09:49 PM - SYSTEM STATS: Time:294.9240 Method:spine Processes:8 Threads:20 Hosts:103 HostsPerProcess:13 DataSources:167023 RRDsProcessed:78246

Any thoughts?


*************** SOLUTION **********************************
Using Rheinhards "PollPerf" plugin I was able to determine that it was a small number of hosts which were the cause of the poller hanging for the full poll cycle. Examining those hosts reveiled one unique item that made them different from all the other hosts being polled, which was a data query based on a perl script called qospol.pl

This script was slow enough that given my number of devices, it caused the poll to take too long, and so it timed out each cycle.

For people experiencing poller timeouts and excessively long poller times, I recommend you use this plugin to help you narrow your focus on the specific hosts causing you the problem, in doing so you will have a smaller set of items to examine to find the root cause.
Last edited by rcaston on Fri Jan 18, 2008 10:41 am, edited 8 times in total.
User avatar
rcaston
Cacti User
Posts: 204
Joined: Tue Jan 06, 2004 7:47 pm
Location: US-Dallas, TX
Contact:

Post by rcaston »

In thinking the problem was I simply have too many datasources being polled for a 300 second window, I disabled most of the larger devices, and got the following:

Jan 11 18:54:56 vsrvr-nms-01-cacti Cacti[9890]: SYSTEM: STATS: Time:294.3332 Method:spine Processes:6 Threads:15 Hosts:65 HostsPerProcess:11 DataSources:31544 RRDsProcessed:10548
Jan 11 18:59:56 vsrvr-nms-01-cacti Cacti[13125]: SYSTEM: STATS: Time:294.7467 Method:spine Processes:6 Threads:15 Hosts:65 HostsPerProcess:11 DataSources:31544 RRDsProcessed:10542

even with 1/7 the RRDs being processed, the spine poller can not finish in under 300 seconds.

Something is causing it to hang.
Last edited by rcaston on Tue Jan 15, 2008 4:00 pm, edited 2 times in total.
bbice
Cacti User
Posts: 71
Joined: Mon May 13, 2002 6:53 pm

Post by bbice »

Do you have any devices that are flagged as down? If so, maybe disable them.

I've got several devices on the other side of a firewall that I can (at least right now) only monitor via TCP and one day I noticed that if one of those systems was down cactid gathered data for everything that was up but also took almost the full 5 mins before giving up. I never bothered to try cmd.php instead but simply disabled the down device temporarily.

Maybe something to do with the way the firewall is config'd? (dropping packets rather than rejecting them I mean) (shrug)
User avatar
rcaston
Cacti User
Posts: 204
Joined: Tue Jan 06, 2004 7:47 pm
Location: US-Dallas, TX
Contact:

Post by rcaston »

bbice wrote:Do you have any devices that are flagged as down? If so, maybe disable them.

I've got several devices on the other side of a firewall that I can (at least right now) only monitor via TCP and one day I noticed that if one of those systems was down cactid gathered data for everything that was up but also took almost the full 5 mins before giving up. I never bothered to try cmd.php instead but simply disabled the down device temporarily.
It's good advice but all my devices are showing as 'up'.

From a debug perspective, all I ever really notice is I get a large amount of errors about "Partial Results" during the polling.

Code: Select all

01/11/2008 02:35:50 PM - SPINE: Poller[0] Host[94] DS[62792] SNMP: v2: 192.168.2.100, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1
.6.2609, value: U
With each poll ending with the dreaded;

Code: Select all

SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hosts Internal
myfreeke
Cacti User
Posts: 82
Joined: Tue Dec 04, 2007 10:24 pm

Post by myfreeke »

it's spine bug
User avatar
rcaston
Cacti User
Posts: 204
Joined: Tue Jan 06, 2004 7:47 pm
Location: US-Dallas, TX
Contact:

Post by rcaston »

myfreeke wrote:it's spine bug

I found the spine thread about the new spine.c, and compiled it. However, it only helped some, but did not solve the issue

So the new spine.c did not fix this. :cry:

Instead of it always timing out at 294-299 seconds, now it can end around 260 seconds, but when it does - it skips the processing of a lot of data sources. See the graph below.

Code: Select all

Jan 15 13:00:01 SYSTEM: STATS: Time:299.0197 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73819
Jan 15 13:04:29 SYSTEM: STATS: Time:268.1955 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73302
Jan 15 13:09:28 SYSTEM: STATS: Time:267.2372 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73302
Jan 15 13:15:01 SYSTEM: STATS: Time:299.3925 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73819
Jan 15 13:19:29 SYSTEM: STATS: Time:267.9404 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73301
Jan 15 13:24:27 SYSTEM: STATS: Time:265.9304 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73302
Jan 15 13:30:01 SYSTEM: STATS: Time:299.2016 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73819
Attachments
poller runtime, before and after new spine.c
poller runtime, before and after new spine.c
spinec2.JPG (31.24 KiB) Viewed 2996 times
User avatar
TheWitness
Developer
Posts: 17007
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

Post by TheWitness »

Rodney do this math:

(Average Host Latency * Data Source for Host) / (Max OID Per Get Request) = XX Seconds

Do that for all your 1XX hosts. Then, what is the:

Max, Min, Average. Multiply the Average by 17 and what do you get.

Also, the more "OID" Errors you get, the more repolling has to go on. Aka invalid OID's require a repoll of the XX OID's over and over again until there are no more errors. It's inneficient as all get out. So, if you are getting lot's of unknowns, this totally destroys your scalability.

Here is the example:

Say your Max OID's per get request is 65

Say in request 1, you had 15 invalid OID's in the request. So 50 good and 15 bad.

The way the snmp works is that you get back an error, you have to pull out the bad oid, and then poll again. So, for this request you get 15 polls to get the complete answer. If you had no invalid OID's in that time period, you would have polled almost 1000 data sources.

Regards,

Larry
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages


For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Post Reply

Who is online

Users browsing this forum: No registered users and 0 guests