[SOLVED] Spine 0.8.7a Timed Out

rcaston · Post by **rcaston** » Fri Jan 11, 2008 6:10 pm

01/11/2008 05:09:48 PM - SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hosts Internal
01/11/2008 05:09:49 PM - SYSTEM STATS: Time:294.9240 Method:spine Processes:8 Threads:20 Hosts:103 HostsPerProcess:13 DataSources:167023 RRDsProcessed:78246

Any thoughts?

*************** SOLUTION **********************************
Using Rheinhards "PollPerf" plugin I was able to determine that it was a small number of hosts which were the cause of the poller hanging for the full poll cycle. Examining those hosts reveiled one unique item that made them different from all the other hosts being polled, which was a data query based on a perl script called qospol.pl

This script was slow enough that given my number of devices, it caused the poll to take too long, and so it timed out each cycle.

For people experiencing poller timeouts and excessively long poller times, I recommend you use this plugin to help you narrow your focus on the specific hosts causing you the problem, in doing so you will have a smaller set of items to examine to find the root cause.

rcaston · Post by **rcaston** » Fri Jan 11, 2008 7:54 pm

In thinking the problem was I simply have too many datasources being polled for a 300 second window, I disabled most of the larger devices, and got the following:

Jan 11 18:54:56 vsrvr-nms-01-cacti Cacti[9890]: SYSTEM: STATS: Time:294.3332 Method:spine Processes:6 Threads:15 Hosts:65 HostsPerProcess:11 DataSources:31544 RRDsProcessed:10548
Jan 11 18:59:56 vsrvr-nms-01-cacti Cacti[13125]: SYSTEM: STATS: Time:294.7467 Method:spine Processes:6 Threads:15 Hosts:65 HostsPerProcess:11 DataSources:31544 RRDsProcessed:10542

even with 1/7 the RRDs being processed, the spine poller can not finish in under 300 seconds.

Something is causing it to hang.

bbice · Post by **bbice** » Sat Jan 12, 2008 8:06 pm

Do you have any devices that are flagged as down? If so, maybe disable them.

I've got several devices on the other side of a firewall that I can (at least right now) only monitor via TCP and one day I noticed that if one of those systems was down cactid gathered data for everything that was up but also took almost the full 5 mins before giving up. I never bothered to try cmd.php instead but simply disabled the down device temporarily.

Maybe something to do with the way the firewall is config'd? (dropping packets rather than rejecting them I mean) (shrug)

rcaston · Post by **rcaston** » Sun Jan 13, 2008 11:59 am

bbice wrote:Do you have any devices that are flagged as down? If so, maybe disable them.

I've got several devices on the other side of a firewall that I can (at least right now) only monitor via TCP and one day I noticed that if one of those systems was down cactid gathered data for everything that was up but also took almost the full 5 mins before giving up. I never bothered to try cmd.php instead but simply disabled the down device temporarily.

It's good advice but all my devices are showing as 'up'.

From a debug perspective, all I ever really notice is I get a large amount of errors about "Partial Results" during the polling.

Code: Select all

01/11/2008 02:35:50 PM - SPINE: Poller[0] Host[94] DS[62792] SNMP: v2: 192.168.2.100, dsname: traffic_in, oid: .1.3.6.1.2.1.31.1.1.1
.6.2609, value: U

With each poll ending with the dreaded;

Code: Select all

SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hosts Internal

myfreeke · Post by **myfreeke** » Tue Jan 15, 2008 3:03 am

it's spine bug

rcaston · Post by **rcaston** » Tue Jan 15, 2008 12:19 pm

myfreeke wrote:it's spine bug

I found the spine thread about the new spine.c, and compiled it. However, it only helped some, but did not solve the issue

So the new spine.c did not fix this.

Instead of it always timing out at 294-299 seconds, now it can end around 260 seconds, but when it does - it skips the processing of a lot of data sources. See the graph below.

Code: Select all

Jan 15 13:00:01 SYSTEM: STATS: Time:299.0197 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73819
Jan 15 13:04:29 SYSTEM: STATS: Time:268.1955 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73302
Jan 15 13:09:28 SYSTEM: STATS: Time:267.2372 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73302
Jan 15 13:15:01 SYSTEM: STATS: Time:299.3925 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73819
Jan 15 13:19:29 SYSTEM: STATS: Time:267.9404 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73301
Jan 15 13:24:27 SYSTEM: STATS: Time:265.9304 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73302
Jan 15 13:30:01 SYSTEM: STATS: Time:299.2016 Method:spine Processes:6 Threads:30 Hosts:98 HostsPerProcess:17 DataSources:158038 RRDsProcessed:73819

Post by **TheWitness** » Tue Jan 15, 2008 11:24 pm

Rodney do this math:

(Average Host Latency * Data Source for Host) / (Max OID Per Get Request) = XX Seconds

Do that for all your 1XX hosts. Then, what is the:

Max, Min, Average. Multiply the Average by 17 and what do you get.

Also, the more "OID" Errors you get, the more repolling has to go on. Aka invalid OID's require a repoll of the XX OID's over and over again until there are no more errors. It's inneficient as all get out. So, if you are getting lot's of unknowns, this totally destroys your scalability.

Here is the example:

Say your Max OID's per get request is 65

Say in request 1, you had 15 invalid OID's in the request. So 50 good and 15 bad.

The way the snmp works is that you get back an error, you have to pull out the bad oid, and then poll again. So, for this request you get 15 polls to get the complete answer. If you had no invalid OID's in that time period, you would have polled almost 1000 data sources.

Regards,

Larry

[SOLVED] Spine 0.8.7a Timed Out

[SOLVED] Spine 0.8.7a Timed Out

Who is online