Spine suddenly gone to 298sec, long delay with no logging..

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Developers, Moderators

User avatar
Howie
Cacti Guru User
Posts: 5508
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

Spine suddenly gone to 298sec, long delay with no logging..

Post by Howie »

I moved to a new (quad-core, 8GB RAM, 15K SAS) Cacti server yesterday, and dropped my poll times from 220sec back down to 30sec, which was very nice :-)

Today, without any special change, it's flipped up to 298 seconds.

Looking in the logs, with DEBUG level logging, I get this:

Code: Select all

12/17/2008 04:26:15 PM - SPINE: Poller[0] Host[301] DS[6653] SCRIPT: /usr/bin/perl /var/www/docs/cacti/scripts/ha7net.pl 10.1.1.22 get value 2B
00000080A6DB26, output: 15.1875
12/17/2008 04:26:15 PM - SPINE: Poller[0] Time: 73.6301 s, Threads: 15, Hosts: 56
12/17/2008 04:30:00 PM - POLLER: Poller[0] Maximum runtime of 298 seconds exceeded. Exiting.
12/17/2008 04:30:00 PM - SYSTEM STATS: Time:298.6729 Method:spine Processes:4 Threads:15 Hosts:220 HostsPerProcess:55 DataSources:13262 RRDsPro
cessed:4010
What is going on in that 3.5 minutes? Nothing is logged at all! Is the SPINE Time line the last thing that SPINE outputs? (so could it be a plugin or something else?)

Switching to CMD polling gives me 150 second polling, which is weird.
Last edited by Howie on Sun Dec 28, 2008 6:54 am, edited 1 time in total.
Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
User avatar
Howie
Cacti Guru User
Posts: 5508
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

Post by Howie »

Update: Disabling all plugins didn't help.

I'm just going to try switching back to cmd.php again to make sure it consistently works fast enough.
Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
User avatar
Howie
Cacti Guru User
Posts: 5508
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

Post by Howie »

Yep - with CMD.PHP I get better performance... and it's definitely affecting rrd updates.
Attachments
spine-timeouts.png
spine-timeouts.png (26.82 KiB) Viewed 3577 times
sample.png
sample.png (17.49 KiB) Viewed 3577 times
Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
fmerrill
Posts: 12
Joined: Tue Oct 28, 2008 7:56 pm
Location: North Carolina

Post by fmerrill »

Have you tried checking to see how many crons are running when this is happening?
When I have seen similar in the past, I would have 3-4 running. I would stop the crond daemon, stop the poller, then kill the crons, then wait until it all settled, then restart the cron daemon, and restart the poller, and it would clear up.

Probably not the issue here, but thought I'd interject ait anyway.
User avatar
Howie
Cacti Guru User
Posts: 5508
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

Post by Howie »

Nope - definitely just the one poller... It had been running OK for a day before this suddenly appeared.

I've just tried Spine 0.8.7c beta 3, with the same results, so back to cmd.php for the moment.
Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
dainiookas
Posts: 34
Joined: Fri Dec 05, 2008 5:49 am
Location: Vilnius, Lithuania

Post by dainiookas »

I have the same problem and have to solve it somehow - cannot use cmd.php because there are
What I managed to find out about this is that it hangs on some host querying - as far as I know it hangs on unreachable hosts, which I think should be corrected in 0.8.7c, the problem is that when I compile it I get buffer overflow error and cannot start it :(
User avatar
Howie
Cacti Guru User
Posts: 5508
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

Post by Howie »

Spine starts OK, and collects some data OK. Same thing with the current spine beta too.

I thought I'd found the answer this morning when I saw that the firewall that gives us VPN access to one of the sites we monitor was hitting it's session limit. I've replaced that firewall with a bigger one, and retested. No session limits, but still the problem, and still nothing useful in the logs.

Code: Select all

12/22/2008 12:32:14 PM - SYSTEM STATS: Time:133.2594 Method:cmd.php Processes:4 Threads:N/A Hosts:220 HostsPerProcess:55 DataSources:13264 RRDsProcessed:4656
12/22/2008 12:36:46 PM - SYSTEM STATS: Time:104.5887 Method:cmd.php Processes:4 Threads:N/A Hosts:220 HostsPerProcess:55 DataSources:13264 RRDsProcessed:4652
12/22/2008 12:42:45 PM - SYSTEM STATS: Time:163.8084 Method:cmd.php Processes:4 Threads:N/A Hosts:220 HostsPerProcess:55 DataSources:13264 RRDsProcessed:4656

12/22/2008 12:50:09 PM - SYSTEM STATS: Time:307.4751 Method:spine Processes:4 Threads:15 Hosts:220 HostsPerProcess:55 DataSources:13264 RRDsProcessed:2615
12/22/2008 12:55:00 PM - SYSTEM STATS: Time:299.0212 Method:spine Processes:4 Threads:15 Hosts:220 HostsPerProcess:55 DataSources:13264 RRDsProcessed:1242

Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
dainiookas
Posts: 34
Joined: Fri Dec 05, 2008 5:49 am
Location: Vilnius, Lithuania

Post by dainiookas »

I can tell you what's happening because I'm fighting with the same thing...
It's about hanging on dead devices or which do not answer to the SNMP or UDP ping (depends on what kind you set)

At least I think so, because I found a few devices in a big list which where DOWN for >1000 polls and after disabling them spine is able to work again...
User avatar
Howie
Cacti Guru User
Posts: 5508
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

Post by Howie »

Code: Select all

12/23/2008 03:10:58 PM - SYSTEM STATS: [b]Time:56.7870 Method:spine[/b] Processes:4 Threads:15 Hosts:210 HostsPerProcess:53 DataSources:13194 RRDsProcessed:5086
That was only 4 down devices. Removing those has cleared this.

So, now how do I defend against this stuff in the future? Things do go down in the real world, and I can't have it kill the cacti server...
Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
dainiookas
Posts: 34
Joined: Fri Dec 05, 2008 5:49 am
Location: Vilnius, Lithuania

Post by dainiookas »

Howie wrote:

Code: Select all

12/23/2008 03:10:58 PM - SYSTEM STATS: [b]Time:56.7870 Method:spine[/b] Processes:4 Threads:15 Hosts:210 HostsPerProcess:53 DataSources:13194 RRDsProcessed:5086
That was only 4 down devices. Removing those has cleared this.

So, now how do I defend against this stuff in the future? Things do go down in the real world, and I can't have it kill the cacti server...
That's a really good question which bothers me a lot as well... Didn't find a real solution yet :(
User avatar
Howie
Cacti Guru User
Posts: 5508
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

Post by Howie »

Hmm, well that ran OK for a while, but fell apart over the weekend, so it's back to cmd.php

It seems to me that if the poller sorted it's queries so that it polled known-down hosts last, then at least the 'damage' to known-working ones would be limited. It would mean that it's possible that a returned host would never return because of other down hosts, but that beats perfectly OK stuff not being polled.
Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Post by gandalf »

That's a nice approach. But unfortunately, the load distribution currently performed only accepts a "range" of device id's, not a "list". While it would be able to perform a SQL ORDER BY "<host down status>", it would currently not be possible to provide a poller process with the resulting "list of devices".
That would touch cmd.php as well as spine. But it's worth a discussion. And it may help on a distributed cacti system where I would expect that each polling threads get's his tasks from a specific poller table on a "per-process base".
E.g. a table holding rows for
- poller process id
- host to be polled (along with it's status?)
- data source to be polled (in case we turn to a data source based polling any time)

Reinhard
User avatar
Howie
Cacti Guru User
Posts: 5508
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

Post by Howie »

So what are other people doing? I don't have that many devices, or that many down devices (maybe 5-10 out of 300, depending on the day).

Is there a timeout I can lower that I've missed somewhere, or is it safe to turn the threads way up? (I assume that it's threads blocking that is my problem)
Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Post by gandalf »

Best is to assure, that
- downed host detection catches all severe "host down problems". This will be performed first and if it fails, the target is skipped
- provide your scripts (they may fail even if the host is up, e.g. a service may be down) with a meaningful timeout parameter. You may hand over the snmp timeout as a first step, but pay attention: sometimes timeouts are given in seconds, sometimes in milliseconds. I already came across this; so I know what I'm speaking about.
Reinhard
User avatar
Howie
Cacti Guru User
Posts: 5508
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

Post by Howie »

gandalf wrote:Best is to assure, that
- downed host detection catches all severe "host down problems". This will be performed first and if it fails, the target is skipped
- provide your scripts (they may fail even if the host is up, e.g. a service may be down) with a meaningful timeout parameter. You may hand over the snmp timeout as a first step, but pay attention: sometimes timeouts are given in seconds, sometimes in milliseconds. I already came across this; so I know what I'm speaking about.
Reinhard
I'll check into those today, but wouldn't they both affect cmd.php and spine equally? I'm seeing much more stable performance with cmd.php. Spine is faster when it works, of course.
Weathermap 0.98a is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)
Post Reply

Who is online

Users browsing this forum: No registered users and 5 guests