SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hos

kingaru1 · Post by **kingaru1** » Thu Feb 02, 2012 8:46 am

Hello,

Well ... running latest version Cacti + latest spine.
having great issues with graphs and rrd not being created for required hosts and or graphs being broken.
It looks like the issue arises when I add more than certain number of items for monitoring.
Cron set for 1 min. Cron interval and Grpah intervals set for 1 min as well.
Keep getting constantly "SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hosts Internal " in the log

Can somebody please help with this?

Thanks
K.

kingaru1 · Post by **kingaru1** » Mon Feb 06, 2012 4:33 am

C'mon guys ... I don't believe that I'm so unique in the issue ...
Why collects data and does not draw graph or stops creating rrd files at the certain poin!?
Also I dont think this is the issue comes from Spine ... switching to cmd.php does not fix the problem but just transforms error in to a warning "POLLER: Poller[0] WARNING: Poller Output Table not Empty. Issues Found: 8, Data Sources: blah blah blah "
Could you please at least let me know at least where to dig in? I searched through the posts regarding similar issue but all issues/fixes going back to 2005-2006 ... several releases back ... where all issues were caused by some sort bugs and should be fixed unfortunately it looks like not all of them fixed therefore I need help to help you and myself to fix it ...

Regards
K.

P.S. Just come up in the log " POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate" what is that?
P.P.S Cacti runs on UBUNTU server 11.10

Post by **gandalf** » Mon Feb 06, 2012 3:57 pm

The overrun issue is thrown in cases, when the poller does not complete within a poller interval (300 sec by default). This often occurs in case of unresponsive targets and/or insane timeouts (e.g. default timeouts) of commands used within scripts.
R.

kingaru1 · Post by **kingaru1** » Tue Feb 07, 2012 8:07 am

gandalf wrote:The overrun issue is thrown in cases, when the poller does not complete within a poller interval (300 sec by default). This often occurs in case of unresponsive targets and/or insane timeouts (e.g. default timeouts) of commands used within scripts.
R.

Thanks a lot for the feedback. At least someone is reacted at hopefully we can find a solution. Yes, besides general poling traffic of the interfaces via standard way, I also run some of my scripts to pull additional stats. I get all relies from ALL hosts without any problems. Id does not take more than 40 seconds to complete pooling for any of the devices that I have on the list. I.e. for some of the devices it maybe 10 seconds for some more remote 40 (absolute maximum) therefore if my pooling interval 60 seconds that should not be a problem. Correct? Is there a way that I can debug this? IN the log I see following as well

02/07/2012 02:01:00 PM - POLLER: Poller[0] WARNING: Poller Output Table not Empty. Issues Found: 3, Data Sources: aaaa(DS[3628]), bbb(DS[3628]), ccc(DS[3631])
02/07/2012 02:01:00 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
02/07/2012 02:01:00 PM - POLLER: Poller[0] NOTE: Poller Int: '60', Cron Int: '60', Time Since Last: '59', Max Runtime '58', Poller Runs: '1'
02/07/2012 02:01:00 PM - SYSTEM STATS: Time:58.1966 Method:cmd.php Processes:10 Threads:N/A Hosts:41 HostsPerProcess:5 DataSources:357 RRDsProcessed:161
02/07/2012 02:01:00 PM - POLLER: Poller[0] Maximum runtime of 58 seconds exceeded. Exiting.

So what is the the runtime that I'm exceeded???? I did not exceeded this time for any of the hosts... or are you taking aggregated time for all hosts? If this is the case then I think it is wrong. pooler should be counted separately for each individual device not for all of them together ...

K

Post by **gandalf** » Tue Feb 07, 2012 3:40 pm

kingaru1 wrote: Id does not take more than 40 seconds to complete pooling for any of the devices that I have on the list. I.e. for some of the devices it maybe 10 seconds for some more remote 40 (absolute maximum) therefore if my pooling interval 60 seconds that should not be a problem. Correct?

In general, this is not correct. In fact, you can't exactly determine, which data sources are polled by which polling process. If I remember correctly, we try to poll a single host by exactly one polling process. If it happens, that two "slow" devices are polled by the same process, polling will time out. And in fact, a response time of 40 sec. is really sloooow, especially when trying to use 60 sec polling interval.
You may want to use multi-threading for that host (personally, I never used that).
You may run spine in debug level 3 to get polling times for each host. This may help for debugging
R.

kingaru1 · Post by **kingaru1** » Wed Oct 31, 2012 12:10 pm

gandalf wrote:
kingaru1 wrote: Id does not take more than 40 seconds to complete pooling for any of the devices that I have on the list. I.e. for some of the devices it maybe 10 seconds for some more remote 40 (absolute maximum) therefore if my pooling interval 60 seconds that should not be a problem. Correct?
In general, this is not correct. In fact, you can't exactly determine, which data sources are polled by which polling process. If I remember correctly, we try to poll a single host by exactly one polling process. If it happens, that two "slow" devices are polled by the same process, polling will time out. And in fact, a response time of 40 sec. is really sloooow, especially when trying to use 60 sec polling interval.
You may want to use multi-threading for that host (personally, I never used that).
You may run spine in debug level 3 to get polling times for each host. This may help for debugging
R.

OK ... it took me a while to get back to this issue but I made some tests and found that there is no a single host thread that takes more than a 50-55 seconds to complete.
I have 155 hosts configured in cacti and each host runs 6 threads. In some worse cases it takes 50-52 seconds for some of the threads to completed , so I _never_ exceed 60 seconds pooling time for any running [TH] (thread). I assume cacti must empty Output table for the each thread up on completion it's pooling tasks, however I feel that it happens only when the entire pooling process (that includes several threads) for the host is completed which may be explaining the issue. I also I have noticed that the each thread includes several "poling" items. Due to different network conditions or network device utilization the responses for the each individual item in the thread can be delayed therefore it slows down the whole thread and makes processing of other items from the same thread to be delayed as well. Yes, this may not be a big issue if you perform mooring of devices within a 40ms range with limited number of SNMP objects that you control however as soon as you will require to monitor devices which are 150+ ms away from you and you need to monitor quite few interfaces and/or SNMP objects you can easily exceed 40 or even 60 sec for the single device which I think is nothing abnormal as soon as you don't exceed 60 seconds interval for an individual SNMP object.

Regards
Igor

Post by **gandalf** » Thu Nov 01, 2012 5:36 am

There are some polling timeouts that you may change under "Settings". Does this help you?
R.

kingaru1 · Post by **kingaru1** » Wed Nov 07, 2012 8:46 am

gandalf wrote:There are some polling timeouts that you may change under "Settings". Does this help you?
R.

I have script and server timeout values set for 120 and it does not help

Regards
Igor

kingaru1 · Post by **kingaru1** » Mon Nov 19, 2012 2:06 am

kingaru1 wrote:
gandalf wrote:There are some polling timeouts that you may change under "Settings". Does this help you?
R.

I have script and server timeout values set for 120 and it does not help

Regards
Igor

So I guess we ran out of any ideas how to resolve this since there are no any other suggestions ?

Regards
Igor

dar1k · Post by **dar1k** » Tue Nov 20, 2012 1:56 am

I have the same problem. My spine works unstable. every 5-10 minute I got timeout error:

crontab 5 min.
poller interval 5 min.
Maximum Concurrent Poller Processes 4
Maximum Threads per Process 40
Number of PHP Script Servers 10
Script and Script Server Timeout Value 25
The Maximum SNMP OID's Per SNMP Get Request 40

11/20/2012 02:20:34 PM - SYSTEM THOLD STATS: Time:0.0878 Tholds:39 DownHosts:0
11/20/2012 02:20:34 PM - SYSTEM STATS: Time:32.0892 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280
11/20/2012 02:18:00 PM - SYSTEM THOLD STATS: Time:0.0881 Tholds:39 DownHosts:0
11/20/2012 02:18:00 PM - SYSTEM STATS: Time:178.8975 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280
11/20/2012 02:10:33 PM - SYSTEM THOLD STATS: Time:0.0872 Tholds:39 DownHosts:0
11/20/2012 02:10:33 PM - SYSTEM STATS: Time:32.0221 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280

11/20/2012 03:00:27 PM - SYSTEM THOLD STATS: Time:0.0166 Tholds:0 DownHosts:0
11/20/2012 03:00:27 PM - SYSTEM STATS: Time:26.1608 Method:spine Processes:4 Threads:40 Hosts:1298 HostsPerProcess:325 DataSources:0 RRDsProcessed:0
11/20/2012 03:00:01 PM - SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hosts Internal
11/20/2012 03:00:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
11/20/2012 03:00:00 PM - SYSTEM THOLD STATS: Time:0.0167 Tholds:0 DownHosts:0
11/20/2012 03:00:00 PM - SYSTEM STATS: Time:298.5146 Method:spine Processes:4 Threads:40 Hosts:1298 HostsPerProcess:325 DataSources:0 RRDsProcessed:0
11/20/2012 03:00:00 PM - POLLER: Poller[0] Maximum runtime of 298 seconds exceeded. Exiting.
11/20/2012 02:50:33 PM - SYSTEM THOLD STATS: Time:0.1463 Tholds:39 DownHosts:0
11/20/2012 02:50:33 PM - SYSTEM STATS: Time:31.7913 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280
11/20/2012 02:45:33 PM - SYSTEM THOLD STATS: Time:0.0884 Tholds:39 DownHosts:0
11/20/2012 02:45:33 PM - SYSTEM STATS: Time:31.7094 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280

Why such big difference in runtime?

Post by **gandalf** » Tue Nov 20, 2012 5:22 pm

This looks like two pollers are running in parallel. Examine your crontabs, please (see e.g. 2nd link of my sig)
R.

dar1k · Post by **dar1k** » Tue Nov 20, 2012 11:21 pm

I have only one crontab:

Code: Select all

[root@cactus ~]# cat /etc/cron.d/cacti
*/5 * * * *     cactiuser       /usr/bin/php -q /var/www/cacti/poller.php > /var/www/cacti/log/poller.log 2>&1
[root@cactus ~]#

paulgevers · Post by **paulgevers** » Fri Nov 23, 2012 4:52 pm

dar1k wrote:I have only one crontab:

Code: Select all

[root@cactus ~]# cat /etc/cron.d/cacti
*/5 * * * *     cactiuser       /usr/bin/php -q /var/www/cacti/poller.php > /var/www/cacti/log/poller.log 2>&1
[root@cactus ~]#

And maybe one in /etc/crontab ?

dar1k · Post by **dar1k** » Sun Nov 25, 2012 11:43 pm

paulgevers wrote:
dar1k wrote:I have only one crontab:
Code: Select all
[root@cactus ~]# cat /etc/cron.d/cacti
*/5 * * * *     cactiuser       /usr/bin/php -q /var/www/cacti/poller.php > /var/www/cacti/log/poller.log 2>&1
[root@cactus ~]#
And maybe one in /etc/crontab ?

Code: Select all

[root@cactus ~]# cat /etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/

# For details see man 4 crontabs

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed

[root@cactus ~]#

I've changed poller type from spine to cmd.php, but cmd did not detect down hosts if:
1. Downed Device Detection is icmp ping
2. Downed Device Detection is ping or snmp uptime and host response only snmp it detects as down.

kingaru1 · Post by **kingaru1** » Mon Nov 26, 2012 8:40 am

dar1k wrote:I have the same problem. My spine works unstable. every 5-10 minute I got timeout error:

crontab 5 min.
poller interval 5 min.
Maximum Concurrent Poller Processes 4
Maximum Threads per Process 40
Number of PHP Script Servers 10
Script and Script Server Timeout Value 25
The Maximum SNMP OID's Per SNMP Get Request 40

11/20/2012 02:20:34 PM - SYSTEM THOLD STATS: Time:0.0878 Tholds:39 DownHosts:0
11/20/2012 02:20:34 PM - SYSTEM STATS: Time:32.0892 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280
11/20/2012 02:18:00 PM - SYSTEM THOLD STATS: Time:0.0881 Tholds:39 DownHosts:0
11/20/2012 02:18:00 PM - SYSTEM STATS: Time:178.8975 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280
11/20/2012 02:10:33 PM - SYSTEM THOLD STATS: Time:0.0872 Tholds:39 DownHosts:0
11/20/2012 02:10:33 PM - SYSTEM STATS: Time:32.0221 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280

11/20/2012 03:00:27 PM - SYSTEM THOLD STATS: Time:0.0166 Tholds:0 DownHosts:0
11/20/2012 03:00:27 PM - SYSTEM STATS: Time:26.1608 Method:spine Processes:4 Threads:40 Hosts:1298 HostsPerProcess:325 DataSources:0 RRDsProcessed:0
11/20/2012 03:00:01 PM - SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hosts Internal
11/20/2012 03:00:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
11/20/2012 03:00:00 PM - SYSTEM THOLD STATS: Time:0.0167 Tholds:0 DownHosts:0
11/20/2012 03:00:00 PM - SYSTEM STATS: Time:298.5146 Method:spine Processes:4 Threads:40 Hosts:1298 HostsPerProcess:325 DataSources:0 RRDsProcessed:0
11/20/2012 03:00:00 PM - POLLER: Poller[0] Maximum runtime of 298 seconds exceeded. Exiting.
11/20/2012 02:50:33 PM - SYSTEM THOLD STATS: Time:0.1463 Tholds:39 DownHosts:0
11/20/2012 02:50:33 PM - SYSTEM STATS: Time:31.7913 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280
11/20/2012 02:45:33 PM - SYSTEM THOLD STATS: Time:0.0884 Tholds:39 DownHosts:0
11/20/2012 02:45:33 PM - SYSTEM STATS: Time:31.7094 Method:spine Processes:4 Threads:40 Hosts:1294 HostsPerProcess:324 DataSources:1618 RRDsProcessed:1280

Why such big difference in runtime?

Try to reduce number of processes to default 1 ...

Igor

Cacti

SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hos

SPINE: Poller[0] ERROR: Spine Timed Out While Processing Hos

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Re: SPINE: Poller[0] ERROR: Spine Timed Out While Processing

Who is online