Problems after migrating to new server.

uno · Post by **uno** » Wed Jan 19, 2011 8:27 am

Hi.

We have migrated our cacti setup to new hardware. The OS has changed from 32-bit to 64-bit. We are using cacti 0.8.7g, and spine. I have recompiled spine on the new server. The problem is that some graphs are not updated, and we get some errors and warnings in the cacti log. The graphs that are not updating are both script queries and snmp queries.

These are the errors we get in the cacti log:

01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[1] PHP Script Server communications lost. Restarting PHP Script Server
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[0] PHP Script Server communications lost. Restarting PHP Script Server
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[9] PHP Script Server communications lost. Restarting PHP Script Server
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[6] PHP Script Server communications lost. Restarting PHP Script Server
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[3] PHP Script Server communications lost. Restarting PHP Script Server

We did not have these issues before, on the old hardware, and I can't figure out the reason for this. I tried using cmd.php instead for some time, and it seems to work better for some graphs, but then we get a lot of SNMP timeouts instead leading to 50% of the graphs not being updated.

uno · Post by **uno** » Wed Jan 19, 2011 8:59 am

I see this in the log as well:

Code: Select all

01/19/2011 02:45:29 PM - SPINE: Poller[0] Time: 27.6522 s, Threads: 20, Hosts: 5
01/19/2011 02:50:00 PM - POLLER: Poller[0] Maximum runtime of 298 seconds exceeded. Exiting.
01/19/2011 02:50:00 PM - SYSTEM STATS: Time:299.5331 Method:spine Processes:10 Threads:20 Hosts:102 HostsPerProcess:11 DataSources:3865 RRDsProcessed:1891
01/19/2011 02:50:00 PM - POLLER: Poller[0] NOTE: Poller Int: '300', Cron Int: '300', Time Since Last: '299', Max Runtime '298', Poller Runs: '1'
01/19/2011 02:50:00 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
01/19/2011 02:50:00 PM - SPINE: Poller[0] NOTE: Spine did not detect multithreaded device polling.

It seems like it's not doing anything for almost 5 minutes, until the timeout is exceeded. I also tried with debug logging enabled, but it doesn't log anything during that time.
We alse have a segmentation fault during each run. How can I find the cause of this:

01/19/2011 02:55:04 PM - SPINE: Poller[0] FATAL: Spine Encountered a Segmentation Fault (Spine thread)

uno · Post by **uno** » Wed Jan 19, 2011 9:34 am

I think I found the problem. There was one php script, for monitoring MSSQL servers, that did not work because php-mssql is not available as a package for the new OS, and I have not fixed it yet.
It seems like these script servers hanged and then some other tests did never run.

The only problem in the log now is
01/19/2011 03:30:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate

What is this, and what can I do about it?

uno · Post by **uno** » Thu Jan 20, 2011 2:26 am

I sometimes still get a poll that doesn't finish in time, which results in gaps in some graphs. Is there a way to find out why the polling isn't finished in time?

Code: Select all

01/20/2011 07:40:28 AM - SPINE: Poller[0] Time: 26.6360 s, Threads: 50, Hosts: 14
01/20/2011 07:45:00 AM - POLLER: Poller[0] Maximum runtime of 298 seconds exceeded. Exiting.
01/20/2011 07:45:00 AM - SYSTEM STATS: Time:298.2993 Method:spine Processes:5 Threads:50 Hosts:101 HostsPerProcess:21 DataSources:3863 RRDsProcessed:1374
01/20/2011 07:45:00 AM - POLLER: Poller[0] NOTE: Poller Int: '300', Cron Int: '300', Time Since Last: '299', Max Runtime '298', Poller Runs: '1'
01/20/2011 07:45:00 AM - POLLER: Poller[0] WARNING: There are '2' detected as overrunning a polling process, please investigate

A successful run looks like this:

Code: Select all

01/20/2011 07:45:30 AM - SPINE: Poller[0] Time: 29.4593 s, Threads: 50, Hosts: 14
01/20/2011 07:45:30 AM - SYSTEM STATS: Time:30.1157 Method:spine Processes:5 Threads:50 Hosts:101 HostsPerProcess:21 DataSources:3863 RRDsProcessed:2224
01/20/2011 07:50:01 AM - POLLER: Poller[0] NOTE: Poller Int: '300', Cron Int: '300', Time Since Last: '301', Max Runtime '298', Poller Runs: '1'

So it seems like there is some script or poll that is hanging, which somehow stops a lot of the RRDs from being updated. How can I see which one is the problem?

On a related topic, what are good threads, script servers and timeout settings for spine on a quad core xeon (E5405)?

Post by **gandalf** » Fri Jan 21, 2011 12:34 pm

I'd start researching the scripts.
You may poll devices one-by-one using spine to test for the "bad" one.
R.

uno · Post by **uno** » Wed Jan 26, 2011 5:54 am

Hi. The scripts worked fine on the old server, and they work most of the time. It's maybe 5 polls per day that exceed the timeout and causes a number of RRD:s to not be updated. It's usually quite a high number of RRDs that are not updated, like it's hanging and isn't able to finish them all in time. For example all graphs for one host might have gaps in them, despite using completely different poll metods.
Since all graphs/datasources work most of the time, it's quite difficult to pin it down by polling one host at a time.

Cacti

Problems after migrating to new server.

Problems after migrating to new server.

Re: Problems after migrating to new server.

Re: Problems after migrating to new server.

Re: Problems after migrating to new server.

Re: Problems after migrating to new server.

Re: Problems after migrating to new server.

Who is online