Hi.
We have migrated our cacti setup to new hardware. The OS has changed from 32-bit to 64-bit. We are using cacti 0.8.7g, and spine. I have recompiled spine on the new server. The problem is that some graphs are not updated, and we get some errors and warnings in the cacti log. The graphs that are not updating are both script queries and snmp queries.
These are the errors we get in the cacti log:
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:04 PM - PHPSVR: Poller[0] ERROR: Input Expected, Script Server Terminating
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[1] PHP Script Server communications lost. Restarting PHP Script Server
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[0] PHP Script Server communications lost. Restarting PHP Script Server
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[9] PHP Script Server communications lost. Restarting PHP Script Server
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[6] PHP Script Server communications lost. Restarting PHP Script Server
01/19/2011 02:10:03 PM - SPINE: Poller[0] ERROR: SS[3] PHP Script Server communications lost. Restarting PHP Script Server
We did not have these issues before, on the old hardware, and I can't figure out the reason for this. I tried using cmd.php instead for some time, and it seems to work better for some graphs, but then we get a lot of SNMP timeouts instead leading to 50% of the graphs not being updated.
Problems after migrating to new server.
Moderators: Developers, Moderators
Re: Problems after migrating to new server.
I see this in the log as well:
It seems like it's not doing anything for almost 5 minutes, until the timeout is exceeded. I also tried with debug logging enabled, but it doesn't log anything during that time.
We alse have a segmentation fault during each run. How can I find the cause of this:
01/19/2011 02:55:04 PM - SPINE: Poller[0] FATAL: Spine Encountered a Segmentation Fault (Spine thread)
Code: Select all
01/19/2011 02:45:29 PM - SPINE: Poller[0] Time: 27.6522 s, Threads: 20, Hosts: 5
01/19/2011 02:50:00 PM - POLLER: Poller[0] Maximum runtime of 298 seconds exceeded. Exiting.
01/19/2011 02:50:00 PM - SYSTEM STATS: Time:299.5331 Method:spine Processes:10 Threads:20 Hosts:102 HostsPerProcess:11 DataSources:3865 RRDsProcessed:1891
01/19/2011 02:50:00 PM - POLLER: Poller[0] NOTE: Poller Int: '300', Cron Int: '300', Time Since Last: '299', Max Runtime '298', Poller Runs: '1'
01/19/2011 02:50:00 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
01/19/2011 02:50:00 PM - SPINE: Poller[0] NOTE: Spine did not detect multithreaded device polling.
We alse have a segmentation fault during each run. How can I find the cause of this:
01/19/2011 02:55:04 PM - SPINE: Poller[0] FATAL: Spine Encountered a Segmentation Fault (Spine thread)
Re: Problems after migrating to new server.
I think I found the problem. There was one php script, for monitoring MSSQL servers, that did not work because php-mssql is not available as a package for the new OS, and I have not fixed it yet.
It seems like these script servers hanged and then some other tests did never run.
The only problem in the log now is
01/19/2011 03:30:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
What is this, and what can I do about it?
It seems like these script servers hanged and then some other tests did never run.
The only problem in the log now is
01/19/2011 03:30:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
What is this, and what can I do about it?
Re: Problems after migrating to new server.
I sometimes still get a poll that doesn't finish in time, which results in gaps in some graphs. Is there a way to find out why the polling isn't finished in time?
A successful run looks like this:
So it seems like there is some script or poll that is hanging, which somehow stops a lot of the RRDs from being updated. How can I see which one is the problem?
On a related topic, what are good threads, script servers and timeout settings for spine on a quad core xeon (E5405)?
Code: Select all
01/20/2011 07:40:28 AM - SPINE: Poller[0] Time: 26.6360 s, Threads: 50, Hosts: 14
01/20/2011 07:45:00 AM - POLLER: Poller[0] Maximum runtime of 298 seconds exceeded. Exiting.
01/20/2011 07:45:00 AM - SYSTEM STATS: Time:298.2993 Method:spine Processes:5 Threads:50 Hosts:101 HostsPerProcess:21 DataSources:3863 RRDsProcessed:1374
01/20/2011 07:45:00 AM - POLLER: Poller[0] NOTE: Poller Int: '300', Cron Int: '300', Time Since Last: '299', Max Runtime '298', Poller Runs: '1'
01/20/2011 07:45:00 AM - POLLER: Poller[0] WARNING: There are '2' detected as overrunning a polling process, please investigate
Code: Select all
01/20/2011 07:45:30 AM - SPINE: Poller[0] Time: 29.4593 s, Threads: 50, Hosts: 14
01/20/2011 07:45:30 AM - SYSTEM STATS: Time:30.1157 Method:spine Processes:5 Threads:50 Hosts:101 HostsPerProcess:21 DataSources:3863 RRDsProcessed:2224
01/20/2011 07:50:01 AM - POLLER: Poller[0] NOTE: Poller Int: '300', Cron Int: '300', Time Since Last: '301', Max Runtime '298', Poller Runs: '1'
On a related topic, what are good threads, script servers and timeout settings for spine on a quad core xeon (E5405)?
- gandalf
- Developer
- Posts: 22383
- Joined: Thu Dec 02, 2004 2:46 am
- Location: Muenster, Germany
- Contact:
Re: Problems after migrating to new server.
I'd start researching the scripts.
You may poll devices one-by-one using spine to test for the "bad" one.
R.
You may poll devices one-by-one using spine to test for the "bad" one.
R.
Re: Problems after migrating to new server.
Hi. The scripts worked fine on the old server, and they work most of the time. It's maybe 5 polls per day that exceed the timeout and causes a number of RRD:s to not be updated. It's usually quite a high number of RRDs that are not updated, like it's hanging and isn't able to finish them all in time. For example all graphs for one host might have gaps in them, despite using completely different poll metods.
Since all graphs/datasources work most of the time, it's quite difficult to pin it down by polling one host at a time.
Since all graphs/datasources work most of the time, it's quite difficult to pin it down by polling one host at a time.
Who is online
Users browsing this forum: No registered users and 2 guests