How to debug script server operation?

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Developers, Moderators

Post Reply
monachus
Posts: 42
Joined: Mon Sep 06, 2004 1:27 am
Location: New York, NY
Contact:

How to debug script server operation?

Post by monachus »

System: Ubuntu 10.04, Cacti 0.8.7e, Boost.

We started using the script server a few weeks ago, when our execution time consistently fell above 60s. Apparently, since activating it, every Cacti run has this in the log:

Code: Select all

08/06/2011 11:34:09 AM - SPINE: Poller[0] ERROR: SS[9] PHP Script Server communications lost.  Restarting PHP Script Server
08/06/2011 11:34:09 AM - SPINE: Poller[0] ERROR: SS[9] PHP Script Server communications lost.  Restarting PHP Script Server
08/06/2011 11:34:09 AM - SPINE: Poller[0] ERROR: SS[7] PHP Script Server communications lost.  Restarting PHP Script Server
08/06/2011 11:34:10 AM - SPINE: Poller[0] ERROR: SS[8] PHP Script Server communications lost.  Restarting PHP Script Server
08/06/2011 11:34:10 AM - SPINE: Poller[0] ERROR: SS[9] PHP Script Server communications lost.  Restarting PHP Script Server
08/06/2011 11:34:10 AM - SPINE: Poller[0] ERROR: SS[2] PHP Script Server communications lost.  Restarting PHP Script Server
<repeats for the number of configured script servers>
08/06/2011 11:34:11 AM - SYSTEM STATS: Time:9.9739 Method:spine Processes:1 Threads:40 Hosts:537 HostsPerProcess:537 DataSources:4194 RRDsProcessed:0
I thought that perhaps this came about as a result of a change I made to the ss_get_mysql_stats.php script in order to actually make it use the script server, but after deleting all of the graphs that were using that script, this still appears. We're also getting intermittent gaps in the graphs for "Host MIB - Available Disk Space," which also uses the script server. My guess is that something is hanging or killing the script server and that it's either not returning, or timing out and getting killed by something else. I'd like to debug what's happening inside of it, to see if a sub-script is throwing an exception, or to see what it's stuck on when it dies/is killed. Is it possible to get more insight into its operation?
Adrian Goins - President / CEO
Arces Network, LLC
http://www.arces.net
noname
Cacti Guru User
Posts: 1566
Joined: Thu Aug 05, 2010 2:04 am
Location: Japan

Re: How to debug script server operation?

Post by noname »

For example, (though I'm not sure if this will help..) slightly detailed logs will be recorded by this code.

In 'cacti-spine-0.8.7e/php.c', add "SPINE_LOG_DEBUG(..);" as follows:

Code: Select all

char *php_cmd(const char *php_command, int php_process) {
        ...

        /* send command to the script server */
        retry:
        bytes = write(php_processes[php_process].php_write_fd, command, strlen(command));

        SPINE_LOG_DEBUG(("DEBUG_EXTRA: SS[%i] Command:'%s' Result:%ibytes Retries:%i", php_process, php_command, bytes, retries));

        /* if write status is <= 0 then the script server may be hung */
        if (bytes <= 0) {
                result_string = strdup("U");
                SPINE_LOG(("ERROR: SS[%i] PHP Script Server communications lost.  Restarting PHP Script Server", php_process));
                ...
}
Then, re-compile spine and overwrite with new spine executable (= 'cacti-spine-0.8.7e/.libs/spine').

Sample log:
...
08/08/2011 06:20:07 PM - SPINE: Poller[0] DEBUG_EXTRA: SS[0] Command:'/var/www/cacti/scripts/ss_poller.php ss_poller' Result:53bytes Retries:0
08/08/2011 06:20:07 PM - PHPSVR: Poller[0] DEBUG: INC: '/var/www/cacti/scripts/ss_poller.php' FUNC: 'ss_poller' PARMS: ''
08/08/2011 06:20:07 PM - SPINE: Poller[0] Host[1] Description[Localhost] DS[57699] Graphs['Local - Cacti Poller Statistics - Runtime'] SS[0] SERVER: /var/www/cacti/scripts/ss_poller.php ss_poller, output: Time:6.8322 Method:spine Processes:5 Threads:5 Hosts:12 HostsPerProcess:3 DataSources:279 RRDsProcessed:86
08/08/2011 06:20:07 PM - SPINE: Poller[0] DEBUG_EXTRA: SS[1] Command:'/var/www/cacti/scripts/ss_poller.php ss_recache' Result:54bytes Retries:0
08/08/2011 06:20:07 PM - PHPSVR: Poller[0] DEBUG: INC: '/var/www/cacti/scripts/ss_poller.php' FUNC: 'ss_recache' PARMS: ''
...
monachus
Posts: 42
Joined: Mon Sep 06, 2004 1:27 am
Location: New York, NY
Contact:

Re: How to debug script server operation?

Post by monachus »

Awesome! I'll spin up a new binary and give it a shot. I'm suuuuuper frustrated with Cacti right now - it's running fine, fine fine, fine, omg poller exceeded 58s!!!, fine, fine, fine, fine, fine, omg it did it again!!! - and it's all with script server stuff. We've been using Cacti for years and years and I feel like we might have outgrown it. I can't have it visible to my clients and have to explain why there are gaps in graphs every few weeks.
Adrian Goins - President / CEO
Arces Network, LLC
http://www.arces.net
monachus
Posts: 42
Joined: Mon Sep 06, 2004 1:27 am
Location: New York, NY
Contact:

Re: How to debug script server operation?

Post by monachus »

More information:

Found out yesterday that while Cacti thinks it's killing off the script servers, it isn't. Hundreds of them continue running while Cacti spawns more and more until the machine eventually runs out of RAM. We're now getting this in the logs:

Code: Select all

08/09/2011 04:07:02 AM - POLLER: Poller[0] WARNING: Poller Output Table not Empty.  Issues Found: 3210, Data Sources: traffic_in(DS[363]), traffic_out(DS[363]), traffic_in(DS[368]), traffic_out(DS[368]), traffic_in(DS[373]), traffic_out(DS[373]), traffic_in(DS[378]), traffic_out(DS[378]), traffic_in(DS[383]), traffic_out(DS[383]), traffic_in(DS[409]), traffic_out(DS[409]), traffic_in(DS[410]), traffic_out(DS[410]), traffic_in(DS[414]), traffic_out(DS[414]), traffic_in(DS[415]), traffic_out(DS[415]), traffic_in(DS[419]), traffic_out(DS[419]), traffic_in(DS[420]), Additional Issues Remain.  Only showing first 20
Our graphs have again returned to looking like a gap-toothed grandma.

I'm in a quandry - we have too many graphs to not use the script server, but when we use the script server, it intermittently doesn't work. How can we sort this out so that it works reliably?

Has anyone looked at offloading the polling to something like Gearmand for distributed processing on remote hosts? Seems like it would be fairly easy to push polling into queues on a Gearman server and let a bunch of clients do the work. We're using it for other services (like Nagios and some in-house stuff), but I don't have the C chops to write it for Spine.
Adrian Goins - President / CEO
Arces Network, LLC
http://www.arces.net
Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests