Cacti 0.8.7 cmd.php to spine, several hosts show down.

nvetro · Post by **nvetro** » Wed Feb 06, 2008 1:07 pm

ok I set it to 4, it is on a dual cpu sparc machine...i think they are dual core or maybe quad core...i'm thinking dual...so I put it to 4...I'll let it run a bit and report back in a few.

nvetro · Post by **nvetro** » Wed Feb 06, 2008 1:48 pm

No go, still shows a bunch of hosts down and only a few up. What else you guys got

nvetro · Post by **nvetro** » Wed Feb 06, 2008 1:52 pm

Just for laughs here is a screen shot of the settings on a 'downed' device that shows down when using spine, but is fine using cmd.php:

nvetro · Post by **nvetro** » Wed Feb 06, 2008 4:23 pm

Anyone?

Post by **TheWitness** » Wed Feb 06, 2008 4:56 pm

Run an ethereal/wireshark capture during polling and send me it using PM.

TheWitness

nvetro · Post by **nvetro** » Wed Feb 06, 2008 7:53 pm

TheWitness I am working on getting that to you asap using wireshark.

Post by **TheWitness** » Fri Feb 08, 2008 7:24 pm

Thank you. I will be back in Detroit on Saturday and likely working on Cacti 0.8.8, and a few other things on Sunday.

TheWitness

frankfegert · Post by **frankfegert** » Sat Feb 09, 2008 9:51 am

nvetro wrote:Anyone?

Ah, another SPARC/Solaris fellow? Thank god, i almost believed i'm alone with that

Did you upgrade from an earlier version of cacti, or is this a fresh install? If it's an upgrade, i had some problems with the SNMP-security-level although i'm using "authNoPriv" some hosts had "snmp_priv_protocol" != "[None]" in the cacti.host mysql table which causes an invalid command line for the snmpwalk/snmpget command. To verify connect to the DB with:

Code: Select all

mysql -u <yourcactiuser> -p cacti

and verify the "snmp_priv_protocol" is set accordingly to your needs:

Code: Select all

select id,hostname,snmp_version,snmp_priv_protocol from host;

For me there were some "MD5" entries, where there should have been "[None]".

Regards,

Frank

nvetro · Post by **nvetro** » Mon Feb 11, 2008 9:47 am

Frank,

Thanks for the reply! So I ran the SQL querry and everything looks to be in order...the SNMP_Version is 3, and the snmp_priv_protocol are all [None].....the details on a "down" host match the details of an "up" host also on the querry...What else ya got? I really dont want to install that packet monitioring app on this box as its a production machine and we do not have another box to acomplish this.

The Witness how else can I provide you with more information...also, for the record......several hosts that are "up" are on the same subnet as several hosts that are "down" I do not think its a network issue as we changed the network location of this box before to rule that issue out.

frankfegert · Post by **frankfegert** » Tue Feb 12, 2008 3:22 am

nvetro wrote:What else ya got?

Please pick a host-ID shown as down. This is the number in the "... Host[62] ..." logfile output. If you're running the spine under its own user, switch to that user via 'su - <cactiuser>'. Run spine on the command line:

Code: Select all

spine -f <number> -l <number> -R -S -V 5

and post the output. Also the output of:

Code: Select all

truss -f spine -f <number> -l <number> -R -S -V 5

could be helpful, but is usually very verbose.

nvetro · Post by **nvetro** » Tue Feb 12, 2008 12:32 pm

I sent you a PM with the outputs of those two commands in an attachment. Let me know if you need anything else, this issue has really stumped me.

crimsonstone · Post by **crimsonstone** » Tue Feb 12, 2008 2:08 pm

I've had this issue in the past, and while I never came up with the root issue, I noticed that changing the downed host detection on the down hosts from "SNMP" to "Ping and SNMP" (or vice-versa) cleared up the issue.

/shrug

frankfegert · Post by **frankfegert** » Tue Feb 12, 2008 3:05 pm

nvetro wrote:I sent you a PM with the outputs of those two commands in an attachment. Let me know if you need anything else, this issue has really stumped me.

Sorry, the truss'd output file and spine debugging output file are basically the same. Something went wrong with your truss run.

But this is odd:

Code: Select all

...
DEBUG: SQL:'SELECT id, hostname, snmp_community, snmp_version, snmp_username, snmp_password, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context, snmp_port, snmp_timeout, max_oids, availability_method, ping_method, ping_port, ping_timeout, ping_retries, status, status_event_count, status_fail_date, status_rec_date, status_last_error, min_time, max_time, cur_time, avg_time, total_polls, failed_polls, availability  FROM host WHERE id=72'
DEBUG: The Value of Active Threads is 1
Host[72] SNMP Result: Host responded to SNMP
DEBUG: SQL:'UPDATE host SET status='2', status_event_count='1', status_fail_date='2008-02-06 13:05:00', status_rec_date='2008-02-12 12:19', status_last_error='Host did not respond to SNMP', min_time='0.492100', max_time='2194.970000', cur_time='3.013850', avg_time='742.754782', total_polls='17159', failed_polls='1930', availability='88.7523' WHERE id='72''
DEBUG: SQL:'SELECT data_query_id, action, op, assert_value, arg1 FROM poller_reindex WHERE host_id=72'
Host[72] Host has no information for recache.
DEBUG: SQL:'SELECT snmp_port, count(snmp_port) FROM poller_item WHERE host_id=72 AND rrd_next_step < 0 GROUP BY snmp_port'
DEBUG: SQL:'SELECT action, hostname, snmp_community, snmp_version, snmp_username, snmp_password, rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, rrd_num, snmp_port, snmp_timeout, snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context  FROM poller_item WHERE host_id=72 and rrd_next_step <=0 ORDER by snmp_port'
DEBUG: SQL:'UPDATE poller_item SET rrd_next_step=rrd_next_step-300 WHERE host_id=72'
DEBUG: SQL:'UPDATE poller_item SET rrd_next_step=rrd_step-300 WHERE rrd_next_step < 0 and host_id=72'
...
Host[72] DS[1436] SNMP: v3: 216.105.160.80, dsname: traffic_out, oid: .1.3.6.1.2.1.2.2.1.16.2, value: 267631185
Host[72] DS[1436] SNMP: v3: 216.105.160.80, dsname: traffic_in, oid: .1.3.6.1.2.1.2.2.1.10.2, value: 1600107098

So you're getting SNMP results, but the host status is set to "2" which is "recovering". Notice the values of "min_time" (0.4921) and "max_time" (2194.97), can you please make sure this isn't a timeout issue in the SNMP-ping by setting the "Ping Timeout Value" higher?

Regards,

Frank

nvetro · Post by **nvetro** » Tue Feb 12, 2008 7:56 pm

ok I set the snmp timeout from 500ms (default) and ping timeout value from 400ms (default) to 1500ms (1.5secodns) for 8 hosts which were down, lets see if this fixes the issue....I don't think it will because if I do an snmpwalk from CLI it will timeout...thats something up on the host site right wouldn't you say and not cacti?

nvetro · Post by **nvetro** » Tue Feb 12, 2008 8:45 pm

ok, I THINK its fixed...all hosts are currently "recovering". Here is what I did...increasing the SNMP Timeout for each host didn't do anything, increasing the ping timeout for each host didn't do anything. What DID do something is changing the host detection method for each host to Ping & SNMP, it has the default ms now, default port (23) and default protocol (udp). All coming up now, ill report back in a bit.

Cacti 0.8.7 cmd.php to spine, several hosts show down.

Who is online