monitor successful but host still down

krypsys · Post by **krypsys** » Tue Jun 03, 2008 4:05 pm

v0.8.7a with several hosts up and running OK. But having trouble getting POWERWARE UPS units to report 'up' in the devices. the SNMP shows successful values coming back, yet the device status stays 'down.' Same deal with TCP ping (TCP ping success) but device reports 'down' status.

this is consistent with all three Powerware UPS devices we have in production.

Any thoughts would be most appreciated.

Post by **gandalf** » Tue Jun 03, 2008 4:37 pm

What is the downed host detection method used for that device? If it's SNMP, please do a "snmpwalk -c ... -v 1 <target> system" and post results
Reinhard

krypsys · Post by **krypsys** » Tue Jun 03, 2008 4:42 pm

when set to SNMP, I see the host information (contact, etc) in the upper right of the device information (and see the traffic by sniffing), but the device still shows down.

when set to 'tcp ping' it shows 'tcp success' but host still down.

right now, it's set to SNMP.

per your request
snmpwalk -c [mycommunity] -v 1 192.168.2.19 system
Timeout: No Response from 192.168.2.19

ah yes...that doesn't look good...
what 'r u thinking?

Post by **gandalf** » Wed Jun 04, 2008 1:57 pm

Not good, yes. Target is not responding, so cacti won't graph.
Please see first link of my sig
Reinhard

krypsys · Post by **krypsys** » Wed Jun 04, 2008 2:16 pm

in troubleshooting today I tried that command again this morning and am getting a response now...

**
user@syslog:~# snmpwalk -c [mycommunity] -v1 192.168.2.19 system
SNMPv2-MIB::sysDescr.0 = STRING: ConnectUPS Web/SNMP Card V3.11
SNMPv2-MIB::sysObjectID.0 = OID: SNMPv2-SMI::enterprises.534.1
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (2792252309) 323 days, 4:15:23.09
SNMPv2-MIB::sysContact.0 = STRING: User Name
SNMPv2-MIB::sysName.0 = STRING: ConnectUPS Web/SNMP Card
SNMPv2-MIB::sysLocation.0 = STRING: Some, Where
SNMPv2-MIB::sysServices.0 = INTEGER: 72

**

only the host is still 'Down' with SNMP set as the monitor...I was reading through most of the information on your link, but not real clear to me what my next troubleshooting steps should be here.

wild!?

Any advice?
With sincere appreciation.

Post by **gandalf** » Wed Jun 04, 2008 2:41 pm

Which downed host detection used, please?
Reinhard

krypsys · Post by **krypsys** » Wed Jun 04, 2008 3:07 pm

SNMP, amigo.
Right now its set to SNMP.
I tried TCP PING, yesterday, which came back 'successful' on the GUI, but the device still showed 'down'

grrrr...

krypsys · Post by **krypsys** » Wed Jun 04, 2008 3:58 pm

to further troubleshoot I started capturing packets and see this:

15:50:17.273151 192.168.1.45.38000 > 192.168.2.19.161: C=[mycommunity] GetNextRequest(21) .0.1 (DF)
15:50:17.273151 192.168.1.45.38000 > 192.168.2.19.161: C=[mycommunity] GetNextRequest(21) .0.1 (DF)

but notice, there is no response from the UPS back to the cacti server (.1.45)...

so, I manually ran this command.

user@syslog:~# snmpwalk -c [mycommunity] -v1 192.168.2.19 .0.1
Timeout: No Response from 192.168.2.19

HOWEVER, the 'system' ones till works ok.
user@syslog:~# snmpwalk -c [my community] -v1 192.168.2.19 system
SNMPv2-MIB::sysDescr.0 = STRING: ConnectUPS Web/SNMP Card V3.11
SNMPv2-MIB::sysObjectID.0 = OID: SNMPv2-SMI::enterprises.534.1
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (2792874556) 323 days, 5:59:05.56
SNMPv2-MIB::sysContact.0 = STRING: User Name
SNMPv2-MIB::sysName.0 = STRING: ConnectUPS Web/SNMP Card
SNMPv2-MIB::sysLocation.0 = STRING: Some, Where
SNMPv2-MIB::sysServices.0 = INTEGER: 72

So, at this point, I think it's related to the ".0.1 (DF)" tag...where is it getting this from? What is it 'walking' for that value? What is that value?

make sense?

hkspvt · Post by **hkspvt** » Wed Jun 04, 2008 4:04 pm

I'm having the same problem. A single one of my hosts is complaining that it's down, so graphs are not populating. I can both ping and run SNMP walks from the cacti server, but Cacti still shows it as down.

This is in Cacti 0.8.7b running on FreeBSD 7.0, Apache 2.2, PHP 5.

I currently have the "Downed Host Detection" set to SNMP. The top left corner (SNMP Information) populates, but the device still shows down. If I set the method to Ping, the Ping Results show "Cannot connect to host" - which I cannot replicate manually.

I've even tried wiping the host out of cacti and re-adding it, to no avail.

-HKS

Post by **gandalf** » Wed Jun 04, 2008 4:08 pm

snmgetnext for ".0.1" should return system information. Please run snmpgetnext for OID ".0.1" manually from cli and return results
Reinhard

krypsys · Post by **krypsys** » Wed Jun 04, 2008 4:18 pm

user@syslog:~# snmpgetnext -c [mycommunity]-v 1 192.168.2.19 .0.1
Timeout: No Response from 192.168.2.19.

to validate the command is ok:
user@syslog:~# snmpgetnext -c [mycommunity] -v 1 192.168.2.19 sysUpTime
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (2792984186) 323 days, 6:17:21.86

MORE INFORMATION
I just ran GETIF on the host and walked the MIB and, sure enough, I do not see a .0.1 value at all.

.0 = ccitt
.0.0 = ccitt.zeroDotZero
.0.0 = ccit.nullOID

which is very strange, since I would expect the value .nullOID to be the .0.1 since it's beneath the zeroDotZero entry.

Any thoughts?

Querying .0.0 yields the same 'no response'

hkspvt · Post by **hkspvt** » Wed Jun 04, 2008 4:21 pm

.0.xxxx are assigned by the ITU-T and are not generally used in the context of the Internet. Is there a prefix we're missing somewhere?

-HKS

Post by **gandalf** » Thu Jun 05, 2008 1:39 pm

Please find module ./lib/ping.php and locate function ping_snmp. You may want to change the magic numbers in there
Reinhard

krypsys · Post by **krypsys** » Thu Jun 05, 2008 2:43 pm

I think my statement earlier, about .0.0 not existing, may have been misplaced...continuing to dive into this, I noticed that the POWERWARE UPS was not responding to the GETIF application's SNMPGET request, just as it was not responding to the CACTI's request.

The Cacti request looks like this:
14:28:23.648114 192.168.1.45.4586 > 192.168.2.19.161: C= GetNextRequest(23) .0.1

Where the GetIf request looks like this:
14:28:23.648114 192.168.1.49.4586 > 192.168.2.19.161: C= GetNextRequest(23) .0.0

In the end, both result in the POWERWARE not responding, so my statement earlier about the .0.1 being 'nonexistent' is premature. That is, it may or may not exist on the POWERWARE UPS...all I know is when asked by either application, it does not reply, where every other devices we have replies fine with the .0.1 request.

So what is the .0.1 is still a pending question...here is an observation, however, from another client:

"user@syslog:~# snmpwalk -c [mycommunity] -v1 192.168.12.1 .0.1 -O n
.1.3.6.1.2.1.1.1.0 = STRING: Intermec Technologies AP"

So, when I set the OID value to .0.1, the client respondes with the .1.3.6.1.2.1.1.1.0 value.

Just an observation...I'm not a SNMP master here (obviously).

As for Gandalf's request, I have been trying to isolate what area of ping_snmp function in I need to massage and am at a loss...my PHP version is 5.2, so the OID variable should get set (always) to .1.3.6.1.2.1.1.3.0, but I can't isolate where the call gets made to .0.1 in the snmp query.

Could use some more advice. My development background here is weak, but I'm trying!

Sincere appreciation on this. I'm excited to be getting into the details of this problem!

krypsys · Post by **krypsys** » Thu Jun 05, 2008 5:00 pm

what's strange is the unit stays 'down' even when I change the monitor. When I change it to TCP port 23, on a tcpdump, I immediately see a success:

16:56:53.948099 192.168.1.45.37452 > 192.168.2.19.23: S 1791524515:1791524515(0) win 5840 <mss 1460,sackOK,timestamp 499136484 0,nop,wscale 5> (DF)
16:56:53.949448 192.168.2.19.23 > 192.168.1.45.37452: S 1719005:1719005(0) ack 1791524516 win 8192 <mss 1440>
16:56:53.964901 192.168.1.45.37452 > 192.168.2.19.23: . ack 1 win 5840 (DF)
16:56:53.964917 192.168.1.45.37452 > 192.168.2.19.23: F 1:1(0) ack 1 win 5840 (DF)
16:56:53.979112 192.168.2.19.23 > 192.168.1.45.37452: FP 1:266(265) ack 2 win 8190
16:56:53.979928 192.168.2.19.23 > 192.168.1.45.37452: F 267:267(0) ack 2 win 8190
16:56:53.980962 192.168.2.19.23 > 192.168.1.45.37452: F 268:268(0) ack 2 win 8190
16:56:53.989182 192.168.1.45.37452 > 192.168.2.19.23: R 1791524517:1791524517(0) win 0 (DF)

however, if I leave it alone for the normal 5s period, I see this:

16:55:15.834731 192.168.1.45.56660 > 192.168.2.19.33439: S 1700466563:1700466563(0) win 5840 <mss 1460,sackOK,timestamp 499126671 0,nop,wscale 5> (DF)

port 33439!? That's not the port i told it to monitor!?

This behaviour is so bizzare!

Cacti

monitor successful but host still down

monitor successful but host still down

snmp response

more information

even more strange behavir

Who is online