Problem after system update in large network (500 hosts)

bchojnacki · Post by **bchojnacki** » Fri May 11, 2012 3:02 am

I updated CentOS distribution from version 5 to 6, and now I have a lot of problems with Cacti (I always use current version of cacti - now 0.8.8a)

The first problem is that I have a lot of errors in cacti log file with lost communication. I tried to debug this and figure out where is the source of my problems, but without success. I haven't had this problem on centos5. Now in system I have 60-70 zombie processes ([php] <defunct>) and high load average ~18.
The second problem is that after a few hours I have gaps in graphs and only reboot helps in this situation ( it is probably caused by high load), but i have the same situation before update.

Could you help me find the the reasons of his situation ?

Post by **gandalf** » Sat May 12, 2012 8:11 am

It looks like you're using 1 minute polling and the poller time overflows that limit. Correct?
R.

bchojnacki · Post by **bchojnacki** » Mon May 14, 2012 1:45 am

Yes, I use one minute poller interval, and 5 minute cron interval, but change Poller interval does not improve the situation, at 5 minutes p. invterval, I lose even more graphs, and the problem with PHP Script Server communication lost, and zombie process still exist :|

bchojnacki · Post by **bchojnacki** » Mon May 14, 2012 3:33 am

In attached is the cacti log file when i changed polling interval, and I used 5 minute polling interval, problem with lost connection still existed, i had very high load, and a lot of zombie process:

->
top - 09:45:23 up 2 days, 22:18, 1 user, load average: 21.64, 10.69, 10.46
Tasks: 380 total, 89 running, 270 sleeping, 0 stopped, 21 zombie

->
top - 09:45:35 up 2 days, 22:19, 1 user, load average: 33.62, 13.90, 11.51
Tasks: 374 total, 69 running, 265 sleeping, 0 stopped, 40 zombie

->
top - 09:46:02 up 2 days, 22:19, 1 user, load average: 29.27, 14.62, 11.82
Tasks: 222 total, 6 running, 203 sleeping, 0 stopped, 13 zombie

And when i used 5 minute polling interval, 1/2 of graphs are empty, so now I back to 1 minute interval.

Post by **gandalf** » Mon May 14, 2012 4:05 pm

You did recompile spine, then?
R.

bchojnacki · Post by **bchojnacki** » Tue May 15, 2012 2:12 am

recompile ? The last recompilation i did to version 0.8.8a (only with directory prefixes --prefix=/usr and --sysconfdir=/etc). I didn't recompile spine to change time interval. I just changed interval in the cacti menu, shooted poller down (removed from cron, and killed), runned php rebuild_poller_cache.php, and added to cron again......

Post by **gandalf** » Sun May 20, 2012 12:16 pm

I meant to recompile after OS change
R.

bchojnacki · Post by **bchojnacki** » Mon May 21, 2012 6:28 am

Of course I did.

Post by **gandalf** » Mon May 21, 2012 11:24 am

Well, you have quite a large amount of scripted stuff. In your case, this will have most impact on your cacti installation. SNMP is not the key, here. So it comes back to optimize those (script server) script(s) (queries).
You may run spine using verbosity=3 to learn, which host is responding slow. This way, you may get an idea where the bottleneck resides.
R.

MrRat · Post by **MrRat** » Mon May 21, 2012 12:21 pm

I'm willing to bet that it is hosts using ss_host_disk. I had the same problem and worked around it by deleting all of the datasources polling (hmib available disk). It's not a solution that is an option for everyone, but if you want to get the box back stable fast and can do without the disk statistics the go to datasources and filter by template hmib Available disk and delete them. There is some sort of regression with ss_host_disk or script server function that causes this. I worked on this for a few hours Friday night with thewitness but I have yet to find the solution other than not gathering as many disk statistics using script server.

I can get 8 sec polling times without all the windows disk stats.
05/21/2012 09:03:10 AM - SYSTEM STATS: Time:8.9264 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35865 RRDsProcessed:0

But after adding a few hosts disk stats
05/21/2012 01:19:24 PM - SYSTEM STATS: Time:22.8256 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35982 RRDsProcessed:0

MrRat · Post by **MrRat** » Mon May 21, 2012 2:06 pm

Another thing that is different than I am used to seeing with cacti, is that when I have multiple down hosts the polling time takes alot longer also. For instance if I have 6 hosts down at any one time by polling time will jump from 8sec to 25-30 sec. I can check which hosts that are the culprit by using spine and sorting the Timed results.
# spine -R -S -V 3 > spine.out

# grep Total spine.out | sort -k 5
Host[524] TH[1] Total Time: 18 Seconds
Host[798] TH[1] Total Time: 18 Seconds
Host[799] TH[1] Total Time: 18 Seconds
Host[800] TH[1] Total Time: 18 Seconds
Host[801] TH[1] Total Time: 18 Seconds
Host[447] TH[1] Total Time: 12 Seconds
Host[730] TH[1] Total Time: 5 Seconds
Host[1112] TH[1] Total Time: 1 Seconds
Host[1169] TH[1] Total Time: 1 Seconds
Host[1047] TH[1] Total Time: 0.1 Seconds
Host[1069] TH[1] Total Time: 0.1 Seconds
Host[1256] TH[1] Total Time: 0.1 Seconds

Then I can confirm that the 6 top hosts are actually down but holding up the polling process for a amount of time. You can imagine that if for some reason there were to be 30+ down devices then the poller time would overrun and cause issues.

Post by **gandalf** » Tue May 22, 2012 3:51 pm

MrRat wrote:Another thing that is different than I am used to seeing with cacti, is that when I have multiple down hosts the polling time takes alot longer also. For instance if I have 6 hosts down at any one time by polling time will jump from 8sec to 25-30 sec. I can check which hosts that are the culprit by using spine and sorting the Timed results.
# spine -R -S -V 3 > spine.out

Hmm, downed host detection should take care for that. Which method are you using? Which timeouts and which retries?
R.

Post by **gandalf** » Tue May 22, 2012 3:52 pm

MrRat wrote:I'm willing to bet that it is hosts using ss_host_disk. I had the same problem and worked around it by deleting all of the datasources polling (hmib available disk). It's not a solution that is an option for everyone, but if you want to get the box back stable fast and can do without the disk statistics the go to datasources and filter by template hmib Available disk and delete them. There is some sort of regression with ss_host_disk or script server function that causes this. I worked on this for a few hours Friday night with thewitness but I have yet to find the solution other than not gathering as many disk statistics using script server.

I can get 8 sec polling times without all the windows disk stats.
05/21/2012 09:03:10 AM - SYSTEM STATS: Time:8.9264 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35865 RRDsProcessed:0

But after adding a few hosts disk stats
05/21/2012 01:19:24 PM - SYSTEM STATS: Time:22.8256 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35982 RRDsProcessed:0

Windows disk via SNMP is baaaad. But I want to recommaned my pure SNMP replacement of the disk space stuff. See 4th link of my sig. It's way faster compared to the script server
R.

MrRat · Post by **MrRat** » Wed May 23, 2012 12:37 pm

I see this on devices that use (ping or snmp) for sure, Ill have to confirm that it does it for devices using (ping and snmp).
I saw it on a single down device just now.

05/23/2012 01:24:32 PM - SYSTEM HMIB STATS: time:20.2100 processes:50 hosts:318
05/23/2012 01:24:17 PM - SYSTEM SYSLOG STATS:Time:6.13 Deletes:0 Incoming:291 Removes:0 XFers:10439 Alerts:0 Alarms:0 Reports:0
05/23/2012 01:24:11 PM - SYSTEM THOLD STATS: Time:0.0085 Tholds:0 TotalHosts:943 DownHosts:1 NewDownHosts:0
05/23/2012 01:24:11 PM - SYSTEM STATS: Time:9.5638 Method:spine Processes:2 Threads:15 Hosts:944 HostsPerProcess:472 DataSources:39417 RRDsProcessed:0

05/23/2012 12:57:57 PM - SYSTEM HMIB STATS: time:20.5700 processes:50 hosts:317
05/23/2012 12:57:51 PM - SYSTEM SYSLOG STATS:Time:14.1 Deletes:0 Incoming:531 Removes:0 XFers:20786 Alerts:0 Alarms:0 Reports:0
05/23/2012 12:57:37 PM - SYSTEM THOLD STATS: Time:13.7606 Tholds:0 TotalHosts:943 DownHosts:1 NewDownHosts:1
05/23/2012 12:57:23 PM - SYSTEM STATS: Time:21.3651 Method:spine Processes:2 Threads:15 Hosts:944 HostsPerProcess:472 DataSources:39409 RRDsProcessed:0
05/23/2012 12:57:21 PM - SPINE: Poller[0] Host[500] Hostname[10.9.75.44] ERROR: HOST EVENT: Host is DOWN Message: Host did not respond to SNMP, ICMP: Ping timed out
05/23/2012 12:57:03 PM - SYSTEM BOOST STATS: Time:112.6200 RRDUpdates:1039088

05/23/2012 12:45:32 PM - SYSTEM HMIB STATS: time:20.5000 processes:50 hosts:318
05/23/2012 12:45:23 PM - SYSTEM SYSLOG STATS:Time:12.04 Deletes:0 Incoming:452 Removes:0 XFers:17050 Alerts:0 Alarms:0 Reports:0
05/23/2012 12:45:11 PM - SYSTEM THOLD STATS: Time:0.0084 Tholds:0 TotalHosts:943 DownHosts:1 NewDownHosts:0
05/23/2012 12:45:11 PM - SYSTEM STATS: Time:9.7753 Method:spine Processes:2 Threads:15 Hosts:944 HostsPerProcess:472 DataSources:39410 RRDsProcessed:0

Cacti

Problem after system update in large network (500 hosts)

Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Re: Problem after system update in large network (500 hosts)

Who is online