Problem after system update in large network (500 hosts)

Post support questions that directly relate to Linux/Unix operating systems.

Moderators: Developers, Moderators

Post Reply
bchojnacki
Posts: 8
Joined: Wed Apr 11, 2012 8:45 am
Location: Poland
Contact:

Problem after system update in large network (500 hosts)

Post by bchojnacki »

I updated CentOS distribution from version 5 to 6, and now I have a lot of problems with Cacti (I always use current version of cacti - now 0.8.8a)

The first problem is that I have a lot of errors in cacti log file with lost communication. I tried to debug this and figure out where is the source of my problems, but without success. I haven't had this problem on centos5. Now in system I have 60-70 zombie processes ([php] <defunct>) and high load average ~18.
The second problem is that after a few hours I have gaps in graphs and only reboot helps in this situation ( it is probably caused by high load), but i have the same situation before update.

Could you help me find the the reasons of his situation ?
Attachments
cacti.log.zip
cacti log
(2.62 MiB) Downloaded 52 times
errors in log
errors in log
error.JPG (229.61 KiB) Viewed 1865 times
Configuration page
Configuration page
Technical Support page.png (972.05 KiB) Viewed 1865 times
Poller setup
Poller setup
poller.JPG (157.31 KiB) Viewed 1865 times
Regards,
Bartek Chojnacki
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Re: Problem after system update in large network (500 hosts)

Post by gandalf »

It looks like you're using 1 minute polling and the poller time overflows that limit. Correct?
R.
bchojnacki
Posts: 8
Joined: Wed Apr 11, 2012 8:45 am
Location: Poland
Contact:

Re: Problem after system update in large network (500 hosts)

Post by bchojnacki »

Yes, I use one minute poller interval, and 5 minute cron interval, but change Poller interval does not improve the situation, at 5 minutes p. invterval, I lose even more graphs, and the problem with PHP Script Server communication lost, and zombie process still exist :|
Regards,
Bartek Chojnacki
bchojnacki
Posts: 8
Joined: Wed Apr 11, 2012 8:45 am
Location: Poland
Contact:

Re: Problem after system update in large network (500 hosts)

Post by bchojnacki »

In attached is the cacti log file when i changed polling interval, and I used 5 minute polling interval, problem with lost connection still existed, i had very high load, and a lot of zombie process:

->
top - 09:45:23 up 2 days, 22:18, 1 user, load average: 21.64, 10.69, 10.46
Tasks: 380 total, 89 running, 270 sleeping, 0 stopped, 21 zombie

->
top - 09:45:35 up 2 days, 22:19, 1 user, load average: 33.62, 13.90, 11.51
Tasks: 374 total, 69 running, 265 sleeping, 0 stopped, 40 zombie

->
top - 09:46:02 up 2 days, 22:19, 1 user, load average: 29.27, 14.62, 11.82
Tasks: 222 total, 6 running, 203 sleeping, 0 stopped, 13 zombie

And when i used 5 minute polling interval, 1/2 of graphs are empty, so now I back to 1 minute interval.
Attachments
cacti.log.5min.zip
5 minute polling interval
(8.09 KiB) Downloaded 63 times
Regards,
Bartek Chojnacki
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Re: Problem after system update in large network (500 hosts)

Post by gandalf »

You did recompile spine, then?
R.
bchojnacki
Posts: 8
Joined: Wed Apr 11, 2012 8:45 am
Location: Poland
Contact:

Re: Problem after system update in large network (500 hosts)

Post by bchojnacki »

recompile ? The last recompilation i did to version 0.8.8a (only with directory prefixes --prefix=/usr and --sysconfdir=/etc). I didn't recompile spine to change time interval. I just changed interval in the cacti menu, shooted poller down (removed from cron, and killed), runned php rebuild_poller_cache.php, and added to cron again......
Regards,
Bartek Chojnacki
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Re: Problem after system update in large network (500 hosts)

Post by gandalf »

I meant to recompile after OS change
R.
bchojnacki
Posts: 8
Joined: Wed Apr 11, 2012 8:45 am
Location: Poland
Contact:

Re: Problem after system update in large network (500 hosts)

Post by bchojnacki »

Of course I did.
Regards,
Bartek Chojnacki
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Re: Problem after system update in large network (500 hosts)

Post by gandalf »

Well, you have quite a large amount of scripted stuff. In your case, this will have most impact on your cacti installation. SNMP is not the key, here. So it comes back to optimize those (script server) script(s) (queries).
You may run spine using verbosity=3 to learn, which host is responding slow. This way, you may get an idea where the bottleneck resides.
R.
MrRat
Cacti User
Posts: 135
Joined: Thu Jan 07, 2010 10:33 am

Re: Problem after system update in large network (500 hosts)

Post by MrRat »

I'm willing to bet that it is hosts using ss_host_disk. I had the same problem and worked around it by deleting all of the datasources polling (hmib available disk). It's not a solution that is an option for everyone, but if you want to get the box back stable fast and can do without the disk statistics the go to datasources and filter by template hmib Available disk and delete them. There is some sort of regression with ss_host_disk or script server function that causes this. I worked on this for a few hours Friday night with thewitness but I have yet to find the solution other than not gathering as many disk statistics using script server.

I can get 8 sec polling times without all the windows disk stats.
05/21/2012 09:03:10 AM - SYSTEM STATS: Time:8.9264 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35865 RRDsProcessed:0

But after adding a few hosts disk stats
05/21/2012 01:19:24 PM - SYSTEM STATS: Time:22.8256 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35982 RRDsProcessed:0
MrRat
Cacti User
Posts: 135
Joined: Thu Jan 07, 2010 10:33 am

Re: Problem after system update in large network (500 hosts)

Post by MrRat »

Another thing that is different than I am used to seeing with cacti, is that when I have multiple down hosts the polling time takes alot longer also. For instance if I have 6 hosts down at any one time by polling time will jump from 8sec to 25-30 sec. I can check which hosts that are the culprit by using spine and sorting the Timed results.
# spine -R -S -V 3 > spine.out

# grep Total spine.out | sort -k 5
Host[524] TH[1] Total Time: 18 Seconds
Host[798] TH[1] Total Time: 18 Seconds
Host[799] TH[1] Total Time: 18 Seconds
Host[800] TH[1] Total Time: 18 Seconds
Host[801] TH[1] Total Time: 18 Seconds
Host[447] TH[1] Total Time: 12 Seconds
Host[730] TH[1] Total Time: 5 Seconds
Host[1112] TH[1] Total Time: 1 Seconds
Host[1169] TH[1] Total Time: 1 Seconds
Host[1047] TH[1] Total Time: 0.1 Seconds
Host[1069] TH[1] Total Time: 0.1 Seconds
Host[1256] TH[1] Total Time: 0.1 Seconds

Then I can confirm that the 6 top hosts are actually down but holding up the polling process for a amount of time. You can imagine that if for some reason there were to be 30+ down devices then the poller time would overrun and cause issues.
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Re: Problem after system update in large network (500 hosts)

Post by gandalf »

MrRat wrote:Another thing that is different than I am used to seeing with cacti, is that when I have multiple down hosts the polling time takes alot longer also. For instance if I have 6 hosts down at any one time by polling time will jump from 8sec to 25-30 sec. I can check which hosts that are the culprit by using spine and sorting the Timed results.
# spine -R -S -V 3 > spine.out
Hmm, downed host detection should take care for that. Which method are you using? Which timeouts and which retries?
R.
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Re: Problem after system update in large network (500 hosts)

Post by gandalf »

MrRat wrote:I'm willing to bet that it is hosts using ss_host_disk. I had the same problem and worked around it by deleting all of the datasources polling (hmib available disk). It's not a solution that is an option for everyone, but if you want to get the box back stable fast and can do without the disk statistics the go to datasources and filter by template hmib Available disk and delete them. There is some sort of regression with ss_host_disk or script server function that causes this. I worked on this for a few hours Friday night with thewitness but I have yet to find the solution other than not gathering as many disk statistics using script server.

I can get 8 sec polling times without all the windows disk stats.
05/21/2012 09:03:10 AM - SYSTEM STATS: Time:8.9264 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35865 RRDsProcessed:0

But after adding a few hosts disk stats
05/21/2012 01:19:24 PM - SYSTEM STATS: Time:22.8256 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35982 RRDsProcessed:0
Windows disk via SNMP is baaaad. But I want to recommaned my pure SNMP replacement of the disk space stuff. See 4th link of my sig. It's way faster compared to the script server
R.
MrRat
Cacti User
Posts: 135
Joined: Thu Jan 07, 2010 10:33 am

Re: Problem after system update in large network (500 hosts)

Post by MrRat »

I see this on devices that use (ping or snmp) for sure, Ill have to confirm that it does it for devices using (ping and snmp).
I saw it on a single down device just now.
05/23/2012 01:24:32 PM - SYSTEM HMIB STATS: time:20.2100 processes:50 hosts:318
05/23/2012 01:24:17 PM - SYSTEM SYSLOG STATS:Time:6.13 Deletes:0 Incoming:291 Removes:0 XFers:10439 Alerts:0 Alarms:0 Reports:0
05/23/2012 01:24:11 PM - SYSTEM THOLD STATS: Time:0.0085 Tholds:0 TotalHosts:943 DownHosts:1 NewDownHosts:0
05/23/2012 01:24:11 PM - SYSTEM STATS: Time:9.5638 Method:spine Processes:2 Threads:15 Hosts:944 HostsPerProcess:472 DataSources:39417 RRDsProcessed:0
05/23/2012 12:57:57 PM - SYSTEM HMIB STATS: time:20.5700 processes:50 hosts:317
05/23/2012 12:57:51 PM - SYSTEM SYSLOG STATS:Time:14.1 Deletes:0 Incoming:531 Removes:0 XFers:20786 Alerts:0 Alarms:0 Reports:0
05/23/2012 12:57:37 PM - SYSTEM THOLD STATS: Time:13.7606 Tholds:0 TotalHosts:943 DownHosts:1 NewDownHosts:1
05/23/2012 12:57:23 PM - SYSTEM STATS: Time:21.3651 Method:spine Processes:2 Threads:15 Hosts:944 HostsPerProcess:472 DataSources:39409 RRDsProcessed:0
05/23/2012 12:57:21 PM - SPINE: Poller[0] Host[500] Hostname[10.9.75.44] ERROR: HOST EVENT: Host is DOWN Message: Host did not respond to SNMP, ICMP: Ping timed out
05/23/2012 12:57:03 PM - SYSTEM BOOST STATS: Time:112.6200 RRDUpdates:1039088
05/23/2012 12:45:32 PM - SYSTEM HMIB STATS: time:20.5000 processes:50 hosts:318
05/23/2012 12:45:23 PM - SYSTEM SYSLOG STATS:Time:12.04 Deletes:0 Incoming:452 Removes:0 XFers:17050 Alerts:0 Alarms:0 Reports:0
05/23/2012 12:45:11 PM - SYSTEM THOLD STATS: Time:0.0084 Tholds:0 TotalHosts:943 DownHosts:1 NewDownHosts:0
05/23/2012 12:45:11 PM - SYSTEM STATS: Time:9.7753 Method:spine Processes:2 Threads:15 Hosts:944 HostsPerProcess:472 DataSources:39410 RRDsProcessed:0
Post Reply

Who is online

Users browsing this forum: No registered users and 3 guests