Problem after system update in large network (500 hosts)
Moderators: Developers, Moderators
-
- Posts: 8
- Joined: Wed Apr 11, 2012 8:45 am
- Location: Poland
- Contact:
Problem after system update in large network (500 hosts)
I updated CentOS distribution from version 5 to 6, and now I have a lot of problems with Cacti (I always use current version of cacti - now 0.8.8a)
The first problem is that I have a lot of errors in cacti log file with lost communication. I tried to debug this and figure out where is the source of my problems, but without success. I haven't had this problem on centos5. Now in system I have 60-70 zombie processes ([php] <defunct>) and high load average ~18.
The second problem is that after a few hours I have gaps in graphs and only reboot helps in this situation ( it is probably caused by high load), but i have the same situation before update.
Could you help me find the the reasons of his situation ?
The first problem is that I have a lot of errors in cacti log file with lost communication. I tried to debug this and figure out where is the source of my problems, but without success. I haven't had this problem on centos5. Now in system I have 60-70 zombie processes ([php] <defunct>) and high load average ~18.
The second problem is that after a few hours I have gaps in graphs and only reboot helps in this situation ( it is probably caused by high load), but i have the same situation before update.
Could you help me find the the reasons of his situation ?
- Attachments
-
- cacti.log.zip
- cacti log
- (2.62 MiB) Downloaded 52 times
-
- errors in log
- error.JPG (229.61 KiB) Viewed 1866 times
-
- Configuration page
- Technical Support page.png (972.05 KiB) Viewed 1866 times
-
- Poller setup
- poller.JPG (157.31 KiB) Viewed 1866 times
Regards,
Bartek Chojnacki
Bartek Chojnacki
- gandalf
- Developer
- Posts: 22383
- Joined: Thu Dec 02, 2004 2:46 am
- Location: Muenster, Germany
- Contact:
Re: Problem after system update in large network (500 hosts)
It looks like you're using 1 minute polling and the poller time overflows that limit. Correct?
R.
R.
-
- Posts: 8
- Joined: Wed Apr 11, 2012 8:45 am
- Location: Poland
- Contact:
Re: Problem after system update in large network (500 hosts)
Yes, I use one minute poller interval, and 5 minute cron interval, but change Poller interval does not improve the situation, at 5 minutes p. invterval, I lose even more graphs, and the problem with PHP Script Server communication lost, and zombie process still exist :|
Regards,
Bartek Chojnacki
Bartek Chojnacki
-
- Posts: 8
- Joined: Wed Apr 11, 2012 8:45 am
- Location: Poland
- Contact:
Re: Problem after system update in large network (500 hosts)
In attached is the cacti log file when i changed polling interval, and I used 5 minute polling interval, problem with lost connection still existed, i had very high load, and a lot of zombie process:
->
top - 09:45:23 up 2 days, 22:18, 1 user, load average: 21.64, 10.69, 10.46
Tasks: 380 total, 89 running, 270 sleeping, 0 stopped, 21 zombie
->
top - 09:45:35 up 2 days, 22:19, 1 user, load average: 33.62, 13.90, 11.51
Tasks: 374 total, 69 running, 265 sleeping, 0 stopped, 40 zombie
->
top - 09:46:02 up 2 days, 22:19, 1 user, load average: 29.27, 14.62, 11.82
Tasks: 222 total, 6 running, 203 sleeping, 0 stopped, 13 zombie
And when i used 5 minute polling interval, 1/2 of graphs are empty, so now I back to 1 minute interval.
->
top - 09:45:23 up 2 days, 22:18, 1 user, load average: 21.64, 10.69, 10.46
Tasks: 380 total, 89 running, 270 sleeping, 0 stopped, 21 zombie
->
top - 09:45:35 up 2 days, 22:19, 1 user, load average: 33.62, 13.90, 11.51
Tasks: 374 total, 69 running, 265 sleeping, 0 stopped, 40 zombie
->
top - 09:46:02 up 2 days, 22:19, 1 user, load average: 29.27, 14.62, 11.82
Tasks: 222 total, 6 running, 203 sleeping, 0 stopped, 13 zombie
And when i used 5 minute polling interval, 1/2 of graphs are empty, so now I back to 1 minute interval.
- Attachments
-
- cacti.log.5min.zip
- 5 minute polling interval
- (8.09 KiB) Downloaded 63 times
Regards,
Bartek Chojnacki
Bartek Chojnacki
- gandalf
- Developer
- Posts: 22383
- Joined: Thu Dec 02, 2004 2:46 am
- Location: Muenster, Germany
- Contact:
Re: Problem after system update in large network (500 hosts)
You did recompile spine, then?
R.
R.
-
- Posts: 8
- Joined: Wed Apr 11, 2012 8:45 am
- Location: Poland
- Contact:
Re: Problem after system update in large network (500 hosts)
recompile ? The last recompilation i did to version 0.8.8a (only with directory prefixes --prefix=/usr and --sysconfdir=/etc). I didn't recompile spine to change time interval. I just changed interval in the cacti menu, shooted poller down (removed from cron, and killed), runned php rebuild_poller_cache.php, and added to cron again......
Regards,
Bartek Chojnacki
Bartek Chojnacki
- gandalf
- Developer
- Posts: 22383
- Joined: Thu Dec 02, 2004 2:46 am
- Location: Muenster, Germany
- Contact:
Re: Problem after system update in large network (500 hosts)
I meant to recompile after OS change
R.
R.
-
- Posts: 8
- Joined: Wed Apr 11, 2012 8:45 am
- Location: Poland
- Contact:
Re: Problem after system update in large network (500 hosts)
Of course I did.
Regards,
Bartek Chojnacki
Bartek Chojnacki
- gandalf
- Developer
- Posts: 22383
- Joined: Thu Dec 02, 2004 2:46 am
- Location: Muenster, Germany
- Contact:
Re: Problem after system update in large network (500 hosts)
Well, you have quite a large amount of scripted stuff. In your case, this will have most impact on your cacti installation. SNMP is not the key, here. So it comes back to optimize those (script server) script(s) (queries).
You may run spine using verbosity=3 to learn, which host is responding slow. This way, you may get an idea where the bottleneck resides.
R.
You may run spine using verbosity=3 to learn, which host is responding slow. This way, you may get an idea where the bottleneck resides.
R.
Re: Problem after system update in large network (500 hosts)
I'm willing to bet that it is hosts using ss_host_disk. I had the same problem and worked around it by deleting all of the datasources polling (hmib available disk). It's not a solution that is an option for everyone, but if you want to get the box back stable fast and can do without the disk statistics the go to datasources and filter by template hmib Available disk and delete them. There is some sort of regression with ss_host_disk or script server function that causes this. I worked on this for a few hours Friday night with thewitness but I have yet to find the solution other than not gathering as many disk statistics using script server.
I can get 8 sec polling times without all the windows disk stats.
05/21/2012 09:03:10 AM - SYSTEM STATS: Time:8.9264 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35865 RRDsProcessed:0
But after adding a few hosts disk stats
05/21/2012 01:19:24 PM - SYSTEM STATS: Time:22.8256 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35982 RRDsProcessed:0
I can get 8 sec polling times without all the windows disk stats.
05/21/2012 09:03:10 AM - SYSTEM STATS: Time:8.9264 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35865 RRDsProcessed:0
But after adding a few hosts disk stats
05/21/2012 01:19:24 PM - SYSTEM STATS: Time:22.8256 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35982 RRDsProcessed:0
Re: Problem after system update in large network (500 hosts)
Another thing that is different than I am used to seeing with cacti, is that when I have multiple down hosts the polling time takes alot longer also. For instance if I have 6 hosts down at any one time by polling time will jump from 8sec to 25-30 sec. I can check which hosts that are the culprit by using spine and sorting the Timed results.
# spine -R -S -V 3 > spine.out
# grep Total spine.out | sort -k 5
Host[524] TH[1] Total Time: 18 Seconds
Host[798] TH[1] Total Time: 18 Seconds
Host[799] TH[1] Total Time: 18 Seconds
Host[800] TH[1] Total Time: 18 Seconds
Host[801] TH[1] Total Time: 18 Seconds
Host[447] TH[1] Total Time: 12 Seconds
Host[730] TH[1] Total Time: 5 Seconds
Host[1112] TH[1] Total Time: 1 Seconds
Host[1169] TH[1] Total Time: 1 Seconds
Host[1047] TH[1] Total Time: 0.1 Seconds
Host[1069] TH[1] Total Time: 0.1 Seconds
Host[1256] TH[1] Total Time: 0.1 Seconds
Then I can confirm that the 6 top hosts are actually down but holding up the polling process for a amount of time. You can imagine that if for some reason there were to be 30+ down devices then the poller time would overrun and cause issues.
# spine -R -S -V 3 > spine.out
# grep Total spine.out | sort -k 5
Host[524] TH[1] Total Time: 18 Seconds
Host[798] TH[1] Total Time: 18 Seconds
Host[799] TH[1] Total Time: 18 Seconds
Host[800] TH[1] Total Time: 18 Seconds
Host[801] TH[1] Total Time: 18 Seconds
Host[447] TH[1] Total Time: 12 Seconds
Host[730] TH[1] Total Time: 5 Seconds
Host[1112] TH[1] Total Time: 1 Seconds
Host[1169] TH[1] Total Time: 1 Seconds
Host[1047] TH[1] Total Time: 0.1 Seconds
Host[1069] TH[1] Total Time: 0.1 Seconds
Host[1256] TH[1] Total Time: 0.1 Seconds
Then I can confirm that the 6 top hosts are actually down but holding up the polling process for a amount of time. You can imagine that if for some reason there were to be 30+ down devices then the poller time would overrun and cause issues.
- gandalf
- Developer
- Posts: 22383
- Joined: Thu Dec 02, 2004 2:46 am
- Location: Muenster, Germany
- Contact:
Re: Problem after system update in large network (500 hosts)
Hmm, downed host detection should take care for that. Which method are you using? Which timeouts and which retries?MrRat wrote:Another thing that is different than I am used to seeing with cacti, is that when I have multiple down hosts the polling time takes alot longer also. For instance if I have 6 hosts down at any one time by polling time will jump from 8sec to 25-30 sec. I can check which hosts that are the culprit by using spine and sorting the Timed results.
# spine -R -S -V 3 > spine.out
R.
- gandalf
- Developer
- Posts: 22383
- Joined: Thu Dec 02, 2004 2:46 am
- Location: Muenster, Germany
- Contact:
Re: Problem after system update in large network (500 hosts)
Windows disk via SNMP is baaaad. But I want to recommaned my pure SNMP replacement of the disk space stuff. See 4th link of my sig. It's way faster compared to the script serverMrRat wrote:I'm willing to bet that it is hosts using ss_host_disk. I had the same problem and worked around it by deleting all of the datasources polling (hmib available disk). It's not a solution that is an option for everyone, but if you want to get the box back stable fast and can do without the disk statistics the go to datasources and filter by template hmib Available disk and delete them. There is some sort of regression with ss_host_disk or script server function that causes this. I worked on this for a few hours Friday night with thewitness but I have yet to find the solution other than not gathering as many disk statistics using script server.
I can get 8 sec polling times without all the windows disk stats.
05/21/2012 09:03:10 AM - SYSTEM STATS: Time:8.9264 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35865 RRDsProcessed:0
But after adding a few hosts disk stats
05/21/2012 01:19:24 PM - SYSTEM STATS: Time:22.8256 Method:spine Processes:2 Threads:15 Hosts:948 HostsPerProcess:474 DataSources:35982 RRDsProcessed:0
R.
Re: Problem after system update in large network (500 hosts)
I see this on devices that use (ping or snmp) for sure, Ill have to confirm that it does it for devices using (ping and snmp).
I saw it on a single down device just now.
I saw it on a single down device just now.
05/23/2012 01:24:32 PM - SYSTEM HMIB STATS: time:20.2100 processes:50 hosts:318
05/23/2012 01:24:17 PM - SYSTEM SYSLOG STATS:Time:6.13 Deletes:0 Incoming:291 Removes:0 XFers:10439 Alerts:0 Alarms:0 Reports:0
05/23/2012 01:24:11 PM - SYSTEM THOLD STATS: Time:0.0085 Tholds:0 TotalHosts:943 DownHosts:1 NewDownHosts:0
05/23/2012 01:24:11 PM - SYSTEM STATS: Time:9.5638 Method:spine Processes:2 Threads:15 Hosts:944 HostsPerProcess:472 DataSources:39417 RRDsProcessed:0
05/23/2012 12:57:57 PM - SYSTEM HMIB STATS: time:20.5700 processes:50 hosts:317
05/23/2012 12:57:51 PM - SYSTEM SYSLOG STATS:Time:14.1 Deletes:0 Incoming:531 Removes:0 XFers:20786 Alerts:0 Alarms:0 Reports:0
05/23/2012 12:57:37 PM - SYSTEM THOLD STATS: Time:13.7606 Tholds:0 TotalHosts:943 DownHosts:1 NewDownHosts:1
05/23/2012 12:57:23 PM - SYSTEM STATS: Time:21.3651 Method:spine Processes:2 Threads:15 Hosts:944 HostsPerProcess:472 DataSources:39409 RRDsProcessed:0
05/23/2012 12:57:21 PM - SPINE: Poller[0] Host[500] Hostname[10.9.75.44] ERROR: HOST EVENT: Host is DOWN Message: Host did not respond to SNMP, ICMP: Ping timed out
05/23/2012 12:57:03 PM - SYSTEM BOOST STATS: Time:112.6200 RRDUpdates:1039088
05/23/2012 12:45:32 PM - SYSTEM HMIB STATS: time:20.5000 processes:50 hosts:318
05/23/2012 12:45:23 PM - SYSTEM SYSLOG STATS:Time:12.04 Deletes:0 Incoming:452 Removes:0 XFers:17050 Alerts:0 Alarms:0 Reports:0
05/23/2012 12:45:11 PM - SYSTEM THOLD STATS: Time:0.0084 Tholds:0 TotalHosts:943 DownHosts:1 NewDownHosts:0
05/23/2012 12:45:11 PM - SYSTEM STATS: Time:9.7753 Method:spine Processes:2 Threads:15 Hosts:944 HostsPerProcess:472 DataSources:39410 RRDsProcessed:0
Who is online
Users browsing this forum: No registered users and 1 guest