Code: Select all
[root@netinfo01 log]# uname -a
Linux netinfo01.ns.XXX.YYY 2.6.32-220.4.1.el6.x86_64 #1 SMP Thu Jan 19 14:50:54 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
One interesting thing is that some graphs stopped between 23:00 on Wednesday and 00:00 on Thursday, while others stopped at around 05:00 on Wednesday. Both of these would have been at times when I was not actively working on the machine.
Example: I've noticed the following items in the logs:
Code: Select all
05/25/2012 05:15:01 PM - SPINE: Poller[0] FATAL: Spine Encountered a Segmentation Fault (Spine thread)
05/25/2012 05:15:01 PM - SPINE: Poller[0] ERROR: The System Lacked the Resources to Create a Thread
05/25/2012 05:15:01 PM - SPINE: Poller[0] ERROR: The System Lacked the Resources to Create a Thread
05/25/2012 05:15:01 PM - SPINE: Poller[0] ERROR: The System Lacked the Resources to Create a Thread
Code: Select all
top - 17:32:04 up 115 days, 2:25, 2 users, load average: 0.00, 0.00, 0.00
Tasks: 1107 total, 3 running, 1104 sleeping, 0 stopped, 0 zombie
Cpu(s): 31.0%us, 5.7%sy, 0.0%ni, 63.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4057416k total, 3929348k used, 128068k free, 20292k buffers
Swap: 8388600k total, 1394616k used, 6993984k free, 158168k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23635 mysql 20 0 695m 24m 4592 S 23.7 0.6 22:06.31 mysqld
31438 cactiuse 20 0 162m 16m 6176 R 11.2 0.4 0:13.94 php
31558 root 20 0 15888 2068 972 R 0.3 0.1 0:00.12 top
1 root 20 0 19272 744 560 S 0.0 0.0 0:04.93 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root 20 0 0 0 0 R 0.0 0.0 0:00.01 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
7 root 20 0 0 0 0 S 0.0 0.0 0:00.40 events/0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuset
9 root 20 0 0 0 0 S 0.0 0.0 0:00.11 khelper
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 async/mgr
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pm
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 sync_supers
14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 bdi-default
An strace of the running poller.php process shows it waiting for something, until it gets killed by the next 5-minute poller run.
I activated the 'domains' and 'spikekill' plugins during the day on Wednesday, but I've deactivated both of them since, to eliminate them as variables while I work on this larger problem.
poller.php is running as 'cactiuser', and cactiuser owns all of the files in the cacti directory structure (/var/www/html/stats).
So... at this point, I'm just trying to get a handle on what's happening, and what I can do to fix it / keep it from happening again.