Original topic, which I enjoyed a lot, as it teached me a something:
viewtopic.php?t=54036&start=30
Current status:
Code: Select all
root@cacti-2022-loaded:/var/www/html/cli# time /usr/bin/spine -C /var/www/html/spine.conf --poller 1 --first 878 --last 4327 --mibs
SPINE: Using spine config file [/var/www/html/spine.conf]
Version 1.2.21 starting
2022-06-14 00:01:45 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[2057] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:01:46 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[1886] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:01:52 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[2396] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:01:57 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[1285] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:02:06 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[2327] HT[1] polling timed out while acquiring Available Thread Lock
2022-06-14 00:02:09 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[3360] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:02:10 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[4096] HT[2] polling timed out while acquiring Available Thread Lock
FATAL: Spine Encountered a Segmentation Fault
Generating backtrace...0 line(s)...
real 1m41.842s
user 0m2.511s
sys 0m3.193s
root@cacti-2022-loaded:/var/www/html/cli#
* use 5 minutes polling (minute is even easier)
* add approximately 3k hosts or more
* make sure they are slow in response, such as like response of first 250 devices is on average of 10k miliseconds
* make sure there are 250 of hosts down
Results are, uhm, well, catastrophic:
* user cannot say how many devices haven't been graphed
* user cannot alter many devices in bulk (as there is no bulk management at scale, cannot eg. reduce timeout and skip the devices above 3k milliseconds)
* there is no simple exit from this situation other than disable the number of hosts down, and even this does not always help
Other remarks and observations:
* configuration in the number of spine threads doesn't really matter, at scale of 3k devices Cacti system gets unusable.
* in comparison, I maxed out Cacti server resources by emulating approx 30k devices using old empty server with 2x Xeons. Cacti spine sucks approx 25k * 256 interfaces, in out bits, every 5 minutes, EASILY (well, NVMe disk burns at 30k IOPS at boost times). So I can easily get 20x times more only because everything this is very artificial, DC oriented, and no surprising delays are introduced. I think I could try more, I just need to split the snmpd to several more servers, which should not be a big deal. However, thats not the point, point is real life use case testing and I am to leave if for another winter evening.
How to workaround this issue:
* disable all hosts which are down, mostly helps spine to complete its cycle. Of course is one way, as enabling some devices back would kill the installation
* sometimes shortening the SNMP timeout helps (I need SNMP only, not sure about UDP or ICMP - it is blocked mostly for myself). But that doesn't help, I really need my lets say, 5 seconds lasting polling.
My ideas of how to resolve this in orders of preferrence:
* introduce automated hosts disabling based on their status. I'd say I need an expression: If the hosts is down for more than 5 poller cycles, disable the host with a DESC
** but I also need a rule: enable a host if its disabled with a DESC for longer amount of period than 10 poller cycles.
* introduce separate poller for checking up the hosts before they are polled. This could be essentially... almost separate spine process with different configuration. Such process could mark the hosts for polling / not polling so the main poller knows number of hosts to skip yet before the run. This should be an toggable option in Cacti configuration. Option name: "Relaxed down host processing" or similar.
* introduce overall limits and control of the maximum number of processes and threads spawned for a installation (if its not now by number of processes * number of threads)
* increase possible number of threads for spine to >100 and try to overcome (how would that work, side effects, no idea)
This is why I say Cacti is not able to go at small scale. Prove me wrong.