(SOLVED) Spine 1.2.21 segfaults heavily on large load

stefanbrudny · Post by **stefanbrudny** » Mon Jun 13, 2022 5:33 pm

This topic was about to be discussion about feature requests for Automate, but I decided to describe use case. Its been 5 years and spine still fails. I thought that maybe, if scalability matters, someone gets interested. This is really root cause, and with this running fine many more feature requests could be postponed when only mid scale is demanded or dropped entirely. No new development, no pain; fixing and making stuff as it was intended makes the stuff better than by adding new features.

Original topic, which I enjoyed a lot, as it teached me a something:
viewtopic.php?t=54036&start=30

Current status:

Code: Select all

root@cacti-2022-loaded:/var/www/html/cli# time /usr/bin/spine -C /var/www/html/spine.conf --poller 1 --first 878 --last 4327 --mibs
SPINE: Using spine config file [/var/www/html/spine.conf]
Version 1.2.21 starting
2022-06-14 00:01:45 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[2057] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:01:46 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[1886] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:01:52 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[2396] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:01:57 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[1285] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:02:06 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[2327] HT[1] polling timed out while acquiring Available Thread Lock
2022-06-14 00:02:09 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[3360] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 00:02:10 - SPINE: Poller[1] PID[2086310] PT[139845542202240] ERROR: Device[4096] HT[2] polling timed out while acquiring Available Thread Lock
FATAL: Spine Encountered a Segmentation Fault
Generating backtrace...0 line(s)...

real    1m41.842s
user    0m2.511s
sys     0m3.193s
root@cacti-2022-loaded:/var/www/html/cli#

What igredients add to the pot to make this Spine-killing, ukhm, nectar:
* use 5 minutes polling (minute is even easier)
* add approximately 3k hosts or more
* make sure they are slow in response, such as like response of first 250 devices is on average of 10k miliseconds
* make sure there are 250 of hosts down

Results are, uhm, well, catastrophic:
* user cannot say how many devices haven't been graphed
* user cannot alter many devices in bulk (as there is no bulk management at scale, cannot eg. reduce timeout and skip the devices above 3k milliseconds)
* there is no simple exit from this situation other than disable the number of hosts down, and even this does not always help

Other remarks and observations:
* configuration in the number of spine threads doesn't really matter, at scale of 3k devices Cacti system gets unusable.
* in comparison, I maxed out Cacti server resources by emulating approx 30k devices using old empty server with 2x Xeons. Cacti spine sucks approx 25k * 256 interfaces, in out bits, every 5 minutes, EASILY (well, NVMe disk burns at 30k IOPS at boost times). So I can easily get 20x times more only because everything this is very artificial, DC oriented, and no surprising delays are introduced. I think I could try more, I just need to split the snmpd to several more servers, which should not be a big deal. However, thats not the point, point is real life use case testing and I am to leave if for another winter evening.

How to workaround this issue:
* disable all hosts which are down, mostly helps spine to complete its cycle. Of course is one way, as enabling some devices back would kill the installation
* sometimes shortening the SNMP timeout helps (I need SNMP only, not sure about UDP or ICMP - it is blocked mostly for myself). But that doesn't help, I really need my lets say, 5 seconds lasting polling.

My ideas of how to resolve this in orders of preferrence:
* introduce automated hosts disabling based on their status. I'd say I need an expression: If the hosts is down for more than 5 poller cycles, disable the host with a DESC
** but I also need a rule: enable a host if its disabled with a DESC for longer amount of period than 10 poller cycles.
* introduce separate poller for checking up the hosts before they are polled. This could be essentially... almost separate spine process with different configuration. Such process could mark the hosts for polling / not polling so the main poller knows number of hosts to skip yet before the run. This should be an toggable option in Cacti configuration. Option name: "Relaxed down host processing" or similar.
* introduce overall limits and control of the maximum number of processes and threads spawned for a installation (if its not now by number of processes * number of threads)
* increase possible number of threads for spine to >100 and try to overcome (how would that work, side effects, no idea)

This is why I say Cacti is not able to go at small scale. Prove me wrong.

Post by **TheWitness** » Mon Jun 13, 2022 6:21 pm

Do the same thing with -V 3 -S -R and post the results. BTW, the thread lock is 25 seconds and was introduced to avoid people that run scripts that don't time out on their own. How many threads was that BTW?

Post by **TheWitness** » Mon Jun 13, 2022 6:27 pm

If you have problems with your WAN that would cause unreliable responses, you should use Remote Data Collectors. I know of Cacti install's with over 40k devices that collect data from over 1M Data Sources in about 120 seconds. That's pretty good. Data collection is N-Tiered like gmond thought. So all responses are local regardless of where the devices are. This is the same a what you should expect with Remote Data Collectors.

Post by **TheWitness** » Mon Jun 13, 2022 6:29 pm

Not sure where the segfault is coming from though. I guess I should setup the same environment and see what's happening. If you run at -V 5, then there will be way more output to track the location of the segfault. If you can capture a core file, run a backtrace against it and report back.

Post by **TheWitness** » Mon Jun 13, 2022 7:33 pm

Hmm, I setup a real skanky environment and the outcome was bad. It did not segfault, but also did not time out. Almost worse.... Try using spine 1.2.5 and let me know if it works better. Have to work that out, but before you try 1.2.5, get me that -V 5 -S -R output.

Post by **TheWitness** » Mon Jun 13, 2022 7:45 pm

Created this bug.

https://github.com/Cacti/spine/issues/269

Post by **TheWitness** » Mon Jun 13, 2022 8:24 pm

Pull the latest spine 1.2.x branch and test again after I see the 1.2.5 output.

Post by **TheWitness** » Mon Jun 13, 2022 8:27 pm

You should get something like this now:

Code: Select all

[root@vmhost5 bin]# ./spine -R -S
SPINE: Using spine config file [../etc/spine.conf]
Version 1.2.21 starting
Total[59.0142] ERROR: Device[2292] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.1051] ERROR: Device[2408] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.1960] ERROR: Device[2410] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.2868] ERROR: Device[3502] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.3777] ERROR: Device[3506] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.4685] ERROR: Device[3507] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.5593] ERROR: Device[3513] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.6501] ERROR: Device[3514] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.7410] ERROR: Device[3515] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.8319] ERROR: Device[3518] HT[1] polling timed out while acquiring Available Thread Lock
Total[59.9228] ERROR: Device[3520] HT[1] polling timed out while acquiring Available Thread Lock
Total[60.0035] ERROR: Device[3521] HT[1] Spine Timed Out While Processing Devices External
Total[60.0036] ERROR: Device[3521] polling timed out while waiting for 20 Threads to End
Total[60.0037] WARNING: There were 4726 threads which did not run
Time: 60.0541 s, Threads: 20, Devices: 4748

stefanbrudny · Post by **stefanbrudny** » Tue Jun 14, 2022 5:04 pm

Thanks, catching up.

* 1.2.5 segfaults in 12 seconds:

Code: Select all

time /usr/bin/spine -C /var/www/html/spine.conf --poller 1 --first 621 --last 7309 --mibs  -V 5 -S -R
[....]

Updating Full System Information Table
Device[1709] SNMP Result: Device responded to SNMP
Updating Full System Information Table
FATAL: Spine Encountered a Segmentation Fault [95, Operation not supported] (Spine thread)

I know this is enough, I'd need debug this, but I'll come to this later.

* I was working with various combinations to try all number of threads, starting from 32 to 100.
* compiled 1.2.x branch with latest correction and I can confirm the behaviour now is little better and I can poll through my devices (6438), of course, when devices are down, results are poor:

Code: Select all

root@cacti-2022-loaded:/home/spine-1.2.x# time /usr/bin/spine -C /var/www/html/spine.conf --poller 1 --first 621 --last 7309 --mibs
SPINE: Using spine config file [/var/www/html/spine.conf]
Version 1.2.21 starting
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[1471] HT[1] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[1471] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[3956] HT[1] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[3956] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[3957] HT[1] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[3957] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[1388] HT[1] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[1784] HT[1] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[1784] HT[2] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:48 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[3103] HT[1] polling timed out while acquiring Available Thread Lock
2022-06-14 23:26:49 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[3103] HT[2] Spine Timed Out While Processing Devices External
2022-06-14 23:26:49 - SPINE: Poller[1] PID[2748080] PT[140690957810560] ERROR: Device[3103] polling timed out while waiting for 64 Threads to End
2022-06-14 23:26:49 - SPINE: Poller[1] PID[2748080] PT[140690957810560] WARNING: There were 6386 threads which did not run
Time: 60.0115 s, Threads: 64, Devices: 6437

real    1m0.022s
user    0m0.986s
sys     0m1.724s
root@cacti-2022-loaded:/home/spine-1.2.x#

So lets run 5 minutes poller (spoiler alert: segfaults):

Code: Select all

root@cacti-2022-loaded:/home/spine-1.2.x# time /usr/bin/spine -C /var/www/html/spine.conf --poller 1 --first 621 --last 7309 --mibs
SPINE: Using spine config file [/var/www/html/spine.conf]
Version 1.2.21 starting
2022-06-14 23:30:13 - SPINE: Poller[1] PID[2749533] PT[140167834744576] Device[1471] HT[2] DQ[23] RECACHE ASSERT FAILED: '42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
6=42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
69 67 45 20 28 4E 44 49 53 20 56 42 44 20 43 6C
69 65 6E 74 29 00'2022-06-14 23:30:28 - SPINE: Poller[1] PID[2749533] PT[140167826351872] Device[1471] HT[1] DQ[23] RECACHE ASSERT FAILED: '42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
6=42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
69 67 45 20 28 4E 44 49 53 20 56 42 44 20 43 6C
[ that continues for +~20 devices ]

2022-06-14 23:31:55 - SPINE: Poller[1] PID[2749533] PT[140167356589824] Device[2148] Hostname[--------] ERROR: HOST EVENT: Device is DOWN Message: Device did not respond to SNMP
2022-06-14 23:31:56 - SPINE: Poller[1] PID[2749533] PT[140168992380672] Device[2597] Hostname[--------] ERROR: HOST EVENT: Device is DOWN Message: Device did not respond to SNMP
FATAL: Spine Encountered a Segmentation Fault
Generating backtrace...0 line(s)...

real    2m46.628s
user    0m2.823s
sys     0m3.671s

Look, this is concidencally same group of devices as it was 5 years ago. Might be just its difficult for Spine and I may be wasting your time. When going into synthetic mode, and if I ever go to prod with cacti, I'll have max 20-60 device types only and all is going to be easier.

I have following proposals:
* I can pass the creds to cacti & shell with root, I need to clean a little, make a copy of the container etc.
* If you could reach you in private I may write more details - no time waste, real stuff. Mixed technical - business, options, real use case reveal.

stefanbrudny · Post by **stefanbrudny** » Tue Jun 14, 2022 5:13 pm

Really, I see is no rule without debugging spine itself:

Code: Select all

root@cacti-2022-loaded:/home/spine-1.2.x# rm /usr/bin/spine && ln -s /opt/spine-1.2.21+/bin/spine /usr/bin/spine && spine -v
SPINE 1.2.21  Copyright 2004-2021 by The Cacti Group
root@cacti-2022-loaded:/home/spine-1.2.x# vim /etc/crontab ^C
root@cacti-2022-loaded:/home/spine-1.2.x# time /usr/bin/spine -C /var/www/html/spine.conf --poller 1 --first 621 --last 7309 --mibs
SPINE: Using spine config file [/var/www/html/spine.conf]
Version 1.2.21 starting
2022-06-15 00:09:25 - SPINE: Poller[1] PID[2764035] PT[139832231708416] Device[1448] HT[2] DQ[23] RECACHE ASSERT FAILED: '\"ethernetCsmacd\"=ethernetCsmacd'
2022-06-15 00:09:33 - SPINE: Poller[1] PID[2764035] PT[139832491751168] Device[1471] HT[2] DQ[23] RECACHE ASSERT FAILED: '42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
6=42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
69 67 45 20 28 4E 44 49 53 20 56 42 44 20 43 6C
69 65 6E 74 29 00'2022-06-15 00:09:34 - SPINE: Poller[1] PID[2764035] PT[139832089097984] Device[1137] Hostname[-----] NOTICE: HOST EVENT: Device Returned from DOWN State
FATAL: Spine Encountered a Segmentation Fault
Generating backtrace...0 line(s)...

real    0m13.304s
user    0m0.155s
sys     0m0.155s
root@cacti-2022-loaded:/home/spine-1.2.x# time /usr/bin/spine -C /var/www/html/spine.conf --poller 1 --first 621 --last 7309 --mibs
SPINE: Using spine config file [/var/www/html/spine.conf]
Version 1.2.21 starting
2022-06-15 00:09:56 - SPINE: Poller[1] PID[2764403] PT[140157416085248] Device[1448] HT[2] DQ[23] RECACHE ASSERT FAILED: '\"ethernetCsmacd\"=ethernetCsmacd'
2022-06-15 00:10:02 - SPINE: Poller[1] PID[2764403] PT[140157944682240] Device[1471] HT[1] DQ[23] RECACHE ASSERT FAILED: '42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
6=42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
69 67 45 20 28 4E 44 49 53 20 56 42 44 20 43 6C
69 65 6E 74 29 00'2022-06-15 00:10:04 - SPINE: Poller[1] PID[2764403] PT[140157466441472] Device[1471] HT[2] DQ[23] RECACHE ASSERT FAILED: '42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
6=42 72 6F 61 64 63 6F 6D 20 42 43 4D 35 37 30 38
43 20 4E 65 74 58 74 72 65 6D 65 20 49 49 20 47
69 67 45 20 28 4E 44 49 53 20 56 42 44 20 43 6C
FATAL: Spine Encountered a Segmentation Fault
Generating backtrace...0 line(s)...
69 65 6E 74 29 00'
real    0m13.035s
user    0m0.105s
sys     0m0.223s

Post by **TheWitness** » Tue Jun 14, 2022 5:19 pm

Code: Select all

gdb ./spine
run -S -V 5
bt full

Post by **TheWitness** » Tue Jun 14, 2022 5:22 pm

Interesting reindex sort field. Totally wrong.

stefanbrudny · Post by **stefanbrudny** » Tue Jun 14, 2022 5:38 pm

bt sent in pm.

Code: Select all

Interesting reindex sort field. Totally wrong.

Hope that nails is. I have limited or no control over the devices, I won't be able to change it. When at scale with 15-25 device models and 60+ firmware versions I need such items / devices to be marked / ignored / etc., but I cannot completely avoid this.

Post by **TheWitness** » Tue Jun 14, 2022 5:43 pm

I'm wondering why you are not using ifName. Maybe an old interface.xml file. Switching can be deadly. It'll switch, but takes some time to fully reindex.

Post by **TheWitness** » Tue Jun 14, 2022 5:43 pm

Got the bt. It's some new code, likely a buffer issue, have to review.

Cacti

(SOLVED) Spine 1.2.21 segfaults heavily on large load

(SOLVED) Spine 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Re: 1.2.21 segfaults heavily on large load

Who is online