2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
Moderators: Developers, Moderators
2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
Afternoon,
My history of hiccups and issues aside; My poller stats were down around 6seconds regularly.
I logged in today to do a check of things and noticed that my poller sessions were regularly timing out. However, the devices that seem to be causing the timeouts; One is a 10 OID device, on my LAN that responds to an SNMP test in under 0.2s.
The other is an ICMP Ping only device, no SNMP in use.
The intropage (which is new, and could be wrong) only shows one device > 1s, let alone the two that are timing out.
The log itself looks similarly. Everything seems to resolve in ~3 seconds, then I get alerts about my LAN device and an ICMP ping device.
Any suggestions where I should be looking to solve this?
My history of hiccups and issues aside; My poller stats were down around 6seconds regularly.
I logged in today to do a check of things and noticed that my poller sessions were regularly timing out. However, the devices that seem to be causing the timeouts; One is a 10 OID device, on my LAN that responds to an SNMP test in under 0.2s.
The other is an ICMP Ping only device, no SNMP in use.
The intropage (which is new, and could be wrong) only shows one device > 1s, let alone the two that are timing out.
The log itself looks similarly. Everything seems to resolve in ~3 seconds, then I get alerts about my LAN device and an ICMP ping device.
Any suggestions where I should be looking to solve this?
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
DNS problem? I last intorpage (develop branch) is test for dns resolving.
It seems that you have 6 spine processes. Better is less processess and add threads (console->Data collection-> data collectors)
It seems that you have 6 spine processes. Better is less processess and add threads (console->Data collection-> data collectors)
Let the Cacti grow!
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
Not DNS; It's an IP only for ping, see below.
I can reduce that process count, but it's been 6 for quite some time w/o having this issue.
Why does it suddenly need to be reduced?
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
I reduced the count to 2, and it made no difference.
Now it says it's waiting on more threads, and different devices are reported as timing out.
Now it says it's waiting on more threads, and different devices are reported as timing out.
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
I also tried to go the other direction and ridiculously cranked the processes/threads, No change:
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
Try to increase log level (Configuration->settings->general tab) and check out cacti log
Let the Cacti grow!
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
Good idea; Here's what I got out of that:
A couple of devices fill the log. The warnings are the precursor to the red "Timed Out" message. However, One of the devices is an "ICMP Ping Only" device.
I enabled device debugging and demonstrated it as well; I don't see any obvious issues here?
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
Try to use cmd poller for few poller cycles (configuration->settings-> poller)->poller_type
Which cacti and spine versions?
You can try:
chmod +s /path/to/spine
Utilities-> system utilities -> rebuild poller cache
for snmp devices change Bulk Walk Maximum Repetitions to auto detect/set on first re-index and do reindex
Which cacti and spine versions?
You can try:
chmod +s /path/to/spine
Utilities-> system utilities -> rebuild poller cache
for snmp devices change Bulk Walk Maximum Repetitions to auto detect/set on first re-index and do reindex
Let the Cacti grow!
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
Same issue:
1.2.25 across the board.macan wrote:Which cacti and spine versions?
This is done as part of the process of building spine.macan wrote:You can try:
chmod +s /path/to/spine
I have already performed the first and the second is the default, I have only changed it on one device as part of troubleshooting this issue.macan wrote: Utilities-> system utilities -> rebuild poller cache
for snmp devices change Bulk Walk Maximum Repetitions to auto detect/set on first re-index and do reindex
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
any network issues? Restart server? OS/package update? I met similar thing after PHP upgrade.
Let the Cacti grow!
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
No network issues. This is a large infrastructure and I've used Cacti for more than a decade.
If you can tell me how to troubleshoot the error more, I'm happy to do so.
At the moment all I see is that it says the poller limit is being overrun. If I debug the device in specific it finishes polling in under 5 seconds, always. So I can't even find what the poller is complaining about.
Let's narrow down to one device;
This is an ICMP device. No SNMP polling occurs. The ping is always faster than 1 second, and the only script it runs for graphing is also a ping. So why does this device constantly appear in the log as having timed out?
Here's the timeout error in the log:
So, I went and performed a device specific debug on it here, which shows no issues:
So how do I troubleshoot this?
- TheWitness
- Developer
- Posts: 17007
- Joined: Tue May 14, 2002 5:08 pm
- Location: MI, USA
- Contact:
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
Advanced Ping has a lot of scalability issues when you have so many hosts. The other thing is how long it takes spine to initialize with so many threads. What is your script server count? How many script server calls do you use?
One option would to reduce the ping count. I think the default in the script is 20 samples, you might want to move it to 10 for example. That'll speed things up.
I almost think that we need to move Advanced Ping to it's own plugin as the polling times make it real difficult to complete in the 5 minute cycle. For me, it's an easy thing, just making the time to do it is the issue.
To answer some of those question, years ago, I wrote a 'showproc' script. It's been refined over the years to do other things. Below is a copy. Root just needs super access. Put it in root's home directory and mark it executable.
One option would to reduce the ping count. I think the default in the script is 20 samples, you might want to move it to 10 for example. That'll speed things up.
I almost think that we need to move Advanced Ping to it's own plugin as the polling times make it real difficult to complete in the 5 minute cycle. For me, it's an easy thing, just making the time to do it is the issue.
To answer some of those question, years ago, I wrote a 'showproc' script. It's been refined over the years to do other things. Below is a copy. Root just needs super access. Put it in root's home directory and mark it executable.
Code: Select all
#!/bin/sh
while true;do
clear
echo "---------------------------------------------------------"
echo -n "Uptime:"
uptime
# Collect metrics
SLEEPING=`mysql -e "show processlist" | grep Sleep | wc -l`
RUNNING=`mysql -e "show processlist" | grep -v Sleep | wc -l`
TOTAL=`expr $SLEEPING + $RUNNING`
ITEMS=`mysql -e "SELECT TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_NAME='poller_output'" | grep -v TABLE_ROWS`
PROCS=`mysql -e "SELECT count(*) FROM processes WHERE tasktype='boost'" cacti | grep -v count`
SPINE=`mysql -e "SELECT count(*) FROM poller_time WHERE end_time='0000-00-00'" cacti | grep -v count`
HISTORY=`mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_history_list_length'" | grep -v Value | sed -e 's/Innodb_history_list_length//g' | tr -d " \t"`
# Echo status
echo "---------------------------------------------------------"
echo "Host: $HOSTNAME"
echo "Pending Poller Items: $ITEMS"
echo "---------------------------------------------------------"
echo "Running Processes: $RUNNING"
echo "Sleeping Processes: $SLEEPING"
echo "---------------------------------------------------------"
echo "Total Processes: $TOTAL"
echo "History: $HISTORY"
echo "Spine Processes: $SPINE"
if [ -z $PROCS ]; then
echo "Boost Processes: 0"
else
echo "Boost Processes: $PROCS"
fi
echo ""
echo "Memory Topology"
echo "---------------------------------------------------------"
free -g
echo ""
#echo "Poller Stats"
#ps -ef | egrep "(php|spine|poller.php)" | grep -v "script_server" | grep -v grep
#echo "---------------------------------------------------------"
#echo "Comm Stats"
#echo "---------------------------------------------------------"
#mysql -e "SHOW GLOBAL STATUS" | egrep "(Com_update|Com_select|Com_insert|Com_set_option|Com_delete|wsrep_cluster_size|Aborted_clients|Uptime)" | awk '{printf("%-25s %10i\n", $1, $2)}'
echo "MariaDB Processes"
mysql -e "SELECT id, user, state, ROUND(time_ms/1000,1) AS time, SUBSTRING(REPLACE(REPLACE(REPLACE(info, '\n', ' '), ' ', ' '), '\t', ' '),1,245) AS info FROM processlist WHERE info NOT LIKE '%processlist%' ORDER BY time DESC LIMIT 35" information_schema
running=`mysql -e 'SELECT COUNT(*) AS running FROM cacti.poller_time WHERE end_time="0000-00-00"' | grep -v running`
if [ "$running" -ne "0" ]; then
echo ""
echo "Running Collector Threads"
mysql -e "SELECT * FROM poller_time WHERE end_time='0000-00-00'" cacti
echo ""
echo "Last Hosts Inserted"
mysql -e "SELECT h.id, h.description, b.last_updated, COUNT(b.local_data_id) AS data_sources FROM poller_output_boost AS b INNER JOIN data_local AS dl ON b.local_data_id = dl.id INNER JOIN host AS h ON h.id=dl.host_id WHERE b.last_updated = (SELECT max(last_updated) FROM poller_output_boost) GROUP BY h.id ORDER BY b.last_updated DESC, data_sources DESC LIMIT 5" cacti
else
echo ""
echo "Poller Not running"
fi
sleep 2
done
True understanding begins only when we realize how little we truly understand...
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Life is an adventure, let yours begin with Cacti!
Author of dozens of Cacti plugins and customization's. Advocate of LAMP, MariaDB, IBM Spectrum LSF and the world of batch. Creator of IBM Spectrum RTM, author of quite a bit of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Percona Device Packages (no support)
Interesting Device Packages
For those wondering, I'm still here, but lost in the shadows. Yearning for less bugs. Who want's a Cacti 1.3/2.0? Streams anyone?
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
I had mine set at 5, and it hasn't really changed.TheWitness wrote: ↑Mon Oct 30, 2023 8:10 am Advanced Ping has a lot of scalability issues when you have so many hosts. The other thing is how long it takes spine to initialize with so many threads. What is your script server count? How many script server calls do you use?
I'll test this; but the question of note is why did it change? When we were doing DB troubleshooting a month ago(two?) it was wrapping out all hosts in 6-7 seconds, with adv. ping.TheWitness wrote: One option would to reduce the ping count. I think the default in the script is 20 samples, you might want to move it to 10 for example. That'll speed things up.
I totally understand the free time and I have been liking the move of features that had been added over the years being moved into managed and separate code bases.TheWitness wrote: I almost think that we need to move Advanced Ping to it's own plugin as the polling times make it real difficult to complete in the 5 minute cycle. For me, it's an easy thing, just making the time to do it is the issue.
Thanks for the block of code. I'll get some time myself and jump into cacti and see what it reveals.TheWitness wrote: To answer some of those question, years ago, I wrote a 'showproc' script. It's been refined over the years to do other things. Below is a copy. Root just needs super access. Put it in root's home directory and mark it executable.
Code: Select all
#!/bin/sh while true;do clear echo "---------------------------------------------------------" echo -n "Uptime:" uptime # Collect metrics SLEEPING=`mysql -e "show processlist" | grep Sleep | wc -l` RUNNING=`mysql -e "show processlist" | grep -v Sleep | wc -l` TOTAL=`expr $SLEEPING + $RUNNING` ITEMS=`mysql -e "SELECT TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_NAME='poller_output'" | grep -v TABLE_ROWS` PROCS=`mysql -e "SELECT count(*) FROM processes WHERE tasktype='boost'" cacti | grep -v count` SPINE=`mysql -e "SELECT count(*) FROM poller_time WHERE end_time='0000-00-00'" cacti | grep -v count` HISTORY=`mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_history_list_length'" | grep -v Value | sed -e 's/Innodb_history_list_length//g' | tr -d " \t"` # Echo status echo "---------------------------------------------------------" echo "Host: $HOSTNAME" echo "Pending Poller Items: $ITEMS" echo "---------------------------------------------------------" echo "Running Processes: $RUNNING" echo "Sleeping Processes: $SLEEPING" echo "---------------------------------------------------------" echo "Total Processes: $TOTAL" echo "History: $HISTORY" echo "Spine Processes: $SPINE" if [ -z $PROCS ]; then echo "Boost Processes: 0" else echo "Boost Processes: $PROCS" fi echo "" echo "Memory Topology" echo "---------------------------------------------------------" free -g echo "" #echo "Poller Stats" #ps -ef | egrep "(php|spine|poller.php)" | grep -v "script_server" | grep -v grep #echo "---------------------------------------------------------" #echo "Comm Stats" #echo "---------------------------------------------------------" #mysql -e "SHOW GLOBAL STATUS" | egrep "(Com_update|Com_select|Com_insert|Com_set_option|Com_delete|wsrep_cluster_size|Aborted_clients|Uptime)" | awk '{printf("%-25s %10i\n", $1, $2)}' echo "MariaDB Processes" mysql -e "SELECT id, user, state, ROUND(time_ms/1000,1) AS time, SUBSTRING(REPLACE(REPLACE(REPLACE(info, '\n', ' '), ' ', ' '), '\t', ' '),1,245) AS info FROM processlist WHERE info NOT LIKE '%processlist%' ORDER BY time DESC LIMIT 35" information_schema running=`mysql -e 'SELECT COUNT(*) AS running FROM cacti.poller_time WHERE end_time="0000-00-00"' | grep -v running` if [ "$running" -ne "0" ]; then echo "" echo "Running Collector Threads" mysql -e "SELECT * FROM poller_time WHERE end_time='0000-00-00'" cacti echo "" echo "Last Hosts Inserted" mysql -e "SELECT h.id, h.description, b.last_updated, COUNT(b.local_data_id) AS data_sources FROM poller_output_boost AS b INNER JOIN data_local AS dl ON b.local_data_id = dl.id INNER JOIN host AS h ON h.id=dl.host_id WHERE b.last_updated = (SELECT max(last_updated) FROM poller_output_boost) GROUP BY h.id ORDER BY b.last_updated DESC, data_sources DESC LIMIT 5" cacti else echo "" echo "Poller Not running" fi sleep 2 done
Thanks again!
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
While running the bash script, it seems to constantly throw an "Unknown Column" error on b.last_updated;
I recorded a gif, but once a minute I get a single refresh that has some poller data in it, so it's a lot of nothingness. My values all seem pretty trivial:
However, the rest of the minute, I see 0 pending items but two long-running collector threads:
I recorded a gif, but once a minute I get a single refresh that has some poller data in it, so it's a lot of nothingness. My values all seem pretty trivial:
However, the rest of the minute, I see 0 pending items but two long-running collector threads:
Re: 2 Devices: "Polling timed out while waiting for 4 Threads to End" - Little Help on Poller Statistics?
For those two devices try testing spine at the command like using the -H host_id option.
Before history, there was a paradise, now dust.
Who is online
Users browsing this forum: No registered users and 1 guest