Graphs with Holes

knebb · Post by **knebb** » Sun Apr 03, 2016 4:44 pm

Hi all,

anyone having an idea what I have to look for? For a single host my graphs have more or less regularly holes (see attached pic).

I do not see any CPU spikes nor IO waits on the monitored host. Neither on the Cacti host- all other hosts are fine. I have these missing items only on this single host.

This is cacti.log:

Code: Select all

04/03/2016 11:36:23 PM - CACTID: Poller[0] Host[175] PING Result: UDP: Host is Alive
04/03/2016 11:36:23 PM - CACTID: Poller[0] Host[175] SNMP Result: SNMP not performed due to setting or ping result
04/03/2016 11:36:23 PM - CACTID: Poller[0] DEBUG: MySQL Insert ID '159': 'update host set status='3', status_event_count='0', status_fail_date='2016-03-25 23:47:00', status_rec_date='2016-03-24 09:26:00', status_last_error='Host did not respond to SNMP, UDP: Ping timed out', min_time='0.194080', max_time='15003.350500', cur_time='0.231980', avg_time='5.386709', total_polls='4162', failed_polls='6', availability='99.8558' where id='175''
04/03/2016 11:36:23 PM - CACTID: Poller[0] DEBUG: MySQL Query ID '222': 'SELECT data_query_id, action, op, assert_value, arg1 FROM poller_reindex WHERE host_id=175'
04/03/2016 11:36:23 PM - CACTID: Poller[0] Host[175] RECACHE: Processing 2 items in the auto reindex cache for 'host.domain.com'
[...]
04/03/2016 11:36:24 PM - CACTID: Poller[0] Host[175] DS[3503] SNMP: v2: host.domain.com, dsname: cpu_idle, oid: .1.3.6.1.4.1.2021.11.53.0, value: 1608684
04/03/2016 11:36:24 PM - CACTID: Poller[0] Host[175] DS[3504] SNMP: v2: host.domain.com, dsname: cpu_interrupt, oid: .1.3.6.1.4.1.2021.11.56.0, value: 1

Any hints where to look for?

Thanks!

/KNEBB

tgrtjake · Post by **tgrtjake** » Mon Apr 04, 2016 8:46 am

Maybe increase Ping Timeout Value and SNMP Timeout on the monitored host?

knebb · Post by **knebb** » Mon Apr 04, 2016 2:31 pm

tgrtjake wrote:Maybe increase Ping Timeout Value and SNMP Timeout on the monitored host?

I will do so. But if it works it does not explain why it worked fine before!

I wil lreport if it helps.

knebb · Post by **knebb** » Sun Apr 10, 2016 11:24 am

Hi,

I want to give an update as promised.

I increased the timeout values by multiplying them with 10. It got better, but it did not go away. Instead, since Friday evening I did not have ANY values in my graphs.

Since noon today everything is smooth as before. Even when I decrease the timeout values to the previous ones.

And now a next information for troubleshooting:

Since last week Sunday my Internet connection was very unreliable- it was available for 5-20 minutes, and then broke down. The re-connection took between 1 and 5 minutes. Since this Friday the connection totally broke down and I had no Internet for 2.5 days! Since noon today Internet is back again- and stable.

If you compare both you will notice the Cacti graphs for this single host where broken while Internet connection was unstable. No data while Internet was gone at all. And all fine since Internet ist back again

So it has something to do with Internet access.

I verified this morning snmpd was running on the target host and it replied when queried by snmpwalk from the Cacti host.

So it is obvious something wants a connection to the internet....is it Cacti querying for some unknown MIBs? Or is it the OS (Raspbian) on the target host? However, why does Cacti not log graphs even though snmpwalk runs fine?

Seriously confused...

/KNEBB

micke2k · Post by **micke2k** » Mon Apr 11, 2016 2:21 am

Hi,

Most likely its cacti not being able to reach hosts that are queryied through the internet connection, if you have 10 retries configured then all other hosts will fail as well because it will spend all its time trying to reach a downed host, and no time to poll the rest of the hosts. Keep retries to 1 or 2.

Do you have any advanced ping/smokeping enabled for internet checks?

What is your poller intervall? Can you show the SystemStats in the log during these errors.

knebb · Post by **knebb** » Mon Apr 11, 2016 8:57 am

Hi,

looks like I did not clearly state it.

The Cacti host and the target host are on the same subnet! There is no Internet connection needed for both to see each other!

And my SNMp-Replies are set to 3 (now 2).

Well, the failing host was indeed the last one added- may I assume it will be queried as last one, too? If so it could be possible- even though all other hosts where running fine.

But this does not explain why it partially helped to increase the polling timeout.....

I will increase polling threads and cut off Internet again- we will see what happens.

Greetings

/KNEBB

phalek · Post by **phalek** » Mon Apr 11, 2016 9:06 am

What is the polling time in overall of your cacti server ?

Also, is this happening during every polling cycle :

Code: Select all

04/03/2016 11:36:23 PM - CACTID: Poller[0] Host[175] RECACHE: Processing 2 items in the auto reindex cache for 'host.domain.com'

knebb · Post by **knebb** » Mon Apr 11, 2016 9:15 am

Hi,

phalek wrote:What is the polling time in overall of your cacti server ?

How do I find out?

Also, is this happening during every polling cycle :
Code: Select all
04/03/2016 11:36:23 PM - CACTID: Poller[0] Host[175] RECACHE: Processing 2 items in the auto reindex cache for 'host.domain.com'

This means?

Thanks!

/KNEBB

phalek · Post by **phalek** » Mon Apr 11, 2016 9:25 am

Goto:

Code: Select all

Console -> System Utilities -> Technical Support

and check the "Last Run Statistics". e.g:

Code: Select all

Last Run Statistics	Time:169.9096 Method:spine Processes:1 Threads:30 Hosts:146 HostsPerProcess:146 DataSources:3626 RRDsProcessed:1236

Alternatively go to:

Code: Select all

Console -> System Utilities -> View Cacti Log File

Then sort/filter by "SYSTEM STATS".

For the Recache, do this:

Code: Select all

Console -> System Utilities -> View Cacti Log File

Then sort/filter by "RECACHE".

knebb · Post by **knebb** » Mon Apr 11, 2016 10:48 am

Code: Select all

04/11/2016 04:02:37 AM - SYSTEM STATS: Time:156.1584 Method:spine Processes:1 Threads:4 Hosts:45 HostsPerProcess:45 DataSources:1309 RRDsProcessed:805
[...]
04/09/2016 01:14:55 AM - SYSTEM STATS: Time:294.1995 Method:spine Processes:1 Threads:4 Hosts:45 HostsPerProcess:45 DataSources:1309 RRDsProcessed:237

First one is from today where the Internet connection is alive. Second one ist from the day where Internet was broken at all. As I have a polling intervall of five minutes I would say it could be correct that there has not been enough time to poll the last host- the one which is affected. Because it ran for 294sec = 5 minutes. Am I right?

Here is the head of the RECACHE entries- I am going to check some docs what a RECACHE means...

Code: Select all

# grep "RECACHE" cacti.log| head
04/11/2016 04:02:37 AM - PCOMMAND: Poller[0] Host[105] RECACHE: Recache for Host, data query #1
04/11/2016 04:02:37 AM - PCOMMAND: Poller[0] Host[105] RECACHE: Recache successful.
04/11/2016 04:02:37 AM - RECACHE STATS: RecacheTime:0.2753 HostsRecached:1
04/11/2016 04:05:01 AM - CACTID: Poller[0] Host[25] RECACHE: Processing 2 items in the auto reindex cache for '10.101.0.10'
04/11/2016 04:05:01 AM - CACTID: Poller[0] Host[47] RECACHE: Processing 2 items in the auto reindex cache for 'cacti'
04/11/2016 04:05:02 AM - CACTID: Poller[0] Host[74] RECACHE: Processing 2 items in the auto reindex cache for 'backup'
04/11/2016 04:05:02 AM - CACTID: Poller[0] Host[77] RECACHE: Processing 2 items in the auto reindex cache for 'my'
04/11/2016 04:05:02 AM - CACTID: Poller[0] Host[90] RECACHE: Processing 2 items in the auto reindex cache for 'inf'
04/11/2016 04:05:02 AM - CACTID: Poller[0] Host[105] RECACHE: Processing 1 items in the auto reindex cache for 'ab3'
04/11/2016 04:05:02 AM - CACTID: Poller[0] Host[107] RECACHE: Processing 1 items in the auto reindex cache for 'ab2'

phalek · Post by **phalek** » Mon Apr 11, 2016 11:05 am

First one is from today where the Internet connection is alive. Second one ist from the day where Internet was broken at all. As I have a polling intervall of five minutes I would say it could be correct that there has not been enough time to poll the last host- the one which is affected. Because it ran for 294sec = 5 minutes. Am I right?

Yes indeed.

knebb · Post by **knebb** » Mon Apr 11, 2016 11:08 am

phalek wrote:
First one is from today where the Internet connection is alive. Second one ist from the day where Internet was broken at all. As I have a polling intervall of five minutes I would say it could be correct that there has not been enough time to poll the last host- the one which is affected. Because it ran for 294sec = 5 minutes. Am I right?
Yes indeed.

Will it help to increase the number of threads for Spine?

phalek · Post by **phalek** » Mon Apr 11, 2016 11:37 am

Well, es. As each thread is polling a device and will do so until finished or the timeout occurrs. So you may actually want to decrease the timeout of some devices to a number where you know it's working ok when everything is fine. If you keep high timeouts and the device is not reachable, the spine thread will wait until that timeout is reached before proceeding with the next device.

It's a matter of playing around with different settings to find out the one that fits best for you,

knebb · Post by **knebb** » Mon Apr 11, 2016 5:36 pm

Hi,

thanks for all your tips!

I increased the number of max threads for Spine from 4 to 8 and under default conditions it looks much better:

Code: Select all

04/11/2016 11:32:30 PM - SYSTEM STATS: Time:149.3865 Method:spine Processes:1 Threads:4 Hosts:45 HostsPerProcess:45 DataSources:1309 RRDsProcessed:805
04/11/2016 11:36:37 PM - SYSTEM STATS: Time:95.5160 Method:spine Processes:1 Threads:8 Hosts:45 HostsPerProcess:45 DataSources:1309 RRDsProcessed:805

Additionally I decreased the retry value from 3 to 2.

So I assume the next connection issue will work fine.

Thanks!

/KNEBB

Cacti

Graphs with Holes

Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Re: Graphs with Holes

Who is online