Lots of cuts in graph!!!

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Developers, Moderators

User avatar
winni
Posts: 24
Joined: Wed Aug 22, 2012 6:35 am
Location: Germany
Contact:

Re: Lots of cuts in graph!!!

Post by winni »

Hi Tyler,

No, I'm no longer using Boost because I came to the same conclusion - it made things worse on our server instead of better. So I de-activated, uninstalled and deleted the plugin.

All the best,
El Winni :)
User avatar
winni
Posts: 24
Joined: Wed Aug 22, 2012 6:35 am
Location: Germany
Contact:

Re: Lots of cuts in graph!!!

Post by winni »

I managed to identify at least one culprit for the slow responses from the web server: The superlinks plugin! When I disabled it, the response times from the web server went back to very acceptable speeds.

It's probably not the plugin itself that causes the problem, but the remote webpages that it displays at tabs in Cacti. There is a decimator appliance involved that uses Java applets to display spectrum graphs, and this is probably what brutally slowed down the machine.

However, that still does not account for the gaps in the graphs, I think.

UPDATE: Unfortunately, it --IS-- the superlinks plugin itself that causes the slow response time of the web server. Even when I disable all external web pages in superlinks, so that only the plugin itself is still installed but not loading any pages, the web server remains slow. When I disable the plugin, the web server flies again. So it must be some code in the plugin that needs to be optimized.
idle
Cacti User
Posts: 77
Joined: Wed May 26, 2004 10:49 am
Location: Barcelona
Contact:

Re: Lots of cuts in graph!!!

Post by idle »

winni wrote:I'd really appreciate any more detailed information about your working setup, maybe it will help us to fix our system.
Hello, winni.
I afraid you get me wrong a bit.
I didn't meant to say that I never had no gaps at all. I had some in few hours or days, as in your first pic. Its related to network issue I guess, not to cacti itself.
What about "bar code style", those I never seen.
Finally I figured it out, it was lack of resources.

P.S. Sorry for the pause.
phalek wrote:I honestly urge everyone running a cacti system to install the Cacti poller template
Thanks for advice.
User avatar
winni
Posts: 24
Joined: Wed Aug 22, 2012 6:35 am
Location: Germany
Contact:

Re: Lots of cuts in graph!!!

Post by winni »

Just as an update, when I use rrdtool to dump the contents of an affected graph, it clearly shows that there are no values stored for the times when Cacti shows "gaps".
User avatar
phalek
Developer
Posts: 2838
Joined: Thu Jan 31, 2008 6:39 am
Location: Kressbronn, Germany
Contact:

Re: Lots of cuts in graph!!!

Post by phalek »

At some customer I have I realized that spine sometimes "forgets" to poll a device. I was not able to repropduce this issue locally, but you may want to enable a higher debuging for the Cati log and check if the log contains the get requests for these devices or if they are just skipped due to some other reason.
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
User avatar
winni
Posts: 24
Joined: Wed Aug 22, 2012 6:35 am
Location: Germany
Contact:

Re: Lots of cuts in graph!!!

Post by winni »

Hi phalek, I have now set the logging level to DEBUG and will let it run for 24 hours. I will report back tomorrow with some more findings - or lack thereof. ;-)
User avatar
phalek
Developer
Posts: 2838
Joined: Thu Jan 31, 2008 6:39 am
Location: Kressbronn, Germany
Contact:

Re: Lots of cuts in graph!!!

Post by phalek »

just be careful about the growth/size of that log ... I managed to fill up a disk due to this ;-)
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
User avatar
winni
Posts: 24
Joined: Wed Aug 22, 2012 6:35 am
Location: Germany
Contact:

Re: Lots of cuts in graph!!!

Post by winni »

Yep, that log has already reached a gigantic size. The good news is that I found an example! :)

Image

The portion of the dumped RRD file for this graph looks like this, and as you can see, there are two NaNs where there should be values:

Code: Select all


                        <!-- 2014-02-17 23:06:00 UTC / 1392678360 --> <row><v>7.6966666667e+01</v></row>
                        <!-- 2014-02-17 23:07:00 UTC / 1392678420 --> <row><v>NaN</v></row>
                        <!-- 2014-02-17 23:08:00 UTC / 1392678480 --> <row><v>NaN</v></row>
                        <!-- 2014-02-17 23:09:00 UTC / 1392678540 --> <row><v>7.6000000000e+01</v></row>
The Cacti log for the host in question looks like this for that time window:

Code: Select all

						
---- THIS LOOKS NORMAL! ----

02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1628323164
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 77
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 740
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -72
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] Total Time: 0.052 Seconds
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function


---- NO VALUES HERE! ---

02/17/2014 11:07:02 PM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/17/2014 11:07:02 PM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/17/2014 11:07:02 PM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/17/2014 11:07:02 PM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1628329222						

				
---- HERE IT POLLS VALUES, BUT DOES NOT ADD THEM TO THE GRAPH! ----
		
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1628335224
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 75
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 740
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -73
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] Total Time: 0.049 Seconds
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function


---- HERE ARE VALUES IN THE GRAPH, ALTHOUGH AN SNMP ERROR FOR THE HOST IS REPORTED! ----

02/17/2014 11:09:01 PM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] SNMP Ping Error: Unknown error: 2
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] SNMP Result: Host did not respond to SNMP
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] TH[1] Total Time:     2 Seconds
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function

I presume that at 11:09, the values that were originally retrieved one minute earlier are added to the graph/RRD file.

This was a randomly chosen host/graph; we have several others with the same randomly occurring problem.
User avatar
phalek
Developer
Posts: 2838
Joined: Thu Jan 31, 2008 6:39 am
Location: Kressbronn, Germany
Contact:

Re: Lots of cuts in graph!!!

Post by phalek »

There should be some rrd update commands in that log as well . Look for the ones containing the same Ds numbers as reported in the SNMP gets (Ds [xxx]).
You should be able to compare the actual updated numbers with what the shmp request returns to match up the entries
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
User avatar
winni
Posts: 24
Joined: Wed Aug 22, 2012 6:35 am
Location: Germany
Contact:

Re: Lots of cuts in graph!!!

Post by winni »

Hi again,

I deleted the original log file too soon and had to wait almost half a day until the gaps in the graphs appeared again. So this is a new try:


Image


Let's concentrate on the first gap that occurs between 12:05 AM and 12:08 AM:

Image


Below are the log file entries between 12:05 AM and 12:08 AM for "Host[350]" and the related data source "DS[5509]". There are more graphs and for this host, but I focus only on the graph depicting the Eb/No values, which are stored in the file /var/www/rra/mod_sm_47_-_free_to_use_ebno_5509.rrd.

In the 12:06 minute, it seems as if the values were properly fetched and the rrd file was properly updated, but the graph shows a gap nonetheless.

In the polling cycle at 12:07, no values for the host were fetched and no "rrdtool update" command was issued.

In the 12:08 cycle, according to the log, I'd say that everything worked normal, but the RRD file shows "NaN" for this time slot.


The RRD dump for that time window looks like this:

Code: Select all

<!-- 2014-02-19 00:05:00 UTC / 1392768300 --> <row><v>8.0016666667e+01</v></row>
<!-- 2014-02-19 00:06:00 UTC / 1392768360 --> <row><v>8.1933333333e+01</v></row>
<!-- 2014-02-19 00:07:00 UTC / 1392768420 --> <row><v>NaN</v></row>
<!-- 2014-02-19 00:08:00 UTC / 1392768480 --> <row><v>NaN</v></row>
<!-- 2014-02-19 00:09:00 UTC / 1392768540 --> <row><v>8.1000000000e+01</v></row>
These are the Cacti log excerpts:

Code: Select all

02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1637316709
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11.0, value: 32
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 80
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 7500
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -73
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] Total Time:  0.03 Seconds
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
02/19/2014 12:05:02 AM - POLLER: Poller[0] CACTI2RRD: /usr/bin/rrdtool update /var/www/rra/mod_sm_47_-_free_to_use_ebno_5509.rrd --template ebno 1392768302:80


02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1637322663
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11.0, value: 32
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 82
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 7500
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -72
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] Total Time: 0.035 Seconds
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
02/19/2014 12:06:02 AM - POLLER: Poller[0] CACTI2RRD: /usr/bin/rrdtool update /var/www/rra/mod_sm_47_-_free_to_use_ebno_5509.rrd --template ebno 1392768361:82


02/19/2014 12:07:02 AM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/19/2014 12:07:02 AM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/19/2014 12:07:02 AM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/19/2014 12:07:02 AM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1637328724


02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1637334723
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11.0, value: 32
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 81
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 7500
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -72
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] Total Time: 0.037 Seconds
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
02/19/2014 12:08:02 AM - POLLER: Poller[0] CACTI2RRD: /usr/bin/rrdtool update /var/www/rra/mod_sm_47_-_free_to_use_ebno_5509.rrd --template ebno 1392768482:81
User avatar
phalek
Developer
Posts: 2838
Joined: Thu Jan 31, 2008 6:39 am
Location: Kressbronn, Germany
Contact:

Re: Lots of cuts in graph!!!

Post by phalek »

So we have 2 issues here:

1) It skipped one poll although it checked the device for reachability (12:07 )
2) It didn't update the rrd file although it said so. (12:08 )

The bad thing is that this is quite hard to analyze.

Was there anything different in the Cacti stats for these 3 polls ? hosts, items, rrd files ... ? Just looks like it didn't pick up any polling items for the 12:07 run and didn't make the actual system call in the 12:08 one.

Both are probably being handled in spine, so we'll may have to get some additional debugging in there ... :-(
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
User avatar
winni
Posts: 24
Joined: Wed Aug 22, 2012 6:35 am
Location: Germany
Contact:

Re: Lots of cuts in graph!!!

Post by winni »

I was afraid that this wouldn't be something trivial. Like I've said, this is just one randomly chosen example where it happens, those gaps appear in almost all graphs.

And no, as far as I can see, there was nothing different in the Cacti stats for these three polls regarding hosts, items and rrd files.

I don't want to switch back to PHP poller, because I doubt that it can handle these many hosts in a one minute polling cycle.

How can we properly debug spine?
User avatar
phalek
Developer
Posts: 2838
Joined: Thu Jan 31, 2008 6:39 am
Location: Kressbronn, Germany
Contact:

Re: Lots of cuts in graph!!!

Post by phalek »

by looking at the C code ...

poller.c at line 859 prints out the following log:

Code: Select all

		
SPINE_LOG_MEDIUM(("Host[%i] TH[%i] NOTE: There are '%i' Polling Items for this Host", host_id, host_thread, num_rows));
which we see at 12:06 and 12:08 but missing at 12:07

It's only going in there if some "num_rows > 0", so let's look at this:

num_rows is based on either query5 or query1 depending on the "poller_interval" ( not sure what this is, but it checks if it's "0"->query1 or something else->query5 ).

As we didn't see any other log entries num_rows should contains something, but can also contain 0.

Code: Select all

Query1:
"SELECT action, hostname, snmp_community, "
				"snmp_version, snmp_username, snmp_password, "
				"rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, "
				"rrd_num, snmp_port, snmp_timeout, "
				"snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context "
			" FROM poller_item"
			" WHERE host_id=%i AND poller_id=%i"
			" ORDER BY snmp_port %s"

Code: Select all

Query5:
SELECT action, hostname, snmp_community, "
				"snmp_version, snmp_username, snmp_password, "
				"rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, "
				"rrd_num, snmp_port, snmp_timeout, "
				"snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context "
			" FROM poller_item"
			" WHERE host_id=%i and rrd_next_step <=0"
			" ORDER by snmp_port %s"
I assume that there may be instances, when these do not return any entries, hence num_rows is 0 and the whole polling part is skipped.
Spine doesn't print out any log entry when the num_rows is 0 but only closes the mysql connection. So if we add another SPINE_LOG_MEDIUM in there, we should be able to identify these issues.

(which would mean to create a special spine version ... )

I'll have to check the rrd update later ... just finished watching the FCB/AFC game ... ;-)
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
stefanbrudny
Cacti User
Posts: 130
Joined: Thu Jan 19, 2012 11:52 am

Re: Lots of cuts in graph!!!

Post by stefanbrudny »

Hello, my 2cents to barcode style, Im not very mad about single loss a day (yet)

In my experience of having 7k graphs with 34k data sources it appears that barcode is usually because of a bottleneck in the system. There are, like said before, many things to check along:

* disk io. Currently, that is my problem, but im on vmware so still there are some resources here and there to ask for. Next, vmware is 5.1 is neurotic: sometimes it just do something and you have no resources. Generally, no hypervisor is better than with hypervisor here , imho.
* network infrastructure. Cacti says my traffic on Cacti is 2mbits avg. Well, each minute when it transfer the snmp traffic increases to 30-50 mbits/s. Thats quite a lot and sometimes bad network equipment just falls behind ( lost udp on poor links)
* i have 4 processess and 40 threads. Other combinations may do better depending of how many hosts are pooled , how many ds there are etc. You know better.
* other cron services generally affect the performance, thats noob, but In my opinion Cacti is just sensitive, and it may be, its by design.
* overall network state. My network is large and it seems its condition changes pooling time a little,
* thresholds. I have some thousands of thresholds with alerts. Im starting to suspect that more thresholds are over limits, more resources are gone. Thats strange, postfix sends only few mails.
* maybe too deep, but how about the kernel optimization? I have some ovh / rhel created kernels and from time to time they perform worse than previous ones
* mysql. I had very nice barcode when our new and brave linux admin tweaked mysql a little.
* some caches in the systems. Usually, first graphs go fine until the io accumulates. Simple enough, further barcode is the result.

And so on, you know.
Post Reply

Who is online

Users browsing this forum: No registered users and 4 guests