Lots of cuts in graph!!!
Moderators: Developers, Moderators
Re: Lots of cuts in graph!!!
Hi Tyler,
No, I'm no longer using Boost because I came to the same conclusion - it made things worse on our server instead of better. So I de-activated, uninstalled and deleted the plugin.
All the best,
El Winni
No, I'm no longer using Boost because I came to the same conclusion - it made things worse on our server instead of better. So I de-activated, uninstalled and deleted the plugin.
All the best,
El Winni
Re: Lots of cuts in graph!!!
I managed to identify at least one culprit for the slow responses from the web server: The superlinks plugin! When I disabled it, the response times from the web server went back to very acceptable speeds.
It's probably not the plugin itself that causes the problem, but the remote webpages that it displays at tabs in Cacti. There is a decimator appliance involved that uses Java applets to display spectrum graphs, and this is probably what brutally slowed down the machine.
However, that still does not account for the gaps in the graphs, I think.
UPDATE: Unfortunately, it --IS-- the superlinks plugin itself that causes the slow response time of the web server. Even when I disable all external web pages in superlinks, so that only the plugin itself is still installed but not loading any pages, the web server remains slow. When I disable the plugin, the web server flies again. So it must be some code in the plugin that needs to be optimized.
It's probably not the plugin itself that causes the problem, but the remote webpages that it displays at tabs in Cacti. There is a decimator appliance involved that uses Java applets to display spectrum graphs, and this is probably what brutally slowed down the machine.
However, that still does not account for the gaps in the graphs, I think.
UPDATE: Unfortunately, it --IS-- the superlinks plugin itself that causes the slow response time of the web server. Even when I disable all external web pages in superlinks, so that only the plugin itself is still installed but not loading any pages, the web server remains slow. When I disable the plugin, the web server flies again. So it must be some code in the plugin that needs to be optimized.
Re: Lots of cuts in graph!!!
Hello, winni.winni wrote:I'd really appreciate any more detailed information about your working setup, maybe it will help us to fix our system.
I afraid you get me wrong a bit.
I didn't meant to say that I never had no gaps at all. I had some in few hours or days, as in your first pic. Its related to network issue I guess, not to cacti itself.
What about "bar code style", those I never seen.
Finally I figured it out, it was lack of resources.
P.S. Sorry for the pause.
Thanks for advice.phalek wrote:I honestly urge everyone running a cacti system to install the Cacti poller template
Re: Lots of cuts in graph!!!
Just as an update, when I use rrdtool to dump the contents of an affected graph, it clearly shows that there are no values stored for the times when Cacti shows "gaps".
- phalek
- Developer
- Posts: 2838
- Joined: Thu Jan 31, 2008 6:39 am
- Location: Kressbronn, Germany
- Contact:
Re: Lots of cuts in graph!!!
At some customer I have I realized that spine sometimes "forgets" to poll a device. I was not able to repropduce this issue locally, but you may want to enable a higher debuging for the Cati log and check if the log contains the get requests for these devices or if they are just skipped due to some other reason.
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
Re: Lots of cuts in graph!!!
Hi phalek, I have now set the logging level to DEBUG and will let it run for 24 hours. I will report back tomorrow with some more findings - or lack thereof.
- phalek
- Developer
- Posts: 2838
- Joined: Thu Jan 31, 2008 6:39 am
- Location: Kressbronn, Germany
- Contact:
Re: Lots of cuts in graph!!!
just be careful about the growth/size of that log ... I managed to fill up a disk due to this
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
Re: Lots of cuts in graph!!!
Yep, that log has already reached a gigantic size. The good news is that I found an example!
The portion of the dumped RRD file for this graph looks like this, and as you can see, there are two NaNs where there should be values:
The Cacti log for the host in question looks like this for that time window:
I presume that at 11:09, the values that were originally retrieved one minute earlier are added to the graph/RRD file.
This was a randomly chosen host/graph; we have several others with the same randomly occurring problem.
The portion of the dumped RRD file for this graph looks like this, and as you can see, there are two NaNs where there should be values:
Code: Select all
<!-- 2014-02-17 23:06:00 UTC / 1392678360 --> <row><v>7.6966666667e+01</v></row>
<!-- 2014-02-17 23:07:00 UTC / 1392678420 --> <row><v>NaN</v></row>
<!-- 2014-02-17 23:08:00 UTC / 1392678480 --> <row><v>NaN</v></row>
<!-- 2014-02-17 23:09:00 UTC / 1392678540 --> <row><v>7.6000000000e+01</v></row>
Code: Select all
---- THIS LOOKS NORMAL! ----
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1628323164
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 77
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 740
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -72
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] Total Time: 0.052 Seconds
02/17/2014 11:06:01 PM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
---- NO VALUES HERE! ---
02/17/2014 11:07:02 PM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/17/2014 11:07:02 PM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/17/2014 11:07:02 PM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/17/2014 11:07:02 PM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1628329222
---- HERE IT POLLS VALUES, BUT DOES NOT ADD THEM TO THE GRAPH! ----
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1628335224
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 75
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 740
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -73
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] Total Time: 0.049 Seconds
02/17/2014 11:08:02 PM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
---- HERE ARE VALUES IN THE GRAPH, ALTHOUGH AN SNMP ERROR FOR THE HOST IS REPORTED! ----
02/17/2014 11:09:01 PM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] SNMP Ping Error: Unknown error: 2
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] SNMP Result: Host did not respond to SNMP
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] TH[1] Total Time: 2 Seconds
02/17/2014 11:09:03 PM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
I presume that at 11:09, the values that were originally retrieved one minute earlier are added to the graph/RRD file.
This was a randomly chosen host/graph; we have several others with the same randomly occurring problem.
- phalek
- Developer
- Posts: 2838
- Joined: Thu Jan 31, 2008 6:39 am
- Location: Kressbronn, Germany
- Contact:
Re: Lots of cuts in graph!!!
There should be some rrd update commands in that log as well . Look for the ones containing the same Ds numbers as reported in the SNMP gets (Ds [xxx]).
You should be able to compare the actual updated numbers with what the shmp request returns to match up the entries
You should be able to compare the actual updated numbers with what the shmp request returns to match up the entries
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
Re: Lots of cuts in graph!!!
Hi again,
I deleted the original log file too soon and had to wait almost half a day until the gaps in the graphs appeared again. So this is a new try:
Let's concentrate on the first gap that occurs between 12:05 AM and 12:08 AM:
Below are the log file entries between 12:05 AM and 12:08 AM for "Host[350]" and the related data source "DS[5509]". There are more graphs and for this host, but I focus only on the graph depicting the Eb/No values, which are stored in the file /var/www/rra/mod_sm_47_-_free_to_use_ebno_5509.rrd.
In the 12:06 minute, it seems as if the values were properly fetched and the rrd file was properly updated, but the graph shows a gap nonetheless.
In the polling cycle at 12:07, no values for the host were fetched and no "rrdtool update" command was issued.
In the 12:08 cycle, according to the log, I'd say that everything worked normal, but the RRD file shows "NaN" for this time slot.
The RRD dump for that time window looks like this:
These are the Cacti log excerpts:
I deleted the original log file too soon and had to wait almost half a day until the gaps in the graphs appeared again. So this is a new try:
Let's concentrate on the first gap that occurs between 12:05 AM and 12:08 AM:
Below are the log file entries between 12:05 AM and 12:08 AM for "Host[350]" and the related data source "DS[5509]". There are more graphs and for this host, but I focus only on the graph depicting the Eb/No values, which are stored in the file /var/www/rra/mod_sm_47_-_free_to_use_ebno_5509.rrd.
In the 12:06 minute, it seems as if the values were properly fetched and the rrd file was properly updated, but the graph shows a gap nonetheless.
In the polling cycle at 12:07, no values for the host were fetched and no "rrdtool update" command was issued.
In the 12:08 cycle, according to the log, I'd say that everything worked normal, but the RRD file shows "NaN" for this time slot.
The RRD dump for that time window looks like this:
Code: Select all
<!-- 2014-02-19 00:05:00 UTC / 1392768300 --> <row><v>8.0016666667e+01</v></row>
<!-- 2014-02-19 00:06:00 UTC / 1392768360 --> <row><v>8.1933333333e+01</v></row>
<!-- 2014-02-19 00:07:00 UTC / 1392768420 --> <row><v>NaN</v></row>
<!-- 2014-02-19 00:08:00 UTC / 1392768480 --> <row><v>NaN</v></row>
<!-- 2014-02-19 00:09:00 UTC / 1392768540 --> <row><v>8.1000000000e+01</v></row>
Code: Select all
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1637316709
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11.0, value: 32
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 80
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 7500
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -73
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] Total Time: 0.03 Seconds
02/19/2014 12:05:02 AM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
02/19/2014 12:05:02 AM - POLLER: Poller[0] CACTI2RRD: /usr/bin/rrdtool update /var/www/rra/mod_sm_47_-_free_to_use_ebno_5509.rrd --template ebno 1392768302:80
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1637322663
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11.0, value: 32
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 82
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 7500
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -72
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] Total Time: 0.035 Seconds
02/19/2014 12:06:01 AM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
02/19/2014 12:06:02 AM - POLLER: Poller[0] CACTI2RRD: /usr/bin/rrdtool update /var/www/rra/mod_sm_47_-_free_to_use_ebno_5509.rrd --template ebno 1392768361:82
02/19/2014 12:07:02 AM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/19/2014 12:07:02 AM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/19/2014 12:07:02 AM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/19/2014 12:07:02 AM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1637328724
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] DEBUG: Entering SNMP Ping
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] SNMP Result: Host responded to SNMP
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] RECACHE: Processing 1 items in the auto reindex cache for '192.168.123.156'
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] Recache DataQuery[1] OID: .1.3.6.1.2.1.1.3.0, output: 1637334723
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] NOTE: There are '6' Polling Items for this Host
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5507] SNMP: v2: 192.168.123.156, dsname: ber, oid: 1.3.6.1.4.1.6247.24.1.3.2.1.0, value: 0
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5510] SNMP: v2: 192.168.123.156, dsname: DemodAcqSweepWidth, oid: .1.3.6.1.4.1.6247.24.1.2.3.11.0, value: 32
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5508] SNMP: v2: 192.168.123.156, dsname: buffer, oid: 1.3.6.1.4.1.6247.24.1.3.2.2.0, value: 50
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5509] SNMP: v2: 192.168.123.156, dsname: ebno, oid: 1.3.6.1.4.1.6247.24.1.3.2.5.0, value: 81
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5511] SNMP: v2: 192.168.123.156, dsname: offset, oid: 1.3.6.1.4.1.6247.24.1.3.2.3.0, value: 7500
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DS[5512] SNMP: v2: 192.168.123.156, dsname: level, oid: .1.3.6.1.4.1.6247.24.1.3.2.4.0, value: -72
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] Total Time: 0.037 Seconds
02/19/2014 12:08:02 AM - SPINE: Poller[0] Host[350] TH[1] DEBUG: HOST COMPLETE: About to Exit Host Polling Thread Function
02/19/2014 12:08:02 AM - POLLER: Poller[0] CACTI2RRD: /usr/bin/rrdtool update /var/www/rra/mod_sm_47_-_free_to_use_ebno_5509.rrd --template ebno 1392768482:81
- phalek
- Developer
- Posts: 2838
- Joined: Thu Jan 31, 2008 6:39 am
- Location: Kressbronn, Germany
- Contact:
Re: Lots of cuts in graph!!!
So we have 2 issues here:
1) It skipped one poll although it checked the device for reachability (12:07 )
2) It didn't update the rrd file although it said so. (12:08 )
The bad thing is that this is quite hard to analyze.
Was there anything different in the Cacti stats for these 3 polls ? hosts, items, rrd files ... ? Just looks like it didn't pick up any polling items for the 12:07 run and didn't make the actual system call in the 12:08 one.
Both are probably being handled in spine, so we'll may have to get some additional debugging in there ...
1) It skipped one poll although it checked the device for reachability (12:07 )
2) It didn't update the rrd file although it said so. (12:08 )
The bad thing is that this is quite hard to analyze.
Was there anything different in the Cacti stats for these 3 polls ? hosts, items, rrd files ... ? Just looks like it didn't pick up any polling items for the 12:07 run and didn't make the actual system call in the 12:08 one.
Both are probably being handled in spine, so we'll may have to get some additional debugging in there ...
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
Re: Lots of cuts in graph!!!
I was afraid that this wouldn't be something trivial. Like I've said, this is just one randomly chosen example where it happens, those gaps appear in almost all graphs.
And no, as far as I can see, there was nothing different in the Cacti stats for these three polls regarding hosts, items and rrd files.
I don't want to switch back to PHP poller, because I doubt that it can handle these many hosts in a one minute polling cycle.
How can we properly debug spine?
And no, as far as I can see, there was nothing different in the Cacti stats for these three polls regarding hosts, items and rrd files.
I don't want to switch back to PHP poller, because I doubt that it can handle these many hosts in a one minute polling cycle.
How can we properly debug spine?
- phalek
- Developer
- Posts: 2838
- Joined: Thu Jan 31, 2008 6:39 am
- Location: Kressbronn, Germany
- Contact:
Re: Lots of cuts in graph!!!
by looking at the C code ...
poller.c at line 859 prints out the following log:
which we see at 12:06 and 12:08 but missing at 12:07
It's only going in there if some "num_rows > 0", so let's look at this:
num_rows is based on either query5 or query1 depending on the "poller_interval" ( not sure what this is, but it checks if it's "0"->query1 or something else->query5 ).
As we didn't see any other log entries num_rows should contains something, but can also contain 0.
I assume that there may be instances, when these do not return any entries, hence num_rows is 0 and the whole polling part is skipped.
Spine doesn't print out any log entry when the num_rows is 0 but only closes the mysql connection. So if we add another SPINE_LOG_MEDIUM in there, we should be able to identify these issues.
(which would mean to create a special spine version ... )
I'll have to check the rrd update later ... just finished watching the FCB/AFC game ...
poller.c at line 859 prints out the following log:
Code: Select all
SPINE_LOG_MEDIUM(("Host[%i] TH[%i] NOTE: There are '%i' Polling Items for this Host", host_id, host_thread, num_rows));
It's only going in there if some "num_rows > 0", so let's look at this:
num_rows is based on either query5 or query1 depending on the "poller_interval" ( not sure what this is, but it checks if it's "0"->query1 or something else->query5 ).
As we didn't see any other log entries num_rows should contains something, but can also contain 0.
Code: Select all
Query1:
"SELECT action, hostname, snmp_community, "
"snmp_version, snmp_username, snmp_password, "
"rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, "
"rrd_num, snmp_port, snmp_timeout, "
"snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context "
" FROM poller_item"
" WHERE host_id=%i AND poller_id=%i"
" ORDER BY snmp_port %s"
Code: Select all
Query5:
SELECT action, hostname, snmp_community, "
"snmp_version, snmp_username, snmp_password, "
"rrd_name, rrd_path, arg1, arg2, arg3, local_data_id, "
"rrd_num, snmp_port, snmp_timeout, "
"snmp_auth_protocol, snmp_priv_passphrase, snmp_priv_protocol, snmp_context "
" FROM poller_item"
" WHERE host_id=%i and rrd_next_step <=0"
" ORDER by snmp_port %s"
Spine doesn't print out any log entry when the num_rows is 0 but only closes the mysql connection. So if we add another SPINE_LOG_MEDIUM in there, we should be able to identify these issues.
(which would mean to create a special spine version ... )
I'll have to check the rrd update later ... just finished watching the FCB/AFC game ...
Greetings,
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
Phalek
---
Need more help ? Read the Cacti documentation or my new Cacti 1.x Book
Need on-site support ? Look here Cacti Workshop
Need professional Cacti support ? Look here CereusService
---
Plugins : CereusReporting
-
- Cacti User
- Posts: 130
- Joined: Thu Jan 19, 2012 11:52 am
Re: Lots of cuts in graph!!!
Hello, my 2cents to barcode style, Im not very mad about single loss a day (yet)
In my experience of having 7k graphs with 34k data sources it appears that barcode is usually because of a bottleneck in the system. There are, like said before, many things to check along:
* disk io. Currently, that is my problem, but im on vmware so still there are some resources here and there to ask for. Next, vmware is 5.1 is neurotic: sometimes it just do something and you have no resources. Generally, no hypervisor is better than with hypervisor here , imho.
* network infrastructure. Cacti says my traffic on Cacti is 2mbits avg. Well, each minute when it transfer the snmp traffic increases to 30-50 mbits/s. Thats quite a lot and sometimes bad network equipment just falls behind ( lost udp on poor links)
* i have 4 processess and 40 threads. Other combinations may do better depending of how many hosts are pooled , how many ds there are etc. You know better.
* other cron services generally affect the performance, thats noob, but In my opinion Cacti is just sensitive, and it may be, its by design.
* overall network state. My network is large and it seems its condition changes pooling time a little,
* thresholds. I have some thousands of thresholds with alerts. Im starting to suspect that more thresholds are over limits, more resources are gone. Thats strange, postfix sends only few mails.
* maybe too deep, but how about the kernel optimization? I have some ovh / rhel created kernels and from time to time they perform worse than previous ones
* mysql. I had very nice barcode when our new and brave linux admin tweaked mysql a little.
* some caches in the systems. Usually, first graphs go fine until the io accumulates. Simple enough, further barcode is the result.
And so on, you know.
In my experience of having 7k graphs with 34k data sources it appears that barcode is usually because of a bottleneck in the system. There are, like said before, many things to check along:
* disk io. Currently, that is my problem, but im on vmware so still there are some resources here and there to ask for. Next, vmware is 5.1 is neurotic: sometimes it just do something and you have no resources. Generally, no hypervisor is better than with hypervisor here , imho.
* network infrastructure. Cacti says my traffic on Cacti is 2mbits avg. Well, each minute when it transfer the snmp traffic increases to 30-50 mbits/s. Thats quite a lot and sometimes bad network equipment just falls behind ( lost udp on poor links)
* i have 4 processess and 40 threads. Other combinations may do better depending of how many hosts are pooled , how many ds there are etc. You know better.
* other cron services generally affect the performance, thats noob, but In my opinion Cacti is just sensitive, and it may be, its by design.
* overall network state. My network is large and it seems its condition changes pooling time a little,
* thresholds. I have some thousands of thresholds with alerts. Im starting to suspect that more thresholds are over limits, more resources are gone. Thats strange, postfix sends only few mails.
* maybe too deep, but how about the kernel optimization? I have some ovh / rhel created kernels and from time to time they perform worse than previous ones
* mysql. I had very nice barcode when our new and brave linux admin tweaked mysql a little.
* some caches in the systems. Usually, first graphs go fine until the io accumulates. Simple enough, further barcode is the result.
And so on, you know.
Who is online
Users browsing this forum: No registered users and 4 guests