[HOWTO] Debug NaN's in your graphs

If you figure out how to do something interesting/cool in Cacti and want to share it with the community, please post your experience here.

Moderators: Developers, Moderators

User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

[HOWTO] Debug NaN's in your graphs

Post by gandalf »

Please find current release of this HowTo at the Cacti Documentation Site at http://docs.cacti.net/node/283
Cacti users sometimes complain about NaN's in their graphs. Unfortunately, there are several reasons for this result. The following is a step-by-step procedure I recommend for debugging this

To debug the NaN's:

1. Check Cacti Log File
Please have a look at your cacti log file. Usually, you'll find it at <path_cacti>/log/cacti.log. Else see "Settings -> Paths". Check for this kind of error:

Code: Select all

CACTID: Host[...] DS[....] WARNING: SNMP timeout detected [500 ms], ignoring host '........'
For "reasonable" timeouts, this may be related to a snmpbulkwalk issue. To change this, see "Settings -> Poller" and lower the value for The Maximum SNMP OID's Per SNMP Get Request. Start at a value of 1 and increase it again, if the poller starts working. Some agent's don't have the horsepower to deliver that many OID's at a time. Therefore, we can reduce the number for those older/underpowered devices.

2. Check Basic Data Gathering:
For scripts, run them as cactiuser from cli to check basic functionality. E.g. for a perl script named your-perl-script.pl with parameters "p1 p2" under *nix this would look like:

Code: Select all

su - cactiuser
/full/path/to/perl your-perl-script.pl p1 p2
... (check output)
For snmp, snmpget the _exact_ OID you're asking for, using same community string and snmp version as defined within cacti. For an OID of .1.3.6.1.4.something, community string of "very-secret" and version 2 for target host "target-host" this would look like

Code: Select all

snmpget -c very-secret -v 2c target-host .1.3.6.1.4.something
.... (check output)
3. Check cacti's poller:
First, note the poller you're using (from crontab, it's _always_ poller.php that's executed. But you may configure cmd.php _or_ cactid from "Settings")
Now, clear ./log/cacti.log (or rename it to get a fresh start)
Then, change "Settings -> Poller Logging Level" to DEBUG for _one_ polling cycle. You may rename this log as well to avoid more stuff added to it with subsequent polling cycles.
Now, find the host/data source in question. The Host[<id>] is given numerically, the <id> being a specific number for that host. Find this <id> from the Devices menue when editing the host: The url contains a string like &id=<id>.
Check, whether the output is as expected. If not, check your script (e.g. /full/path/to/perl). If ok, proceed to next step

4. Check MySQL updating
In most cases, this step make be skipped. You may want to return to this step, if the next one fails (e.g. no rrdtool update to be found)
From debug log, please find the MySQL update statement for that host concerning table poller_output. On very rare occasions, this will fail. So please copy that sql statement and paste it to a mysql session started from cli. This may as well be done from some tool like phpmyadmin. Check the sql return code.

5. Check rrd file updating
Down in the same log, you should find some

Code: Select all

rrdtool update <filename> --template ...
You should find exactly one update statement for each file.
RRD files should be created by the poller. If it does not create them, it will not fill them either. If it does, please check your Poller Cache from Utilities and search for your target. Does the query show up here?

6. Check rrd file numbers
You're perhaps wondering about this step, if the former was ok. But due to data sources MINIMUM and MAXIMUM definitions, it is possible, that valid updates for rrd files are suppressed, because MINIMUM was not reached or MAXIMUM was exceeded.
Assuming, you've got some valid rrdtool update in step 3, perform a

Code: Select all

rrdtool fetch <rrd file> AVERAGE
and look at the last 10-20 lines. If you find NaN's there, perform

Code: Select all

rrdtool info <rrd file>
and check the ds[...].min and ds[...].max entries, e.g.

Code: Select all

ds[loss].min = 0.0000000000e+00
ds[loss].max = 1.0000000000e+02
In this example, MINIMUM = 0 and MAXIMUM = 100. For a ds.[...].type=GAUGE verify, that e.g. the number returned by the script does not exceed ds[...].MAX (same holds for MINIMUM, respectively).
If you run into this, please do not only update the data source definition within the Data Template, but perform a

Code: Select all

rrdtool tune <rrd file> --maximum <ds-name>:<new ds maximum>
for all existing rrd files belonging to that Data Template.

7. Check rrdtool graph statement
Last resort would be to check, that the corract data sources are used. Goto Graph Management and select your Graph. Enable DEBUG Mode to find the whole rrdtool graph statement. You should notice the DEF statements. They specify the rrd file and data source to be used. You may check, that all of them are as wanted.

Miscellaneous
Up to current cacti 0.8.6h, table poller_output may increase beyond reasonable size. This is commonly due to php.ini's memory settings of 8MB default. Change this to at least 64 MB.
To check this, please run follwoing sql from mysql cli (or phpmyadmin or the like)

Code: Select all

select count(*) from poller_output;
. If the result is huge, you may get rid of those stuff by

Code: Select all

truncate table poller_output;
As of current SVN code for upcoming cacti 0.9, I saw measures were taken on both issues (memory size, truncating poller_output).

RPM Installation?
Most rpm installations will setup the crontab entry now. If you've followed the installation instructions to the letter (which you should always do ;-) ), you may now have two poller running. That's not a good thing, though. Most rpm installations will setup cron in /etc/cron.d/cacti.
Now, please check all your crontabs, especially /etc/crontab and crontabs of users root and cactiuser. Leave only one poller entry for all of them. Personally, I've chosen /etc/cron.d/cacti to avoid problems when updating rpm's. Mosten often, you won't remember this item when updating lots of rpm's, so I felt more secure to put it here. And I've made some slight modifications, see

Code: Select all

*/5 * * * *     cactiuser       /usr/bin/php /var/www/html/cacti/poller.php > /var/local/log/poller.log 2>&1
This will produce a file /var/local/log/poller.log, which includes some additionals information from each poller's run, such as rrdtool errors. It occupies only some few bytes and will be overwritten each time.


Please comment, if these instructions may be difficult to understand or to follow. If you find other aspects worth to be checked, I'd like to hear from you, too.

Happy cactiing
Reinhard

Added new chapter 3 on MySQL debugging
Added new chapter on "Miscellaneous" stuff
Added new chapter on "RPM Installation" for crontab related issues
Added new chapter on Max OID get requests, courtesy http://forums.cacti.net/viewtopic.php?t=17839
Last edited by gandalf on Wed Jun 18, 2008 1:31 pm, edited 9 times in total.
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

Great guide as usual :-). Might want to add it to your signature...
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Post by gandalf »

BSOD2600 wrote:Great guide as usual :-). Might want to add it to your signature...
If only there was enough space left in it :roll:
Reinhard
User avatar
BSOD2600
Cacti Moderator
Posts: 12171
Joined: Sat May 08, 2004 12:44 pm
Location: USA

Post by BSOD2600 »

Get rony to increase the limit. I had to remove the coloring from mine to make the text fit...
User avatar
rony
Developer/Forum Admin
Posts: 6022
Joined: Mon Nov 17, 2003 6:35 pm
Location: Michigan, USA
Contact:

Post by rony »

Get more creative! :P
[size=117][i][b]Tony Roman[/b][/i][/size]
[size=84][i]Experience is what causes a person to make new mistakes instead of old ones.[/i][/size]
[size=84][i]There are only 3 way to complete a project: Good, Fast or Cheap, pick two.[/i][/size]
[size=84][i]With age comes wisdom, what you choose to do with it determines whether or not you are wise.[/i][/size]
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Post by gandalf »

Wow, you've got it. Suppose, I've to start soon ... :wink: :lol: :D
Reinhard
srikrishnak
Posts: 3
Joined: Sun Oct 01, 2006 11:02 pm

Post by srikrishnak »

great guide. Puzlled about NAN and thx to google it brought me to the right place.
Thx a ton
Criggie
Posts: 16
Joined: Sat Jul 21, 2007 4:30 am
Location: Christchurch, New Zealand
Contact:

Another cause

Post by Criggie »

Hi - I found this guide most useful, and with it I fixed all but one of my non-working graphs.

The other graph was a script that would return 0 for values, rather than NAN.

It was running smartctl which was complaining "you are not root" so a quick chmod +s on the script fixed that problem.

Secondly, the script was taking several seconds to run. So cacti was logging a "U" for unparseable in the debug output, and was recording NAN. So my fix there was to make the script run faster - it has to complete in less than one second, and the age of my box make that hard.
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Post by gandalf »

Thanks. I will add your comments to http://docs.cacti.net/node/283 as this is the only supported version of this HowTo, ATM
Reinhard
User avatar
inetquestion
Cacti User
Posts: 67
Joined: Wed Feb 01, 2006 11:13 am
Location: Charlotte NC

Post by inetquestion »

I've been using your script to debug a problem over the past week or so and thought I would give you some additional items which may help others...

My problem was the values were being returned from the data input method were set to the wrong format in the data template. I was getting a mixture of integers and floating point numbers. Data sources which were set to gauge would accept either type. Those set to counter were rejecting updates for floating point numbers. Modifying the original script so that only integers were returned fixed the issue. The only way I was able to find this was to pull the "rrdtool update...." statement from the logs and run it manually from the command line. this showed me exactly what the errors were. There may be another way I could have gotten this info, but I still don't know that part....

Regards,

-Inet
User avatar
gandalf
Developer
Posts: 22383
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

Post by gandalf »

I will add this as well. But please find the current docs at the second link of my signature
Reinhard
pato
Posts: 17
Joined: Thu Sep 09, 2010 10:35 am

Re: [HOWTO] Debug NaN's in your graphs

Post by pato »

There is one point not covered in the Debug steps, wrong Data Templates -> select your template -> Associated RRA's settings. If the template has a "ds[wlc_5min_cpu].minimal_heartbeat = 120" (in this case Wlan Controller template) set, but the Associated RRA's is also including 1 Minute, this will only create NaN entries. Delete the the affected device, including all graphs, change all Data Templates that have wrong Associated RRA's settings and add the device again.
That fixed it in my case.
leoncyn
Posts: 10
Joined: Wed Nov 23, 2016 2:06 pm

Debug NaN's in your graphs Windows

Post by leoncyn »

Hello! Everybody; I saw lot of information to solve this problems in Linux; some advice for windows 7; My problem is that the graphics can't appear the information and in the information fron the divices appear current:Nan Average: Nan Maximum: Nan.
Attachments
poller_today.jpg
poller_today.jpg (333.95 KiB) Viewed 41010 times
cacti_error_nan.jpg
cacti_error_nan.jpg (251.29 KiB) Viewed 41010 times
cigamit
Developer
Posts: 3369
Joined: Thu Apr 07, 2005 3:29 pm
Location: B/CS Texas
Contact:

Re: Debug NaN's in your graphs Windows

Post by cigamit »

leoncyn wrote:Hello! Everybody; I saw lot of information to solve this problems in Linux; some advice for windows 7; My problem is that the graphics can't appear the information and in the information fron the divices appear current:Nan Average: Nan Maximum: Nan.
Windows doesn't report "Load Average" so you will be unable to graph it, you will want to graph CPU usage instead.
pbud70
Posts: 5
Joined: Mon Oct 10, 2016 7:25 pm

Re: [HOWTO] Debug NaN's in your graphs

Post by pbud70 »

Hi, I've been bashing my head against this issue for a while now... time to get some help.

I'm reading data from a database into variables, and then dropping that into the rrd. Graphs aren't working in cacti, which is failing because the rra aggregations are all returning nan. I've worked my way through a whole bunch of stuff to no avail. So... here's where I'm currently at:

cron runs a script every 5 minutes. This script basically does:
rrdtool updatev /var/www/mrtg/applications/test.rrd N:$rttime >> /var/log/rttime.log 2>&1

(I'm using updatev right now to try and get more information about what's going on).

the RRD is defined:
/usr/bin/rrdtool create \
/var/www/mrtg/applications/test.rrd \
--step 300 \
DS:rttime:GAUGE:600:0:U \
RRA:AVERAGE:0.5:1:500 \
RRA:AVERAGE:0.5:1:600 \
RRA:AVERAGE:0.5:6:700 \
RRA:AVERAGE:0.5:24:775 \
RRA:AVERAGE:0.5:288:797 \
RRA:AVERAGE:0.5:1440:820 \
RRA:MAX:0.5:1:500 \
RRA:MAX:0.5:1:600 \
RRA:MAX:0.5:6:700 \
RRA:MAX:0.5:24:775 \
RRA:MAX:0.5:288:797 \
RRA:MAX:0.5:1440:820 \


$rttime is populated by querying a sql server database and returning the floating value to the variable.
I'm using mssql 0.6.2 from https://www.npmjs.com/package/sql-cli to talk to the sql server.

So the command string ends up looking like

$SQLBIN -s "$DBHOST" -d "$DBNAME" -u "$DBUSER" -p "$DBPASSWD" -q "$QUERY1" -f csv | sed 's/"//g' | cut -d , -f 1,3 > /tmp/output.txt

which gets me useful fields. Populating the variable happens with:

rttime=`sed -n "$i p" /tmp/output.txt | cut -d , -f 2`

which will give something like:
2.0166666666666666

If I then update the rrd:
rrdtool updatev /var/www/mrtg/applications/test.rrd N:$rttime >> /var/log/rttime.log 2>&1

rrdtool updatev /var/www/mrtg/applications/test.rrd N:0
return_value = 0
[1502287200]RRA[AVERAGE][1]DS[rttime] = NaN
...

It puts the data into the rrd, but the rras don't get populated:
filename = "/var/www/mrtg/applications/test.time.rrd"
rrd_version = "0003"
step = 300
last_update = 1502250603
header_size = 4120
ds[rttime].index = 0
ds[rttime]= "GAUGE"
ds[rttime].minimal_heartbeat = 600
ds[rttime].min = 0.0000000000e+00
ds[rttime].max = 1.0000000000e+03
ds[rttime].last_ds = "1.4166666666666667
ds[rttime].value = NaN
ds[rttime].unknown_sec = 3
rra[0].cf = "AVERAGE"
rra[0].rows = 500
rra[0].cur_row = 301
rra[0].pdp_per_row = 1
rra[0].xff = 5.0000000000e-01
rra[0].cdp_prep[0].value = NaN
rra[0].cdp_prep[0].unknown_datapoints = 0
...

valid data in this case is likely to range from 0 - 10, and will be floating point. But the doco says GAUGE should be OK with float?

Now... I have some graphs using almost exactly the same process, but where the data returned is always an integer - and they work fine. Maybe this is the problem? I've just modified the script to only return 2dp.

Nope... still getting NaN for all the averages.

Where do I look next?

Thanks!
Post Reply

Who is online

Users browsing this forum: No registered users and 3 guests