BIND 8/9 based DNS monitoring

ritter · Post by **ritter** » Tue Apr 16, 2002 8:57 pm

Is anyone currently using Cacti to monitor Bind 8 or 9?

If not, I am probably going to write an input script and release it to Rax for review so that everyone can make use it. The only RRDtool oriented dns monitor I have seen so far has been for ORCA and I am not sure if it is being maintained much less supports BIND9.

Nicholas

Post by **raX** » Wed Apr 17, 2002 5:57 pm

All scripts that I have previously seen for this purpose use the output of 'named.stats'.

You can force Bind to dump its stats like:

rndc stats

Then you'd probably have to parse the file it outputs. The contents of this file is where I get confused. Not quite sure what to make of the data.

-Ian

ritter · Post by **ritter** » Thu Apr 18, 2002 10:16 am

I have a book that outlines and describes the different parts of the file, so that gets us over one hurdle.

For a long time I had the same problem of seeing the named stats but not being able to decipher them. This book cleared it up for me.....granted the book is, in general, not written very well.

Anyway, so I assume that it is ok for me to develop it and send it off to you, or would you prefer I send you the stats definitions so that you can do it?

Nicholas

Post by **raX** » Fri Apr 19, 2002 1:30 pm

If it is not a huge task, you can take care of writing the bind script since you have had more exposure than I to that information if you do not mind. Otherwise you can just send me the information, it will probably just take longer that way.

-Ian

ritter · Post by **ritter** » Mon Apr 22, 2002 8:46 am

No problem, the script should be pretty easy to write provided that the data key is known.

What I think I am going to do is write the script to handle arguements that are the data keys desired for graphing. There are so many different types of DNS queries, for example, that I want the script to fetch only the values desired by the admin.

This would allow one script to be used to build different data sets. Example: one graph to handle PTR and A, one graph to handle SOA and NS, etc.

gwynnebaer · Post by **gwynnebaer** » Mon Apr 22, 2002 4:11 pm

Here's a pretty good script I wrote based on this request. I just wrote it but it's pretty solid. It requires ndc so it is likely a unix-only script.

if you want to port to win32, send me patches. (heck, send me patches anyway)...

Code: Select all

#!/usr/bin/perl

# bind-stats.pl - a script to return bind-related statistical information
#                 Author: Matt Groener, gwynnebaer@hotmail.com

# Use built-in option syntax
use Getopt::Std;

# use $opt_d to override default named.stats dir location
getopt('d');

$STATFILE = $opt_d ? "$opt_d/named.stats" : '/var/named/named.stats';
$MEMFILE  = $opt_d ? "$opt_d/named.memstats" : '/var/named/named.memstats';
$cmd_ndc  = '/usr/sbin/ndc -q stats > /dev/null 2>&1';

# Generate stats now (this could be turned off and run via cron as well)
unlink($STATFILE,$MEMFILE);
qx($cmd_ndc);
$status = $?;
die "Failed command: $cmd_ndc: EXIT_CODE: $status" if $status;

# Die unless we can locate the stats file
if (!open(STATS,$STATFILE)) {
        die "Failed to open $STATFILE: $!\n";
}

# Parse the stats file
while (<STATS>) {
        next if /^[\-\+]/;
        chomp();
        if (/Legend/) { $start_legend++; next; }
        if (/Global/) { $start_legend--; $start_global++; next; }
        if ($start_legend) {
                push(@legend,split());
        } elsif ($start_global) {
                @global = split();
                for (0..$#legend) { $hash{lc($legend[$_])} = $global[$_]; }
                last;
        } else {
                @data = split();
                next if $data[1] =~ /^\d+$/;
                # break up the data and build hash of data
                /time since/i && do { $hash{lc($data[3])} = $data[0]; next; };
                /^\d+\s+.*\s+quer/i && do { $hash{lc($data[1])} = $data[0]; next; };
        }
}
close (STATS);

# print out stats or usage
if (@ARGV) {
        foreach $argv (@ARGV) {
                push(@output,$hash{lc($argv)}) if defined $hash{lc($argv)};
        }
        print "@output";
} else {
        print "Usage: $0 [-d statsdir] args\n\n       where args is one of:\n       ";
        foreach $argv (sort keys %hash) {
                print $argv;
                $incr++;
                if ($incr == 13) {
                        print "\n       ";
                        $incr = 0;
                } else {
                        print " ";
                }
        }
        print "\n\n";
}

Remember that you must remove the current stats files before re-running ndc otherwise it will just append, causing bad data (inaccurate at least).

PS: raX, I will keep copies of everything I write until we can get some sort of posting location up.

Here is a shamelessly stolen snippet from the O'Reilly DNS/BIND book about what the stats mean:

-gwynnebaer

To get the statistics from your name server, send the version 4 name server an ABRT signal (on many systems, called IOT):

% kill -ABRT `cat /etc/named.pid`
Or send a version 8 name server an ILL signal instead of ABRT:

% kill -ILL `cat /etc/named.pid`
(The process ID is stored in /var/run/named.pid on an SVR4 filesystem.) Wait a few seconds and look at the file /usr/tmp/named.stats (or /var/tmp/named.stats). A version 8 name server leaves the file named.stats in its current directory (/usr/local/named in most of our examples). If the statistics are not dumped to this file, your server may not have been compiled with STATS defined and, thus, may not be collecting statistics. Following are the statistics from one of Paul Vixie's name servers. These statistics came from a 4.9.3 name server. An 8.1.2 name server has all of the same items as below except RNotNsQ and the items are arranged in a different order. If your name server is newer than 8.1.2, the statistics may not look at all like those shown here - the BIND statistics may be replaced with the DNS server and resolver MIB extensions defined in RFC 1611 and RFC 1612.

+++ Statistics Dump +++ (800708260) Wed May 17 03:57:40 1995
746683 time since boot (secs)
392768 time since reset (secs)
14 Unknown query types
268459 A queries
3044 NS queries
5680 CNAME queries
11364 SOA queries
1008934 PTR queries
44 HINFO queries
680367 MX queries
2369 TXT queries
40 NSAP queries
27 AXFR queries
8336 ANY queries
++ Name Server Statistics ++
(Legend)
RQ RR RIQ RNXD RFwdQ
RFwdR RDupQ RDupR RFail RFErr
RErr RTCP RAXFR RLame ROpts
SSysQ SAns SFwdQ SFwdR SDupQ
SFail SFErr SErr RNotNsQ SNaAns
SNXD
(Global)
1992938 112600 0 19144 63462 60527 194 347 3420 0 5 2235 27 35289 0
14886 1927930 63462 60527 107169 10025 119 0 1785426 805592 35863
[15.255.72.20]
485 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 485 0 0 0 0 0 0 0 485 0
[15.255.152.2]
441 137 0 1 2 108 0 0 0 0 0 0 0 0 0 13 439 85 7 84 0 0 0 0 431 0
[15.255.152.4]
770 89 0 1 4 69 0 0 0 0 0 0 0 0 0 14 766 68 5 7 0 0 0 0 755 0
... <lots of entries deleted>
Let's look at these statistics one line at a time:

746683 time since boot (secs)
This is how long the local name server has been running. To convert to days, divide by 86400 (60x60x24, the number of seconds in a day). This server has been running for about 8.5 days.

392768 time since reset (secs)
This is how long the local name server has run since the last HUP signal - i.e., the last time it loaded its database. You'll probably see this number differ from the time since boot only if the server is a primary master name server for some zone. Name servers that are slaves for a zone automatically pick up new data with zone transfers and are not usually sent HUP signals. Since this server has been reset, it is probably a primary master name server for some zone.

14 Unknown query types
This name server received 14 queries for data of a type the name server didn't recognize. Either someone is experimenting with new types, or there is a defective implementation somewhere.

268459 A queries
There have been 268459 address lookups. Address queries are normally the most common type of query.

3044 NS queries
There have been 3044 name server queries. Internally, name servers generate NS queries when they are trying to look up servers for the root domain. Externally, applications like dig or nslookup can also be used to look up NS records.

5680 CNAME queries
Some versions of sendmail make CNAME queries in order to canonicalize a mail address (replace an alias with the canonical name). Other versions of sendmail use ANY queries instead (we'll get to those shortly). Otherwise, the CNAME lookups are most likely from dig or nslookup.

11364 SOA queries
SOA queries are made by slave name servers to check if their zone data are current. If the data are not current, an AXFR query follows to cause the zone transfer. Since this set of statistics does show AXFR queries, we can conclude that slave name servers load zone data from this server.

1008934 PTR queries
The pointer queries map addresses to names. Many kinds of software look up IP addresses: inetd, rlogind, rshd, network management software, and network tracing software.

44 HINFO queries
The host-information queries are most likely from someone interactively looking up HINFO records.

680367 MX queries
Mail exchanger queries are made by mailers like sendmail as part of the normal electronic mail delivery process.

2369 TXT queries
Some application must be making text queries for this number to be this large. It might be a tool like Harvest, which is an information search and retrieval technology developed at the University of Colorado.

40 NSAP queries
This is a relatively new data type used to map domain names to OSI Network Service Access Point addresses.

27 AXFR queries
An AXFR query is made by a slave name server to cause a zone transfer.

8336 ANY queries
ANY queries request data of any type for a name. This query type is used most often by sendmail. Since sendmail looks up CNAME, MX, and address records for a mail destination, it will make a query for ANY data type so that all the resource records are cached right away at the local name server.

The rest of the statistics are kept on a per-host basis. If you look over the list of hosts your name server has exchanged packets with, you'll find out just how garrulous your name server is - you'll see hundreds or even thousands of hosts in the list. While the size of the list is impressive, the statistics themselves are only somewhat interesting. We will explain all of the statistics, even the ones with zero counts, although you'll probably only find a handful of the statistics useful. To make the statistics easier to read, you'll need a tool to expand the statistics because the output format is rather compact. We wrote a tool, called bstat, to do just this. Here's what its output looks like:

hpcvsop.cv.hp.com
485 queries received
485 responses sent to this name server
485 queries answered from our cache
relay.hp.com
441 queries received
137 responses received
1 negative response received
2 queries for data not in our cache or authoritative data
108 responses from this name server passed to the querier
13 system queries sent to this name server
439 responses sent to this name server
85 queries sent to this name server
7 responses from other name servers sent to this name server
84 duplicate queries sent to this name server
431 queries answered from our cache
hp.com
770 queries received
89 responses received
1 negative response received
4 queries for data not in our cache or authoritative data
69 responses from this name server passed to the querier
14 system queries sent to this name server
766 responses sent to this name server
68 queries sent to this name server
5 responses from other name servers sent to this name server
7 duplicate queries sent to this name server
755 queries answered from our cache
In the raw statistics (not the bstat output), each host IP address is followed by a table of counts. The column heading for this table is the cryptic legend at the beginning. The legend is broken into several lines, but the host statistics are all on a single line. In the following section, we'll explain briefly what each column means as we look at the statistics for one of the hosts this server conversed with - 15.255.152.2 (relay.hp.com). For the sake of our explanation, we'll first show you the column heading from the legend (e.g., RQ) followed by the count for this column for relay.

RQ 441
RQ is the count of queries received from relay. These queries were made because relay needed information about a domain served by this name server.

RR 137
RR is the count of responses received from relay. These are responses to queries made from this name server. Don't try to correlate this number to RQ, because they are not related. RQ counts questions asked by relay; RR counts answers relay gave to this name server (because this name server asked relay for information).

RIQ 0
RIQ is the count of inverse queries received from relay. Inverse queries were originally intended to map addresses to names, but that function is now handled by PTR records. Older versions of nslookup use an inverse query on startup, so you may see a nonzero RIQ count.

RNXD 1
RNXD is the count of "no such domain" answers received from relay.

RFwdQ 2
RFwdQ is the count of queries received (RQ) from relay that needed further processing before they could be answered. This count is much higher for hosts that configure their resolver (with resolv.conf) to send all queries to your name server.

RFwdR 108
RFwdR is the count of responses received (RR) from relay that answered the original query and were passed back to the application that made the query.

RDupQ 0
RDupQ is the count of duplicate queries from relay. You'll only see duplicates when the resolver is configured (with resolv.conf) to query this name server.

RDupR 0
RDupR is the count of duplicate responses from relay. A response is a duplicate when the name server can no longer find the original query in its list of pending queries that caused the response.

RFail 0
RFail is the count of SERVFAIL responses from relay. A SERVFAIL response indicates some sort of server failure. Server failure responses often occur because the remote server read a db file and found a syntax error. Any queries for data in that zone (the one from the erroneous db file) will result in a server failure answer from the remote name server. This is probably the most common bad response. Server failure responses also occur when the remote name server tries to allocate more memory and can't, or when the remote name server's zone data expire.

RFErr 0
RFErr is the count of FORMERR responses from relay. FORMERR means that the remote name server said the local name server's query had a format error.

RErr 0
RErr is the count of errors that weren't either SERVFAIL or FORMERR.

RTCP 0
RTCP is the count of queries received on TCP connections from relay. (Most queries use UDP.)

RAXFR 0
RAXFR is the count of zone transfers initiated. The 0 count indicates that relay is not a slave for any zones served by this name server.

RLame 0
RLame is the count of lame delegations received. If this count is not 0, it means that some zone is delegated to the name server at this IP address, and the name server is not authoritative for the zone.

ROpts 0
ROpts is the count of packets received with IP options.

SSysQ 13
SSysQ is the count of system queries sent to relay. System queries are queries initiated by the local name server. Most system queries will go to root name servers, because system queries are used to keep up-to-date on who the root name servers are. But system queries are also used to find out the address of a name server if the address record timed out before the name server record did. Since relay is not a root name server, these queries must have been sent for the latter reason.

SAns 439
SAns is the count of answers sent to relay. This name server answered 439 out of the 441 (RQ) queries relay sent to it. I wonder what happened to the 2 queries it didn't answer...

SFwdQ 85
SFwdQ is the count of queries that were sent (forwarded) to relay when the answer was not in this name server's zone data or cache.

SFwdR 7
SFwdR is the count of responses from some name server that were sent (forwarded) to relay.

SDupQ 84
SDupQ is the count of the duplicate queries sent to relay. It's not as bad as it looks, though. The duplicate count is incremented if the query was sent to any other name server first. So, relay might have answered all the queries it received the first time it received them, and the query still counted as a duplicate because it was sent to some other name server before relay.

SFail 0
SFail is the count of SERVFAIL responses sent to relay.

SFErr 0
SFErr is the count of FORMERR responses sent to relay.

SErr 0
SErr is the count of sendto() system calls that failed when the destination was relay.

RNotNsQ 0
RNotNsQ is the count of queries received that were not from port 53, the name server port. Prior to version 8, all name server queries would come from port 53. Any queries from ports other than 53 came from a resolver. Now, name servers will query from ports other than 53, which makes this statistic useless since you can no longer distinguish resolver queries from name server queries. Hence, version 8 dropped RNotNsQ from its statistics.

SNaAns 431
SNaAns is the count of nonauthoritative answers sent to relay. Out of the 439 answers (SAns) sent to relay, 431 were from cached data.

SNXD 0
SNXD is the count of "no such domain" answers sent to relay.

Is this name server "healthy"? How do you know what "healthy" operation is? From this one snapshot, we really couldn't say if the name server is healthy. You have to watch the statistics generated by your server over a period of time to get a feel for what sorts of numbers are normal for your configuration. These numbers will vary markedly among servers, depending on the mix of applications generating lookups, the type of server (primary, slave, caching-only), and the level in the domain tree it is serving.

One thing to watch for in the statistics is how many queries per second your server receives. Take the number of queries received (from the "Global" statistics) and divide by the number of seconds the name server has been running. This server received 1992938 queries in 746683 seconds, or approximately 2.7 queries per second - a pretty busy server. If the number you come up with for your server seems out of line, look at which hosts are making all the queries and decide if it makes sense for them to be making all those queries. At some point you may decide that you need more servers to handle the load; we cover that situation in the next chapter.

ritter · Post by **ritter** » Tue Apr 23, 2002 8:43 am

This script is awsome!

Thanks for posting it, it does exactly what we need. I added to lines of code during my demoing of the script...

the two lines are perl system() calls to delete the prexisting stats files prior to running the rest of the code. I hope you don't mind to much.

Are you using this script to do BIND monitoring in cacti currently?

The only thing that needs to be done now then is to have the steps documented to configure cacti to use the script, which should be hard....

Nick

ritter · Post by **ritter** » Tue Apr 23, 2002 10:42 am

ok, I took the script posted in this topic, modified it to delete the stats files before regenerating them, then I created a data input following the scripts requirements.

At this point I setup a data input to use the script to fetch just PTR queries. When I call the script in the shell with the same command path...perl <script name> <script args> .... the script works great.

I then created a data source using the data input, but didn't create graphs just yet.

I got the following error the first time the script ran:

sh: -c: line 1: syntax error near unexpected token `<PTR>'
sh: -c: line 1: `perl /var/www/html/cacti/scripts/nsstats.pl <PTR>'
X-Powered-By: PHP/4.0.6
Content-type: text/html

I assumed this to be a problem of the carrots not being editied out somewhere like in the data source. I checked and found the carrots to be only in the data input where it should be (right?)

Without any modification, the second time the script ran, it spit out the default garble as if it didn't understand what the command option was, which further supports the carrots being submitted with the command option data.

Just to check I issued the ping.pl script with the command options in carrots and not, and got the same errors. Is the data input pasring the problem?

Nick

ritter · Post by **ritter** » Tue Apr 23, 2002 10:55 am

Sorry, dumb error on my part....dunno how I missed it....oh well.

I think I know have one data source working. I was getting errors when using multple data sources, but I am going to see if I made the same error.

Nick

ritter · Post by **ritter** » Tue Apr 23, 2002 11:24 am

Although I have not queried more than just PTR stats through the script, I have a question of implimentation and scalability.

Say that a user wants to track errors in one RRD/graph, PTR and A in another, and NS/SOA in yet another.

This would involve having multple data inputs to the same script, just with different arguements, which is easy. The issue is how to deal with the script hitting the stats files multiple times. If left the way I edited the script, the stats would be deleted and then recreated. If issued at the same time, the script would cause havoc, not to mention the disk I/O issues.

Any ideas on how to make this more scalable so as to give more freedom to graph data in groups? Or should cacti's abilities be used for graphing and grouping?

gwynnebaer · Post by **gwynnebaer** » Tue Apr 23, 2002 3:00 pm

The code snippet from the script:

Code: Select all

# Generate stats now (this could be turned off and run via cron as well) 
unlink($STATFILE,$MEMFILE);
qx($cmd_ndc);

actually first removes the files, and then regenerates the stats files, so you shouldn't need any system calls to first delete the stat files (perl is good like that).

The issue of I/O and repetitive calls to ndc on my first stab at it could be solved this way:

1. run ndc -q stats from cron before we run cmd.php from cron.
2. remove the unlink statements and the ndc call from the script
3. make all the calls to this script you need from cmd.php

I would suggest a simple shell wrapper to first remove, then regenerate the stats files to fulfill #1 above:

Code: Select all

#!/bin/sh -

rm -f /var/named/named.stats /var/named/named.memstats
ndc -q stats
exit 0

This would run stats every five minutes, and you would save the processor hit on bind and the box itself (since you only care about the stats within the 5-minute window anyway).

-gwynnebaer[/i]

ritter · Post by **ritter** » Tue Apr 23, 2002 8:13 pm

That is what I ended up doing earlier today. I wrote a shell script that runs every four minutes and deletes then recreates the stats files. Although I could have done five minutes, but sometimes on a long running dns server I find ndc stats takes a while.

Anyway, I have a sampling of dns traffic monitoring setup now if anyone wants to see it. I created a general dns data input that takes just one argument (the variable to be monitored), then graphed various rrds together. Cacti is the most wonderful frontend of any out there!

Here is the URL:

http://lfcnms.lfc.edu/cacti/graph_view.php?action=tree

Let me know if you can't reach the site, I may need to add a firewall rule to allow off campus web hits to the box.

Anyway, I can document everything I have done plus the shell script, although the script posted above would work just the same.

Nick

gwynnebaer · Post by **gwynnebaer** » Tue Apr 23, 2002 8:41 pm

ritter wrote: Here is the URL:

http://lfcnms.lfc.edu/cacti/graph_view.php?action=tree

Let me know if you can't reach the site, I may need to add a firewall rule to allow off campus web hits to the box.

A better URL would be:

http://lfcnms.lfc.edu/cacti/graph_view. ... on=preview

But looks great. Just seeing your data makes me have a whole new view of DNS queries. Amazing! I wonder what my data will show...

-gwynnebaer

Guest · Post by **Guest** » Wed Apr 24, 2002 1:21 am

I grab the DNS queries with just 2 lines of PERL, short and sweet.

Code: Select all

system("/usr/bin/sudo /usr/sbin/rndc stats");
($dns_queries) = (`/bin/cat /var/named/named.stats` =~ /success (\d+)/);

I added that script to UCD SNMP so I can poll all of our DNS servers remotely. On that note, I do ALL data collection via SNMP.

Brent Meshier
Global Transport Logistics, Inc.
http://www.gtlogistics.com/

ablyler · Post by **ablyler** » Sun Apr 28, 2002 4:08 pm

Code: Select all

+++ Statistics Dump +++ (1020033800)
success 13
referral 0
nxrrset 0
nxdomain 10
recursion 22
failure 0
--- Statistics Dump --- (1020033800)

I have bind 9.2.0, and my named.stats only contains the above. Any ideas on how I can expand this?

Thanks,
Andy

Cacti

BIND 8/9 based DNS monitoring

BIND 8/9 based DNS monitoring

BIND monitoring, ndc stats key

BIND script

posted code....WONDERFUL!!

errors, questions...

oops...fixed

Working...one last question

RE: Scalability issues...

Re: RE: Scalability issues...

BIND 9

named.stats

Who is online