My graphs stop updating

remm · Post by **remm** » Wed May 19, 2004 4:21 am

I don't understand why, but sometimes, it happens that my graphs stop updating. When I nuke the host with the Data Sources and then, I recreate them, it re-starts. But it's not the best solution (I hope) !
I think that my RRDfiles are no more updated because after my host_and_data_sources rebirth there is a blank between when it stops and when it re-starts.
Hope I've been clear.

Anyone has a track ?
thx

Vince22 · Post by **Vince22** » Thu May 20, 2004 10:52 am

I have the same problem but I don't understand!
I have some graph for a device who are update but not all.
I have change the snmp time out but nothing change.
Help us!

Post by **raX** » Tue May 25, 2004 12:53 am

Are either of you using cactid to gather data for the graphs? If so, do you see any "stuck" cactid processes if you run 'ps ax' at the command line? Random holes in the graph usually indicate that the poller is either not running or is dying pre-maturely.

-Ian

remm · Post by **remm** » Thu May 27, 2004 5:40 am

I'm not using cactid but the crontab.
Vince22, I don't think we have the same problem because in my case, when the graphs stop updating the never start again. I have the same type of graphs with some of my routers, but I think it is normal.

Lux · Post by **Lux** » Thu May 27, 2004 6:35 am

Remm, I wouldn't question the blank times on my router, however I would question the times that you have a nearly constant 1% utilization. Does your graph show 0% at the same time every day? If you notice, your router is reporting no utilization during non-business hours (1800-0800).

Mike

remm · Post by **remm** » Thu May 27, 2004 8:12 am

My graph do not show 0% at the same time every day, even if the trafic between 18 and 8 is almost non-existent.
I've a 1% CPU usage during nearly all the business hours, I had a peak once since I monitore it.

oharel · Post by **oharel** » Wed Aug 04, 2004 2:03 pm

i have the exact same thing, only this happened all at once, and only with Cisco CPU and memory graphs!
attached is an example...
this is what i did to try and restore the graphs, to no avail:
cleared poller cache
removed from the device the two graphs and re-installed them
checked permissions on the rrds
here is the rrd info:
filename = "snmp_alex_acca_5min_cpu_88.rrd"
rrd_version = "0001"
step = 300
last_update = 1091645648
ds[5min_cpu].type = "GAUGE"
ds[5min_cpu].minimal_heartbeat = 600
ds[5min_cpu].min = 0.0000000000e+00
ds[5min_cpu].max = 1.0000000000e+02
ds[5min_cpu].last_ds = "UNKN"
ds[5min_cpu].value = 0.0000000000e+00
ds[5min_cpu].unknown_sec = 248
rra[0].cf = "AVERAGE"
rra[0].rows = 700
rra[0].pdp_per_row = 6
rra[0].xff = 5.0000000000e-01
rra[0].cdp_prep[0].value = NaN
rra[0].cdp_prep[0].unknown_datapoints = 4
rra[1].cf = "AVERAGE"
rra[1].rows = 17280
rra[1].pdp_per_row = 1
rra[1].xff = 5.0000000000e-01
rra[1].cdp_prep[0].value = NaN
rra[1].cdp_prep[0].unknown_datapoints = 0
rra[2].cf = "AVERAGE"
rra[2].rows = 775
rra[2].pdp_per_row = 24
rra[2].xff = 5.0000000000e-01
rra[2].cdp_prep[0].value = NaN
rra[2].cdp_prep[0].unknown_datapoints = 10
rra[3].cf = "AVERAGE"
rra[3].rows = 797
rra[3].pdp_per_row = 288
rra[3].xff = 5.0000000000e-01
rra[3].cdp_prep[0].value = 5.5556000000e+02
rra[3].cdp_prep[0].unknown_datapoints = 119
rra[4].cf = "MAX"
rra[4].rows = 700
rra[4].pdp_per_row = 6
rra[4].xff = 5.0000000000e-01
rra[4].cdp_prep[0].value = NaN
rra[4].cdp_prep[0].unknown_datapoints = 4
rra[5].cf = "MAX"
rra[5].rows = 17280
rra[5].pdp_per_row = 1
rra[5].xff = 5.0000000000e-01
rra[5].cdp_prep[0].value = NaN
rra[5].cdp_prep[0].unknown_datapoints = 0
rra[6].cf = "MAX"
rra[6].rows = 775
rra[6].pdp_per_row = 24
rra[6].xff = 5.0000000000e-01
rra[6].cdp_prep[0].value = NaN
rra[6].cdp_prep[0].unknown_datapoints = 10
rra[7].cf = "MAX"
rra[7].rows = 797
rra[7].pdp_per_row = 288
rra[7].xff = 5.0000000000e-01
rra[7].cdp_prep[0].value = 1.2000000000e+01
rra[7].cdp_prep[0].unknown_datapoints = 119

rdtool fetch shows:
1091607300: 1.0380000000e+01
1091607600: 1.1376666667e+01
1091607900: 1.2000000000e+01
1091608200: 1.2000000000e+01
1091608500: 1.2000000000e+01
1091608800: 1.2000000000e+01
1091609100: 1.2000000000e+01
1091609400: 1.2000000000e+01
1091609700: 1.2000000000e+01
1091610000: nan
1091610300: nan
1091610600: nan
1091610900: nan
1091611200: nan
1091611500: nan
1091611800: nan
1091612100: nan

this happened only on Cisco devices, on these specific graphs.
it must be something i did, but what

?

--harel

Post by **raX** » Wed Aug 04, 2004 11:22 pm

After clearing your poller cache, can you still find references to the "broken" Cisco templates in there? The cause of your problem really depends on whether Cacti is actually trying to poll data for these graphs or not.

If Cacti is actually re-creating the .rrd files for these templates after you delete them, that is a good sign.

Do you see any strange output when you run cmd.php/cactid?

-Ian

oharel · Post by **oharel** » Thu Aug 05, 2004 4:04 am

Hi Rax,

After clearing your poller cache, can you still find references to the "broken" Cisco templates in there? The cause of your problem really depends on whether Cacti is actually trying to poll data for these graphs or not.

after clearing the poller cache, i can see:
Data Source: SNMP - Alex - Gw - 5 Minute CPU
RRD: /var/www/htdocs/cacti-0.8.5a/rra/snmp_alex_gw_5min_cpu_42.rrd
Action: 0, OID: (Host: 172.16.0.1, Community: mycom)

If Cacti is actually re-creating the .rrd files for these templates after you delete them, that is a good sign.

seems like the rrd is updated, because it has the same time as the other rrds.

i tried creating a new router-device, to graph only cpu on that router. same results...

as for the cmd.php output: yes, definitely. this was not there before:
Missing object name
USAGE: snmpget [OPTIONS] AGENT OID [OID]... Version: 5.1.1
Web: http://www.net-snmp.org/
Email: net-snmp-coders@lists.sourceforge.net

OPTIONS:
-h, --help display this help message
-H display configuration file directives understood
-v 1|2c|3 specifies SNMP version to use
-V, --version display package version number
SNMP Version 1 or 2c specific
-c COMMUNITY set the community string
SNMP Version 3 specific
-a PROTOCOL set authentication protocol (MD5|SHA)
-A PASSPHRASE set authentication protocol pass phrase
-e ENGINE-ID set security engine ID (e.g. 800000020109840301)
-E ENGINE-ID set context engine ID (e.g. 800000020109840301)
-l LEVEL set security level (noAuthNoPriv|a
etc.

this appears every so often, right before a cisco router / switch normal output.

i tried snmpwalk and snmpget to see maybe the cpu oid got corrupted or something, and i get it just fine:
root@dublin:/var/www/htdocs/cacti# snmpwalk -v1 -c mycom 172.16.0.1:161 .1.3.6.1.4.1.9.2.1.58.0
SNMPv2-SMI::enterprises.9.2.1.58.0 = INTEGER: 4

it seems strange that all graphs for memory and cpu stopped acting together.

does this help?

--harel

melchandra · Post by **melchandra** » Thu Aug 05, 2004 8:28 am

remm wrote:I'm not using cactid but the crontab.
Vince22, I don't think we have the same problem because in my case, when the graphs stop updating the never start again. I have the same type of graphs with some of my routers, but I think it is normal.

It almost looks like you have set the graph maximum to 1 or something. If you had autoscale set, it almost always (in my expierence anyways) leaves a bit of room above the highest peak.

Mika · Post by **Mika** » Thu Aug 05, 2004 2:13 pm

For me happened the same thing as for "oharel": all traffic graphs stoped collecting data.
If I delete that data collection and restart it again, it collects correctly.
I've checked poller cache - there is no records left about those stoped traffic measurements. Tried to clear poller cache - it didn't help.
by the way, all these graphs stopped at the same time.

oharel · Post by **oharel** » Thu Aug 05, 2004 2:39 pm

Melchandra wrote:

It almost looks like you have set the graph maximum to 1 or something

nope. it is set on autoscale.

Mika wrote:

For me happened the same thing as for "oharel": all traffic graphs stoped collecting data.

nope. just my cisco cpu and memory. on the same router and switches, interfaces are polled correctly

--harel

Anacronik12 · Post by **Anacronik12** » Mon Aug 09, 2004 10:42 pm

Hello,

i have sometime a hole in my graphs.

1092105300: 5.2142857143e-02
1092105600: 3.1600000000e-02
1092105900: 2.6900000000e-02
1092106200: nan
1092106500: nan
1092106800: nan
1092107100: nan
1092107400: nan
1092107700: nan
1092108000: nan
1092108300: nan
1092108600: nan
1092108900: nan
1092109200: nan

suddenly, all stop to be registered. But if i start php cmd.php
i don't see any problem.
I have a bunch of graphs (1500).
i don't know how to debug:-(

Tks

melchandra · Post by **melchandra** » Tue Aug 10, 2004 10:09 am

Hmm. 1500 graphs. I'd almost bet my cacti installation that you problem is that cmd.php takes longer than 5 minutes to run. This causes serious problems. Switch to cactid.

oharel · Post by **oharel** » Tue Aug 10, 2004 5:00 pm

sounds right to me, somehow.
i looked throughout the forum but did not find anything like this; if there is, please direct me to the correct discussion.
anyway:
one fine morning my cmd.php stopped updating the rrds every 5 minutes. i had to configure it to work every minute in order for it to do the job properly.
here is the old file:

* /5 * * * php /var/www/htdocs/cacti/cmd.php > /tmp/cacti_log 2>&1

and here is the new:

* * * * * php /var/www/htdocs/cacti/cmd.php > /tmp/cacti_log 2>&1

it takes a lot more than 5 minutes to run, but that was the situation even before it stopped working.
the output of cmd.php is the same whether it runs every 5 minutes or every minute...
also, if i change the cmd.php to /5, the only way i can get it run again is reboot the machine.
i am running cacti 0.8.5a on Slackware9

any idea, beside moving to cactid?

-harel

My graphs stop updating

My graphs stop updating

same problem

Who is online