After a reboot only half of CPU monitored?

knebb · Post by **knebb** » Thu Dec 04, 2008 1:57 pm

Yohoo!

I have a strange issue here.

I rebooted one of my Linux boxes (CentOS 5.2) and after the reboot it appears the box has only half of the CPUs

As you can see the CPU values dropped down to 200 instead of 400 for the box.

But I still see all four CPUs in /proc/cpuinfo and in top. Except the reboot I haven't changed anything on Cacti- even Cacti wasn''t rebooted.

I'm polling the .1.3.6.1.4.1.2021.11.54.0 MIB and I'm getting the following with snmpwalk:

Code: Select all

[root@nas ~]# snmpwalk localhost  -v2c -c XXXX .1.3.6.1.4.1.2021.11
UCD-SNMP-MIB::ssIndex.0 = INTEGER: 1
UCD-SNMP-MIB::ssErrorName.0 = STRING: systemStats
UCD-SNMP-MIB::ssSwapIn.0 = INTEGER: 0
UCD-SNMP-MIB::ssSwapOut.0 = INTEGER: 0
UCD-SNMP-MIB::ssIOSent.0 = INTEGER: 1236
UCD-SNMP-MIB::ssIOReceive.0 = INTEGER: 606
UCD-SNMP-MIB::ssSysInterrupts.0 = INTEGER: 841
UCD-SNMP-MIB::ssSysContext.0 = INTEGER: 678
UCD-SNMP-MIB::ssCpuUser.0 = INTEGER: 0
UCD-SNMP-MIB::ssCpuSystem.0 = INTEGER: 1
UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 36
UCD-SNMP-MIB::ssCpuRawUser.0 = Counter32: 8890
UCD-SNMP-MIB::ssCpuRawNice.0 = Counter32: 8771
UCD-SNMP-MIB::ssCpuRawSystem.0 = Counter32: 77037
UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 3361344
UCD-SNMP-MIB::ssCpuRawWait.0 = Counter32: 812719
UCD-SNMP-MIB::ssCpuRawKernel.0 = Counter32: 71227
UCD-SNMP-MIB::ssCpuRawInterrupt.0 = Counter32: 1379
UCD-SNMP-MIB::ssIORawSent.0 = Counter32: 41850680
UCD-SNMP-MIB::ssIORawReceived.0 = Counter32: 85315110
UCD-SNMP-MIB::ssRawInterrupts.0 = Counter32: 71959845
UCD-SNMP-MIB::ssRawContexts.0 = Counter32: 195189837
UCD-SNMP-MIB::ssCpuRawSoftIRQ.0 = Counter32: 4431
UCD-SNMP-MIB::ssRawSwapIn.0 = Counter32: 0
UCD-SNMP-MIB::ssRawSwapOut.0 = Counter32: 28

What I'm confused about are the high values as result. But they seem to be delivered by the other Linux boxes as well- and they report the correct value in Cacti...

So someone an idea what's going wrong here?

Post by **gandalf** » Sun Dec 07, 2008 2:23 pm

Wow, that's nice. Never saw this before.
Please first use snmpwalk to walk against those OIDs for CPU metrics (find the correct OIDs from Settings -> View Poller Cache -> Filter for the host).
Please be aware of the fact, that (at least net-snmp) has some trouble with CPU data. Latest net-snmp (5.4) changed CPU measures.
Reinhard

knebb · Post by **knebb** » Mon Dec 08, 2008 5:11 pm

gandalf wrote:Wow, that's nice. Never saw this before.
Please first use snmpwalk to walk against those OIDs for CPU metrics (find the correct OIDs from Settings -> View Poller Cache -> Filter for the host).

Ehm- what should be the difference to the above snmpwalk? And, I still have an older version of Cacti running (other topic, can't upgrade) where I don't see these log entries- what are you looking for exactly?

Please be aware of the fact, that (at least net-snmp) has some trouble with CPU data. Latest net-snmp (5.4) changed CPU measures.

What could this mean for my issue? I'm using net-snmp-5.3.1-24.el5_2.2 from CentOS 5.

gruad23 · Post by **gruad23** » Tue Dec 09, 2008 2:34 pm

I also did some modifications to my ucd/net-CPU - template to monitor IO-stats and like the screenshot above my graph shows a total value of much more than 100%. So I guess the vertical label "Percent" is wrong in the original graph too.

But what am I actually looking at here?

In my understanding 100% should be the total cpu-bandwidth which is splitted in
user/nice/system/iowait/hard-irq/soft-irq/stolen/idle.

the original graphtemplate and datasources dont have any percent-calculation - they just read and show the following snmp-values (counter)

ssCpuRawUser.0
ssCpuRawNice.0
ssCpuRawSystem.0

I added the following as stack to the graph

ssCpuRawWait.0

and see what I got. How can I now get the "real percentage" which is - naturally - limited by 100% maximum. I looked up the multi-cpu-templates as well and no "percentage-calculation" in there.

I'd like to graph values that can be compared to the %-values I get when I run the top-command.

thnx
peter

ps: In my graph there is lot of io-waits cause I did a copy from a virtualdisk to a nfs-share - both on the same harddisk

gruad23 · Post by **gruad23** » Tue Dec 09, 2008 3:19 pm

Ok - I solved my question. It shows more than 100% cause I've got a quad-cpu and the RawValues are just added up, so it gets 400%. And the graph-template doesnt need percent-calculation, cause the counters are based on seconds and so is the cacti-poller, so things turn out fine

To the OP: it seems you have a quad-cpu-system as well and after reboot you seem to have a dual-cpu-system anymore. Maybe some wrong calcs in your template or maybe something was changed in your system when you rebooted. New kernel, new VM... ?

knebb · Post by **knebb** » Wed Dec 10, 2008 6:11 am

gruad23 wrote: To the OP: it seems you have a quad-cpu-system as well and after reboot you seem to have a dual-cpu-system anymore. Maybe some wrong calcs in your template or maybe something was changed in your system when you rebooted. New kernel, new VM... ?

I don't know what happened here. The monitored machine above is a physical one and has been rebooted. As far as I can see no change in kernel appeared (yum is started automatically).

My template shouldn't be wrong because it monitors all other machines (even with multiple CPU) right. And at the stage when the issue appeared cacti hasn't been touched nor rebooted.

So really confusing. I tried to figure out what's going wrong, but the snmpwalk show raw values- I don't know how they are converted in cacti internally.

gruad23 · Post by **gruad23** » Thu Dec 11, 2008 8:02 pm

knebb wrote:
I don't know what happened here. The monitored machine above is a physical one and has been rebooted. As far as I can see no change in kernel appeared (yum is started automatically).
...
So really confusing. I tried to figure out what's going wrong, but the snmpwalk show raw values- I don't know how they are converted in cacti internally.

First I would check if your system really still sees 4 cpus by checking /proc/cpuinfo. Maybe two cores got disabled in BIOS or whatever. (my dell poweredge can do that)

for snmpwalk and cacti : cacti reads the counters every 5 minutes:

Code: Select all

ssCpuRawUser.0 = Counter32: 132970
ssCpuRawNice.0 = Counter32: 0
ssCpuRawSystem.0 = Counter32: 990871
ssCpuRawIdle.0 = Counter32: 13040085
ssCpuRawWait.0 = Counter32: 45369
ssCpuRawKernel.0 = Counter32: 0
ssCpuRawInterrupt.0 = Counter32: 196
ssCpuRawSoftIRQ.0 = Counter32: 52796

the types are counter32 so cacti calculates the difference to the last value (5minutes ago) and divides by 300 (5 minutes = 300 seconds) to get the values per second and displays it. No more calculations should be done.

The calculation would be wrong if you messed with settings in cacti like heartbeatvalue of your datasource, but I doubt you did this ....

knebb · Post by **knebb** » Mon Dec 15, 2008 3:28 pm

gruad23 wrote:
knebb wrote: First I would check if your system really still sees 4 cpus by checking /proc/cpuinfo.
This was the first thing I checked. And yes, it still reports four CPUs.

The calculation would be wrong if you messed with settings in cacti like heartbeatvalue of your datasource, but I doubt you did this ....

You're right- I didn't change anything

And all other SMP boxes still report the right values according to the # of CPUs.

Maanwhile I deleted the whole CPU graph and rebuild in from scratch. Still the same result- only approx 200 is shown. Interestingly enough, my CDEF function calculated nearly perfect the 400s in summary. Now it always is above 200...(ca. 215)>

What else can I do to get rid of this behavior?

Do you need some additional information?

knebb · Post by **knebb** » Mon Dec 15, 2008 5:15 pm

gandalf wrote:Wow, that's nice. Never saw this before.

Me neither

Please be aware of the fact, that (at least net-snmp) has some trouble with CPU data. Latest net-snmp (5.4) changed CPU measures.

Funny enough that it changes from time to time. I checked my yum.log- the latest snmp update was installed aprox. a month before the issue appeared. And there have been reboots after that- so why does it change suddenly?

Anyway: I checked some output from snmpwalk and it looks like you're on the right track. When I calculate all differences during the 5 minute time period they divided by the 300secs I'm getting a summary of ~214. Which seems to match the graph value.

So it looks like Cacti is doing good- btu the net-snmp stuff is bogus.

As you recommend I'll see if I can upgrade to 5.4 through rpmforge. Additional question: what can cause this issue?

knebb · Post by **knebb** » Tue Dec 16, 2008 3:49 am

gandalf wrote:Wow, that's nice. Never saw this before.
Please be aware of the fact, that (at least net-snmp) has some trouble with CPU data. Latest net-snmp (5.4) changed CPU measures.
Reinhard

I upgraded to net-snmp-5.4.x.svn200812050230-1.1.i386.rpm.

The same issue!

Still not the full calculation.

Any clues?

knebb · Post by **knebb** » Tue Dec 16, 2008 6:01 am

Talking to myself now, hoping someone can help me.

As already stated I installed net-snmp-5.4.x but the issues stays.
Now I performed a snmpwalk every five minutes and checked the output.

So I calculated the difference between each value and added it up. The sum was then divided by 300 (secs). The result always stays at approx 210. Sometimes it may be 215 or so. But I'd expect the sum being at 400.

So again, it doesn't seem to be Cacti- is there another place where to ask for the net-snmp issue?

Post by **gandalf** » Sat Jan 03, 2009 6:53 am

knebb wrote:So again, it doesn't seem to be Cacti- is there another place where to ask for the net-snmp issue?

Sure: net-snmp-users mailing list
Reinhard

knebb · Post by **knebb** » Sat Jan 03, 2009 7:24 am

gandalf wrote:
knebb wrote:So again, it doesn't seem to be Cacti- is there another place where to ask for the net-snmp issue?
Sure: net-snmp-users mailing list
Reinhard

Got no answer, too.

So still unresolved....

Post by **gandalf** » Sat Jan 03, 2009 7:32 am

That's bad. Those are THE experts
Reinhard

Cacti

After a reboot only half of CPU monitored?

After a reboot only half of CPU monitored?

How can we interpret this graph at all?

4cpu's a 100% = 400% !!

Re: 4cpu's a 100% = 400% !!

Re: 4cpu's a 100% = 400% !!

Re: 4cpu's a 100% = 400% !!

Who is online