CPU utilisation - user procs dissapeared ?

stucky101 · Post by **stucky101** » Sun Dec 02, 2007 12:49 am

Guys

The below graph is from an oracle database on a PE 6850 that serves as a backend for siebel.
Look at the area between 8am and 10am. We had a major problem with some queries that were causing the clients to hang.
The problem is this : usually you get a high count of user procs since all oracle procs run under the user 'oracle'.
Assuming that "system procs" are considered the ones running under root I don't understand the graph.
When I logged on at around 8.30 and did a top I saw tons of oracle procs using up most of the cpu.

Q1 - Why do I see an increase in system procs instead of user procs ?
Q2 - Why are all the user procs completely gone in that time period (no blue graph at all)

How can there be no user procs at all within this time period ?

I saw this one single time before 2 years ago on 2 nodes of an oracle RAC system. Same exact thing - after a reboot the blue came back.

I'm trying to interpret this for the dba's but I'm not sure I can.
Can anyone help me ?

DB runs Oracle 10g on RHEL4-U5 using net-snmp to poll. The PE 6850 has 4 sockets - each dual-cored. HT per core is turned off so linux sees 8 procs.

Thanks

--stucky

Post by **gandalf** » Sun Dec 02, 2007 4:40 am

Please zoom into the graph at exactly the timespan of that problem and re-post the graph
Reinhard

stucky101 · Post by **stucky101** » Sun Dec 02, 2007 6:33 pm

Gandalf

Dh'ou - I totally forgot about cacti's zooming capabilities - thx for reminding me.
Attached are 2 zooms
1. The weird one.
2. A normal one from a day later.
I'm not sure it reveals anyhing more other than that something was very different on day one.
Could it be that if system procs were higher than user procs the red would overshadow the blue ?
Not that this could ever happen on a busy database with 300 plus oracle procs running but if it was the case I wonder how cacti would graph it.
I've been poking though other graphs and I can't find any that don't show at least a little blue except this one we're talking about.

stucky101 · Post by **stucky101** » Sun Dec 02, 2007 6:58 pm

Ok I looked at values and they don't make sense to me.

According to my last post the system procs were in fact higher than user procs.
1. I don't see how that could be when a top showed oracle totally hogging the cpu.
2. I attached another graph from when we had bad queries before and they look totally like I'd expect them to look.
What I don't understand here though is that all 3 values for system and user procs are nearly identical - yet we have red graphs shortly under 40% and blue graphs in the 80% range.
I'm probably not reading this right. Hope this info helps.

Post by **gandalf** » Mon Dec 03, 2007 7:46 am

Please visit that graph at Graph Management and switch to DEBUG mode. Please post the whole rrdtool graph statement
Reinhard

stucky101 · Post by **stucky101** » Mon Dec 03, 2007 1:57 pm

RRDTool Command:

/usr/bin/rrdtool graph - \
--imgformat=PNG \
--start=-86400 \
--end=-300 \
--title="SBLPRD 2 - CPU Usage" \
--rigid \
--base=1000 \
--height=120 \
--width=500 \
--alt-autoscale-max \
--lower-limit=0 \
--vertical-label="percent" \
--slope-mode \
DEF:a="/var/www/html/cacti-0.8.6j/rra/sblprd_2_cpu_system_875.rrd":cpu_system:AVERAGE \
DEF:b="/var/www/html/cacti-0.8.6j/rra/sblprd_2_cpu_user_876.rrd":cpu_user:AVERAGE \
DEF:c="/var/www/html/cacti-0.8.6j/rra/sblprd_2_cpu_nice_874.rrd":cpu_nice:AVERAGE \
CDEF:cdefbc=TIME,1196707863,GT,a,a,UN,0,a,IF,IF,TIME,1196707863,GT,b,b,UN,0,b,IF,IF,TIME,1196707863,GT,c,c,UN,0,c,IF,IF,+,+ \
AREA:a#FF0000:"System" \
GPRINT

LAST:"Current\:%8.2lf %s" \
GPRINT

AVERAGE:"Average\:%8.2lf %s" \
GPRINT

MAX:"Maximum\:%8.2lf %s\n" \
AREA:b#0000FF:"User":STACK \
GPRINT

LAST:" Current\:%8.2lf %s" \
GPRINT

AVERAGE:"Average\:%8.2lf %s" \
GPRINT

MAX:"Maximum\:%8.2lf %s\n" \
AREA:c#00FF00:"Nice":STACK \
GPRINT:c:LAST:" Current\:%8.2lf %s" \
GPRINT:c:AVERAGE:"Average\:%8.2lf %s" \
GPRINT:c:MAX:"Maximum\:%8.2lf %s\n" \
LINE1:cdefbc#000000:"Total" \
GPRINT:cdefbc:LAST:" Current\:%8.2lf %s" \
GPRINT:cdefbc:AVERAGE:"Average\:%8.2lf %s" \
GPRINT:cdefbc:MAX:"Maximum\:%8.2lf %s"

RRDTool Says:

OK

Post by **gandalf** » Mon Dec 03, 2007 3:07 pm

Ok, those are AREA/STACKs as should be. No error here. You may of course dump the rrd file's contents (e.g. using dataquery, else rrdtool fetch) to make sure the rrd file does not contain other numbers.
Aah, stop, wait, ...
Is this a multi core CPU? In this case, "user" proc may have exceeded the value of 100 which is the (wrong) default MAXIMUM of the proc data source. Upper the MAXIMUM to "number of cores * 100" and apply same change to all existing rrd files of this type using "rrdtool tune"
Reinhard

fmangeant · Post by **fmangeant** » Tue Dec 04, 2007 10:53 am

gandalf wrote:Aah, stop, wait, ...
Is this a multi core CPU? In this case, "user" proc may have exceeded the value of 100 which is the (wrong) default MAXIMUM of the proc data source. Upper the MAXIMUM to "number of cores * 100" and apply same change to all existing rrd files of this type using "rrdtool tune"

Hi

Reinhard is right : with Net-SNMP < 5.4 on Linux boxes, CPU usage goes from 0 to 100 x number of procs.

You can use the template in my signature for 2, 4 and 8 CPU systems.

stucky101 · Post by **stucky101** » Tue Dec 04, 2007 10:08 pm

Thanks everybody - I'm trying the 2/4/8 way templates right now.
The PE 6850 has 8 cores but HT can be turned on per core so linux sees
16 cpus. Your templates refer to cores only right ?
I know HT is not the same thing as a core but then again linux doesn't make a difference. It sees it all as a cpu.
Thoughts ?

PS: Time to update the default templates for cacti maybe ? I mean who runs single core systems these days ?

fmangeant · Post by **fmangeant** » Wed Dec 05, 2007 1:37 am

To know how many CPU your Linux system sees, run that :

Code: Select all

$ grep -c ^processor /proc/cpuinfo

stucky101 · Post by **stucky101** » Wed Dec 05, 2007 3:22 am

Well if you really go by what linux sees as a cpu then it's 16. I already knew that cause when you do a 'top' and press '1' it shows 16 cpu's there ( your test confirms too) but I thought you were referring to cores only and this box has 8. I always thought a core is real whereas HT is not considered quite the same as a core. Then again linux can't seem to distinguish.
This means I need another template then right ?

fmangeant · Post by **fmangeant** » Wed Dec 05, 2007 4:06 am

stucky101 wrote:This means I need another template then right ?

Yes

I'll try to post a template tomorrow.

stucky101 · Post by **stucky101** » Thu Dec 06, 2007 11:05 pm

Guys what do you think about this thread ?

http://forums.cacti.net/viewtopic.php?t ... &start=120

Sure tempting to get each cpu graphed separately as well...

stucky101 · Post by **stucky101** » Fri Dec 07, 2007 8:55 pm

nah never mind. I tried it and it's a pain. Some of those don't import at all and the others require too much manually mangling.
Besides they look messy on 16 way box anyway.
I'd rather wait on your 16 way template.

thanks

--stucky

stucky101 · Post by **stucky101** » Tue Dec 11, 2007 1:03 pm

Guys

Any news on the 16-way template ?
Also I have found that the new templates don't graph quite the way the old ones did.
It used to adjust the max based on the average of the utilisation.
Since I applied the new graps it always shows the 100% mark even if utilization is very low.
Doesn't make for as good a graph in my opinion. Can I adjust that so it graphs like before except with the correct number of cores ?

Cacti

CPU utilisation - user procs dissapeared ?

CPU utilisation - user procs dissapeared ?

Who is online