Determine threshold for CPU

02-24-2017

Registered User

1, 0

Join Date: Feb 2017

Last Activity: 24 February 2017, 7:24 AM EST

Posts: 1

Thanks Given: 0

Thanked 0 Times in 0 Posts

Determine threshold for CPU

I'm writing an application that should display whether a system is running �fine� (normal activity) or if it has reached a critical level and thus indicate through a graphical interface using a green-yellow-red color scheme. The server machines in question are running AIX (but it shouldn't differ much through various UNIX systems, though important to note it uses POWER). The solution will be applied on both single server machines with 100% (CPU) capacity and clusters which allow utilization of more than 100%.

I'm well aware that threshold like these are most commonly determined through a lot of trial & error and testing but I would like to come to a conclusion as to which would be the most appropriate threshold with some facts to back it up.

Which leads me to the following questions, how do I set these thresholds in a theoretical way? By thresholds I mean for example �should it turn red and alert with a critical warning at 90%, then how come?�, �Why not 85%?�.
There's also possible spikes in the CPU usage, so should it only indicate as critical after 2 minutes of usage above 85%?

My main question is: Are there any algorithms or past works that have done something similar? Any research papers or books that you know of? I've tried to research this a bit without much success, most of what I could find was related to the x86 architecture and not POWER. Even if the two architectures differ a bit, there's also many similarities so some methods may work with them both.

ttl_aix

View Public Profile for ttl_aix

Find all posts by ttl_aix

02-24-2017

Moderator

6,876, 694

Join Date: Sep 2005

Last Activity: 10 February 2021, 3:50 AM EST

Location: Switzerland - GE

Posts: 6,876

Thanks Given: 594

Thanked 694 Times in 627 Posts

The truth is only the admin responsible on the machine could tell you... It will depend on your knowledge of what and how things run on a machine and so necessarily be not the same on another... furthermore the contentions will differ too, one beeing a memory hog the other a cPU intensive consumer because og laborious calculation algorithms...
A Generalist thing to apply threshold will be only valid on a generalist box...

vbe

View Public Profile for vbe

Find all posts by vbe

02-24-2017

Moderator

3,843, 841

Join Date: Jun 2007

Last Activity: 29 June 2020, 12:30 PM EDT

Location: Lancashire, UK

Posts: 3,843

Thanks Given: 2,004

Thanked 841 Times in 727 Posts

There are other considerations too, such as is your CPU allocation fixed or variable? It might sound odd, but an LPAR can (configuration choice) use more CPU if it is available on the whole server and other LPARs are fully using (or indeed there is some unallocated). You also need to know if you have a share of processors or whole CPUs allocated. That can really skew the figures too.

You would need to better clarify what you have.

What output do you get from something like vmstat 5 3?

Robin

rbatte1

View Public Profile for rbatte1

Visit rbatte1's homepage!

Find all posts by rbatte1

03-16-2017

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

vmstat measures a certain interval, then you get the average CPU usage from that interval.
That means your check must wait until the interval is finished.
For example

Code:

vmstat 5 2

The second value line is the average from the 5 seconds interval.
(The first value line is the average since the system was booted - not very useful.)
"Normal" thresholds for usr%,system%,iowait% are 75,55,30 for warning and 90,70,40 for critical.
Another measurement is the loadavg, this is the runqueue length. The runqueue gets longer if the scheduler is too busy to run the task according to the schedule.
The advantage of the loadavg is that the system provides the measurement interval; there are even 3 intervals: 1 minute, 5 minutes, 15 minutes.
The command line tool for this is uptime.
In the "infrastructure monitoring" sub-forum I have provided some Nagios-plugin-scripts that work on many platforms. Even if you do not have Nagios, you can see the commands in the code. Actually the check_load5.sh uses uptime and the check_cpu_stats.sh uses vmstat.

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

03-17-2017

Registered User

456, 108

Join Date: Nov 2012

Last Activity: 30 July 2019, 10:40 AM EDT

Location: on the road for work; home is private time

Posts: 456

Thanks Given: 10

Thanked 108 Times in 100 Posts

Quote:

What is a machine? In "Openstack" terms - is the machine the host, or the virtual machine?

100% of what? On POWER virtualization - 100% of a processor, or of entitlement (which can get as high as 2000% - yes 2000! although 1000 is the more typical ridiculous number.)

Or are you looking a lcpu percentage: 25% lcpu could mean 100% of all the virtual processors - operating in single-threaded 'scheduling'.

The other thing to be aware of is AIX stats are PURR (processor utilization resource register) - that are processor (hardware) counters, not time-based metrics. A program like vmstat might say 95% user plus 5% system, but it is only 1% of the physical processor (i.e. the physical usage was 1%, and of that 1% 95% user "user%").

So, data-only can be very difficult. For advice you will need advice from someone who knows the expected workload and reasons for "virtual" sizing decisions.

Great ambition - difficult to define the meaning of the variables - as in all things performance - there is a sauce called "it depends" that flavors the numbers you see/observe.

MichaelFelt

View Public Profile for MichaelFelt

Find all posts by MichaelFelt

AIX

Determine threshold for CPU

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Is it possible to combine multiple CPU to act as a single CPU on the same server?

Discussion started by: Dissa

2. Solaris

Rootvol above threshold

Discussion started by: sundar63

3. UNIX for Dummies Questions & Answers

threshold

Discussion started by: danieladna

4. Solaris

How to change CPU threshold high temperature

Discussion started by: Danielz

5. HP-UX

how could I use shell script to determine which CPU structure

Discussion started by: alert0919

6. Solaris

Multi CPU Solaris system shows 100% CPU usage.

Discussion started by: mahive

7. UNIX for Dummies Questions & Answers

how to get persistant cpu utilization values per process per cpu in linux (! top,ps)

Discussion started by: pankajd

8. Shell Programming and Scripting

apache threshold

Discussion started by: learnbash

9. HP-UX

How to determine cpu&memory percentage usage per user

Discussion started by: hp-ux-user