I'm writing an application that should display whether a system is running “fine” (normal activity) or if it has reached a critical level and thus indicate through a graphical interface using a green-yellow-red color scheme. The server machines in question are running AIX (but it shouldn't differ much through various UNIX systems, though important to note it uses POWER). The solution will be applied on both single server machines with 100% (CPU) capacity and clusters which allow utilization of more than 100%.
I'm well aware that threshold like these are most commonly determined through a lot of trial & error and testing but I would like to come to a conclusion as to which would be the most appropriate threshold with some facts to back it up.
Which leads me to the following questions, how do I set these thresholds in a theoretical way? By thresholds I mean for example “should it turn red and alert with a critical warning at 90%, then how come?”, “Why not 85%?”.
There's also possible spikes in the CPU usage, so should it only indicate as critical after 2 minutes of usage above 85%?
My main question is: Are there any algorithms or past works that have done something similar? Any research papers or books that you know of? I've tried to research this a bit without much success, most of what I could find was related to the x86 architecture and not POWER. Even if the two architectures differ a bit, there's also many similarities so some methods may work with them both.
The truth is only the admin responsible on the machine could tell you... It will depend on your knowledge of what and how things run on a machine and so necessarily be not the same on another... furthermore the contentions will differ too, one beeing a memory hog the other a cPU intensive consumer because og laborious calculation algorithms...
A Generalist thing to apply threshold will be only valid on a generalist box...
There are other considerations too, such as is your CPU allocation fixed or variable? It might sound odd, but an LPAR can (configuration choice) use more CPU if it is available on the whole server and other LPARs are fully using (or indeed there is some unallocated). You also need to know if you have a share of processors or whole CPUs allocated. That can really skew the figures too.
You would need to better clarify what you have.
What output do you get from something like vmstat 5 3?
vmstat measures a certain interval, then you get the average CPU usage from that interval.
That means your check must wait until the interval is finished.
For example
The second value line is the average from the 5 seconds interval.
(The first value line is the average since the system was booted - not very useful.)
"Normal" thresholds for usr%,system%,iowait% are 75,55,30 for warning and 90,70,40 for critical.
Another measurement is the loadavg, this is the runqueue length. The runqueue gets longer if the scheduler is too busy to run the task according to the schedule.
The advantage of the loadavg is that the system provides the measurement interval; there are even 3 intervals: 1 minute, 5 minutes, 15 minutes.
The command line tool for this is uptime.
In the "infrastructure monitoring" sub-forum I have provided some Nagios-plugin-scripts that work on many platforms. Even if you do not have Nagios, you can see the commands in the code. Actually the check_load5.sh uses uptime and the check_cpu_stats.sh uses vmstat.
Location: on the road for work; home is private time
Posts: 456
Thanks Given: 10
Thanked 108 Times in 100 Posts
Quote:
I'm writing an application that should display whether a system is running “fine” (normal activity) or if it has reached a critical level and thus indicate through a graphical interface using a green-yellow-red color scheme. The server machines in question are running AIX (but it shouldn't differ much through various UNIX systems, though important to note it uses POWER). The solution will be applied on both single server machines with 100% (CPU) capacity and clusters which allow utilization of more than 100%.
What is a machine? In "Openstack" terms - is the machine the host, or the virtual machine?
100% of what? On POWER virtualization - 100% of a processor, or of entitlement (which can get as high as 2000% - yes 2000! although 1000 is the more typical ridiculous number.)
Or are you looking a lcpu percentage: 25% lcpu could mean 100% of all the virtual processors - operating in single-threaded 'scheduling'.
The other thing to be aware of is AIX stats are PURR (processor utilization resource register) - that are processor (hardware) counters, not time-based metrics. A program like vmstat might say 95% user plus 5% system, but it is only 1% of the physical processor (i.e. the physical usage was 1%, and of that 1% 95% user "user%").
So, data-only can be very difficult. For advice you will need advice from someone who knows the expected workload and reasons for "virtual" sizing decisions.
Great ambition - difficult to define the meaning of the variables - as in all things performance - there is a sauce called "it depends" that flavors the numbers you see/observe.
We have a single threaded application which is restricted by CPU usage even though there are multiple CPUs on the server, hence leading to significant performance issues. Is it possible to merge / combine multiple CPUs at OS level so it appear as a single CPU for the application? (6 Replies)
Hi there,
Root filesystem is above threshold, I have search and cleared unwanted files which are filling up space. But the root fs is still above threshold.
I don't know about veritas volume management. Can anyone show me how to solve this. Du shows /proc is occupying a lot of space. Most of the... (2 Replies)
Hi,
I have a table with 14 columns. How can I filter the columns 2-14, so that I get only those rows back in which the data values are >= 6 in 5 or more columns. :confused:
E.g.
A 6 6 3 6 7 8
B 1 2 3 4 5 5
C 2 2 2 6 7 8
Here I should only get back the row A.
I would like to work from... (5 Replies)
Hi,
I have a NETRA 240 server wich should work on high temperature environment (up to 50 deg celsius). After reaching ~48 deg, the system is shuting down.
The HighShutDownThreshold of the CPU is set to 89 deg
The PowerOffThreshold of the CPU is set to 96 deg
Please help me to change these... (2 Replies)
how could I use shell script to determine which CPU structure
because I found that I compile my program under Itanium base that cannot run on the PA-RISC base
but PA-RISC program can run on Itanium base
i would like to use shell script to know which CPU structure it is,how could i do
thanks (1 Reply)
Hello Friends,
On one of my Solaris 10 box, CPU usage shows 100% using "sar", "vmstat". However, it has 4 CPUs and prstat and glance are not showing enough processes to justify high CPU utilization.
=========================================================================
$ prstat -a
... (4 Replies)
hi,
i want to know cpu utilizatiion per process per cpu..for single processor also if multicore in linux ..to use these values in shell script to kill processes exceeding cpu utilization.ps (pcpu) command does not give exact values..top does not give persistant values..psstat,vmstat..does njot... (3 Replies)
Hi folks,
how can i check apache threshold values via shell scripting and what factors need to check via shell scripting process or number of users or what.
pls do advice me.
Thanks,
Bash (9 Replies)
Using HP-UX v11
Need to monitor cpu and memory usage, total for system and separately for each user in command-line mode.
Found out next ways to monitor total cpu usage under hp-ux:
1) vmstat, also shows free memory
2) sar -M
ps -eo user,pcpu - does not work, means 'user-defined format'... (4 Replies)