Understanding Linux Load Average – Part 1
Posted by Harald van Breederode on April 23, 2012
A frequently asked question in my classroom is “What is the meaning of load average and when is it too high?”. This may sound like an easy question, and I really thought it was, but recently I discovered that things aren’t always that easy as they seem. In this first of a three-part post I will explain what the meaning of Linux load average is and how to diagnose load averages that may seem too high.
Obtaining the current load average is very simple by issuing the uptime
command:
$ uptime 21:49:05 up 11:33, 1 user, load average: 10.52, 6.03, 3.78
But what is the meaning of these 3 numbers? Basically load average is the run-queue utilization averaged over the last minute, the last 5 minutes and the last 15 minutes. The run-queue is a list of processes waiting for a resource to become available inside the Linux operating system. The example above indicates that on average there were 10.52 processes waiting to be scheduled on the run-queue measured over the last minute.
The questions are of course: Which processes are on the run-queue? And what are they waiting for? Why not find the answer to these questions by performing a series of experiments?
CPU utilization and load average
To be able to perform the necessary experiments I wrote a few shell scripts to generate various types of load on my Linux box. The first experiment is to start one CPU load process, on an otherwise idle system, and watch its effect on the load average using the sar
command:
$ load-gen cpu Starting 1 CPU load process. $ sar –q 30 6 Linux 2.6.32-300.20.1.el5uek (roger.example.com) 04/21/2012 09:06:54 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 09:07:24 PM 1 290 0.39 0.09 0.15 09:07:54 PM 1 290 0.63 0.18 0.18 09:08:24 PM 1 290 0.77 0.26 0.20 09:08:54 PM 1 290 0.86 0.33 0.22 09:09:24 PM 1 290 0.97 0.40 0.25 09:09:54 PM 1 288 0.98 0.46 0.28 Average: 1 290 0.77 0.29 0.21
The above sar
output reported the load average 6 times with an interval of 30 seconds. It shows that there was 1 process constantly on the run-queue resulting that the 1 minute load average slowly climbs to a value of 1 and then stabilizes there. The 5 minute load average will continue to climb for a few more minutes and will also stabilize at a value of 1 and the same is true for the 15 minute load average assuming the run-queue utilization will remain the same.
The next step is to take a look at the CPU utilization to check if there is a correlation between it and the load average. While measuring the load average using sar
I also had it running to report the CPU utilization.
$ sar –u 30 6 Linux 2.6.32-300.20.1.el5uek (roger.example.com) 04/21/2012 09:06:54 PM CPU %user %nice %system %iowait %steal %idle 09:07:24 PM all 50.48 0.00 0.65 0.00 0.00 48.87 09:07:54 PM all 50.40 0.00 0.48 0.02 0.00 49.10 09:08:24 PM all 50.03 0.00 0.57 0.02 0.00 49.39 09:08:54 PM all 49.97 0.00 0.52 0.00 0.00 49.52 09:09:24 PM all 50.10 0.00 0.52 0.02 0.00 49.37 09:09:54 PM all 50.23 0.00 0.55 0.02 0.00 49.21 Average: all 50.20 0.00 0.55 0.01 0.00 49.24
This shows that overall the system was roughly spending 50% of its time running user processes and the other 50% was spent doing nothing. Thus only half of the machine’s capacity was used to run the CPU load which caused a load average of 1. Isn’t that strange? Not if you know that the machine is equipped with two processors. While one CPU was busy running the load the other CPU was idle resulting in an overall CPU utilization of 50%.
Personally I prefer using sar
to peek around in a busy Linux system but other people tend to use top
for the same thing. This is what top
had to report about the situation we are studying using sar
:
$ top –bi –d30 –n7 top - 21:09:55 up 10:54, 1 user, load average: 0.98, 0.46, 0.28 Tasks: 188 total, 2 running, 186 sleeping, 0 stopped, 0 zombie Cpu(s): 50.2%us, 0.5%sy, 0.0%ni, 49.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3074820k total, 2539340k used, 535480k free, 218600k buffers Swap: 5144568k total, 0k used, 5144568k free, 1160120k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27348 hbreeder 20 0 63836 1068 908 R 99.8 0.0 3:00.31 busy-cpu 27354 hbreeder 20 0 12756 1184 836 R 0.0 0.0 0:00.12 top
The -bi
command line option given to top
tells it to go into batch-mode, instead of full-screen-mode, and to ignore idle processes. The -d30
and the -n7
instructs top
to produce 7 sets of output with a delay of 30 seconds between them. The output above is the last of 7 sets of output top
produced.
Besides everything we already discovered by looking at the various sar
outputs, top
gives us useful information about the processes consuming CPU time as well as information about physical and virtual memory usage. It is interesting to see that the busy-cpu
process consumes 99.8% while the overall CPU utilization is slightly over 50% resulting in 49% of idle time.
The explanation for this is that top
reports an averaged CPU utilization in the header section of its output while the per process CPU utilization is not averaged over the total number of processors.
We can verify this statement by using the -P ALL
command line option to make sar
report the CPU utilization on a per processor basis as well as the averaged values.
$ sar –P ALL –u 30 6 Linux 2.6.32-300.20.1.el5uek (roger.example.com) 04/21/2012 09:06:54 PM CPU %user %nice %system %iowait %steal %idle 09:07:24 PM all 50.48 0.00 0.65 0.00 0.00 48.87 09:07:24 PM 0 0.97 0.00 1.27 0.00 0.00 97.77 09:07:24 PM 1 99.97 0.00 0.03 0.00 0.00 0.00 09:07:24 PM CPU %user %nice %system %iowait %steal %idle 09:07:54 PM all 50.41 0.00 0.48 0.02 0.00 49.09 09:07:54 PM 0 0.83 0.00 0.97 0.00 0.00 98.20 09:07:54 PM 1 100.00 0.00 0.00 0.00 0.00 0.00 09:07:54 PM CPU %user %nice %system %iowait %steal %idle 09:08:24 PM all 50.03 0.00 0.57 0.02 0.00 49.38 09:08:24 PM 0 75.89 0.00 0.87 0.00 0.00 23.24 09:08:24 PM 1 24.17 0.00 0.27 0.03 0.00 75.53 09:08:24 PM CPU %user %nice %system %iowait %steal %idle 09:08:54 PM all 49.95 0.00 0.52 0.00 0.00 49.53 09:08:54 PM 0 81.03 0.00 0.77 0.00 0.00 18.21 09:08:54 PM 1 18.91 0.00 0.23 0.00 0.00 80.86 09:08:54 PM CPU %user %nice %system %iowait %steal %idle 09:09:24 PM all 50.11 0.00 0.52 0.02 0.00 49.36 09:09:24 PM 0 57.05 0.00 0.93 0.03 0.00 41.99 09:09:24 PM 1 43.12 0.00 0.17 0.03 0.00 56.68 09:09:24 PM CPU %user %nice %system %iowait %steal %idle 09:09:54 PM all 50.23 0.00 0.55 0.02 0.00 49.21 09:09:54 PM 0 19.94 0.00 0.97 0.00 0.00 79.09 09:09:54 PM 1 80.56 0.00 0.07 0.00 0.00 19.37 Average: CPU %user %nice %system %iowait %steal %idle Average: all 50.20 0.00 0.55 0.01 0.00 49.24 Average: 0 39.28 0.00 0.96 0.01 0.00 59.75 Average: 1 61.12 0.00 0.13 0.01 0.00 38.74
This output confirms that most of the time only one of the two available processors was busy resulting in an overall averaged CPU utilization of 50.2%.
The next experiment is to add a second CPU load process to the still running first CPU load process. This will increase the number of processes on the run-queue from 1 to 2. What effect will this have on the load average?
$ load-gen cpu Starting 1 CPU load process. $ sar –q 30 6 Linux 2.6.32-300.20.1.el5uek (roger.example.com) 04/21/2012 09:09:55 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 09:10:25 PM 2 291 1.38 0.60 0.33 09:10:55 PM 2 291 1.62 0.74 0.38 09:11:25 PM 2 291 1.77 0.86 0.43 09:11:55 PM 2 291 1.86 0.96 0.48 09:12:25 PM 2 291 1.91 1.06 0.53 09:12:55 PM 2 291 1.95 1.15 0.57 Average: 2 291 1.75 0.90 0.45
The output above shows that the number of processes on the run-queue is now indeed 2 and that the load average is climbing to a value of 2 as a result of this. Because there are now 2 processes hogging the CPU we can expect that the overall averaged CPU utilization is close to 100%. The top
output below confirms this:
$ top –bi –d30 –n7 top - 21:12:55 up 10:57, 1 user, load average: 1.95, 1.15, 0.57 Tasks: 189 total, 3 running, 186 sleeping, 0 stopped, 0 zombie Cpu(s): 99.3%us, 0.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3074820k total, 2540968k used, 533852k free, 218756k buffers Swap: 5144568k total, 0k used, 5144568k free, 1160212k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27377 hbreeder 20 0 63836 1064 908 R 99.4 0.0 2:59.45 busy-cpu 27348 hbreeder 20 0 63836 1068 908 R 98.8 0.0 5:59.18 busy-cpu 27383 hbreeder 20 0 12756 1188 836 R 0.1 0.0 0:00.13 top
Please note that top
reports 2 processes using nearly 100% CPU time. Using sar
we can verify that indeed both processors are now fully utilized.
$ sar –P ALL –u 30 6 Linux 2.6.32-300.20.1.el5uek (roger.example.com) 04/21/2012 09:09:55 PM CPU %user %nice %system %iowait %steal %idle 09:10:25 PM all 99.22 0.00 0.78 0.00 0.00 0.00 09:10:25 PM 0 98.60 0.00 1.40 0.00 0.00 0.00 09:10:25 PM 1 99.83 0.00 0.17 0.00 0.00 0.00 09:10:25 PM CPU %user %nice %system %iowait %steal %idle 09:10:55 PM all 99.32 0.00 0.68 0.00 0.00 0.00 09:10:55 PM 0 98.70 0.00 1.30 0.00 0.00 0.00 09:10:55 PM 1 99.90 0.00 0.10 0.00 0.00 0.00 09:10:55 PM CPU %user %nice %system %iowait %steal %idle 09:11:25 PM all 99.28 0.00 0.72 0.00 0.00 0.00 09:11:25 PM 0 98.70 0.00 1.30 0.00 0.00 0.00 09:11:25 PM 1 99.90 0.00 0.10 0.00 0.00 0.00 09:11:25 PM CPU %user %nice %system %iowait %steal %idle 09:11:55 PM all 99.27 0.00 0.73 0.00 0.00 0.00 09:11:55 PM 0 98.67 0.00 1.33 0.00 0.00 0.00 09:11:55 PM 1 99.87 0.00 0.13 0.00 0.00 0.00 09:11:55 PM CPU %user %nice %system %iowait %steal %idle 09:12:25 PM all 99.25 0.00 0.75 0.00 0.00 0.00 09:12:25 PM 0 98.60 0.00 1.40 0.00 0.00 0.00 09:12:25 PM 1 99.90 0.00 0.10 0.00 0.00 0.00 09:12:25 PM CPU %user %nice %system %iowait %steal %idle 09:12:55 PM all 99.32 0.00 0.68 0.00 0.00 0.00 09:12:55 PM 0 98.77 0.00 1.23 0.00 0.00 0.00 09:12:55 PM 1 99.90 0.00 0.10 0.00 0.00 0.00 Average: CPU %user %nice %system %iowait %steal %idle Average: all 99.27 0.00 0.73 0.00 0.00 0.00 Average: 0 98.67 0.00 1.33 0.00 0.00 0.00 Average: 1 99.88 0.00 0.12 0.00 0.00 0.00
The final experiment is to add 3 additional CPU load processes to check if we can force the load average to go up any further now that we are already consuming all available CPU resources on the system.
$ load-gen cpu 3 Starting 3 CPU load processes. $ top –bi –d30 –n7 top - 21:21:59 up 11:06, 1 user, load average: 4.91, 3.47, 2.41 Tasks: 193 total, 6 running, 186 sleeping, 0 stopped, 0 zombie Cpu(s): 99.0%us, 0.7%sy, 0.3%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3074820k total, 2570552k used, 504268k free, 219180k buffers Swap: 5144568k total, 0k used, 5144568k free, 1160512k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27408 hbreeder 20 0 63836 1068 908 R 39.9 0.0 4:09.41 busy-cpu 27377 hbreeder 20 0 63836 1064 908 R 39.8 0.0 8:42.65 busy-cpu 27348 hbreeder 20 0 63836 1068 908 R 39.6 0.0 10:09.95 busy-cpu 27477 hbreeder 20 0 63836 1064 908 R 39.4 0.0 1:11.19 busy-cpu 27436 hbreeder 20 0 63836 1064 908 R 38.9 0.0 2:39.25 busy-cpu 27483 hbreeder 20 0 12756 1192 836 R 0.1 0.0 0:00.13 top
We managed to drive the load average up to 5 ;-) Because there are only 2 processors available in the system and there are 5 processes fighting for CPU time, each process will only get 40% from the available 200% CPU time.
Conclusion
Based on all these experiments we can conclude that CPU utilization is clearly influencing the load average of a Linux system. If the load average is above the total number of processors in the system we could conclude that the system is overloaded but this assumes that nothing else influences the load average. Is CPU utilization indeed the only factor that drives the Linux load average? Stay tuned for part two!
-Harald
Raheel Syed said
Reblogged this on Raheel's Blog.
doktersil said
Clear aticle Harald! I like it very much. Enough information for a DBA without complicating things! Cheers, Simone Pedroso
djeday84 said
can u provide scripts to reproduce load ?
Harald van Breederode said
Hi,
The CPU load script contains the following 4 lines of code:
while :
do
:
done
kevinclosson said
Hi Harald,
Good topic. One little nit, though. That can’t be your script because ; (comma) is not a shell builtin…or do you have a script or a.out in your path called “;” :-)
Harald van Breederode said
Hi Kevin,
You are absulutly right! I made a typo while entering the comment. The ; should be a :. I’ve corrected it right away.
-Harald
Narendra said
Herald,
Excellent. Clear yet simple for somebody like me who is new to Linux to understand.
Can’t wait for subsequent parts.
Thanks in advance
Log Buffer #269, A Carnival of the Vanities for DBAs | The Pythian Blog said
[…] is the meaning of load average and when is it too high? Harald van Breederode has the […]
Amir Hameed said
It seems that the CPU run-queue is reported differently on Linux than on Solaris. If I am interpreting it correctly, on Linux, run-queue shows the actually number of running processes. On Solaris, run-queue shows the number of processes that are not running yet but are waiting to be put on the CPU. I ran the same test, as shown above, on my Solaris server and the run-queue starts to show a value of greater than zero when the number of load processes increased the number of CPUs on the server.
Harald van Breederode said
Hi Amir,
Yes, that is correct. On Linux the run queue shows the number of running (and waiting) processes. I haven’t verified your statement about Solaris but I believe this is indeed true. I haven’t looked at Solaris for a very long time ;-) But if my memory serves me correct interpreting load average on Solaris is quite different.
-Harald
dincer salih kurnaz said
Hi
I use to measure the load atop and iostat
Dinçer
Tom Bouwman said
Here is another good article on load of Linux systems:
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
Harald van Breederode said
Hi Tom,
Thanx for the pointer to another load average article. However that article states that the load average is only affected by CPU utilization. Which is clearly not the case. This is a common misunderstanding hence my postings on this subject.
-Harald
Fabrice Bacchella said
I don’t understand why peoples keep using such a broken indicator. It add values that count on unit (number on process waiting for cores) and values that can count in tens or hundreds (number of process waiting for IO). So a value of 10 can be on the same machine a non issue (10 IO waiting for disks) or a big load (10 process wainting for CPU).
It was designed for computers with one core and one IDE disk, both components being mono-tasked. Those times are gone by now.
Harald van Breederode said
Hi Fabrice,
Thanx for sharing your opinion. Besides telling me what I shouldn’t use, you may as well tell me what to use instead.
-Harald
Fabrice Bacchella said
vmstat/iostat are usefuls tools. There is a lot of values in /proc too. There is just two of them than one should never use because they where haven’t be upgraded from IDE and mono core computers :
-load average.
-svctm in iostat -x.
But they are wrong only on linux. BSD/Solaris make them right.
Veritas Volume Manager (VxVM) command line examples | IT World said
[…] Understanding Linux Load Average – Part 1 […]
dong ma said
hi~i come from china.read your this serious of blog make my mind so clear.but,why don’t u upload your shell scripts?aha,hope u see it
Unix Load Average « Oracle Mine…. said
[…] Unix Load Average […]
Understanding Linux CPU Load 资料汇总 | 系统技术非业余研究 said
[…] Understanding Linux Load Average 谢谢 @jametong 参考:part1 part2 […]
kumarkarthiknk said
As per your blog no matter how many CPUs you have load is same ? ie Adding “1 CPU load process” will increase load by one on dual core server or 64 core server ?
1 cpu 1 process = load 1 ,
2 cpu 2 processes = load 2,
3 cpu 6 process = load 3
But,
This blog says it the other way http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages =>
1 cpu 1 process = load 1 ,
2 cpu 2 processes = load 1,
3 cpu 6 process = load 2
Am I mistaken ? Please advice
Harald van Breederode said
Hi,
I think you miss interpret both articles. I see no differences between both postings.
The load average will increase with each running process, i.e. 1 proc = load 1; 2 procs = load 2; 6 procs = load 6, but in order to determine if your CPUs are fully utilized you need to divide the load by the number of CPUs.
-Harald
Ali MEZGANI said
Reblogged this on Mezgani blog – A Linux System Engineer Blog and commented:
Very nice article on how to detect load average issue
System Load Average and cpu cores. | urahero said
[…] http://www.linuxjournal.com/article/9001 https://prutser.wordpress.com/2012/04/23/understanding-linux-load-average-part-1 […]
System Load Average and cpu cores,Run Queue Part 2 | urahero said
[…] http://www.linuxjournal.com/article/9001 https://prutser.wordpress.com/2012/04/23/understanding-linux-load-average-part-1 […]
Unix/Linux:How can average run queue length be 1 but load average is almost zero – Unix Questions said
[…] Hi, thanks for responding! Its true that the sar man page doesn't say the run queue length is average. However, what makes me doubt that it see's itself is why then when the load script runs does it not always see 2 at least. Also, what i see is contrary to the experiments i have seen where others are doing a similar thing e.g. prutser.wordpress.com/2012/04/23/… […]
What is Processor Use and Load Averages In CentOS Servers said
[…] Understanding Linux Load Averages […]
Alona said
Hi
Could somebody knows how I can get from sar -q grep only average for 15 mins and create output in such format: “average for 15mins: 0.01”
thanks in advance
Harald van Breederode said
Hi Alona,
SAR will collect its data every 10 minutes by default, but this can easily changed to every 15 minutes in /etc/cron.d/sysstat. Hope this helps.
-Harald
Chakravarthi Thangavelu said
Very nice article. It is really helpful to understand the concepts. Thanks harald.