The Dutch Prutser's Blog

By: Harald van Breederode

  • Disclaimer

    The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.
  • Subscribe

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 243 other followers

Archive for the ‘Linux’ Category

Understanding Linux Load Average – Part 3

Posted by Harald van Breederode on May 28, 2012

In part 1 we performed a series of experiments to explore the relation between CPU utilization and Linux load average. We concluded that the load average is influenced by processes running on or waiting for the CPU. Based on experiments in part 2 we came to the conclusion that processes that are performing disk I/O also influence the load average on a Linux system. In this posting we will do another experiment to find out if the Linux load average is also affected by processes performing network I/O.

Network I/O and load average

To check if a correlation exists between processes performing network I/O and the load average we will start 10 processes generating network I/O on an otherwise idle system and collect various performance related statistics using the sar command. Note: My load-gen script uses the ping command to generate network I/O.

$ load-gen network 10
Starting 10 network load processes.
$ sar –n DEV 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:38:01 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:38:31 PM        lo  88953.60  88953.60 135920963.87 135920963.87      0.00      0.00      0.00
09:38:31 PM      eth1      0.13      0.17     11.33     62.33      0.00      0.00      0.00
09:38:31 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:38:31 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:39:01 PM        lo  89295.13  89295.13 136442626.93 136442626.93      0.00      0.00      0.00
09:39:01 PM      eth1      0.03      0.03      2.60     48.33      0.00      0.00      0.00
09:39:01 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:39:01 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:39:31 PM        lo  89364.38  89364.38 136548566.91 136548566.91      0.00      0.00      0.00
09:39:31 PM      eth1      0.10      0.10      7.34     47.30      0.00      0.00      0.03
09:39:31 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:39:31 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:40:01 PM        lo  89410.80  89410.80 136619365.60 136619365.60      0.00      0.00      0.00
09:40:01 PM      eth1      0.03      0.03      2.60     48.33      0.00      0.00      0.00
09:40:01 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:40:01 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:40:31 PM        lo  89502.30  89502.30 136759314.53 136759314.53      0.00      0.00      0.00
09:40:31 PM      eth1      0.23      0.27     20.60     59.33      0.00      0.00      0.00
09:40:31 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

09:40:31 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
09:41:01 PM        lo  89551.52  89551.52 136834718.24 136834718.24      0.00      0.00      0.00
09:41:01 PM      eth1      0.03      0.03      2.60     48.35      0.00      0.00      0.00
09:41:01 PM      eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:        IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
Average:           lo  89346.27  89346.27 136520905.51 136520905.51      0.00      0.00      0.00
Average:         eth1      0.09      0.11      7.85     52.33      0.00      0.00      0.01
Average:         eth0      0.00      0.00      0.00      0.00      0.00      0.00      0.00 

The above output shows that the lo interface sent and received almost 90 thousand packets per second good for a total of 136 million bytes of traffic. The other two interfaces had virtually no traffic at all. This is because my network load processes are pinging localhost. Let’s have a look at the CPU utilization before taking a look at the run-queue utilization and Load Average.

$ sar –P ALL –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:38:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:38:31 PM       all     13.90      0.00     86.10      0.00      0.00      0.00
09:38:31 PM         0     13.60      0.00     86.40      0.00      0.00      0.00
09:38:31 PM         1     14.17      0.00     85.83      0.00      0.00      0.00

09:38:31 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:39:01 PM       all     13.82      0.00     86.18      0.00      0.00      0.00
09:39:01 PM         0     13.30      0.00     86.70      0.00      0.00      0.00
09:39:01 PM         1     14.37      0.00     85.63      0.00      0.00      0.00

09:39:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:39:31 PM       all     13.84      0.00     86.16      0.00      0.00      0.00
09:39:31 PM         0     13.30      0.00     86.70      0.00      0.00      0.00
09:39:31 PM         1     14.37      0.00     85.63      0.00      0.00      0.00

09:39:31 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:40:01 PM       all     13.82      0.00     86.18      0.00      0.00      0.00
09:40:01 PM         0     14.10      0.00     85.90      0.00      0.00      0.00
09:40:01 PM         1     13.53      0.00     86.47      0.00      0.00      0.00

09:40:01 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:40:31 PM       all     13.75      0.00     86.25      0.00      0.00      0.00
09:40:31 PM         0     14.27      0.00     85.73      0.00      0.00      0.00
09:40:31 PM         1     13.20      0.00     86.80      0.00      0.00      0.00

09:40:31 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:41:01 PM       all     13.55      0.00     86.45      0.00      0.00      0.00
09:41:01 PM         0     13.83      0.00     86.17      0.00      0.00      0.00
09:41:01 PM         1     13.27      0.00     86.73      0.00      0.00      0.00

Average:          CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:          all     13.78      0.00     86.22      0.00      0.00      0.00
Average:            0     13.73      0.00     86.27      0.00      0.00      0.00
Average:            1     13.82      0.00     86.18      0.00      0.00      0.00
 

On average the CPU spent 14% of its time running code in user mode and 86% of the CPU time was spent running code in kernel mode. This is because the Linux kernel has to work quite hard to handle the amount of network traffic. The question is of course: What effect does this have on the Load Average?

$ sar –q 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:38:01 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:38:31 PM        10       319      4.03      1.93      1.86
09:39:01 PM        10       319      6.46      2.72      2.12
09:39:31 PM        10       319      7.85      3.41      2.37
09:40:01 PM        10       319      8.69      4.04      2.61
09:40:31 PM        10       319      9.14      4.59      2.84
09:41:01 PM        10       313      9.55      5.12      3.07
Average:           10       318      7.62      3.63      2.48
 

The above sar output shows that the run-queue was constantly occupied by 10 processes and that the 1-minute Load Average slowly climbed towards 10 as one might expect by now ;-) This could be an indication that Load Average is influenced by processes performing network I/O. But maybe the ping processes are using high amounts of CPU time and thereby forcing the Load Average to go up. To figure this out we will take a look at the top output.

top - 21:41:02 up 11:25,  1 user,  load average: 9.51, 5.19, 3.11
Tasks: 215 total,   9 running, 206 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.6%us, 34.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi, 52.4%si,  0.0%st
Mem:   3074820k total,  2567640k used,   507180k free,   221652k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1161696k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
30118 root      20   0  8124  768  640 R  1.1  0.0   0:00.32 ping               
30121 root      20   0  8124  768  640 R  0.9  0.0   0:00.27 ping               
30126 root      20   0  8124  772  640 R  0.5  0.0   0:00.15 ping               
30127 root      20   0  8124  772  640 R  0.5  0.0   0:00.15 ping               
30134 root      20   0  8124  772  640 R  0.4  0.0   0:00.13 ping               
30135 root      20   0  8124  768  640 R  0.4  0.0   0:00.11 ping               
30136 root      20   0  8124  764  640 R  0.4  0.0   0:00.11 ping               
30139 root      20   0  8124  768  640 R  0.2  0.0   0:00.05 ping               
27675 hbreeder  20   0 12864 1212  836 R  0.1  0.0   0:00.15 top                
 

It is clear from the above output that the ping processes are not using huge amounts of CPU time at all and that eliminates CPU utilization as the driving force behind the Load Average. The above output also reveals that the high CPU utilization is mainly caused by handling software interrupts, 52% in this case.

Conclusion

Based on this experiment we can conclude that processes performing network I/O have an effect on the Linux Load Average. And based on the experiments in the previous two postings we concluded that processes running on, or waiting for, the CPU and processes performing disk I/O also have an effect on the Linux Load Average. Thus the 3 factors that drive the Load Average on a Linux system are processes that are on the run-queue because they:

  • Run on, or are waiting for, the CPU

  • Perform disk I/O

  • Perform network I/O

Summary

The Linux Load Average is driven by the three factors mentioned above, but how does one interpret a Load Average that seems to be too high? The first step is to look at the CPU utilization. If this isn’t 100% and the Load Average is above the number of CPU’s in the system, the Load Average is primarily driven by processes performing disk I/O, network I/O or the combination of both. Finding the processes responsible for most of the I/O isn’t straightforward because there aren’t many tools available to assist you in doing so. A very useful tool is iotop but it doesn’t seem to work on Oracle Linux 5. It does work on Oracle Linux 6 however. Another tool is atop but it requires one or more kernel patches to be useful.

If the CPU utilization is 100% and the Load Average is above the number of CPUs in the system, the Load Average is either completely driven by processes running on, or waiting for, the CPU or driven by a combination of processes running on, or waiting for, the CPU and processes performing I/O (which could be in turn a combination of disk and network I/O). Using top is an easy method to verify if CPU utilization is indeed solely responsible for the current Load Average or that the other two factors play a role as well. Knowing your system does help a lot when it comes to troubleshooting performance problems. Taking performance baselines using sar is always a good thing to do.
-Harald

Posted in Linux | 24 Comments »

Understanding Linux Load Average – Part 2

Posted by Harald van Breederode on May 5, 2012

In part 1 we performed a series of experiments to explore the relation between CPU utilization and Linux load average. We came to the conclusion that CPU utilization clearly influences the load average. In part 2 we will continue our experiments and take a look if disk I/O also influences the Linux load average.

Disk I/O and load average

The first experiment is starting 2 processes performing disk I/O on an otherwise idle system to measure the amount I/O issued, the load average and CPU utilization using the sar command. BTW: My load-gen script uses the dd command to generate disk I/O.

$ load-gen disk 2
$ sar –b 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:26:59 PM       tps      rtps      wtps   bread/s   bwrtn/s
09:27:29 PM  24881.59      0.00  24881.59      0.00  46776.38
09:27:59 PM  26894.34      0.00  26894.34      0.00  50553.80
09:28:29 PM  26772.19      0.10  26772.09      0.80  50327.21
09:28:59 PM  27366.10      0.00  27366.10      0.00  51438.92
09:29:29 PM  25126.06      0.20  25125.86      1.61  47241.72
09:29:59 PM  22815.88      0.00  22815.88      0.00  42897.59
Average:     25643.28      0.05  25643.23      0.40  48207.04

The -b command line option given to sar tells it to report disk I/O statistics. The above output tells us that on average 48207 blocks per second were written to disk and almost nothing was read. What effect does this have on the load average?

$ sar –q 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:26:59 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:27:29 PM         4       293      0.89      1.70      1.75
09:27:59 PM         3       293      1.33      1.73      1.76
09:28:29 PM         1       293      1.59      1.75      1.77
09:28:59 PM         3       293      1.89      1.81      1.78
09:29:29 PM         3       293      1.93      1.83      1.79
09:29:59 PM         2       291      2.02      1.86      1.80
Average:            3       293      1.61      1.78      1.77

The run-queue utilization varies because a process is not on the run-queue while it is waiting for its I/O to complete. However the 1-minute load average climbs towards the number of processes issuing disk I/O, 2 in this case, which is a clear indication that performing disk I/O influences the load average. Before jumping to conclusions too soon we also have to take a look at the CPU utilization.

$ sar –P ALL –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:26:59 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:27:29 PM       all      0.89      0.00     27.62     58.21      0.00     13.28
09:27:29 PM         0      0.93      0.00     41.46     57.57      0.00      0.03
09:27:29 PM         1      0.84      0.00     14.28     58.81      0.00     26.07

09:27:29 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:27:59 PM       all      0.76      0.00     30.93     59.04      0.00      9.27
09:27:59 PM         0      0.70      0.00     45.57     53.63      0.00      0.10
09:27:59 PM         1      0.82      0.00     16.59     64.37      0.00     18.22

09:27:59 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:28:29 PM       all      0.84      0.00     30.19     58.33      0.00     10.64
09:28:29 PM         0      0.70      0.00     44.67     54.56      0.00      0.07
09:28:29 PM         1      0.97      0.00     16.25     61.96      0.00     20.82

09:28:29 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:28:59 PM       all      0.79      0.00     30.48     57.33      0.00     11.39
09:28:59 PM         0      0.67      0.00     45.35     53.98      0.00      0.00
09:28:59 PM         1      0.94      0.00     16.47     60.48      0.00     22.10

09:28:59 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:29:29 PM       all      0.81      0.00     26.43     55.30      0.00     17.46
09:29:29 PM         0      0.70      0.00     40.33     58.93      0.00      0.03
09:29:29 PM         1      0.90      0.00     13.53     51.93      0.00     33.64

09:29:29 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:29:59 PM       all      0.62      0.00     22.09     53.08      0.00     24.21
09:29:59 PM         0      0.57      0.00     35.11     64.33      0.00      0.00
09:29:59 PM         1      0.64      0.00     10.17     42.84      0.00     46.35

Average:          CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:          all      0.78      0.00     27.93     56.86      0.00     14.43
Average:            0      0.71      0.00     42.08     57.16      0.00      0.04
Average:            1      0.85      0.00     14.51     56.58      0.00     28.06

On average the CPU was about 29% busy, 14% idle and 57% was spent waiting for I/O completion (IOWAIT). During IOWAIT the system is actually idle but because there is at least 1 outstanding I/O it is reported as IOWAIT. IOWAIT is a common misunderstood CPU state! See the mpstat manual page for a good description of the various CPU states on a Linux system.

What does top have to report?

top - 21:30:00 up 11:14,  1 user,  load average: 2.02, 1.86, 1.80
Tasks: 191 total,   1 running, 190 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.6%us, 15.5%sy,  0.0%ni, 24.3%id, 53.1%wa,  0.9%hi,  5.7%si,  0.0%st
Mem:   3074820k total,  2550432k used,   524388k free,   220404k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1160732k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
27576 hbreeder  20   0 63168  612  512 D 13.9  0.0   0:04.19 dd                 
27575 hbreeder  20   0 63168  612  512 D 13.1  0.0   0:03.95 dd                 
27542 hbreeder  20   0 12756 1188  836 R  0.1  0.0   0:00.23 top                 

The top output above confirms that although the CPU utilization is quite low, the load average is about 2. Since there was nothing else running on my system during this period we can safely conclude that the Linux load average is also influenced by the number of processes issuing disk I/O.

Demystifying IOWAIT

As stated above IOWAIT is basically the same as idle time. The only difference is that during IOWAIT there is at least 1 outstanding I/O. Had there been another process (or thread) waiting for the CPU it would use the CPU and the IOWAIT would be reported as one of the other CPU states such as USER, NICE or SYSTEM. We can easily prove this by performing another experiment by starting 2 CPU load processes while the 2 disk load processes are still running.

$ load-gen cpu 2
$ sar –P ALL –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:30:00 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:30:30 PM       all     63.29      0.00     36.41      0.30      0.00      0.00
09:30:30 PM         0     30.11      0.00     69.29      0.60      0.00      0.00
09:30:30 PM         1     96.47      0.00      3.53      0.00      0.00      0.00

09:30:30 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:31:00 PM       all     61.33      0.00     38.52      0.15      0.00      0.00
09:31:00 PM         0     28.28      0.00     71.42      0.30      0.00      0.00
09:31:00 PM         1     94.40      0.00      5.60      0.00      0.00      0.00

09:31:00 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:31:30 PM       all     62.60      0.00     37.40      0.00      0.00      0.00
09:31:30 PM         0     29.67      0.00     70.33      0.00      0.00      0.00
09:31:30 PM         1     95.53      0.00      4.47      0.00      0.00      0.00

09:31:30 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:32:00 PM       all     64.29      0.00     35.71      0.00      0.00      0.00
09:32:00 PM         0     30.84      0.00     69.16      0.00      0.00      0.00
09:32:00 PM         1     97.70      0.00      2.30      0.00      0.00      0.00

09:32:00 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:32:30 PM       all     63.77      0.00     36.23      0.00      0.00      0.00
09:32:30 PM         0     31.40      0.00     68.60      0.00      0.00      0.00
09:32:30 PM         1     96.13      0.00      3.87      0.00      0.00      0.00

09:32:30 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:33:00 PM       all     63.49      0.00     36.51      0.00      0.00      0.00
09:33:00 PM         0     31.97      0.00     68.03      0.00      0.00      0.00
09:33:00 PM         1     95.00      0.00      5.00      0.00      0.00      0.00

Average:          CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:          all     63.13      0.00     36.80      0.08      0.00      0.00
Average:            0     30.38      0.00     69.47      0.15      0.00      0.00
Average:            1     95.87      0.00      4.13      0.00      0.00      0.00

Voila, the IOWAIT is gone! If one of the disk I/O load processes has to wait for I/O completion, one of the CPU load processes can use the CPU and therefore the system does not enter the IOWAIT state. Isn’t that a nice way to “tune” a system that is “suffering” high amounts of IOWAIT? ;-)

To further prove that disk I/O is just another factor that influences the Linux load average, just as CPU utilization does, we can take a look at the sar run-queue statistics:

$ sar –q 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:30:00 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:30:30 PM         4       295      2.80      2.06      1.87
09:31:00 PM         4       295      3.27      2.24      1.93
09:31:30 PM         4       295      3.70      2.44      2.01
09:32:00 PM         4       295      3.82      2.59      2.07
09:32:30 PM         4       295      3.89      2.72      2.13
09:33:00 PM         4       289      3.93      2.84      2.19
Average:            4       294      3.57      2.48      2.03

The load average has increased by 2. Hopefully this matches your expectation. If not, please read part 1 again!

Does the current load average have an impact on how well the system is performing? We can check this by the amount of I/O the system is able to perform using sar:

$ sar –b 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:30:00 PM       tps      rtps      wtps   bread/s   bwrtn/s
09:30:30 PM  19960.52      0.00  19960.52      0.00  37556.49
09:31:00 PM  20967.83      0.00  20967.83      0.00  39447.60
09:31:30 PM  20959.73      0.00  20959.73      0.00  39421.50
09:32:00 PM  20634.82      0.10  20634.72      0.80  38803.77
09:32:30 PM  20117.79      0.00  20117.79      0.00  37849.28
09:33:00 PM  19797.07      0.10  19796.97      0.80  37237.61
Average:     20406.31      0.03  20406.28      0.27  38386.08

The above output shows that, under the current load average of 4, on average the system is able to write 38386 blocks per second to disk. This is considerable less compared to the average of 48207 blocks per second we were able to write before with a load average of 2.

We will take a look at the corresponding top output before we come to a conclusion.

top - 21:33:01 up 11:17,  1 user,  load average: 3.93, 2.84, 2.19
Tasks: 195 total,   5 running, 190 sleeping,   0 stopped,   0 zombie
Cpu(s): 63.7%us, 19.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  3.0%hi, 13.8%si,  0.0%st
Mem:   3074820k total,  2553332k used,   521488k free,   221216k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1160828k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
27581 hbreeder  20   0 63836 1064  908 R 82.3  0.0   1:47.34 busy-cpu           
27582 hbreeder  20   0 63836 1068  908 R 59.4  0.0   2:18.79 busy-cpu           
27621 hbreeder  20   0 63168  616  512 R 23.9  0.0   0:07.19 dd                 
27620 hbreeder  20   0 63168  616  512 R 23.8  0.0   0:07.14 dd                 
27588 hbreeder  20   0 12756 1192  836 R  0.0  0.0   0:00.13 top                

The above top output confirms that the load average is about 4, the CPUs are fully utilized and that there is no IOWAIT. It also shows the processes that are active on the system.

Conclusion

So far we have proven that both CPU utilization and disk I/O influences the load average on a Linux system. The question remains if they are the only factors driving the load average? Or are there other factors to consider?
Stay tuned for part three to get the answer!
-Harald

Posted in Linux | 9 Comments »

Understanding Linux Load Average – Part 1

Posted by Harald van Breederode on April 23, 2012

A frequently asked question in my classroom is “What is the meaning of load average and when is it too high?”. This may sound like an easy question, and I really thought it was, but recently I discovered that things aren’t always that easy as they seem. In this first of a three-part post I will explain what the meaning of Linux load average is and how to diagnose load averages that may seem too high.

Obtaining the current load average is very simple by issuing the uptime command:

$ uptime
21:49:05 up 11:33,  1 user,  load average: 10.52, 6.03, 3.78

But what is the meaning of these 3 numbers? Basically load average is the run-queue utilization averaged over the last minute, the last 5 minutes and the last 15 minutes. The run-queue is a list of processes waiting for a resource to become available inside the Linux operating system. The example above indicates that on average there were 10.52 processes waiting to be scheduled on the run-queue measured over the last minute.

The questions are of course: Which processes are on the run-queue? And what are they waiting for? Why not find the answer to these questions by performing a series of experiments?

CPU utilization and load average

To be able to perform the necessary experiments I wrote a few shell scripts to generate various types of load on my Linux box. The first experiment is to start one CPU load process, on an otherwise idle system, and watch its effect on the load average using the sar command:

$ load-gen cpu
Starting 1 CPU load process.
$ sar –q 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:06:54 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:07:24 PM         1       290      0.39      0.09      0.15
09:07:54 PM         1       290      0.63      0.18      0.18
09:08:24 PM         1       290      0.77      0.26      0.20
09:08:54 PM         1       290      0.86      0.33      0.22
09:09:24 PM         1       290      0.97      0.40      0.25
09:09:54 PM         1       288      0.98      0.46      0.28
Average:            1       290      0.77      0.29      0.21 

The above sar output reported the load average 6 times with an interval of 30 seconds. It shows that there was 1 process constantly on the run-queue resulting that the 1 minute load average slowly climbs to a value of 1 and then stabilizes there. The 5 minute load average will continue to climb for a few more minutes and will also stabilize at a value of 1 and the same is true for the 15 minute load average assuming the run-queue utilization will remain the same.

The next step is to take a look at the CPU utilization to check if there is a correlation between it and the load average. While measuring the load average using sar I also had it running to report the CPU utilization.

$ sar –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:06:54 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:07:24 PM       all     50.48      0.00      0.65      0.00      0.00     48.87
09:07:54 PM       all     50.40      0.00      0.48      0.02      0.00     49.10
09:08:24 PM       all     50.03      0.00      0.57      0.02      0.00     49.39
09:08:54 PM       all     49.97      0.00      0.52      0.00      0.00     49.52
09:09:24 PM       all     50.10      0.00      0.52      0.02      0.00     49.37
09:09:54 PM       all     50.23      0.00      0.55      0.02      0.00     49.21
Average:          all     50.20      0.00      0.55      0.01      0.00     49.24

This shows that overall the system was roughly spending 50% of its time running user processes and the other 50% was spent doing nothing. Thus only half of the machine’s capacity was used to run the CPU load which caused a load average of 1. Isn’t that strange? Not if you know that the machine is equipped with two processors. While one CPU was busy running the load the other CPU was idle resulting in an overall CPU utilization of 50%.

Personally I prefer using sar to peek around in a busy Linux system but other people tend to use top for the same thing. This is what top had to report about the situation we are studying using sar:

$ top –bi –d30 –n7
top - 21:09:55 up 10:54,  1 user,  load average: 0.98, 0.46, 0.28
Tasks: 188 total,   2 running, 186 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.2%us,  0.5%sy,  0.0%ni, 49.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3074820k total,  2539340k used,   535480k free,   218600k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1160120k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
27348 hbreeder  20   0 63836 1068  908 R 99.8  0.0   3:00.31 busy-cpu           
27354 hbreeder  20   0 12756 1184  836 R  0.0  0.0   0:00.12 top                

The -bi command line option given to top tells it to go into batch-mode, instead of full-screen-mode, and to ignore idle processes. The -d30 and the -n7 instructs top to produce 7 sets of output with a delay of 30 seconds between them. The output above is the last of 7 sets of output top produced.

Besides everything we already discovered by looking at the various sar outputs, top gives us useful information about the processes consuming CPU time as well as information about physical and virtual memory usage. It is interesting to see that the busy-cpu process consumes 99.8% while the overall CPU utilization is slightly over 50% resulting in 49% of idle time.

The explanation for this is that top reports an averaged CPU utilization in the header section of its output while the per process CPU utilization is not averaged over the total number of processors.

We can verify this statement by using the -P ALL command line option to make sar report the CPU utilization on a per processor basis as well as the averaged values.

$ sar –P ALL –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:06:54 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:07:24 PM       all     50.48      0.00      0.65      0.00      0.00     48.87
09:07:24 PM         0      0.97      0.00      1.27      0.00      0.00     97.77
09:07:24 PM         1     99.97      0.00      0.03      0.00      0.00      0.00

09:07:24 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:07:54 PM       all     50.41      0.00      0.48      0.02      0.00     49.09
09:07:54 PM         0      0.83      0.00      0.97      0.00      0.00     98.20
09:07:54 PM         1    100.00      0.00      0.00      0.00      0.00      0.00

09:07:54 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:08:24 PM       all     50.03      0.00      0.57      0.02      0.00     49.38
09:08:24 PM         0     75.89      0.00      0.87      0.00      0.00     23.24
09:08:24 PM         1     24.17      0.00      0.27      0.03      0.00     75.53

09:08:24 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:08:54 PM       all     49.95      0.00      0.52      0.00      0.00     49.53
09:08:54 PM         0     81.03      0.00      0.77      0.00      0.00     18.21
09:08:54 PM         1     18.91      0.00      0.23      0.00      0.00     80.86

09:08:54 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:09:24 PM       all     50.11      0.00      0.52      0.02      0.00     49.36
09:09:24 PM         0     57.05      0.00      0.93      0.03      0.00     41.99
09:09:24 PM         1     43.12      0.00      0.17      0.03      0.00     56.68

09:09:24 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:09:54 PM       all     50.23      0.00      0.55      0.02      0.00     49.21
09:09:54 PM         0     19.94      0.00      0.97      0.00      0.00     79.09
09:09:54 PM         1     80.56      0.00      0.07      0.00      0.00     19.37

Average:          CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:          all     50.20      0.00      0.55      0.01      0.00     49.24
Average:            0     39.28      0.00      0.96      0.01      0.00     59.75
Average:            1     61.12      0.00      0.13      0.01      0.00     38.74

This output confirms that most of the time only one of the two available processors was busy resulting in an overall averaged CPU utilization of 50.2%.

The next experiment is to add a second CPU load process to the still running first CPU load process. This will increase the number of processes on the run-queue from 1 to 2. What effect will this have on the load average?

$ load-gen cpu
Starting 1 CPU load process.
$ sar –q 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:09:55 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:10:25 PM         2       291      1.38      0.60      0.33
09:10:55 PM         2       291      1.62      0.74      0.38
09:11:25 PM         2       291      1.77      0.86      0.43
09:11:55 PM         2       291      1.86      0.96      0.48
09:12:25 PM         2       291      1.91      1.06      0.53
09:12:55 PM         2       291      1.95      1.15      0.57
Average:            2       291      1.75      0.90      0.45
 

The output above shows that the number of processes on the run-queue is now indeed 2 and that the load average is climbing to a value of 2 as a result of this. Because there are now 2 processes hogging the CPU we can expect that the overall averaged CPU utilization is close to 100%. The top output below confirms this:

$ top –bi –d30 –n7
top - 21:12:55 up 10:57,  1 user,  load average: 1.95, 1.15, 0.57
Tasks: 189 total,   3 running, 186 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.3%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3074820k total,  2540968k used,   533852k free,   218756k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1160212k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
27377 hbreeder  20   0 63836 1064  908 R 99.4  0.0   2:59.45 busy-cpu           
27348 hbreeder  20   0 63836 1068  908 R 98.8  0.0   5:59.18 busy-cpu           
27383 hbreeder  20   0 12756 1188  836 R  0.1  0.0   0:00.13 top                

Please note that top reports 2 processes using nearly 100% CPU time. Using sar we can verify that indeed both processors are now fully utilized.

$ sar –P ALL –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:09:55 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:10:25 PM       all     99.22      0.00      0.78      0.00      0.00      0.00
09:10:25 PM         0     98.60      0.00      1.40      0.00      0.00      0.00
09:10:25 PM         1     99.83      0.00      0.17      0.00      0.00      0.00

09:10:25 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:10:55 PM       all     99.32      0.00      0.68      0.00      0.00      0.00
09:10:55 PM         0     98.70      0.00      1.30      0.00      0.00      0.00
09:10:55 PM         1     99.90      0.00      0.10      0.00      0.00      0.00

09:10:55 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:11:25 PM       all     99.28      0.00      0.72      0.00      0.00      0.00
09:11:25 PM         0     98.70      0.00      1.30      0.00      0.00      0.00
09:11:25 PM         1     99.90      0.00      0.10      0.00      0.00      0.00

09:11:25 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:11:55 PM       all     99.27      0.00      0.73      0.00      0.00      0.00
09:11:55 PM         0     98.67      0.00      1.33      0.00      0.00      0.00
09:11:55 PM         1     99.87      0.00      0.13      0.00      0.00      0.00

09:11:55 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:12:25 PM       all     99.25      0.00      0.75      0.00      0.00      0.00
09:12:25 PM         0     98.60      0.00      1.40      0.00      0.00      0.00
09:12:25 PM         1     99.90      0.00      0.10      0.00      0.00      0.00

09:12:25 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:12:55 PM       all     99.32      0.00      0.68      0.00      0.00      0.00
09:12:55 PM         0     98.77      0.00      1.23      0.00      0.00      0.00
09:12:55 PM         1     99.90      0.00      0.10      0.00      0.00      0.00

Average:          CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:          all     99.27      0.00      0.73      0.00      0.00      0.00
Average:            0     98.67      0.00      1.33      0.00      0.00      0.00
Average:            1     99.88      0.00      0.12      0.00      0.00      0.00 

The final experiment is to add 3 additional CPU load processes to check if we can force the load average to go up any further now that we are already consuming all available CPU resources on the system.

$ load-gen cpu 3
Starting 3 CPU load processes.
$ top –bi –d30 –n7
top - 21:21:59 up 11:06,  1 user,  load average: 4.91, 3.47, 2.41
Tasks: 193 total,   6 running, 186 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.0%us,  0.7%sy,  0.3%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3074820k total,  2570552k used,   504268k free,   219180k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1160512k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
27408 hbreeder  20   0 63836 1068  908 R 39.9  0.0   4:09.41 busy-cpu           
27377 hbreeder  20   0 63836 1064  908 R 39.8  0.0   8:42.65 busy-cpu           
27348 hbreeder  20   0 63836 1068  908 R 39.6  0.0  10:09.95 busy-cpu           
27477 hbreeder  20   0 63836 1064  908 R 39.4  0.0   1:11.19 busy-cpu           
27436 hbreeder  20   0 63836 1064  908 R 38.9  0.0   2:39.25 busy-cpu           
27483 hbreeder  20   0 12756 1192  836 R  0.1  0.0   0:00.13 top                

We managed to drive the load average up to 5 ;-) Because there are only 2 processors available in the system and there are 5 processes fighting for CPU time, each process will only get 40% from the available 200% CPU time.

Conclusion

Based on all these experiments we can conclude that CPU utilization is clearly influencing the load average of a Linux system. If the load average is above the total number of processors in the system we could conclude that the system is overloaded but this assumes that nothing else influences the load average. Is CPU utilization indeed the only factor that drives the Linux load average? Stay tuned for part two!
-Harald

Posted in Linux | 22 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 243 other followers