The Dutch Prutser's Blog

By: Harald van Breederode

  • Disclaimer

    The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.
  • Subscribe

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 307 other followers

Understanding Linux Load Average – Part 1

Posted by Harald van Breederode on April 23, 2012

A frequently asked question in my classroom is “What is the meaning of load average and when is it too high?”. This may sound like an easy question, and I really thought it was, but recently I discovered that things aren’t always that easy as they seem. In this first of a three-part post I will explain what the meaning of Linux load average is and how to diagnose load averages that may seem too high.

Obtaining the current load average is very simple by issuing the uptime command:

$ uptime
21:49:05 up 11:33,  1 user,  load average: 10.52, 6.03, 3.78

But what is the meaning of these 3 numbers? Basically load average is the run-queue utilization averaged over the last minute, the last 5 minutes and the last 15 minutes. The run-queue is a list of processes waiting for a resource to become available inside the Linux operating system. The example above indicates that on average there were 10.52 processes waiting to be scheduled on the run-queue measured over the last minute.

The questions are of course: Which processes are on the run-queue? And what are they waiting for? Why not find the answer to these questions by performing a series of experiments?

CPU utilization and load average

To be able to perform the necessary experiments I wrote a few shell scripts to generate various types of load on my Linux box. The first experiment is to start one CPU load process, on an otherwise idle system, and watch its effect on the load average using the sar command:

$ load-gen cpu
Starting 1 CPU load process.
$ sar –q 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:06:54 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:07:24 PM         1       290      0.39      0.09      0.15
09:07:54 PM         1       290      0.63      0.18      0.18
09:08:24 PM         1       290      0.77      0.26      0.20
09:08:54 PM         1       290      0.86      0.33      0.22
09:09:24 PM         1       290      0.97      0.40      0.25
09:09:54 PM         1       288      0.98      0.46      0.28
Average:            1       290      0.77      0.29      0.21 

The above sar output reported the load average 6 times with an interval of 30 seconds. It shows that there was 1 process constantly on the run-queue resulting that the 1 minute load average slowly climbs to a value of 1 and then stabilizes there. The 5 minute load average will continue to climb for a few more minutes and will also stabilize at a value of 1 and the same is true for the 15 minute load average assuming the run-queue utilization will remain the same.

The next step is to take a look at the CPU utilization to check if there is a correlation between it and the load average. While measuring the load average using sar I also had it running to report the CPU utilization.

$ sar –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:06:54 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:07:24 PM       all     50.48      0.00      0.65      0.00      0.00     48.87
09:07:54 PM       all     50.40      0.00      0.48      0.02      0.00     49.10
09:08:24 PM       all     50.03      0.00      0.57      0.02      0.00     49.39
09:08:54 PM       all     49.97      0.00      0.52      0.00      0.00     49.52
09:09:24 PM       all     50.10      0.00      0.52      0.02      0.00     49.37
09:09:54 PM       all     50.23      0.00      0.55      0.02      0.00     49.21
Average:          all     50.20      0.00      0.55      0.01      0.00     49.24

This shows that overall the system was roughly spending 50% of its time running user processes and the other 50% was spent doing nothing. Thus only half of the machine’s capacity was used to run the CPU load which caused a load average of 1. Isn’t that strange? Not if you know that the machine is equipped with two processors. While one CPU was busy running the load the other CPU was idle resulting in an overall CPU utilization of 50%.

Personally I prefer using sar to peek around in a busy Linux system but other people tend to use top for the same thing. This is what top had to report about the situation we are studying using sar:

$ top –bi –d30 –n7
top - 21:09:55 up 10:54,  1 user,  load average: 0.98, 0.46, 0.28
Tasks: 188 total,   2 running, 186 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.2%us,  0.5%sy,  0.0%ni, 49.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3074820k total,  2539340k used,   535480k free,   218600k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1160120k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
27348 hbreeder  20   0 63836 1068  908 R 99.8  0.0   3:00.31 busy-cpu           
27354 hbreeder  20   0 12756 1184  836 R  0.0  0.0   0:00.12 top                

The -bi command line option given to top tells it to go into batch-mode, instead of full-screen-mode, and to ignore idle processes. The -d30 and the -n7 instructs top to produce 7 sets of output with a delay of 30 seconds between them. The output above is the last of 7 sets of output top produced.

Besides everything we already discovered by looking at the various sar outputs, top gives us useful information about the processes consuming CPU time as well as information about physical and virtual memory usage. It is interesting to see that the busy-cpu process consumes 99.8% while the overall CPU utilization is slightly over 50% resulting in 49% of idle time.

The explanation for this is that top reports an averaged CPU utilization in the header section of its output while the per process CPU utilization is not averaged over the total number of processors.

We can verify this statement by using the -P ALL command line option to make sar report the CPU utilization on a per processor basis as well as the averaged values.

$ sar –P ALL –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:06:54 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:07:24 PM       all     50.48      0.00      0.65      0.00      0.00     48.87
09:07:24 PM         0      0.97      0.00      1.27      0.00      0.00     97.77
09:07:24 PM         1     99.97      0.00      0.03      0.00      0.00      0.00

09:07:24 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:07:54 PM       all     50.41      0.00      0.48      0.02      0.00     49.09
09:07:54 PM         0      0.83      0.00      0.97      0.00      0.00     98.20
09:07:54 PM         1    100.00      0.00      0.00      0.00      0.00      0.00

09:07:54 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:08:24 PM       all     50.03      0.00      0.57      0.02      0.00     49.38
09:08:24 PM         0     75.89      0.00      0.87      0.00      0.00     23.24
09:08:24 PM         1     24.17      0.00      0.27      0.03      0.00     75.53

09:08:24 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:08:54 PM       all     49.95      0.00      0.52      0.00      0.00     49.53
09:08:54 PM         0     81.03      0.00      0.77      0.00      0.00     18.21
09:08:54 PM         1     18.91      0.00      0.23      0.00      0.00     80.86

09:08:54 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:09:24 PM       all     50.11      0.00      0.52      0.02      0.00     49.36
09:09:24 PM         0     57.05      0.00      0.93      0.03      0.00     41.99
09:09:24 PM         1     43.12      0.00      0.17      0.03      0.00     56.68

09:09:24 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:09:54 PM       all     50.23      0.00      0.55      0.02      0.00     49.21
09:09:54 PM         0     19.94      0.00      0.97      0.00      0.00     79.09
09:09:54 PM         1     80.56      0.00      0.07      0.00      0.00     19.37

Average:          CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:          all     50.20      0.00      0.55      0.01      0.00     49.24
Average:            0     39.28      0.00      0.96      0.01      0.00     59.75
Average:            1     61.12      0.00      0.13      0.01      0.00     38.74

This output confirms that most of the time only one of the two available processors was busy resulting in an overall averaged CPU utilization of 50.2%.

The next experiment is to add a second CPU load process to the still running first CPU load process. This will increase the number of processes on the run-queue from 1 to 2. What effect will this have on the load average?

$ load-gen cpu
Starting 1 CPU load process.
$ sar –q 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:09:55 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:10:25 PM         2       291      1.38      0.60      0.33
09:10:55 PM         2       291      1.62      0.74      0.38
09:11:25 PM         2       291      1.77      0.86      0.43
09:11:55 PM         2       291      1.86      0.96      0.48
09:12:25 PM         2       291      1.91      1.06      0.53
09:12:55 PM         2       291      1.95      1.15      0.57
Average:            2       291      1.75      0.90      0.45
 

The output above shows that the number of processes on the run-queue is now indeed 2 and that the load average is climbing to a value of 2 as a result of this. Because there are now 2 processes hogging the CPU we can expect that the overall averaged CPU utilization is close to 100%. The top output below confirms this:

$ top –bi –d30 –n7
top - 21:12:55 up 10:57,  1 user,  load average: 1.95, 1.15, 0.57
Tasks: 189 total,   3 running, 186 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.3%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3074820k total,  2540968k used,   533852k free,   218756k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1160212k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
27377 hbreeder  20   0 63836 1064  908 R 99.4  0.0   2:59.45 busy-cpu           
27348 hbreeder  20   0 63836 1068  908 R 98.8  0.0   5:59.18 busy-cpu           
27383 hbreeder  20   0 12756 1188  836 R  0.1  0.0   0:00.13 top                

Please note that top reports 2 processes using nearly 100% CPU time. Using sar we can verify that indeed both processors are now fully utilized.

$ sar –P ALL –u 30 6
Linux 2.6.32-300.20.1.el5uek (roger.example.com) 	04/21/2012

09:09:55 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:10:25 PM       all     99.22      0.00      0.78      0.00      0.00      0.00
09:10:25 PM         0     98.60      0.00      1.40      0.00      0.00      0.00
09:10:25 PM         1     99.83      0.00      0.17      0.00      0.00      0.00

09:10:25 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:10:55 PM       all     99.32      0.00      0.68      0.00      0.00      0.00
09:10:55 PM         0     98.70      0.00      1.30      0.00      0.00      0.00
09:10:55 PM         1     99.90      0.00      0.10      0.00      0.00      0.00

09:10:55 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:11:25 PM       all     99.28      0.00      0.72      0.00      0.00      0.00
09:11:25 PM         0     98.70      0.00      1.30      0.00      0.00      0.00
09:11:25 PM         1     99.90      0.00      0.10      0.00      0.00      0.00

09:11:25 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:11:55 PM       all     99.27      0.00      0.73      0.00      0.00      0.00
09:11:55 PM         0     98.67      0.00      1.33      0.00      0.00      0.00
09:11:55 PM         1     99.87      0.00      0.13      0.00      0.00      0.00

09:11:55 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:12:25 PM       all     99.25      0.00      0.75      0.00      0.00      0.00
09:12:25 PM         0     98.60      0.00      1.40      0.00      0.00      0.00
09:12:25 PM         1     99.90      0.00      0.10      0.00      0.00      0.00

09:12:25 PM       CPU     %user     %nice   %system   %iowait    %steal     %idle
09:12:55 PM       all     99.32      0.00      0.68      0.00      0.00      0.00
09:12:55 PM         0     98.77      0.00      1.23      0.00      0.00      0.00
09:12:55 PM         1     99.90      0.00      0.10      0.00      0.00      0.00

Average:          CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:          all     99.27      0.00      0.73      0.00      0.00      0.00
Average:            0     98.67      0.00      1.33      0.00      0.00      0.00
Average:            1     99.88      0.00      0.12      0.00      0.00      0.00 

The final experiment is to add 3 additional CPU load processes to check if we can force the load average to go up any further now that we are already consuming all available CPU resources on the system.

$ load-gen cpu 3
Starting 3 CPU load processes.
$ top –bi –d30 –n7
top - 21:21:59 up 11:06,  1 user,  load average: 4.91, 3.47, 2.41
Tasks: 193 total,   6 running, 186 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.0%us,  0.7%sy,  0.3%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3074820k total,  2570552k used,   504268k free,   219180k buffers
Swap:  5144568k total,        0k used,  5144568k free,  1160512k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
27408 hbreeder  20   0 63836 1068  908 R 39.9  0.0   4:09.41 busy-cpu           
27377 hbreeder  20   0 63836 1064  908 R 39.8  0.0   8:42.65 busy-cpu           
27348 hbreeder  20   0 63836 1068  908 R 39.6  0.0  10:09.95 busy-cpu           
27477 hbreeder  20   0 63836 1064  908 R 39.4  0.0   1:11.19 busy-cpu           
27436 hbreeder  20   0 63836 1064  908 R 38.9  0.0   2:39.25 busy-cpu           
27483 hbreeder  20   0 12756 1192  836 R  0.1  0.0   0:00.13 top                

We managed to drive the load average up to 5 ;-) Because there are only 2 processors available in the system and there are 5 processes fighting for CPU time, each process will only get 40% from the available 200% CPU time.

Conclusion

Based on all these experiments we can conclude that CPU utilization is clearly influencing the load average of a Linux system. If the load average is above the total number of processors in the system we could conclude that the system is overloaded but this assumes that nothing else influences the load average. Is CPU utilization indeed the only factor that drives the Linux load average? Stay tuned for part two!
-Harald

Advertisements

Posted in Linux | 30 Comments »

Ksplice in action

Posted by Harald van Breederode on September 24, 2011

On July 21, 2011 Oracle announced that it has aquired Ksplice. With Ksplice users can update the Linux kernel while it is running, so without a reboot or any other disruption. As of September 15, 2011 Ksplice is available, at no additional charge, to new and existing Oracle PremierSupport customers on the Unbreakable Linux Network (ULN).

Updating the Linux kernel while it is running sounded like an impossible mission to me, and I was really keen to see this in action with my own “eyes” ;-) Yesterday I gave it a try and in this posting I will share my first exprerience with you.

The installation of Ksplice is a very easy process which took only a few minutes and can be performed while the system is up and running. It does however require an ULN account for obvious reasons ;-)

Before updating my system lets have a look when the system was booted, which kernel it is running and show you that I have an Oracle database running while the kernel is being updated to a new version:

$ who -b
         system boot  2011-09-23 18:52
$ uname -r
2.6.32-200.16.1.el5uek
$ pgrep -lf smon
6037 ora_smon_v1120
 

The above output shows that my system is running a 2.6.32-200.16.1.el5uek kernel. The “-uek” indicates an Oracle Unbreakable Enterprise Kernel which is a pre-requisite for using Ksplice on Oracle Linux.

And now, lets update the currently running Linux kernel to the latest version using Ksplice:

$ sudo uptrack-upgrade -y
The following steps will be taken:
Install [694jrs5f] Clear garbage data on the kernel stack when handling signals.
Install [zfm9vkzx] CVE-2011-2491: Local denial of service in NLM subsystem.
Install [gxqj9ojz] CVE-2011-2492: Information leak in bluetooth implementation.
Install [hojignhn] CVE-2011-2495: Information leak in /proc/PID/io.
Install [fa05bhhk] CVE-2011-2497: Buffer overflow in the Bluetooth subsystem.
Install [04wcg4oc] CVE-2011-2517: Buffer overflow in nl80211 driver.
Install [xjzxf6c1] CVE-2011-2695: Off-by-one errors in the ext4 filesystem.
Install [oqz3q8m2] CVE-2011-1576: Denial of service with VLAN packets and GRO.
Installing [694jrs5f] Clear garbage data on the kernel stack when handling signals.
Installing [zfm9vkzx] CVE-2011-2491: Local denial of service in NLM subsystem.
Installing [gxqj9ojz] CVE-2011-2492: Information leak in bluetooth implementation.
Installing [hojignhn] CVE-2011-2495: Information leak in /proc/PID/io.
Installing [fa05bhhk] CVE-2011-2497: Buffer overflow in the Bluetooth subsystem.
Installing [04wcg4oc] CVE-2011-2517: Buffer overflow in nl80211 driver.
Installing [xjzxf6c1] CVE-2011-2695: Off-by-one errors in the ext4 filesystem.
Installing [oqz3q8m2] CVE-2011-1576: Denial of service with VLAN packets and GRO.
Your kernel is fully up to date.
Effective kernel version is 2.6.32-200.19.1.el5uek

Note: Although the product is called Ksplice, the service it provides is known as uptrack.

The result of running the uptrack-upgrade command is that my system is now running kernel version 2.6.32-200.19.1.el5uek and it happened without a reboot or even stopping the running Oracle database! The output also shows that updating the running kernel occurred by installing small chunks of code corresponding to each patch that was applied to the kernel source code when the new kernel version was put together.
The output below shows that the system was not rebooted nor that the running database was restarted.

$ who -b
         system boot  2011-09-23 18:52
$ pgrep -lf smon
6037 ora_smon_v1120
$ uname -r
2.6.32-200.16.1.el5uek

It may be a bit confusing that uname –r still reports kernel version 2.6.32-200.16.1.el5uek while in reality the kernel version is 2.6.32-200.19.1.el5uek. According to the documentation this is expected behaviour and there is an uptrack-uname command available to report the kernel version that is actually running as shown below:

$ uptrack-uname -r
2.6.32-200.19.1.el5uek

In case you want to know which updates were applied to the running kernel the uptrack-show command is your friend:

$ sudo uptrack-show
Installed updates:
[694jrs5f] Clear garbage data on the kernel stack when handling signals.
[zfm9vkzx] CVE-2011-2491: Local denial of service in NLM subsystem.
[gxqj9ojz] CVE-2011-2492: Information leak in bluetooth implementation.
[hojignhn] CVE-2011-2495: Information leak in /proc/PID/io.
[fa05bhhk] CVE-2011-2497: Buffer overflow in the Bluetooth subsystem.
[04wcg4oc] CVE-2011-2517: Buffer overflow in nl80211 driver.
[xjzxf6c1] CVE-2011-2695: Off-by-one errors in the ext4 filesystem.
[oqz3q8m2] CVE-2011-1576: Denial of service with VLAN packets and GRO.

Effective kernel version is 2.6.32-200.19.1.el5uek

If, for whatever reason, you want to remove the updates that were applied to the running kernel, it is good to know that this can also be performed without a reboot or any other service disruption by running the uptrack-remove command.

$ sudo uptrack-remove -y --all
The following steps will be taken:
Remove [oqz3q8m2] CVE-2011-1576: Denial of service with VLAN packets and GRO.
Remove [xjzxf6c1] CVE-2011-2695: Off-by-one errors in the ext4 filesystem.
Remove [04wcg4oc] CVE-2011-2517: Buffer overflow in nl80211 driver.
Remove [fa05bhhk] CVE-2011-2497: Buffer overflow in the Bluetooth subsystem.
Remove [hojignhn] CVE-2011-2495: Information leak in /proc/PID/io.
Remove [gxqj9ojz] CVE-2011-2492: Information leak in bluetooth implementation.
Remove [zfm9vkzx] CVE-2011-2491: Local denial of service in NLM subsystem.
Remove [694jrs5f] Clear garbage data on the kernel stack when handling signals.
Removing [oqz3q8m2] CVE-2011-1576: Denial of service with VLAN packets and GRO.
Removing [xjzxf6c1] CVE-2011-2695: Off-by-one errors in the ext4 filesystem.
Removing [04wcg4oc] CVE-2011-2517: Buffer overflow in nl80211 driver.
Removing [fa05bhhk] CVE-2011-2497: Buffer overflow in the Bluetooth subsystem.
Removing [hojignhn] CVE-2011-2495: Information leak in /proc/PID/io.
Removing [gxqj9ojz] CVE-2011-2492: Information leak in bluetooth implementation.
Removing [zfm9vkzx] CVE-2011-2491: Local denial of service in NLM subsystem.
Removing [694jrs5f] Clear garbage data on the kernel stack when handling signals.

All the previously applied updates are taken out, in reverse order, which basically reverts the system back to its original state. The output below shows that this indeed happened without a reboot or stopping the running Oracle database:

$ who -b
         system boot  2011-09-23 18:52
$ pgrep -lf smon
6037 ora_smon_v1120
$ uname -r
2.6.32-200.16.1.el5uek
$ uptrack-uname -r
2.6.32-200.16.1.el5uek
$ sudo uptrack-show
Installed updates:
None

Effective kernel version is 2.6.32-200.16.1.el5uek

Cool, isn’t it? I am impressed!

Please read this Ksplice technical paper for some background information on the Ksplice technology.

Please keep in mind that Ksplice will only update the running kernel in memory and does not install a new kernel RPM. It does re-apply the updates automatically after a system reboot and will also check for new updates on a regular basis. Ksplice can download and install new updates automatically whenever they become available ensuring your kernel is always up-to-date!
-Harald

Posted in Linux | Leave a Comment »

Password file maintenance in a Data Guard environment

Posted by Harald van Breederode on June 13, 2011

In a previous posting I wrote about password file maintenance in a clustered ASM and RAC environment.
This article raised another question: Is there anything specific about password file maintenance in a Data Guard environment?

Yes, updating a password file in a Data Guard environment isn’t as straight forward as one might think. In this posting I will shed some light on this subject and show you how to properly update a password file in a Data Guard environment.

Let’s get started by taking a look at my Data Guard setup before diving into password file maintenance procedures.

DGMGRL> show configuration;

Configuration - PeppiEnKokki

  Protection Mode: MaxPerformance
  Databases:
    peppi - Primary database
    kokki - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS

This output shows that I have a primary database called peppi, a physical standby database called kokki and that the overall status of my Data Guard configuration is healthy.

Note: peppi runs on host prutser and kokki runs on host el5.

Revealing the problem

The first question is: Are updates to the password file on the primary database propagated to the standby database(s)? We can easily figure this out by triggering an update to the password file on peppi followed by querying the password file on kokki.

SQL> connect sys/oracle@peppi as sysdba
Connected.
SQL> select * from v$pwfile_users;

USERNAME                       SYSDB SYSOP SYSAS
------------------------------ ----- ----- -----
SYS                            TRUE  TRUE  FALSE

Currently there is only one entry in the password file on peppi. By granting SYSDBA to SYSTEM we can trigger an update to the password file as shown below:

SQL> grant sysdba to system;

Grant succeeded.

SQL> select * from v$pwfile_users;

USERNAME                       SYSDB SYSOP SYSAS
------------------------------ ----- ----- -----
SYS                            TRUE  TRUE  FALSE
SYSTEM                         TRUE  FALSE FALSE

There are now two entries in the password file on the primary database. How many entries are there on the standby database?

SQL> connect sys/oracle@kokki as sysdba
Connected.
SQL> select * from v$pwfile_users;

USERNAME                       SYSDB SYSOP SYSAS
------------------------------ ----- ----- -----
SYS                            TRUE  TRUE  FALSE

The output above makes clear that updates to the password file on the primary database are not propagated to the standby database.

Before continuing, we take away SYSDBA from SYSTEM because we don’t want SYSTEM to become too powerful, do we?

SQL> connect sys/oracle@peppi as sysdba
Connected.
SQL> revoke sysdba from system;

Revoke succeeded.

Does this affect Data Guard?

The next question is: Does this affect Data Guard? The primary database sends its redo to its standby database(s) and this redo transport is authenticated by the password of the SYS user, or another user if configured, of the primary database. That is, the primary database logs into a standby database by using the password stored in the password file for the user who ships the redo, which is most likely SYS. If the (encrypted) passwords of the primary database and the standby database(s) don’t match redo transport will (eventually) be in trouble.

To demonstrate this behavior we will change the SYS password on peppi and see if or how it affects kokki.

SQL> alter user sys identified by prutser;

User altered.
SQL> connect sys/prutser@peppi as sysdba
Connected.
SQL> connect sys/prutser@kokki as sysdba
ERROR:
ORA-01017: invalid username/password; logon denied


Warning: You are no longer connected to ORACLE.

The above output shows (again) that an update to the password file on the primary database is not automatically propagated to the standby database(s).

SQL> connect sys/oracle@kokki as sysdba
Connected.

As a matter of fact, we can still connect to kokki using the old SYS password as shown above.

The next question is: Is redo transport still possible now that the passwords are no longer the same? Let’s switch a few logs on the primary database and see what happens.

SQL> connect sys/prutser@peppi as sysdba
Connected.
SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

DGMGRL> show configuration;

Configuration - PeppiEnKokki

  Protection Mode: MaxPerformance
  Databases:
    peppi - Primary database
    kokki - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS

According to the above output everything is still fine. Apparently the password change didn’t affect the redo transport. This is because the primary database was already logged into the standby database and as long as this connection remains open there is no need for the standby database to re-authenticate the incoming redo. However if we disable and re-enable redo transport it becomes clear that we indeed have a problem as shown below.

DGMGRL> edit database kokki set property LogShipping=off;
Property "logshipping" updated

DGMGRL> edit database kokki set property LogShipping=on;
Property "logshipping" updated

DGMGRL> show configuration;

Configuration - PeppiEnKokki

  Protection Mode: MaxPerformance
  Databases:
    peppi - Primary database
      Error: ORA-16778: redo transport error for one or more databases

    kokki - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
ERROR

It is pretty clear that redo transport ceased indicated by the ORA-16778. It is important to realize that a problem caused by updating the password file on the primary database doesn’t necessarily show up immediately but may show up much later in time.

What is an ORA-16778 anyway?

$ oerr ora 16778
16778, 00000, "redo transport error for one or more databases"
// *Cause:  The redo transport service was unable to send redo data to one
//          or more standby databases.
// *Action: Check the Data Guard broker log and Oracle alert log for
//          more details. Query the LogXptStatus property to see the
//          errors.

Because we just changed the SYS password on the primary database we already know what caused the ORA-16778. So the question is: How do we fix this?

In search of a solution

Maybe we can simply re-create the password file on kokki with the new SYS password? Let’s give it a try:

el5$ rm $ORACLE_HOME/dbs/orapwv1120

el5$ orapwd file=$ORACLE_HOME/dbs/orapwv1120 password=prutser

SQL> connect sys/prutser@kokki as sysdba
Connected.

That seems to work! The question is of course: Does Data Guard agree with my optimism?

SQL> connect sys/prutser@peppi as sysdba
Connected.
SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

DGMGRL> show configuration;

Configuration - PeppiEnKokki

  Protection Mode: MaxPerformance
  Databases:
    peppi - Primary database
      Error: ORA-16778: redo transport error for one or more databases

    kokki - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
ERROR

Hmmm, this doesn’t look too good, does it? Maybe we need to disable and re-enable redo shipment?

DGMGRL> edit database kokki set property LogShipping=off;
Property "logshipping" updated

DGMGRL> edit database kokki set property LogShipping=on;
Property "logshipping" updated

DGMGRL> show configuration;

Configuration - PeppiEnKokki

  Protection Mode: MaxPerformance
  Databases:
    peppi - Primary database
      Error: ORA-16778: redo transport error for one or more databases

    kokki - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
ERROR

No, this didn’t help either. Re-creating the password file on the standby database is not the solution. The reason for this is the way the passwords are encrypted in the password file. Even if the passwords are the same, the result of the encryption is not.

How about copying the password file from the primary database to the standby database?

The solution

We will copy the password file from peppi to kokki and see if it works:

$ scp $ORACLE_HOME/dbs/orapwv1120 el5:$ORACLE_HOME/dbs/orapwv1120
orapwv1120                                    100% 1536     1.5KB/s   00:00    

SQL> connect sys/prutser@kokki as sysdba
Connected.

Again we can connect to kokki with the updated SYS password, but is Data Guard happy now?

SQL> connect sys/prutser@peppi as sysdba
Connected.
SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

DGMGRL> show configuration;

Configuration - PeppiEnKokki

  Protection Mode: MaxPerformance
  Databases:
    peppi - Primary database
      Error: ORA-16778: redo transport error for one or more databases

    kokki - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
ERROR

Hmmm, again Data Guard doesn’t look like a happy camper. But things are not what they seem. If we wait long enough Data Guard will re-open its archive destinations and redo will automatically flow again. We can trigger this by disabling and re-enabling redo shipment:

DGMGRL> edit database kokki set property LogShipping=off;
Property "logshipping" updated

DGMGRL> edit database kokki set property LogShipping=on;
Property "logshipping" updated

DGMGRL> show configuration;

Configuration - PeppiEnKokki

  Protection Mode: MaxPerformance
  Databases:
    peppi - Primary database
    kokki - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS

Finally Data Guard is back in business again because the primary database is able to login on its standby database(s) and thereby successfully ship redo to them.

In summary

In order to keep Data Guard going, we must copy the password file from the primary database to the standby database(s) after an update is made to the password file on the primary database. Happy Data Guarding ;-)
-Harald

Posted in Oracle | 28 Comments »