Oracle8i Designing and Tuning for Performance Release 2 (8.1.6) Part Number A76992-01 |
|
This chapter offers guidelines for tuning instance recovery.
This chapter contains the following sections:
Instance and crash recovery are the automatic application of redo log records to Oracle data blocks after a crash or system failure. If a single instance database crashes, or if all instances of an Oracle Parallel Server configuration crash, then Oracle performs instance recovery at the next startup. If one or more instances of an Oracle Parallel Server configuration crash, then a surviving instance performs recovery.
Instance and crash recovery occur in two phases. In phase one, Oracle applies all committed and uncommitted changes in the redo log files to the affected datablocks. In phase two, Oracle applies information in the rollback segments to undo changes made by uncommitted transactions to the data blocks.
During normal operations, Oracle's DBWn processes periodically write dirty buffers, or buffers that have in-memory changes, to disk. Periodically, Oracle records the highest system change number (SCN) of all changes to blocks, such that all data blocks with changes below that SCN have been written to disk by DBWn. This SCN is the checkpoint.
Records that Oracle appends to the redo log file after the change record that the checkpoint refers to are changes that Oracle has not yet written to disk. If a failure occurs, then only redo log records containing changes at SCNs higher than the checkpoint need to be replayed during recovery.
The duration of recovery processing is directly influenced by the number of data blocks that have changes at SCNs higher than the SCN of the checkpoint. For example, Oracle recovers a redo log with 100 entries affecting one data block faster than it recovers a redo log with 10 entries for 10 different data blocks. This is because for each log record processed during recovery, the corresponding data block (if it is not already in memory) must be read from disk by Oracle, so that the change represented by the redo log entry can be applied to that block.
The principal means of balancing the duration of instance recovery and daily performance is by influencing how aggressively Oracle advances the checkpoint. If you force Oracle to keep the checkpoint only a few blocks behind the most recent redo log record, then you minimize the number of blocks Oracle processes during recovery.
The trade-off for having minimal recovery time, however, is increased performance overhead for normal database operations. If daily operational efficiency is more important than minimizing recovery time, then decreasing the frequency of writes to the datafiles increases instance recovery time.
There are several methods for tuning instance and crash recovery to keep the duration of recovery within user-specified bounds. For example:
The Oracle8i Enterprise Edition also offers fast-start fault recovery functionality to control instance recovery. This reduces the roll forward time by making it bounded and predictable, and it also eliminates the time required perform rollback. The foundation of fast-start fault recovery is fast-start checkpointing architecture. Instead of the conventional periodic checkpointing, as performed in earlier versions of Oracle, fast-start checkpointing occurs continuously, advancing the checkpoint time as blocks are written. Fast-start checkpointing always writes the oldest modified block first, ensuring that every write allows the checkpoint time to be advanced. Administrators specify a target (bounded) time to complete the roll forward phase of recovery, and Oracle automatically varies the checkpoint writes to meet that target.
During recovery, Oracle performs two main tasks:
Fast-start checkpointing eliminates bulk writes and the resultant I/O spikes that occur with conventional checkpointing, yielding smooth and fast ongoing performance. Continuous advancement reduces roll forward by half, compared to conventional checkpoints at the same transaction rate. Administrators can specify a bound on the time to do roll forward, rather than specifying the frequency of checkpoints.
You can use three initialization parameters to influence how aggressively Oracle advances the checkpoint, as shown in Table 24-1:
Set the initialization parameter LOG_CHECKPOINT_TIMEOUT
to a value n (where n is an integer) to require that the latest checkpoint position follow the most recent redo block by no more than n seconds. In other words, at most, n seconds worth of logging activity can occur between the most recent checkpoint position and the end of the redo log. This forces the checkpoint position to keep pace with the most recent redo block
You can also interpret LOG_CHECKPOINT_TIMEOUT
as specifying an upper bound on the time a buffer can be dirty in the cache before DBWn must write it to disk. For example, if you set LOG_CHECKPOINT_TIMEOUT
to 60, then no buffers remain dirty in the cache for more than 60 seconds. The default value for LOG_CHECKPOINT_TIMEOUT
is 1800.
Set the initialization parameter LOG_CHECKPOINT_INTERVAL
to a value n (where n is an integer) to require that the checkpoint position never follow the most recent redo block by more than n blocks. In other words, at most n redo blocks can exist between the checkpoint position and the last block written to the redo log. In effect, you are limiting the amount of redo blocks that can exist between the checkpoint and the end of the log.
Oracle limits the maximum value of LOG_CHECKPOINT_INTERVAL
to 90% of the smallest log to ensure that the checkpoint advances into the current log before that log fills and a log switch is attempted.
LOG_CHECKPOINT_INTERVAL
is specified in redo blocks. Redo blocks are the same size as operating system blocks. Use the LOG_FILE_SIZE_REDO_BLKS
column in V$INSTANCE_RECOVERY
to see the number of redo blocks corresponding to 90% of the size of the smallest log file.
Set this parameter to n, where n is an integer limiting to n the number of buffers that Oracle processes during crash or instance recovery. Because the number of I/Os to be processed during recovery correlates closely to the duration of recovery, the FAST_START_IO_TARGET
parameter gives you the most precise control over the duration of recovery.
FAST_START_IO_TARGET
advances the checkpoint, because DBWn uses the value of FAST_START_IO_TARGET
to determine how much writing to do. Assuming that users are making many updates to the database, a low value for this parameter forces DBWn to write changed buffers to disk. As the changed buffers are written to disk, the checkpoint advances.
The smaller the value of FAST_START_IO_TARGET
, the better the recovery performance, because fewer blocks require recovery. If you use smaller values for this parameter, however, then you impose higher overhead during normal processing, because DBWn must write more buffers to disk more frequently.
See Also:
For more information, see "Estimating Recovery Time" and "Calculating Performance Overhead". For more information on tuning checkpoints, see Chapter 20, "Tuning I/O". For more information about initialization parameters, see the Oracle8i Reference. |
The size of a redo log file directly influences checkpoint performance. The smaller the size of the smallest log, the more aggressively Oracle writes dirty buffers to disk to ensure the position of the checkpoint has advanced to the current log before that log completely fills. Forcing the checkpoint to advance into the current log before it fills ensures that Oracle will not need to wait for the checkpoint to advance out of a redo log file before it can be reused. Oracle enforces this behavior by ensuring the number of redo blocks between the checkpoint and the most recent redo record is less than 90% of the size of the smallest log.
If your redo logs are small compared to the number of changes made against the database, then Oracle must switch logs frequently. If the value of LOG_CHECKPOINT_INTERVAL
is less than 90% of the size of the smallest log, then the size of the smallest log file does not influence checkpointing behavior.
Although you specify the number and sizes of online redo log files at database creation, you can alter the characteristics of your redo log files after startup. Use the ADD
LOGFILE
clause of the ALTER
DATABASE
statement to add a redo log file and specify its size, or the DROP
LOGFILE
clause to drop a redo log.
The size of the redo log appears in the LOG_FILE_SIZE_REDO_BLKS
column of the V$INSTANCE_RECOVERY
dynamic performance. This value shows how the size of the smallest online redo log is affecting checkpointing. By increasing or decreasing the size of your online redo logs, you indirectly influence the frequency of checkpoint writes.
See Also:
For information on using the |
Besides setting initialization parameters and sizing your redo log files, you can also influence checkpoints with SQL statements. ALTER
SYSTEM
CHECKPOINT
directs Oracle to record a checkpoint for the node, and ALTER
SYSTEM
CHECKPOINT
GLOBAL
directs Oracle to record a checkpoint for every node in a cluster.
SQL-induced checkpoints are heavyweight. This means that Oracle records the checkpoint in a control file shared by all the redo threads. Oracle also updates the datafile headers. SQL-induced checkpoints move the checkpoint position to the point that corresponded to the end of the log when the statement was initiated. These checkpoints can adversely affect performance, because the additional writes to the datafiles increase system overhead.
Use the V$INSTANCE_RECOVERY
view to see your current recovery parameter settings. You can also use statistics from this view to calculate which parameter has the greatest influence on checkpointing. V$INSTANCE_RECOVERY
contains the columns shown in Table 24-2.
The value appearing in the TARGET_REDO_BLKS
column equals a value appearing in another column in the view. This other column corresponds to the parameter or log file that is determining the maximum number of redo blocks that Oracle processes during recovery. The setting for the parameter in this column is imposing the heaviest requirement on redo block processing.
For example, assume your initialization parameter settings are as follows:
FAST_START_IO_TARGET = 1000 LOG_CHECKPOINT_TIMEOUT = 1800 # default LOG_CHECKPOINT_INTERVAL = 0# default: disabled interval checkpointing
You execute the following query:
SELECT * FROM V$INSTANCE_RECOVERY;
Oracle responds with the following:
1 row selected.
As you can see by the values in the last three columns, the FAST_START_IO_TARGET
parameter places heavier recovery demands on Oracle than the other two parameters. It requires that Oracle process no more than 4215 redo blocks during recovery. The LOG_FILE_SIZE_REDO_BLKS
column indicates that Oracle can process up to 55,296 blocks during recovery, so the log file size is not the heaviest influence on checkpointing.
The TARGET_REDO_BLKS
column shows the smallest value of the last five columns. This shows the parameter or condition that exerts the heaviest requirement for Oracle checkpointing. In this example, the FAST_START_IO_TARGET
parameter is the strongest influence with a value of 4215.
Assume you make several updates to the database and query V$INSTANCE_RECOVERY
three hours later. Oracle responds with the following:
1 row selected.
FAST_START_IO_TARGET
is still exerting the strongest influence over checkpointing behavior, although the number of redo blocks corresponding to this target has changed dramatically. This change is not due to a change in FAST_START_IO_TARGET
or the corresponding RECOVERY_ESTIMATED_IOS
. Instead, this indicates that operations requiring I/O in the event of recovery are more frequent in the redo log, so fewer redo blocks now correspond to the same FAST_START_IO_TARGET
.
Assume you decide that FAST_START_IO_TARGET
is placing an excessive limit on the maximum number of redo blocks that Oracle processes during recovery. You adjust FAST_START_IO_TARGET
to 8000, set LOG_CHECKPOINT_TIMEOUT
to 60, and perform several updates. You reissue the query to V$INSTANCE_RECOVERY
and Oracle responds with:
1 row selected.
Because the TARGET_REDO_BLKS
column value of 6707 corresponds to the value in the LOG_CHKPT_TIMEOUT_REDO_BLKS
column, LOG_CHECKPOINT_TIMEOUT
is now exerting the most influence over checkpointing behavior.
Use statistics from the V$INSTANCE_RECOVERY
view to estimate recovery time using the following formula:
For example, if RECOVERY_ESTIMATED_IOS
is 2500, and the maximum number of writes your system performs is 500 per second, then recovery time is 5 seconds. Note the following restrictions:
V$FILESTAT
view provides information on the number of physical reads and writes performed since the instance started. Measure these values over a set time interval, and then divide this by the time interval to estimate your system's maximum I/Os per second. The following query can be used to measure the total I/Os since the instance started:
SELECT sum(PHYBLKRD+PHYBLKWRT) FROM v$filestat;
FAST_START_IO_TARGET
is both enabled and when this parameter is the determining influence on checkpointing behavior.
To adjust recovery time, change the initialization parameter that has the most influence over checkpointing. Use the V$INSTANCE_RECOVERY
view as described in "Monitoring Instance Recovery" to determine which parameter to adjust. Then, either adjust the parameter to decrease or increase recovery time as required.
For example, assume as in "Determining the Strongest Checkpoint Influence" that your initialization parameter settings are the following:
FAST_START_IO_TARGET = 1000 LOG_CHECKPOINT_TIMEOUT = 1800 # default LOG_CHECKPOINT_INTERVAL = 0 # default: disabled interval checkpointing
You execute the following query:
SELECT * FROM V$INSTANCE_RECOVERY;
Oracle responds with the following:
1 row selected.
You calculate recovery time using the formula, where RECOVERY_ESTIMATED_IOS
is 1025 and the maximum I/Os per second the system can perform is 500:
You decide you can afford slightly more than 2.05 seconds of recovery time: constant access to the data is not critical. You increase the value for the parameter FAST_START_IO_TARGET
to 2000 and perform several updates. You then reissue the query and Oracle displays:
1 row selected.
Recalculate recovery time using the same formula:
You have increased your recovery time by 1.96 seconds. If you can afford more time, then repeat the procedure until you arrive at an acceptable recovery time.
To calculate performance overhead, use the V$SYSSTAT
view. For example, assume you execute the following query:
SELECT NAME, VALUE FROM V$SYSSTAT WHERE NAME IN ( 'PHYSICAL READS', 'PHYSICAL WRITES',);
Oracle responds with the following:
NAME VALUE physical reads 2376 physical writes 14932 physical writes non checkpoint 11165 3 rows selected.
The first row shows the number of data blocks retrieved from disk. The second row shows the number of data blocks written to disk. The last row shows the value of the number of writes to disk that would occur if you turned off checkpointing.
Use this data to calculate the overhead imposed by setting the FAST_START_IO_TARGET
initialization parameter. To effectively measure the percentage of extra writes, mark the values for these statistics at different times, t_1
and t_2
. Use the following formula where the variables stand for the following:
Calculate the percentage of extra I/Os generated by fast-start checkpointing using this formula:
It can take some time for database statistics to stabilize after instance startup or dynamic initialization parameter modification. After such events, wait until all blocks age out of the buffer cache at least once before taking measurements.
If the percentage of extra I/Os is too high, then increase the value for FAST_START_IO_TARGET
. Adjust this parameter until you get an acceptable value for the RECOVERY_ESTIMATED_IOS
in V$INSTANCE_RECOVERY
as described in "Determining the Strongest Checkpoint Influence".
The number of extra writes caused by setting FAST_START_IO_TARGET
to a non-zero value is application-dependent. An application that repeatedly modifies the same buffers incurs a higher write penalty because of fast-start checkpointing than an application that does not. The extra write penalty is not dependent on cache size.
As an example, assume your initialization parameter settings are:
FAST_START_IO_TARGET = 2000 LOG_CHECKPOINT_TIMEOUT = 1800 # default LOG_CHECKPOINT_INTERVAL = 0 # default: disabled interval checkpointing
After the statistics stabilize, you issue this query on V$SYSSTAT
:
SELECT NAME, VALUE FROM V$SYSSTAT WHERE NAME IN ('PHYSICAL READS', 'PHYSICAL WRITES', 'PHYSICAL WRITES NON CHECKPOINT');
Oracle responds with:
Name Value physical reads 2376 physical writes 14932 physical writes non checkpoint 11165 3 rows selected.
After making updates for a few hours, you re-issue the query and Oracle responds with:
Name Value physical reads 3011 physical writes 17467 physical writes non checkpoint 13231 3 rows selected.
Substitute the values from your select statements in the formula as described to determine how much performance overhead you are incurring:
[((17467 - 14932) - (13231 - 11165)) / ((3011 - 2376) + (17467 - 14932))] x 100% = 14.8%
As the result indicates, enabling fast-start checkpointing generates about 15% more I/O than would be required had you not enabled fast-start checkpointing. After calculating the extra I/O, you decide you can afford more system overhead if you decrease recovery time.
To decrease recovery time, reduce the value for the parameter FAST_START_IO_TARGET
to 1000. After items in the buffer cache age out, calculate V$SYSSTAT
statistics across a second interval to determine the new performance overhead. Query V$SYSSTAT
:
SELECT NAME, VALUE FROM V$SYSSTAT WHERE NAME IN ('PHYSICAL READS', 'PHYSICAL WRITES', 'PHYSICAL WRITES NON CHECKPOINT');
Oracle responds with:
Name Value physical reads 4652 physical writes 28864 physical writes non checkpoint 21784 3 rows selected.
After making updates, re-issue the query and Oracle responds with:
Name Value physical reads 6000 physical writes 35394 physical writes non checkpoint 26438 3 rows selected.
Calculate how much performance overhead you are incurring using the values from your two SELECT
statements:
[(35394 - 28864) - (26438 - 21784)) / ((6000 - 4652) + (35394 - 28864))] x 100% = 23.8%
After changing the parameter, the percentage of I/Os performed by Oracle is now about 24% more than it would be if you disabled fast-start checkpointing.
The work required to do roll forward processing is proportional to the rate of change to the database (update transactions per second) and the time between which consistent snapshots, or checkpoints, of the database are made. The work required to do roll back is proportional to the number and size of uncommitted transactions when the system fault occurred. The total recovery time is the sum of time to do roll forward and the time to do roll back.
Besides using checkpoints to tune instance recovery, you can also use a variety of parameters to control Oracle's behavior during the rolling forward and rolling back phases of instance recovery. In some cases, you can parallelize operations and thereby increase recovery efficiency.
This section contains the following topics:
Use parallel block recovery to tune the roll forward phase of recovery. Parallel block recovery uses a division of labor approach to allocate different processes to different data blocks during the roll forward phase of recovery. For example, during recovery the redo log is read, and blocks that require redo application are parsed out. These blocks are subsequently distributed evenly to all recovery slaves to be read into the buffer cache. Crash, instance, and media recovery of many datafiles on different disk drives are good candidates for parallel block recovery.
Use the RECOVERY_PARALLELISM
initialization parameter to specify the number of concurrent recovery processes for instance or media recovery operations. Because crash recovery occurs at instance startup, this parameter is useful for specifying the number of processes to use for crash recovery. The value of this parameter is also the default number of processes used for media recovery if you do not specify the PARALLEL
clause of the RECOVER
statement. To use parallel processing, the value of RECOVERY_PARALLELISM
must be greater than 1 and cannot exceed the value of the PARALLEL_MAX_SERVERS
parameter. Parallel block recovery requires a minimum of eight recovery processes for it to be more effective than serial recovery.
Recovery is usually I/O bound on reads to data blocks. Consequently, parallelism at the block level may only help recovery performance if it increases total I/Os. In other words, parallelism at the block level by-passes operating system restrictions on asynchronous I/Os. Performance on systems with efficient asynchronous I/O typically does not improve significantly with parallel block recovery.
During the second phase of instance recovery, Oracle rolls back uncommitted transactions. Oracle uses two features, fast-start on-demand rollback and fast-start parallel rollback, to increase the efficiency of this recovery phase.
This section contains the following topics:
Using the fast-start on-demand rollback feature, Oracle automatically allows new transactions to begin immediately after the roll forward phase of recovery completes. Should a user attempt to access a row that is locked by a dead transaction, Oracle rolls back only those changes necessary to complete the transaction, in other words, it rolls them back on demand. Consequently, new transactions do not have to wait until all parts of a long transaction are rolled back.
In fast-start parallel rollback, the background process SMON acts as a coordinator and rolls back a set of transactions in parallel using multiple server processes. Essentially, fast-start parallel rollback is to rolling back what parallel block recovery is to rolling forward.
Fast-start parallel rollback is mainly useful when a system has transactions that run a long time before committing, especially parallel INSERT
, UPDATE
, and DELETE
operations. When SMON discovers that the amount of recovery work is above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes: process 1 rolls back one transaction, process 2 rolls back a second transaction, and so on. The threshold is the point at which parallel recovery becomes cost-effective, in other words, when parallel recovery takes less time than serial recovery.
One special form of fast-start parallel rollback is intra-transaction recovery. In intra-transaction recovery, a single transaction is divided among several processes. For example, assume 8 transactions require recovery with one parallel process assigned to each transaction. The transactions are all similar in size except for transaction 5, which is quite large. This means it takes longer for one process to roll this transaction back than for the other processes to roll back their transactions.
In this situation, Oracle automatically begins intra-transaction recovery by dispersing transaction 5 among the processes: process 1 takes one part, process 2 takes another part, and so on.
You control the number of processes involved in transaction recovery by setting the parameter FAST_START_PARALLEL_ROLLBACK
to one of three values:
In Oracle Parallel Server, you can perform fast-start parallel rollback on each instance. Within each instance, you can perform parallel rollback on transactions that are:
After a rollback segment is online for a given instance, only this instance can perform parallel rollback on transactions on that segment.
Monitor the progress of fast-start parallel rollback by examining the V$FAST_START_SERVERS
and V$FAST_START_TRANSACTIONS
tables. V$FAST_START_SERVERS
provides information about all recovery processes performing fast-start parallel rollback. V$FAST_START_TRANSACTIONS
contains data about the progress of the transactions.
See Also:
For more information on fast-start parallel rollback in an Oracle Parallel Server environment, see Oracle8i Parallel Server Administration, Deployment, and Performance. For more information about initialization parameters, see the Oracle8i Reference. |
|
![]() Copyright © 1996-2000, Oracle Corporation. All Rights Reserved. |
|