20
Tuning I/O

This chapter explains how to avoid input/output (I/O) bottlenecks that could prevent Oracle from performing at its maximum potential.

This chapter contains the following sections:

Understanding I/O Problems

The performance of many software applications is inherently limited by disk input/output (I/O). Often, CPU activity must be suspended while I/O activity completes. Such an application is said to be I/O bound. Oracle is designed so that performance is not limited by I/O.

Tuning I/O can enhance performance if a disk containing database files is operating at its capacity. However, tuning I/O cannot help performance in CPU bound cases--or cases in which your computer's CPUs are operating at their capacity.

This section introduces I/O performance issues. It covers:

Tuning I/O: Top Down and Bottom Up

When designing a new system, you should analyze I/O needs from the top down, determining what resources you require in order to achieve the desired performance.

For an existing system, you should approach I/O tuning from the bottom up:

Determine the number of disks on the system.
Determine the number of disks that are being used by Oracle.
Determine the type of I/O that your system performs.
Ascertain whether the I/Os are going to the file system or to raw devices.
Determine how to spread objects over multiple disks, using either manual striping or striping software.
Calculate the level of performance you can expect.

Analyzing I/O Requirements

This section explains how to determine your system's I/O requirements.

Calculate the total throughput your application requires.

To begin, figure out the number of reads and writes involved in each transaction, and distinguish the objects against which each operation is performed.

In an OLTP application, for example, each transaction might involve:
- 1 read from object A.
- 1 read from object B.
- 1 write to object C.
So, one transaction requires 2 reads and 1 write, all to different objects.
Define the I/O performance target for this application by specifying the number of transactions per second (tps) that the system must support.

With this example, the designer might specify that 100 tps constitutes an acceptable level of performance. To achieve this, the system must be able to perform 300 I/Os per second:
- 100 reads from object A.
- 100 reads from object B.
- 100 writes to object C.
Determine the number of disks needed to achieve this level of performance.

To do this, ascertain the number of I/Os that each disk can perform per second. This number depends on three factors:
- The speed of your particular disk hardware.
- Whether the I/Os needed are reads or writes.
- Whether you are using the file system or raw devices.
In general, disk speed tends to have the following characteristics:

Table 20-1 Relative Disk Speed

Disk Speed File System Raw Devices

Reads per second

fast

slow

Writes per second

slow

fast
Write the relative speed per operation of your disks in a chart like the one shown in Table 20-2:

Table 20-2 Disk I/O Analysis Worksheet

Disk Speed File System Raw Devices

Reads per second

Writes per second

The disks in the current example have characteristics as shown in Table 20-3:

Table 20-3 Sample Disk I/O Analysis

Disk Speed File System Raw Devices

Reads per second

50

45

Writes per second

20

50

Calculate the number of disks you need to achieve your I/O performance target using a chart like the one shown in Table 20-4:

Table 20-4 Disk I/O Requirements Worksheet

Object	If Stored on File System			If Stored on Raw Devices
Object	R/W Needed per Sec.	Disk R/W Capabil. per Sec.	Disks Needed	R/W Needed per Sec.	Disk R/W Capabil. per Sec.	Disks Needed
A
B
C
Disks Req'd

Table 20-5 shows the values from this example:

Table 20-5 Sample Disk I/O Requirements

Object	If Stored on File System			If Stored on Raw Devices
Object	R/W Needed per Sec.	Disk R/W Capabil. per Sec.	Disks Needed	R/W Needed per Sec.	Disk R/W Capabil. per Sec.	Disks Needed
A	100 reads	50 reads	2 disks	100 reads	45 reads	3 disks
B	100 reads	50 reads	2 disks	100 reads	45 reads	3 disks
C	100 writes	20 writes	5 disks	100 writes	50 writes	2 disks
Disks Req'd			9 disks			8 disks

Planning File Storage

This section explains the following:

How to determine the types of I/O operations required by your application.
How to choose between file system and raw devices for your database files.

Design Approach

Use the following approach to design file storage:

Identify the operations required by your application.
Test the performance of your system's disks and controllers for the different operations required by your application.
Finally, evaluate what kind of disk and controller layout gives you the best performance for the operations that predominate in your application.

These steps are described in detail under the following headings.

Identifying the Required Read/Write Operations

Evaluate your application to determine how often it requires each type of I/O operation (sequential read, sequential write, random read, and random write).

Table 20-6 shows the types of read and write operations performed by each of the background processes, by foreground processes, and by parallel execution servers.

Table 20-6 Read/Write Operations Performed by Oracle Processes

Operation	Process
Operation	LGWR	DBWn	ARCH	SMON	PMON	CKPT	Foreground	PQ Processes
Sequential Read			X	X		X	X	X
Sequential Write	X		X			X	X	X
Random Read				X			X
Random Write		X

In this discussion, a sample application might involve 50% random reads, 25% sequential reads, and 25% random writes.

Sequential I/O

Sequential I/O is characterized by high data rates. For example, a single DSS type I/O may access hundreds of blocks. Sequential access is efficient, because these accesses allow data prefetches and cause limited head positioning. This provides high throughputs.

Because DSS systems may not do a large number of transactions per second, it is better to estimate the size of the I/O in terms of bytes per second. For example:

(estimate # of physical blocks in transaction * Oracle block size) = byte/second

Using this value and the theoretical limits for disk and controller throughputs can help you determine the number of drives/controllers to implement.

Note:
Most disk drives can handle 50-70 I/O per second and can transfer approximately 5Mb/sec.

The goal in optimizing sequential I/O is to maximize throughput by involving the maximum number of disks in the I/O request. The more disks involved, the greater aggregate throughput. For example:

4 disks/array @ 5Mb/second = (20 Mb/second)I/O call

Random I/O

Random I/O is characterized by high I/O rate in OLTP. It requires frequent seeks with small I/O sizes.

The following example determines the application load for OLTP (gets the number and size of the transactions):

(# of blocks accessed/transaction) * (# of transactions/second) = blocks/second

Using this value and the theoretical limits for disk and controller throughputs can help you determine the number of drives per controllers to implement.

The goal in optimizing random I/O is to reduce disk hot spots and limit seek times.

Testing the Performance of Your Disks

This section illustrates relative performance of read/write operations by a particular test system.

Note:
Values provided in this example do not constitute a general rule. They were generated by an actual UNIX test system using particular disks. These figures differ significantly for different platforms and different disks! To make accurate judgments, test your own system using an approach similar to the one demonstrated in this section. Or, contact your system vendor for information on disk performance for the different operations.

Table 20-7 shows the speed of sequential read in milliseconds per I/O on a test system.

Table 20-7 Block Size and Speed of Sequential Read (Sample Data)

Block Size	Speed of Sequential Read on:
Block Size	Raw Device	UNIX File System (UFS)
512 bytes	1.4	0.4
1KB	1.4	0.3
2KB	1.5	0.6
4KB	1.6	1.0
8KB	2.7	1.5
16KB	5.1	3.7
32KB	10.1	8.1
64KB	20.0	18.0
128KB	40.4	36.1
256KB	80.7	61.3

Doing research like this helps determine the correct stripe size. In this example, it takes at most 5.3 milliseconds to read 16KB. If your data is in chunks of 256KB, then you could stripe the data over 16 disks (as described) and maintain this low read time.

By contrast, if all your data is on one disk, then read time would be 80 milliseconds. Thus, the test results show that on this particular set of disks, things look quite different from what might be expected. It is sometimes beneficial to have a smaller stripe size, depending on the size of the I/O.

Table 20-8 shows the speed of sequential write in milliseconds per I/O on the test system.

Table 20-8 Block Size and Speed of Sequential Write (Sample Data)

Block Size	Speed of Sequential Write on:
Block Size	Raw Device	UNIX File System (UFS)
512 bytes	11.2	17.9
1KB	11.7	18.3
2KB	11.6	19.0
4KB	12.3	19.8
8KB	13.5	21.8
16KB	16.0	35.3
32KB	19.3	62.2
64KB	31.5	115.1
128KB	62.5	221.8
256KB	115.6	429.0

Table 20-9 shows the speed of random read in milliseconds per I/O on the test system.

Table 20-9 Block Size and Speed of Random Read (Sample Data)

Block Size	Speed of Random Read on:
Block Size	Raw Device	UNIX File System (UFS)
512 bytes	12.3	15.5
1KB	12.0	14.1
2KB	13.4	15.0
4KB	13.9	15.3
8KB	15.4	14.4
16KB	19.1	39.7
32KB	25.7	39.9
64KB	38.1	40.2
128KB	64.3	62.2
256KB	115.7	91.2

Table 20-10 shows the speed of random write in milliseconds per I/O on the test system.

Table 20-10 Block Size and Speed of Random Write (Sample Data)

Block Size	Speed of Random Write on:
Block Size	Raw Device	UNIX File System (UFS)
512 bytes	12.3	40.7
1KB	12.0	41.4
2KB	12.6	41.6
4KB	13.8	41.4
8KB	14.8	32.8
16KB	17.7	45.6
32KB	24.8	71.6
64KB	38.0	123.8
128KB	74.4	230.3
256KB	137.4	441.5

Evaluate Disk Layout Options

Knowing the types of operations that predominate in your application and the speed with which your system can process the corresponding I/Os, you can choose the disk layout that maximizes performance.

For example, with the sample application and test system described previously, the UNIX file system is a good choice. With random reads predominating (50% of all I/O operations), 8KB is good block size. Furthermore, the UNIX file system in this example processes sequential reads (25% of all I/O operations) almost twice as fast as raw devices, given an 8KB block size.

Note:
Figures shown in the preceding example differ significantly on different platforms, and with different disks! To plan effectively, test I/O performance on your own system.

Choosing Data Block Size

Table data in the database is stored in data blocks. This section describes how to allocate space within data blocks for best performance.

With single block I/O (random read), minimize the number of reads required to retrieve the desired data. How you store the data determines whether this performance objective is achieved. It depends on two factors: storage of the rows and block size.

Tests have proven that matching the database block size to the UNIX file system (UFS) block size provides the most predictable and efficient performance. On UNIX systems, the block size of an existing file system can be determined using the df -g command.

Having the database block size greater than the UFS block size, or having the UFS block size greater than database block size, may yield inconsistent performance based on how your operating system and external I/O subsystem manage data pre-fetching and the coalescing of multiple I/Os.

See Also:
For more information, see the example in "Testing the Performance of Your Disks".

Figure 20-1 illustrates the suitability of various block sizes to online transaction processing (OLTP) or decision support (DSS) applications.

Figure 20-1 Block Size and Application Type

Block Size Advantages and Disadvantages

Table 20-11 lists the advantages and disadvantages of different block sizes.

Table 20-11 Block Size Advantages and Disadvantages

Block Size	Advantages	Disadvantages
Small (2KB-4KB)	Reduces block contention. Good for small rows, or lots of random access.	Has relatively large overhead. You may end up storing only a small number of rows, depending on the size of the row.
Medium (8KB)	If rows are medium size, then you can bring a number of rows into the buffer cache with a single I/O. With 2KB or 4KB block size, you may only bring in a single row.	Space in the buffer cache is wasted if you are doing random access to small rows and have a large block size. For example, with an 8KB block size and 50 byte row size, you are wasting 7,950 bytes in the buffer cache when doing random access.
Large (16KB-32KB)	There is relatively less overhead; thus, there is more room to store useful data. Good for sequential access or very large rows.	Large block size is not good for index blocks used in an OLTP type environment, because they increase block contention on the index leaf blocks.

Evaluating Device Bandwidth

The number of I/Os a disk can perform depends on whether the operations involve reading or writing to objects stored on raw devices or on the file system. This affects the number of disks you must use to achieve the desired level of performance.

I/O Tuning Tips

When performing I/O tuning, remember that I/O service time reported by the operating system is not necessarily the total time taken to process the I/O. The goal of tuning I/O is to minimize waits, such that the response time equals service time plus wait time.

An I/O queue consists of two stages: one stage is in the device driver, and the second stage is in the device itself. When the I/O is waiting to be serviced by the SCSI bus or disk, it is actually waiting in the device driver queue. This is the true wait queue. The time on the wait queue is the wait queue time. When the I/O is truly being service by the disk unit, then it is on the active or run queue. The time to physically process the I/O is the service time.

The goal in tuning I/O is to minimize waits and increase throughput. Disk time can encounter the following bottlenecks:

Waiting for access to the I/O bus.
Waiting for other queued disk requests.
Seek time (to find the correct track in cylinder).
Rotation time (to find the correct sector in track).
I/O transfer time.

Tips for Tuning the I/O Adapter

Spread disks over several I/O adapters, so that one I/O adapter is not overworked. Also, spread I/O bound files across several disks and several I/O adapters using appropriate RAID implementation. Striping allows single I/O requests to be serviced by several drives, and it allows multiple I/Os to occur in parallel.
Make sure that there are not too many adapters on the same system bus. Exceeding the bandwidth of the system bus creates large CPU waits.
Understand your applications. Make sure that the number of I/O operations (or the I/O size) of your application does not exceed the theoretical limit of the adapter or a given disk.

Dissecting the I/O Path

This section explains the I/O path, so that you can analyze and tune I/O bottlenecks. The I/O path follows these steps:

A user process issues an I/O call (read or write).
The I/O is placed on an available CPU's dispatch queue. An available CPU picks up the request and context switches the user process.
The CPU checks the local cache to see if the requested data block is there. If it is, then it is a cache hit, and the I/O request is complete.
If the block does not exist in the cache or main memory, then it takes a major page fault (gets page from disk), and issues an I/O call against the appropriate device driver. The device driver builds a set of SCSI commands against an I/O unit.
The operating system (device driver) sends an I/O request through system bus to I/O controller (host bus adapter).
The host bus adapter (HBA) arbitrates for bus access. When the I/O request's device is ready, it is selected, and the I/O statement is prepared to be sent to the target.
If the target unit can satisfy the request from its cache, then it transfers the data back and disconnects. This is a disk cache hit. For a cache miss, the target disconnects and tries to service the request.
The I/O request is placed in the target's queue table on its adapter where it may be sorted and merged with other I/O requests. This is possible only if the disk unit supports tag queuing.
After the I/O is picked off the queue, it is serviced by computing the physical address and seeking to the correct sector, read or write. If it is a read operation, then data is placed in the target cache. Write operations signal a completion by sending an interrupt signal.
The target controller reconnects with the I/O bus to transfer the data (with reads).
The HBA sends an interrupt to the operating system, marking the end of the I/O request.

The following table explains where the wait components lay with respect to the I/O path.

Steps 1 - 5, 11

These steps are handled by the operating system. The time required for these operations is limited by access time to the HBA. Slowness can be attributed to CPU contention or I/O bus contention. A heavily loaded CPU is not able to service an I/O request. Review vmstat statistics for runnable process, high system time, and excessive context switch.

Steps 5, 6, 10, 11

These two steps involve the I/O adapter. A faster adapter propagates and manages I/O requests faster. An overloaded I/O bus may cause I/O requests to process slowly. Review IOSTAT statistics for high percent busy combined with large AVWAIT (or AVSERV, if available).

Steps 7 - 9

These two steps are handled by the disk drive. Limiting factors can be seek times, rotational delays, and data transfer times. These disk operations are mechanical in nature; therefore, they consume the largest chunk of time in an I/O call. Newer disks have improved seek times, rotational speeds, and data transfer rates.
Mechanical time is considered wasted time, because no data (productive work) is transferred during this time. The goal is to minimize this time by acquiring disks with larger caches and by using disks with tag queuing.
To minimize the mechanical overhead of an I/O, spread the I/O request across several disks using an appropriate stripe size under a RAID implementation. An incorrect stripe size can cause hot disks or multiple physical I/Os per logical I/O.
Additionally, use a raw interface (raw devices) or direct I/O when possible. Raw devices allow unbuffered I/O and can utilize kernalized asynchronous I/O. Raw interfaces can also be implemented using a volume manager. Finally, check to see if your operating system provides direct I/O support on file system-based files. Direct I/O has proven to be helpful for I/Os that involve sequential reads and writes.

See Also:
For more information on removing I/O contention, see "Reducing Disk Contention by Distributing I/O".

Detecting I/O Problems

This section describes two tasks to perform if you suspect a problem with I/O usage:

Oracle compiles file I/O statistics that reflect disk access to database files. These statistics report only the I/O utilization of Oracle sessions--yet every process affects the available I/O resources. Tuning non-Oracle factors can thus improve performance.

Checking System I/O Utilization

Use operating system monitoring tools to determine what processes are running on the system as a whole, and to monitor disk access to all files. Remember that disks holding datafiles and redo log files may also hold files that are not related to Oracle. Try to reduce any heavy access to disks that contain database files. Access to non-Oracle files can be monitored only through operating system facilities rather than through the V$FILESTAT view.

Tools, such as sar -d, on many UNIX systems let you examine the I/O statistics for your entire system. (Some UNIX-based platforms have an iostat command.) On NT systems, use Performance Monitor.

Checking Oracle I/O Utilization

This section identifies the views and processes that provide Oracle I/O statistics. It also shows how to check statistics using V$FILESTAT.

Dynamic Performance Views for I/O Statistics

Table 20-12 shows dynamic performance views to check for I/O statistics relating to Oracle database files, log files, archive files, and control files.

Table 20-12 Where to Find Statistics about Oracle Files

File Type	Where to Find Statistics
Database Files	V$FILESTAT, V$SYSTEM_EVENT, V$SESSION_EVENT
Log Files	V$SYSSTAT, V$SYSTEM_EVENT, V$SESSION_EVENT
Archive Files	V$SYSTEM_EVENT, V$SESSION_EVENT
Control Files	V$SYSTEM_EVENT, V$SESSION_EVENT

Table 20-13 lists which file types processes write to.

Table 20-13 File Throughput Statistics for Oracle Processes

File	Process
File	LGWR	DBWn	ARCH	SMON	PMON	CKPT	Foreground	PQ Process
Database Files		X		X	X	X	X	X
Log Files	X
Archive Files			X
Control Files	X	X	X	X	X	X	X	X

V$SYSTEM_EVENT can be queried by event to show the total number of I/Os and average duration by type of I/O (read/write). With this, you can determine which types of I/O are too slow. If there are Oracle-related I/O problems, then tune them. But, if your process is not consuming the available I/O resources, then some other process is. Go back to the system to identify the process that is using up so much I/O, and determine why. Then tune this process.

Note:
Different types of I/O in Oracle require different tuning approaches. Tuning I/O for data warehousing applications that perform large sequential reads is different from tuning I/O for OLTP applications that perform random reads and writes.

See Also:
"Planning File Storage".

Checking Oracle Datafile I/O with V$FILESTAT

Examine disk access to database files through the dynamic performance view V$FILESTAT. This view shows the following information for database I/O (but not for log file I/O):

Number of physical reads and writes.
Number of blocks read and written.
Total I/O time for reads and writes.

By default, this view is available only to the user SYS and to users granted SELECT ANY TABLE system privilege, such as SYSTEM. The following column values reflect the number of disk accesses for each datafile:

PHYWRTS

The number of writes to each database file.

Use the following query to monitor these values over some period of time while your application is running:

SELECT NAME, PHYRDS, PHYWRTS
FROM V$DATAFILE df, V$FILESTAT fs
WHERE df.FILE# = fs.FILE#;

This query also retrieves the name of each datafile from the dynamic performance view V$DATAFILE. Sample output might look like this:

NAME                                             PHYRDS    PHYWRTS
-------------------------------------------- ---------- ----------
/oracle/ora70/dbs/ora_system.dbf                   7679       2735
/oracle/ora70/dbs/ora_temp.dbf                       32        546

The PHYRDS and PHYWRTS columns of V$FILESTAT can also be obtained through SNMP.

The total I/O for a single disk is the sum of PHYRDS and PHYWRTS for all the database files managed by the Oracle instance on that disk. Determine this value for each of your disks. Also, determine the rate at which I/O occurs for each disk by dividing the total I/O by the interval of time over which the statistics were collected.

Note:
Although Oracle records read and write times accurately, a database that is running on UFS may not reflect true disk accesses. For example, the read times may not always reflect a true disk read, but rather a UFS cache hit. However, read and write times should be accurate for raw devices. Additionally, write times are only recorded per batch, with all blocks in the same batch given the same time after the completion of the write I/O.

Solving I/O Problems

The rest of this chapter describes various techniques of solving I/O problems:

Reducing Disk Contention by Distributing I/O

This section describes how to reduce disk contention.

What Is Disk Contention?

Disk contention occurs when multiple processes try to access the same disk simultaneously. Most disks have limits on both the number of accesses and the amount of data they can transfer per second. When these limits are reached, processes may have to wait to access the disk.

In general, consider the statistics in the V$FILESTAT view and your operating system facilities. Consult your hardware documentation to determine the limits on the capacity of your disks. Any disks operating at or near full capacity are potential sites for disk contention. For example, 60 or more I/Os per second may be excessive for some disks on VMS or UNIX operating systems.

In addition, review V$SESSION_EVENT for the following events: db file sequential read, db file scattered read, db file single write, and db file parallel write. These are all events corresponding to I/Os performed against the data file headers, control files, or data files. If any of these wait events correspond to high Average Time, then investigate the I/O contention using sar or iostat. Look for busy waits on the device. Examine the file statistics to determine which file is associated with the high I/O.

To reduce the activity on an overloaded disk, move one or more of its heavily accessed files to a less active disk. Apply this principle to each of your disks until they all have roughly the same amount of I/O. This is known as distributing I/O.

Separating Datafiles and Redo Log Files

Oracle processes constantly access datafiles and redo log files. If these files are on common disks, then there is potential for disk contention. Place each datafile on a separate disk. Multiple processes can then access different files concurrently without disk contention.

Place each set of redo log files on a separate disk with no other activity. Redo log files are written by the Log Writer process (LGWR) when a transaction is committed. Information in a redo log file is written sequentially. This sequential writing can take place much faster if there is no concurrent activity on the same disk. Dedicating a separate disk to redo log files usually ensures that LGWR runs smoothly with no further tuning attention. Performance bottlenecks related to LGWR are rare.

See Also:
For information on tuning LGWR, see the section "Detecting Contention for Redo Log Buffer Latches".

Dedicating separate disks to datafiles and mirroring redo log files are important safety precautions. These steps ensure that the datafiles and the redo log files cannot both be lost in a single disk failure. Mirroring redo log files ensures that a redo log file cannot be lost in a single disk failure.

In order to prevent I/O contention between the archiver process and LGWR (when using multi-membered groups), make sure that archiver reads and LGWR writes are separated. For example, if your system has 4 groups with 2 members, then the following scenario should used to separate disk access:

4 groups x 2 members each = 8 logfiles labeled: 1a, 1b, 2a, 2b, 3a, 3b, 4a, 4b.

This requires at least 4 disks, plus one disk for archived files.

Figure 20-2 illustrates how redo members should be distributed across disks to minimize contention.

Figure 20-2 Distributing Redo Members Across Disks

In this example, LGWR switched out of log group 1 (member 1a and 1b) and is now writing to log group2 (2a and 2b). Concurrently, the archiver process is reading from the group 1 and writing to its archive destination. Note how the redo log files are isolated from contention.

Note:
Mirroring redo log files, or maintaining multiple copies of each redo log file on separate disks, does not slow LGWR considerably. LGWR writes to each disk in parallel and waits until each part of the parallel write is complete. Because the time required to perform a single-disk write may vary, increasing the number of copies increases the likelihood that one of the single-disk writes in the parallel write takes longer than average. A parallel write does not take longer than the longest possible single-disk write. There may also be some overhead associated with parallel writes on your operating system.

Striping Table Data

Striping, or spreading a large table's data across separate datafiles on separate disks, can also help to reduce contention.

See Also:
This strategy is fully discussed in the section "Striping Disks".

Separating Tables and Indexes

It is not necessary to separate a frequently used table from its index. During the course of a transaction, the index is read first, and then the table is read. Because these I/Os occur sequentially, the table and index can be stored on the same disk without contention. However, for very high OLTP systems, separating indexes from tables may be required.

Split indexes and tables into separate tablespaces to minimize disk head movement and parallelize I/O. Both reads happen faster because one disk head is on the index data and the other is on the table data.

The idea of separating objects accessed simultaneously applies to indexes as well. For example, if a SQL statement uses two indexes at the same time, then performance is improved by having each index on a separate disk.

Also, avoid having several heavily accessed tables on the same disk. This requires strong knowledge of the application access patterns.

The use of partitioned tables and indexes can improve performance of operations in a data warehouse. Divide a large table or index into multiple physical segments residing in different tablespaces. All tables that contain large object datatypes should be placed into a separate tablespace as well.

Reducing Disk I/O Unrelated to Oracle

If possible, eliminate I/O unrelated to Oracle on disks that contain database files. This measure is especially helpful in optimizing access to redo log files. Not only does this reduce disk contention, it also allows you to monitor all activity on such disks through the dynamic performance table V$FILESTAT.

Striping Disks

This section describes:

Purpose of Striping

Striping divides a large table's data into small portions and stores these portions in separate datafiles on separate disks. This permits multiple processes to access different portions of the table concurrently without disk contention. Striping is particularly helpful in optimizing random access to tables with many rows. Striping can either be done manually (described below), or through operating system striping utilities.

I/O Balancing and Striping

Benchmark tuners in the past tried hard to ensure that the I/O load was evenly balanced across the available devices. Currently, operating systems are providing the ability to stripe a heavily used container file across many physical devices. However, such techniques are productive only where the load redistribution eliminates or reduces some form of queue.

If the wait service time exists, along with high percentage busy on a drive, then I/O distribution may be required. Where larger numbers of physical drives are available, consider dedicating two drives to carrying redo logs (two because redo logs should always be mirrored either by the operating system or using Oracle redo log group features). Because redo logs are written serially, drives dedicated to redo log activity normally require limited head movement. This significantly accelerates log writing.

When archiving, it is beneficial to use extra disks so that LGWR and ARCH do not compete for the same read/write head. This is achieved by placing logs on alternating drives.

Mirroring can also be a cause of I/O bottlenecks. The process of writing to each mirror is normally done in parallel, and does not cause a bottleneck. However, if each mirror is striped differently, then the I/O is not completed until the slowest mirror member is finished. To avoid I/O problems, stripe using the same number of disks for the destination database, or the copy, as you used for the source database.

For example, if you have 160KB of data striped over 8 disks, but the data is mirrored onto only one disk, then regardless of how quickly the data is processed on the 8 disks, the I/O is not completed until 160KB has been written onto the mirror disk. It might thus take 20.48 milliseconds to write the database, but 137 milliseconds to write the mirror.

Striping Disks Manually

To stripe disks manually, you need to relate an object's storage requirements to its I/O requirements.

Begin by evaluating an object's disk storage requirements by checking:
- The size of the object.
- The size of the disk.
For example, if an object requires 5GB in Oracle storage space, then you need one 5GB disk or two 4GB disks to accommodate it. On the other hand, if the system is configured with 1GB or 2GB disks, then the object may require 5 or 3 disks, respectively.
Compare to this the application's I/O requirements, as described in "Analyzing I/O Requirements". You must take the larger of the storage requirement and the I/O requirement.

For example, if the storage requirement is 5 disks (1GB each), and the I/O requirement is 2 disks, then your application requires the higher value: 5 disks.

Create a tablespace with the CREATE TABLESPACE statement. Specify the datafiles in the DATAFILE clause. Each of the files should be on a different disk. For example:

CREATE TABLESPACE stripedtabspace
   DATAFILE 'file_on_disk_1' SIZE 1GB,
      'file_on_disk_2' SIZE 1GB,
      'file_on_disk_3' SIZE 1GB,
      'file_on_disk_4' SIZE 1GB,
      'file_on_disk_5' SIZE 1GB;

Then, create the table with the CREATE TABLE statement. Specify the newly created tablespace in the TABLESPACE clause.

Also specify the size of the table extents in the STORAGE clause. Store each extent in a separate datafile. The table extents should be slightly smaller than the datafiles in the tablespace to allow for overhead. For example, when preparing for datafiles of 1GB (1024MB), you can set the table extents to be 1023MB. For example:

CREATE TABLE stripedtab (
   col_1  NUMBER(2),
   col_2  VARCHAR2(10) )
   TABLESPACE stripedtabspace
   STORAGE ( INITIAL 1023MB  NEXT 1023MB
      MINEXTENTS 5  PCTINCREASE 0 );

(Alternatively, you can stripe a table by entering an ALTER TABLE ALLOCATE EXTENT statement with DATAFILE 'datafile' SIZE 'size'.)

These steps result in the creation of table STRIPEDTAB. STRIPEDTAB has 5 initial extents, each of size 1023MB. Each extent takes up one of the datafiles named in the DATAFILE clause of the CREATE TABLESPACE statement. Each of these files is on a separate disk. The 5 extents are all allocated immediately, because MINEXTENTS is 5.

Striping Disks with Operating System Software

As an alternative to striping disks manually, use operating system utilities or third-party tools, such as logical volume managers, or use hardware-based striping.

With utilities or hardware-based striping mechanisms, the main factors to consider are stripe size, number of disks to stripe across (which defines the stripe width), and the level of concurrency (or level of I/O activity). These factors are affected by the Oracle block size and the database access methods.

Table 20-14 Minimum Stripe Size

Disk Access	Minimum Stripe Size
Random reads and writes	The minimum stripe size is twice the Oracle block size.
Sequential reads	The minimum stripe size is twice the value of `DB_FILE_MULTIBLOCK_READ_COUNT`.

Table 20-15 Typical Stripe Size

Concurrency	I/O Size	Typical Stripe Size
Low	Small	k * DB_BLOCK_SIZE
Low	Large	k * DB_BLOCK_SIZE
High	Small	k * DB_BLOCK_SIZE
High	Large	k * DB_BLOCK_SIZE * DB_FILE_MULTI_BLOCK_READ_COUNT

Where k = 2,3,4...

In striping, uniform access to the data is assumed. If the stripe size is too large, then a hot spot may appear on one disk or on a small number of disks. Avoid this by reducing the stripe size, thus spreading the data over more disks.

Consider an example in which 100 rows of fixed size are evenly distributed over 5 disks, with each disk containing 20 sequential rows. If your application only requires access to rows 35 through 55, then only 2 disks must perform the I/O. At a high rate of concurrency, the system may not be able to achieve the desired level of performance.

Correct this problem by spreading rows 35 through 55 across more disks. In the current example, if there were two rows per block, then we could place rows 35 and 36 on the same disk, and rows 37 and 38 on a different disk. Taking this approach, we could spread the data over all the disks and I/O throughput would improve.

Striping and RAID

Redundant arrays of inexpensive disks (RAID) configurations provide improved data reliability. However, I/O performance depends on which RAID configuration is implemented.

Below are the most widely used RAID configurations:

RAID 1: Provides good reliability and good read rates; however, writes may be costly.
RAID 0+1: Provides good reliability and better read and write performance than RAID 1.

RAID 5: Provides good reliability. Sequential reads benefit the most. Writes performance may suffer with RAID 5. This configuration is not recommended for heavy write applications.

Note:
Although RAID 0 provides the best read and write performance, it is not a true RAID system, because it does not allow for redundancy. Oracle recommends that you do not place production database files on RAID 0 systems.

Optimal stripe size is a function of three things:

Size of I/O requests to the array.
Concurrency of I/O requests to the array.
The physical stripe boundaries matching the block size boundaries.

Striping is a good tool for balancing I/O across two or more disks in an array. However, keep in mind the following techniques:

On high concurrency arrays, you must ensure that no single I/O request gets broken up into more than one physical I/O call. Failing to do this multiplies the number of physical I/O requests performed in your system, which in turn causes exponential degradation in your system I/O response times.
On low concurrency arrays, you must ensure that no single I/O visits any disk twice. To fail here causes the same performance penalty as the one described above.

Avoiding Dynamic Space Management

When you create an object, such as a table or rollback segment, Oracle allocates space in the database for the data. This space is called a segment. If subsequent database operations cause the data volume to increase and exceed the space allocated, then Oracle extends the segment. Dynamic extension then reduces performance.

This section discusses:

Detecting Dynamic Extension

Dynamic extension causes Oracle to execute SQL statements in addition to those SQL statements issued by user processes. These SQL statements are called recursive calls because Oracle issues these statements itself. Recursive calls are also generated by these activities:

Misses on the data dictionary cache.
Firing of database triggers.
Execution of Data Definition Language (DDL) statements.
Execution of SQL statements within stored procedures, functions, packages, and anonymous PL/SQL blocks.
Enforcement of referential integrity constraints.

Examine the RECURSIVE CALLS statistic through the dynamic performance view V$SYSSTAT. By default, this view is available only to user SYS and to users granted the SELECT ANY TABLE system privilege, such as SYSTEM. Use the following query to monitor this statistic over a period of time:

SELECT NAME, VALUE
FROM V$SYSSTAT
WHERE NAME = 'recursive calls';

Oracle responds with something similar to the following:

NAME                                                         VALUE
------------------------------------------------------- ----------
recursive calls                                             626681

If Oracle continues to make excessive recursive calls while your application is running, then determine whether these recursive calls are due to an activity, other than dynamic extension, that generates recursive calls. If you determine that the recursive calls are caused by dynamic extension, then reduce this extension by allocating larger extents.

Allocating Extents

Follow these steps to avoid dynamic extension:

Determine the maximum size of your object.
Choose storage parameter values so that Oracle allocates extents large enough to accommodate all your data when you create the object.

Larger extents tend to benefit performance for the following reasons:

Blocks in a single extent are contiguous, so one large extent is more contiguous than multiple small extents. Oracle can read one large extent from disk with fewer multiblock reads than would be required to read many small extents. Therefore, make sure that the extent size is a multiple of DB_FILE_MULTI_BLOCK_READ_COUNT.
Segments with larger extents are less likely to be extended.

However, because large extents require more contiguous blocks, Oracle may have difficulty finding enough contiguous space to store them. To determine whether to allocate only a few large extents or many small extents, evaluate the benefits and drawbacks of each in consideration of plans for the growth and use of your objects.

Automatically re-sizable datafiles can also cause problems with dynamic extension. Instead, manually allocate more space to a datafile during times when the system is relatively inactive.

Evaluating Unlimited Extents

Even though an object may have unlimited extents, this does not mean that having a large number of small extents is acceptable. For optimal performance you may decide to reduce the number of extents.

Extent maps list all extents for a particular segment. The number of extents entries per Oracle block depends on operating system block size and platform. Although an extent is a data structure inside Oracle, the size of this data structure depends on the platform. Accordingly, this affects the number of extents Oracle can store in a single operating system block. Typically, this value is as follows:

Table 20-16 Block Size and Maximum Number of Extents (Typical Values)

Block Size (KB)	Maximum Number of Extents
2	121
4	255
8	504
16	1032
32	2070

For optimal performance, you should be able to read the extent map with a single I/O. Performance degrades if multiple I/Os are necessary for a full table scan to get the extent map.

Avoid dynamic extension in dictionary-mapped tablespaces. For dictionary-mapped tablespaces, do not let the number of extents exceed 1,000. If extent allocation is local, then do not have more than 2,000 extents. Having too many extents reduces performance when dropping or truncating tables.

Evaluating Multiple Extents

This section explains various ramifications of using multiple extents.

You cannot put large segments into single extents, because of file size and file system size limitations. When you enable segments to allocate new extents over time, you can take advantage of faster, less expensive disks.
For a table that is never full-table scanned, it makes no difference in terms of query performance whether the table has one extent or multiple extents.
The performance of searches using an index is not affected by the index having one extent or multiple extents.
Using more than one extent in a table, cluster, or temporary segment does not affect the performance of full scans on a multi-user system.
Using more than one extent in a table, cluster, or temporary segment does not materially affect the performance of full scans on a dedicated single-user batch processing system, if the extents are properly sized and if the application is designed to avoid expensive DDL operations.
If extent sizes are appropriately matched to the I/O size, then the performance cost of having many extents in a segment is minimized.
For rollback segments, many extents are preferable to few extents. Having many extents reduces the number of recursive SQL calls to perform dynamic extent allocations on the segments.

Avoiding Dynamic Space Management in Rollback Segments

The size of rollback segments can affect performance. Rollback segment size is determined by the rollback segment's storage parameter values. Your rollback segments must be large enough to hold the rollback entries for your transactions. As with other objects, you should avoid dynamic space management in rollback segments.

Use the SET TRANSACTION statement to assign transactions to rollback segments of the appropriate size based on the recommendations in the following sections. If you do not explicitly assign a transaction to a rollback segment, then Oracle automatically assigns it to a rollback segment.

For example, the following statement assigns the current transaction to the rollback segment OLTP_13:

SET TRANSACTION USE ROLLBACK SEGMENT oltp_13

Note:
If you are running multiple concurrent copies of the same application, then be careful not to assign the transactions for all copies to the same rollback segment. This leads to contention for that rollback segment.

Also, monitor the shrinking, or dynamic deallocation, of rollback segments based on the OPTIMAL storage parameter.

See Also:
For information on choosing values for this parameter, monitoring rollback segment shrinking, and adjusting the OPTIMAL parameter, see Oracle8i Administrator's Guide.

For Long Queries

Assign large rollback segments to transactions that modify data that is concurrently selected by long queries. Such queries may require access to rollback segments to reconstruct a read-consistent version of the modified data. The rollback segments must be large enough to hold all the rollback entries for the data while the query is running.

For Long Transactions

Assign large rollback segments to transactions that modify large amounts of data. A large rollback segment can improve the performance of such a transaction, because the transaction generates large rollback entries. If a rollback entry does not fit into a rollback segment, then Oracle extends the segment. Dynamic extension reduces performance and should be avoided whenever possible.

For OLTP Transactions

OLTP applications are characterized by frequent concurrent transactions, each of which modifies a small amount of data. Assign OLTP transactions to small rollback segments, provided that their data is not concurrently queried. Small rollback segments are more likely to remain stored in the buffer cache where they can be accessed quickly. A typical OLTP rollback segment might have 2 extents, each approximately 10 kilobytes in size. To best avoid contention, create many rollback segments and assign each transaction to its own rollback segment.

Reducing Migrated and Chained Rows

If an UPDATE statement increases the amount of data in a row so that the row no longer fits in its data block, then Oracle tries to find another block with enough free space to hold the entire row. If such a block is available, then Oracle moves the entire row to the new block. This is called migrating a row. If the row is too large to fit into any available block, then Oracle splits the row into multiple pieces and stores each piece in a separate block. This is called chaining a row. Rows can also be chained when they are inserted.

Dynamic space management, especially migration and chaining, is detrimental to performance:

UPDATE statements that cause migration and chaining perform poorly.
Queries that select migrated or chained rows must perform more I/O.

Identify migrated and chained rows in a table or cluster using the ANALYZE statement with the LIST CHAINED ROWS clause. This statement collects information about each migrated or chained row and places this information into a specified output table.

The definition of a sample output table named CHAINED_ROWS appears in a SQL script available on your distribution medium. The common name of this script is UTLCHN1.SQL, although its exact name and location varies depending on your platform. Your output table must have the same column names, datatypes, and sizes as the CHAINED_ROWS table.

You can also detect migrated or chained rows by checking the TABLE FETCH CONTINUED ROW column in V$SYSSTAT. Increase PCTFREE to avoid migrated rows. If you leave more free space available in the block, then the row has room to grow. You can also reorganize or re-create tables and indexes with high deletion rates.

Note:
PCTUSED is not the opposite of PCTFREE; PCTUSED controls space management.

To reduce migrated and chained rows in an existing table, follow these steps:

Use the ANALYZE statement to collect information about migrated and chained rows. For example:
```
ANALYZE TABLE order_hist LIST CHAINED ROWS;
```

Query the output table:

SELECT *
FROM CHAINED_ROWS
WHERE TABLE_NAME = 'ORDER_HIST';

OWNER_NAME  TABLE_NAME  CLUST... HEAD_ROWID          TIMESTAMP
----------  ----------  -----... ------------------  ---------
SCOTT       ORDER_HIST       ... AAAAluAAHAAAAA1AAA  04-MAR-96
SCOTT       ORDER_HIST       ... AAAAluAAHAAAAA1AAB  04-MAR-96
SCOTT       ORDER_HIST       ... AAAAluAAHAAAAA1AAC  04-MAR-96

The output lists all rows that are either migrated or chained.

If the output table shows that you have many migrated or chained rows, then you can eliminate migrated rows with the following steps:

Create an intermediate table with the same columns as the existing table to hold the migrated and chained rows:

CREATE TABLE int_order_hist
   AS SELECT *
      FROM order_hist
      WHERE ROWID IN
         (SELECT HEAD_ROWID
            FROM CHAINED_ROWS
            WHERE TABLE_NAME = 'ORDER_HIST');

Delete the migrated and chained rows from the existing table:

DELETE FROM order_hist
   WHERE ROWID IN
      (SELECT HEAD_ROWID
         FROM CHAINED_ROWS
         WHERE TABLE_NAME = 'ORDER_HIST');

Insert the rows of the intermediate table into the existing table:
```
INSERT INTO order_hist
```
```
SELECT *
FROM int_order_hist;
```
Drop the intermediate table:
```
DROP TABLE int_order_history;
```

Delete the information collected in step 1 from the output table:

DELETE FROM CHAINED_ROWS

WHERE TABLE_NAME = 'ORDER_HIST';

Use the ANALYZE statement again and query the output table.

Any rows that appear in the output table are chained. You can eliminate chained rows only by increasing your data block size. It may not be possible to avoid chaining in all situations. Chaining is often unavoidable with tables that have a LONG column or long CHAR or VARCHAR2 columns.

Retrieval of migrated rows is resource intensive; therefore, all tables subject to UPDATE should have their distributed free space set to allow enough space within the block for the likely update.

Modifying the SQL.BSQ File

The SQL.BSQ file runs when you issue the CREATE DATABASE statement. This file contains the table definitions that make up the Oracle server. The views you use as a DBA are based on these tables. Oracle recommends that you strictly limit modifications to SQL.BSQ.

If necessary, you can increase the value of the following storage parameters: INITIAL, NEXT, MINEXTENTS, MAXEXTENTS, PCTINCREASE, FREELISTS, FREELIST GROUPS, and OPTIMAL.

See Also:
For complete information about these parameters, see Oracle8i SQL Reference.
With the exception of PCTINCREASE, do not decrease the setting of a storage parameter to a value below the default. (If the value of MAXEXTENTS is large, then you can lower the value for PCTINCREASE or even set it to zero.)

No other changes to SQL.BSQ are supported. In particular, you should not add, drop, or rename a column.

Note:
Oracle may add, delete, or change internal data dictionary tables from release to release. For this reason, modifications you make are not carried forward when the dictionary is migrated to later releases.

Using Locally-Managed Tablespaces

A tablespace that manages its own extents maintains a bitmap in each datafile to keep track of the free or used status of blocks in that datafile. Each bit in the bitmap corresponds to a block or a group of blocks. When an extent is allocated or freed for reuse, Oracle changes the bitmap values to show the new status of the blocks. These changes do not generate rollback information, because they do not update tables in the data dictionary (except for special cases such as tablespace quota information).

Locally-managed tablespaces have the following advantages over dictionary-managed tablespaces:

Local management of extents avoids recursive space management operations, which can occur in dictionary-managed tablespaces if consuming or releasing space in an extent results in another operation that consumes or releases space in a rollback segment or data dictionary table.
Local management of extents automatically tracks adjacent free space, eliminating the need to coalesce free extents.

The sizes of extents that are managed locally can be determined automatically by the system. Alternatively, all extents can have the same size in a locally-managed tablespace.

See Also:
For more information on locally-managed tablespaces, see Oracle8i Concepts and Oracle8i Administrator's Guide. For more information on the statements for specifying space management, see Oracle8i SQL Reference.

Tuning Sorts

There is a trade-off between performance and memory usage. For best performance, most sorts should occur in memory; sorts written to disk adversely affect performance. If the sort area size is too large, then too much memory may be used. If the sort area size is too small, then sorts may need to be written to disk which, as, mentioned, can severely degrade performance.

This section describes:

Sorting to Memory

The default sort area size is adequate to hold all the data for most sorts. However, if your application often performs large sorts on data that does not fit into the sort area, then you may want to increase the sort area size. Large sorts can be caused by any SQL statement that performs a sort on many rows.

Recognizing Large Sorts

Oracle collects statistics that reflect sort activity and stores them in the dynamic performance view V$SYSSTAT. By default, this view is available only to the user SYS and to users granted the SELECT ANY TABLE system privilege. These statistics reflect sort behavior:

SORTS(MEMORY)

The number of sorts small enough to be performed entirely in sort areas without I/O to temporary segments on disk.

SORTS(DISK)

The number of sorts too large to be performed entirely in the sort area, requiring I/O to temporary segments on disk.

Use the following query to monitor these statistics over time:

SELECT NAME, VALUE
FROM V$SYSSTAT
WHERE NAME IN ('SORTS (MEMORY)', 'SORTS (DISK)');

The output of this query might look like this:

NAME                                                         VALUE
------------------------------------------------------- ----------
SORTS(MEMORY)                                                  965
SORTS(DISK)                                                      8

The information in V$SYSSTAT can also be obtained through the Simple Network Management Protocol (SNMP).

Increasing SORT_AREA_SIZE to Avoid Sorting to Disk

SORT_AREA_SIZE is a dynamically modifiable initialization parameter that specifies the maximum amount of memory to use for each sort. If a significant number of sorts require disk I/O to temporary segments, then your application's performance may benefit from increasing the size of the sort area. In this case, increase the value of SORT_AREA_SIZE.

The maximum value of this parameter depends on your operating system. You need to determine what size SORT_AREA_SIZE makes sense. If you set SORT_AREA_SIZE to an adequately large value, then most sorts should not need to go to disk (unless, for example, you are sorting a 10-gigabyte table).

Performance Benefits of Large Sort Areas

As mentioned, increasing sort area size decreases the chances that sorts go to disk. Therefore, with a larger sort area, most sorts process quickly without I/O.

When Oracle writes sort operations to disk, it writes out partially sorted data in sorted runs. After all the data has been received by the sort, Oracle merges the runs to produce the final sorted output. If the sort area is not large enough to merge all the runs at once, then subsets of the runs are merged in several merge passes. If the sort area is larger, then there are fewer, longer runs produced. A larger sort area also means the sort can merge more runs in one merge pass.

Performance Trade-offs for Large Sort Areas

Increasing sort area size causes each Oracle sort process to allocate more memory. This increase reduces the amount of memory for private SQL and PL/SQL areas. It can also affect operating system memory allocation and may induce paging and swapping. Before increasing the size of the sort area, be sure enough free memory is available on your operating system to accommodate a larger sort area.

If you increase sort area size, then consider decreasing the value for the SORT_AREA_RETAINED_SIZE parameter. This parameter controls the lower limit to which Oracle reduces the size of the sort area when Oracle completes some or all of a sort process. That is, Oracle reduces the size of the sort area after the sort has started sending the sorted data to the user or to the next part of the query. A smaller retained sort area reduces memory usage but causes additional I/O to write and read data to and from temporary segments on disk.

Sorting to Disk

Sort writes to disk directly bypass the buffer cache. If you sort to disk, then make sure that PCTINCREASE is set to zero for the tablespace used for sorting. Also, INITIAL and NEXT should be the same size. This reduces fragmentation of the tablespaces used for sorting. You set these parameters using the STORAGE clause of ALTER TABLE.

See Also:
For more information on PCTINCREASE, see Oracle8i Concepts.

Optimizing Sort Performance with Temporary Tablespaces

Optimize sort performance by performing sorts in temporary tablespaces. To create temporary tablespaces, use the CREATE TABLESPACE or ALTER TABLESPACE statements with the TEMPORARY keyword.

Normally, a sort may require many space allocation calls to allocate and deallocate temporary segments. If you specify a tablespace as TEMPORARY, then Oracle caches one sort segment in that tablespace for each instance requesting a sort operation. This scheme bypasses the normal space allocation mechanism and greatly improves performance of medium-sized sorts that cannot be done completely in memory.

You cannot use the TEMPORARY keyword with tablespaces containing permanent objects such as tables or rollback segments.

Striping Temporary Tablespaces

Stripe the temporary tablespace over many disks, preferably using an operating system striping tool. For example, if you only stripe the temporary tablespace over 2 disks with a maximum of 50 I/Os per second on each disk, then Oracle can only perform 100 I/Os per second. This restriction could lengthen the duration of sort operations.

For the previous example, you could accelerate sort processing fivefold if you striped the temporary tablespace over 10 disks. This would enable 500 I/Os per second.

Using SORT_MULTIBLOCK_READ_COUNT

Another way to improve sort performance using temporary tablespaces is to tune the parameter SORT_MULTIBLOCK_READ_COUNT. For temporary segments, SORT_MULTIBLOCK_READ_COUNT has nearly the same effect as the parameter DB_FILE_MULTIBLOCK_READ_COUNT.

Increasing the value of SORT_MULTIBLOCK_READ_COUNT forces the sort process to read a larger section of each sort run from disk to memory during each merge pass. This also forces the sort process to reduce the merge width, or number of runs, that can be merged in one merge pass. This may increase in the number of merge passes.

Because each merge pass produces a new sort run to disk, an increase in the number of merge passes causes an increase in the total amount of I/O performed during the sort. Carefully balance increases in I/O throughput obtained by increasing the SORT_MULTIBLOCK_READ_COUNT parameter with possible increases in the total amount of I/O performed.

Using NOSORT to Create Indexes Without Sorting

One cause of sorting is the creation of indexes. Creating an index for a table involves sorting all rows in the table based on the values of the indexed columns. Oracle also allows you to create indexes without sorting. If the rows in the table are loaded in ascending order, then you can create the index faster without sorting.

The NOSORT Clause

To create an index without sorting, load the rows into the table in ascending order of the indexed column values. Your operating system may provide a sorting utility to sort the rows before you load them. When you create the index, use the NOSORT clause on the CREATE INDEX statement. For example, this CREATE INDEX statement creates the index EMP_INDEX on the ENAME column of the emp table without sorting the rows in the EMP table:

CREATE INDEX emp_index
   ON emp(ename)
   NOSORT;

Note:
Specifying NOSORT in a CREATE INDEX statement negates the use of PARALLEL INDEX CREATE, even if PARALLEL (DEGREE n) is specified.

When to Use the NOSORT Clause

Presorting your data and loading it in order may not always be the fastest way to load a table.

If you have a multiple-CPU computer, then you may be able to load data faster using multiple processors in parallel, each processor loading a different portion of the data. To take advantage of parallel processing, load the data without sorting it first. Then create the index without the NOSORT clause.
If you have a single-CPU computer, then you should sort your data before loading, if possible. Then create the index with the NOSORT clause.

GROUP BY NOSORT

Sorting can be avoided when performing a GROUP BY operation when you know that the input data is already ordered, so that all rows in each group are clumped together. This may be the case if the rows are being retrieved from an index that matches the grouped columns, or if a sort-merge join produces the rows in the right order. ORDER BY sorts can be avoided in the same circumstances. When no sort takes place, the EXPLAIN PLAN output indicates GROUP BY NOSORT.

Tuning Checkpoint Activity

Checkpointing is an operation that Oracle performs automatically. This section explains the following:

How Checkpoints Affect Performance

Aggressive checkpointing will write dirty buffers to the datafiles more quickly and can reduce instance recovery time in the event of an instance failure. If checkpointing is fairly aggressive, then replaying the redo records in the redo log between the current checkpoint position and the end of the log involves processing relatively few data blocks. This means that the roll-forward phase of recovery will be fairly short.

However, aggressive checkpointing can reduce run-time performance, because checkpointing causes DBWn processes to perform I/O. The overhead associated with checkpointing is usually small.

Adjusting Checkpointing Activity

Adjust your checkpointing activity based on your performance concerns. If you are more concerned with efficient run-time performance than recovery time, then set checkpointing to be less aggressive.

If you are more concerned with having fast instance recovery than with achieving optimal run-time performance, then increase the checkpointing interval.

Checkpointing behavior can be influenced by the following parameters:

Set the value of the LOG_CHECKPOINT_INTERVAL initialization parameter (in multiples of physical block size) to be larger than the size of your largest redo log file.
Set the value of the LOG_CHECKPOINT_TIMEOUT initialization parameter to zero. This value eliminates time-based checkpoints.
Set the value of FAST_START_IO_TARGET to zero to disable fast-start checkpointing. This is described below under the heading, "Fast-Start Checkpointing".

In addition to setting these parameters, also consider the size of your log files. Maintaining small log files can increase checkpoint activity and reduce performance.

Fast-Start Checkpointing

The fast-start checkpointing feature limits the number of dirty buffers and thereby limits the amount of time required for instance recovery. If Oracle must process an excessive number of I/O operations to perform instance recovery, then performance can be adversely affected. You can control this overhead by setting an appropriate value for the parameter FAST_START_IO_TARGET.

Note:
Fast-start checkpointing is only available with the Oracle8i Enterprise Edition.
Oracle recommends using fast-start checkpointing to control the duration of the "roll-forward" phase of recovery. This behavior is controlled by the FAST_START_IO_TARGET parameter. The parameter, DB_BLOCK_MAX_DIRTY_TARGET, is an Oracle8 parameter used to provide more limited control over roll-forward duration, and it is included in Oracle8i only for backward compatibility.

FAST_START_IO_TARGET limits the number of I/O operations that Oracle should allow for instance recovery. If the number of operations required for recovery at any point in time exceeds this limit, then Oracle writes dirty buffers to disk until the number of I/O operations needed for instance recovery is reduced to the limit set by FAST_START_IO_TARGET.

You can control the duration of instance recovery, because the number of operations required to recover indicates how much time this process takes. Disable this aspect of checkpointing by setting FAST_START_IO_TARGET to zero (0).

See Also:
For more information on the FAST_START_IO_TARGET parameter and the trade-off between performance and instance recovery time, see Chapter 24, "Tuning Instance Recovery Performance".

Tuning LGWR and DBWR I/O

This section describes how to tune I/O for the log writer and database writer background processes.

Tuning LGWR I/O

Applications with many INSERTs or with LONG/RAW activity may benefit from tuning LGWR I/O. The size of each I/O write depends on the size of the log buffer which is set by the initialization parameter LOG_BUFFER. Therefore, it is important to choose the right log buffer size. LGWR starts writing if the buffer is one third full, or when it is posted by a foreground process such as a COMMIT. Too large a log buffer size might delay the writes. Too small a log buffer might also be inefficient, resulting in frequent, small I/Os.

If the average size of the I/O becomes quite large, then the log file could become a bottleneck. To avoid this problem, you can stripe the redo log files, going in parallel to several disks. You must use an operating system striping tool, because manual striping is not possible in this situation.

Stripe size is also important. You can figure an appropriate value by dividing the average redo I/O size by the number of disks over which you want to stripe the buffer.

Review V$SYSSTAT or the UTLBSTAT report for the following:

Log buffer space: This is time spent waiting for space in the log buffer. This is an indication that the buffers are being filled up faster than LGWR is writing. This may also indicate disk I/O contention. If the count is very high, then increase LGWR buffers and investigate disk I/O contention where the redo logs reside.
Log file space/switch: This is the time Oracle spent waiting for the space on disk for LGWR to complete the write of log buffers to the redo log. This may be an indication to increase log buffers.

For example:

SELECT a.VALUE / DECODE(b.VALUE, 0, 1, b.VALUE) 
FROM V$SYSSTAT a, V$SYSSTAT b 
WHERE a.NAME = 'redo size' AND b.NAME = 'user commits';

This provides the average number redo records per commit. Now you must determine the average number of commits per second and multiply it by the average number redo records per commit (calculated above). This provides the max log buffers to setup.

The following events uniquely identify the specific wait:

Log buffer space
Log file switch (checkpoint incomplete)
Log file switch (archiving needed)
Log file switch (clearing log file)
Log file switch completion
Switch logfile statement

Pre-Oracle 7.3 tuning required the following to tune log file switch counts:

If log file/switch = log space free requests, then add more log_buffers
If log file/switch is high and there is a significant difference between background checkpoints started and background checkpoints completed, then consider reviewing the checkpoint frequency and the log file sizes.

Tuning DBWR I/O

This section describes the following issues of tuning DBWR I/O:

Multiple Database Writer (DBWR) Processes and Database Slaves

Multiple database writer processes are useful when a buffer cache is so large that one DBWn process running full-time cannot keep up with the load. However, for large transaction rate systems that have many CPUs, you can enable multiple database writers to handle the load.

Using the DB_WRITER_PROCESSES initialization parameter, you can create multiple database writer processes (from DBW0 to DBW9). Database I/O slaves provide non-blocking, asynchronous requests to simulate asynchronous I/O.

I/O slaves for DBWR are allocated immediately following database open, when the first I/O request is made. The DBWR continues to do all the DBWR-related work (scanning LRU). When the DBWR process initiates the I/O, the DBWR I/O slave simply does the I/O on behalf of DBWR. The writing of the batch is parallelized between the I/O slaves.

The main DBWR process, which is I/O issuing process, looks for an idle I/O slave. If one is available, then that I/O slave gets a post. If there are no idle slaves, then the I/O issuer spawns one. If the allowed number of slaves have been spawned, then the issuer waits and tries again to find an idle slave.

If the asynchronous I/O code of the platform has bugs or is not efficient, then asynchronous I/O can be disabled on a device type. However, multiple I/O slaves only parallelize the writing of the batch between the DBWR I/O slaves. In contrast, you can parallelize the gathering as well as the writing of buffers with the multiple DBWR feature. Therefore, from the throughput standpoint, N DBWR processes should deliver more throughput than one DBWR process with the same number of I/O slaves.

Multiple writer processes (and I/O slaves) are advanced features that are intended for heavy OLTP processing. Implement this feature only if the database environment requires such I/O throughput. For example, if asynchronous I/O is available, then it may be wise to disable I/O slaves and run with a single DBWR in asynchronous I/O mode.

Note:
Review the current throughput, and examine possible bottlenecks to determine if it is feasible to implement these features.

If it is determined that there is a need for multiple writer processes or slave processes, then determine which option to use. Although both implementations of DBWR processes may be beneficial, the general indicator rule on which option to use depends on the availability of asynchronous I/O (from the operating system) and the number of CPUs.

The number of CPUs is also indirectly related to the number LRU latch sets. To determine whether to use multiple DBWn processes or database slaves, follow these guidelines:

First, determine if DBWR is keeping up the write requests. Review the V$SYSTEM_EVENT view for significant numbers of 'free buffer' waits. Large values may indicate that users wants to read a buffer, but they cannot because there are too many dirty buffers in the cache. If you do not see free buffer waits, then DBWR is not a problem.
If DBWR is keeping up with the write requests, then use I/O slaves only if asynchronous I/O is not supported (or is working improperly) on the operating system for the types of files that you are using. Start with the number of I/O slaves equal to the average number of disks that a typical file spans.
If you are using a RAID device with a large write back cache, then you can reduce your I/O slaves significantly, because I/Os to the RAID cache are much faster than I/Os to the disks.

Use multiple DBWR (DB_WRITER_PROCESSES) when one DBWR cannot keep up. This is generally only necessary for large SMP systems with a lot of I/O activity. Typically, there should be no more than one DBWR for every 8 CPUs. Some systems can use multiple DBWR processes with asynchronous I/O enabled if they have very high transaction rates, are not CPU starved, and the operating system's async I/O works effectively.

Note:
Implementing DB_IO_SLAVES or multiple writer processes creates some overhead cost. Enabling these features requires that extra shared memory be allocated for I/O buffers and request queues.

Internal Write Batch Size

Database writer (DBWn) processes use the internal write batch size, which is set to the lowest of the following three values (A, B, or C):

Value A is calculated as follows:
Value B is the port-specific limit. (See your Oracle platform-specific documentation.)
Value C is one-fourth the value of DB_BLOCK_BUFFERS.

Setting the internal write batch size too large may result in uneven response times.

For best results, you can influence the internal write batch size by changing the parameter values by which Value A is calculated. Take the following approach:

Determine the files to which you must write, and the number of disks on which those files reside.
Determine the number of I/Os you can perform against these disks.
Determine the number of writes that your transactions require.
Make sure that you have enough disks to sustain this rate.

LRU Latches with a Single Buffer Pool

When you have multiple database writer DBWn processes and only one buffer pool, the buffer cache is divided up among the processes by LRU (least recently used) latches; each LRU latch is for one LRU list.

The default value of the DB_BLOCK_LRU_LATCHES parameter is 50% of the number of CPUs in the system. You can adjust this value to be equal to, or a multiple of, the number of CPUs. The objective is to cause each DBWn process to have the same number of LRU lists, so that they have equivalent loads.

For example, if you have 2 database writer processes and 4 LRU lists (4 latches), then the DBWn processes obtain latches in a round-robin fashion. DBW0 obtains latch 1, DBW1 obtains latch 2, then DBW0 obtains latch 3 and DBW1 obtains latch 4. Similarly, if your system has 8 CPUs and 3 DBWn processes, then you should have 9 latches.

LRU Latches with Multiple Buffer Pools

If you are using multiple buffer pools and multiple database writer (DBWn) processes, then the number of latches in each pool (DEFAULT, KEEP, and RECYCLE) should be equal to, or a multiple of, the number of processes. This is recommended so that each DBWn process is equally loaded.

Note:
When there are multiple buffer pools, each buffer pool has a contiguous range of LRU latches.

Consider the example in Figure 20-3 where there are 3 DBWn processes and 2 latches for each of the 3 buffer pools, for a total of 6 latches. Each buffer pool would obtain a latch in round robin fashion.

Figure 20-3 LRU Latches with Multiple Buffer Pools: Example 1

The DEFAULT buffer pool has 500 buffers for each LRU list. The RECYCLE buffer pool has 250 buffers for each LRU list. The KEEP buffer pool has 100 buffers for each LRU.

DBW0 gets latch 1 (500) and latch 4 (250) for 750.
DBW1 gets latch 2 (500) and latch 6 (100) for 600.
DBW2 gets latch 3 (250) and latch 5 (100) for 350.

Thus, the load carried by each of the DBWn processes differs, and performance suffers. If, however, there are 3 latches in each pool, then the DBWn processes have equal loads, and performance is optimized.

The different buffer pools have different rates of block replacement. Ordinarily, blocks are rarely modified in the KEEP pool and frequently modified in the RECYCLE pool. This means that you need to write out blocks more frequently from the RECYCLE pool than from the KEEP pool. As a result, owning 100 buffers from one pool is not the same as owning 100 buffers from the other pool. To be perfectly load balanced, each DBWn process should have the same number of LRU lists from each type of buffer pool.

A well-configured system might have 3 DBWn processes and 9 latches, with 3 latches in each buffer pool.

Figure 20-4 LRU Latches with Multiple Buffer Pools: Example 2

The DEFAULT buffer pool has 500 buffers for each LRU list. The RECYCLE buffer pool has 250 buffers for each LRU list. The KEEP buffer pool has 100 buffers for each LRU list.

DBW0 gets latch 1 (500) and latch 4 (250) and latch 7 (100) for 850.
DBW1 gets latch 2 (500) and latch 5 (250) and latch 8 (100) for 850.
DBW2 gets latch 3 (500) and latch 6 (250) and latch 9 (100) for 850.

Tuning Backup and Restore Operations

The primary goal of backup and restore tuning is to create an adequate flow of data between disk and storage device. Tuning backup and restore operations involve the following tasks:

Locating the Source of Bottlenecks

Backups and restore operations have three distinct components:

Read the input (disk or tape).
Process data by validating blocks and copying them from the input to the output buffer.
Write the output to tape or disk.

It is unlikely that all three of these perform at the same speed. Therefore, the slowest of these components is the bottleneck.

Types of I/O

Oracle backup and restore uses two types of I/O: disk and tape. When performing a backup, the input files are read using disk I/O, and the output backup file is written using either disk or tape I/O. When performing restores, these roles reverse. Both disk and tape I/O can be synchronous or asynchronous; each is independent of the other.

Measuring Synchronous and Asynchronous I/O Rates

When using synchronous I/O, you can easily determine how much time backup jobs require, because devices only perform one I/O task at a time. When using asynchronous I/O, it is more difficult to measure the bytes-per-second rate for the following reasons:

Asynchronous processing implies that more than one task occurs at a time.
Oracle I/O uses a polling, rather than an interrupt, mechanism to determine when each I/O request completes. Because the backup or restore process is not immediately notified of I/O completion by the operating system, you cannot determine the duration of each I/O.

Using Recovery Manager

Recovery Manager (RMAN) is an Oracle tool that allows you to back up, copy, restore, and recover datafiles, control files, and archived redo logs. You can invoke RMAN as a command-line utility from the operating system prompt or use the GUI-based Enterprise Manager Backup Manager.

RMAN automates many of the backup and recovery tasks that were formerly performed manually. For example, instead of requiring you to locate appropriate backups for each datafile, copy them to the correct place using operating system commands, and choose which archived logs to apply, RMAN manages all these tasks automatically.

RMAN provides several parameters that allow you to tune backup and recovery operations. These are discussed in the following sections.

Allocating Disk Buffers

There are two different buffers: disk buffers and tape buffers. They can be different sizes. When RMAN backs up from disk, it allocates four disk buffers for each input datafile. You cannot alter the number of buffers that RMAN allocates.

The size of the disk buffers is controlled by the DB_FILE_DIRECT_IO_COUNT initialization parameter. This is the number of blocks per buffer. The default is 64.

The size of each buffer is equal to the product of the following initialization parameters:

DB_BLOCK_SIZE * DB_FILE_DIRECT_IO_COUNT

For example, if DB_BLOCK_SIZE = 2048 and DB_FILE_DIRECT_IO_COUNT =64, then each disk buffer is 128K. In this example, the total size of the buffers for each datafile is 128K * 4, or 512K. There are 4 buffers allocated for each datafile in the backup set.

If you want to know the total size of the buffers allocated in your backup, then multiply this total by the number of datafiles being accessed by the channel, and then multiply by the number of channels. You should also add a little extra to account for the control structures.

Note:
On some platforms, the most efficient I/O buffer size may be more than 128KB.

You can reduce the size of the buffers by lowering DB_FILE_DIRECT_IO_COUNT, but the number of buffers remains at 4 per file.

Figure 20-5 Disk Buffer Allocation

Allocating Tape Buffers

Oracle allocates 4 buffers per channel for the tape writers (or reads if doing a restore). There are usually 64K each. Therefore, to size this, multiply by 4, and then multiply by the number of channels.

You can change the size of each tape buffer using the parms parameter of the ALLOCATE CHANNEL statement. Set blksize to the desired size of each buffer. For example:

allocate channel foo type 'sbt_tape' parms="blksize=16384"

RMAN allocates the tape buffers in the SGA or the PGA. If you set the initialization parameter BACKUP_TAPE_IO_SLAVES = true, then RMAN allocates them from the SGA. If you set the parameter to FALSE, then RMAN allocates the buffers in the PGA.

If you use I/O slaves, then you should use the LARGE_POOL_SIZE initialization parameter to set aside some SGA memory that is dedicated to holding these large memory allocations. Hence, the RMAN I/O buffers do not compete with the library cache for SGA memory.

Figure 20-6 Tape Buffer Allocation

Synchronous vs. Asynchronous I/O

When RMAN reads or writes data, the action is either synchronous or asynchronous. When the I/O is synchronous, a server process can perform only one task at a time. When the I/O is asynchronous, a server process can begin one task, and other processes can perform other tasks while the initial process waits for the task to complete.

You can set initialization parameters that determine the type of I/O. If you set BACKUP_TAPE_IO_SLAVES to true, then the I/O is asynchronous. Otherwise, the I/O is synchronous.

Figure 20-7 illustrates synchronous I/O in a backup to tape. The following steps occur:

A server process writes blocks to a tape buffer.
The tape process writes data to tape. While the tape process is writing, the server process must remain idle.
The tape process returns a message to the server process stating that it has completed writing.
The server process can initiate a new task.

Figure 20-7 Synchronous I/O

Figure 20-8 illustrates asynchronous I/O in a backup to tape. The following steps occur:

A server process writes blocks to a tape buffer.
The tape process writes data to tape. While the tape process is writing, other server processes are free to perform tasks.
Two spawned server processes write to tape buffers as the initial tape process writes to tape.

Figure 20-8 Asynchronous I/O

Allocating Channels

When you allocate a channel, RMAN lets you set various channel limit parameters that apply to operations performed by the allocated server session. You can use these parameters to do the following:

Force RMAN to create multiple backup pieces.
Prevent RMAN from consuming too much disk bandwidth.
Prevent RMAN from opening too many input files at once.

You can specify the following parameters:

Parameter Description

kbytes

Specifies the maximum size of a backup piece created on a channel. Use this parameter to force RMAN to create multiple backup pieces in a backup set. RMAN creates each backup piece with a size no larger than the value specified in the parameter.

readrate

Specifies the maximum number of buffers per second read from each input datafile. You can use this parameter to prevent RMAN from consuming too much bandwidth during backups.
For example, set readrate to 12. If each input buffer is 128K, then the read rate for each input datafile is 12 * 128 or about 1.5 Mb per second. If each SCI drive delivers 3 Mb per second, then RMAN leaves some disk bandwidth available to the online system.

maxopenfiles

Determines the maximum number of input files that a backup or copy can have open at a given time. Set this parameter to prevent RMAN from attempting to open a number of files greater than the upper limit for your operating system.

Allocating Input Files

The BACKUP statement lets you set parameters that influence how RMAN selects files for input into backup sets. You may need to set these parameter to do the following:

Prevent RMAN from write a single backup set to multiple volumes.
Prevent RMAN from reading from too many disks at once.
Keep a tape drive streaming during a backup.

You can specify the following parameters:

Parameter Description

filesperset

Specifies the maximum number of files to place in a backup set. RMAN divides the number of input files by the number of channels to calculate the number of files per backup set. Use this parameter to force RMAN to create multiple backup sets.
For example, if you have 50 input datafiles and 2 channels, you can set filesperset = 5 to create 10 backup sets. This action can prevent you from splitting a backup set among multiple tapes.

diskratio

Specifies the number of drives to include in the backup.
For example, assume your datafiles are located on 5 disks, the disks supply data at 10 bytes/second, and the tape drive requires 20 bytes/second to set keep streaming. If you set diskratio to 2, then RMAN reads from 2 drives at a time, thereby spreading the backup load.

Note:
The number of datafiles accessed by a channel can be controlled by setting filesperset on the BACKUP statement or by entering a SET LIMIT CHANNEL ... maxopenfiles=n statement before the BACKUP statement.

Using Incremental Backups

An incremental backup is an RMAN backup in which only modified blocks are backed up. Incremental backups are not necessarily faster than full backups, because Oracle still reads the entire datafile to take an incremental backup. If tape drives are locally attached, then incremental backups could be faster. You must consider how much bandwidth there is for reading the disks compared to the bandwidth for writing to the tapes. If tape bandwidth is limited compared to disk, then incremental backups could help.

In an incremental backup, if only a few blocks have changed, then you need to input many buffers from the datafile before you accumulate enough blocks to fill a buffer and write to tape. So, it is possible that the tape drive is not streaming. Streaming means that the tape drive is 100% busy. If a tape drive is not kept streaming, then it becomes inefficient, because after each write, the tape drive must stop and rewind a little bit.

If you use a large filesperset in the filesperset parameter, then you can scan many datafiles in parallel, the output buffers for the tape drive are filled quickly, and you can write them frequently enough to keep the tape drive streaming.

For an incremental backup, filesperset=50 might be a good number. But, for a full or incremental level=0 backup, filesperset should be smaller, such as 4 or 8.

RMAN Performance Tips

To get the best performance for a backup, follow these suggestions:

Do not set the readrate parameter. This is intended to slow down a backup, so that you can run it in the background during production hours.
Set the BACKUP_TAPE_IO_SLAVES initialization parameter to TRUE.
If you set BACKUP_TAPE_IO_SLAVES to TRUE, then the tape buffers are allocated from the SGA. Therefore, allocate a LARGE_POOL for them. You can control the buffer size with the parms clause on the allocate channel statement:
Increase the size of the disk reads by setting the DB_FILE_DIRECT_IO_COUNT initialization parameter. Use the maxopenfiles parameter to control how many datafiles are opened simultaneously on each channel.
If your datafiles are in a UNIX file system, then try setting BACKUP_DISK_IO_SLAVES to 4. This spawns processes to read datafiles in parallel, simulating asynchronous I/O. In this case, the datafile buffers are allocated from the SGA. The default value for this parameter is 0, meaning I/O slaves are not used at all, and the buffers come from the process local memory, not the SGA.
For incremental backups, use a higher filesperset than for full backups. Set it high enough so that the tape drive is kept streaming. Filesperset should be less than or equal to maxopenfiles. Try making FILESPERSET=MAXOPENFILES. Use a value of 10 to start, and raise this if the tape drive does not stream. BACKUP_DISK_IO_SLAVES may be necessary.

Using Fixed Views to Monitor Bottlenecks

Use the V$BACKUP_SYNC_IO and V$BACKUP_ASYNC_IO views to determine the source of backup or restore bottlenecks and to determine the progress of backup jobs.

V$BACKUP_SYNC_IO contains rows when the I/O is synchronous to the process (or thread on some platforms) performing the backup. V$BACKUP_ASYNC_IO contains rows when the I/O is asynchronous. Asynchronous I/O is obtained either with I/O processes or because it is supported by the underlying operating system.

Identifying Bottlenecks with Synchronous I/O

With synchronous I/O, it is difficult to identify specific bottlenecks, because all synchronous I/O is a bottleneck to the process. The only way to tune synchronous I/O is to compare the bytes-per-second rate with the device's maximum throughput rate. If the bytes-per-second rate is lower than that device specifies, then consider tuning that part of the backup/restore process. Use the DISCRETE_BYTES_PER_SECOND column in the V$BACKUP_SYNC_IO view to see the I/O rate.

Identifying Bottlenecks with Asynchronous I/O

If the combination of LONG_WAITS and SHORT_WAITS is a significant fraction of IO_COUNT, then the file indicated in V$BACKUP_SYNCH_IO and V$BACKUP_ASYNCH_IO is probably a bottleneck. Some platforms' implementation of asynchronous I/O can cause the caller to wait for I/O completion when performing a non-blocking poll for I/O. Because this behavior can vary among platforms, the V$BACKUP_ASYNC_IO view shows the total time for both "short" and "long" waits.

Long waits are the number of times the backup/restore process told the operating system to wait until an I/O was complete. Short waits are the number of times the backup/restore process made an operating system call to poll for I/O completion in a non-blocking mode. Both types of waits the operating system should respond immediately.

If the SHORT_WAIT_TIME_TOTAL column is equal to or greater than the LONG_WAIT_TIME_TOTAL column, then your platform probably blocks for I/O completion when performing non-blocking I/O polling. In this case, the SHORT_WAIT_TIME_TOTAL represents real I/O time for this file. If the SHORT_WAIT_TIME_TOTAL is low compared to the total time for this file, then the delay is most likely caused by other factors, such as process swapping. If possible, tune your operating system so that the I/O wait time appears up in the LONG_WAIT_TIME_TOTAL column.

Columns Common to V$BACKUP_SYNC_IO and V$BACKUP_ASYNC_IO

Table 20-17 lists columns and their descriptions that are common to the V$BACKUP_SYNC_IO and V$BACKUP_ASYNC_IO views.

Table 20-17 Common Columns of V$BACKUP_SYNC_IO and V$BACKUP_ASYNC_IO

Column	Description
SID	Oracle SID of the session doing the backup or restore.
SERIAL	Usage counter for the SID doing the backup or restore.
USE_COUNT	A counter you can use to identify rows from different backup sets. Each time a new set of rows is created in `V$BACKUP_SYNC_IO` or `V$BACKUP_ASYNC_IO`, they have a `USE_COUNT` that is greater than the previous rows. The `USE_COUNT` is the same for all rows used by each backup or restore operation.
DEVICE_TYPE	Device type where the file is located (typically `DISK` or `SBT_TAPE`).
TYPE	`INPUT`: The file(s) are being read. `OUTPUT`: The file(s) are being written. `AGGREGATE`: This row represents the total I/O counts for all `DISK` files involved in the backup or restore.
STATUS	`NOT` `STARTED`: This file has not been opened yet. `IN` `PROGRESS`: This file is currently being read or written. `FINISHED`: Processing for this file is complete.
FILENAME	The name of the backup file being read or written.
SET_COUNT	The `SET_COUNT` of the backup set being read or written.
SET_STAMP	The `SET_STAMP` of the backup set being read or written.
BUFFER_SIZE	Size of the buffers being used to read/write this file in bytes.
BUFFER_COUNT	The number of buffers being used to read/write this file.
TOTAL_BYTES	The total number of bytes to be read or written for this file if known. If not known, this column is null.
OPEN_TIME	Time this file was opened. If `TYPE` = '`AGGREGATE`', then this is the time that the first file in the aggregate was opened.
CLOSE_TIME	Time this file was closed. If `TYPE` = '`AGGREGATE`', then this is the time that the last file in the aggregate was closed.
ELAPSED_TIME	The length of time expressed in 100ths of seconds that the file was open.
MAXOPENFILES	The number of concurrently open `DISK` files. This value is only present in rows where `TYPE` = '`AGGREGATE`'.
BYTES	The number of bytes read or written so far.
EFFECTIVE_BYTES_PER_ SECOND	The I/O rate achieved with this device during the backup. It is the number of bytes read or written divided by the elapsed time. This value is only meaningful for the component of the backup system causing a bottleneck. If this component is not causing a bottleneck, then the speed measured by this column actually reflects the speed of some other, slower, component of the system.
IO_COUNT	The number of I/Os performed to this file. Each request is to read or write one buffer, of size `BUFFER_SIZE`.

Columns Specific to V$BACKUP_SYNC_IO

Table 20-18 lists columns specific to the V$BACKUP_SYNC_IO view.

Table 20-18 Columns Specific to V$BACKUP_SYNC_IO

Column	Description
IO_TIME_TOTAL	The total time required to perform I/O for this file expressed in 100ths of seconds.
IO_TIME_MAX	The maximum time taken for a single I/O request.
DISCRETE_BYTES_PER_ SECOND	The average transfer rate for this file. This is based on measurements taken at the start and end of each individual I/O request. This value should reflect the real speed of this device.

Columns Specific to V$BACKUP_ASYNC_IO

Table 20-19 lists columns specific to the V$BACKUP_ASYNC_IO view.

Table 20-19 Columns Specific to V$BACKUP_ASYNC_IO

Column	Description
READY	The number of asynchronous requests for which a buffer was immediately ready for use.
SHORT_WAITS	The number of times a buffer was not immediately available, but then a buffer became available after doing a non-blocking poll for I/O completion. The reason non-blocking waits are timed is because some implementations of asynchronous I/O may wait for an I/O to complete even when the request is supposed to be non-blocking.
SHORT_WAIT_TIME_TOTAL	The total time expressed in 100ths of seconds, taken by non-blocking polls for I/O completion.
SHORT_WAIT_TIME_MAX	The maximum time taken for a non-blocking poll for I/O completion, in 100ths of seconds.
LONG_WAITS	The number of times a buffer was not immediately available, and only became available after issuing a blocking wait for an I/O to complete.
LONG_WAIT_TIME_TOTAL	The total time expressed in 100ths of seconds taken by blocking waits for I/O completion.
LONG_WAIT_TIME_MAX	The maximum time taken for a blocking wait for I/O completion expressed in 100ths of seconds.

Improving Backup Throughput

In optimally tuned backups, tape components should create the only bottleneck. You should keep the tape and its device streaming, or constantly rotating. If the tape is not streaming, then the data flow to the tape may be inadequate.

This section contains the following topics to maintain streaming by improving backup throughput:

Understanding Factors Affecting Data Transfer Rates

The rate at which the host sends data to keep the tape streaming depends on the following factors:

The raw capacity of the tape device.
Compression.

Tape device raw capacity is the smallest amount of data required to keep the tape streaming.

Compression is implemented either in the tape hardware or by the media management software. If you do not use compression, then the raw capacity of the tape device keeps it streaming. If you use compression, then the amount of data that must be sent to stream the tape is the raw device capacity multiplied by the compression factor. The compression factor varies for different types of data.

Determining If Tape is Streaming for Synchronous I/O

To determine whether your tape is streaming when the I/O is synchronous, query the EFFECTIVE_BYTES_PER_SECOND column in the V$BACKUP_SYNC_IO view.

Table 20-20 V$BACKUP_SYNC_IO View

If EFFECTIVE_BYTES_PER_SECOND is:	Then:
Less than the raw capacity of the hardware	The tape is not streaming.
More than the raw capacity of the hardware	The tape may be streaming, depending on the compression ratio of the data.

Determining If Tape is Streaming for Asynchronous I/O

If the I/O is asynchronous, then the tape is streaming if the combination of LONG_WAITS and SHORT_WAITS is a significant fraction of I/O count. Place more importance on SHORT_WAITS if the time indicated in the SHORT_WAIT_TIME_TOTAL column is equal to or greater than the LONG_WAIT_TIME_TOTAL column.

Increasing Throughput to Enable Tape Streaming

If the tape is not streaming, then the basic strategy is to supply more bytes-per-second to the tape. Modify this strategy depending on the how many blocks Oracle must read from disk and how many disks Oracle must access.

Spreading I/O Across Multiple Disks

Using the DISKRATIO parameter of the BACKUP statement to distribute backup I/O across multiple volumes, specify how many disk drives RMAN uses to distribute file reads when backing up multiple concurrent datafiles. For example, assume that your system uses 10 disks. The disks supply data at 10 byes per second, and the tape drive requires 50 bytes per second to keep streaming. In this case, set DISKRATIO to 5 to spread the backup load onto 5 disks.

When setting DISKRATIO, spread the I/O over only as many disks as needed to keep the tape streaming: any more can increase the time it takes to recover a single file and provides no performance benefit. If you do not specify DISKRATIO, but you do specify FILESPERSET, then DISKRATIO defaults to FILESPERSET. If neither is specified, then DISKRATIO defaults to 4.

Backing Up Empty Files or Files with Few Changes

When performing a full backup of files that are largely empty, or when performing an incremental backup when few blocks have changed, you may not be able to supply data to the tape fast enough to keep it streaming.

In this case, you can achieve optimal performance by the following:

Use the highest possible value for the MAXOPENFILES parameter of the Recovery Manager SET LIMIT CHANNEL statement.

See Also:
For more information about the RMAN SET statement, see Oracle8i Backup and Recovery Guide.
Use asynchronous disk I/O. This takes advantage of asynchronous read-ahead that fills input buffers from one file while processing data from others.

Backing Up Full Files

When you perform a full backup of files that are mostly full and the tape is not streaming, you can improve performance in several ways, as shown in Table 20-21:

Table 20-21 Throughput Performance Improvement Methods

Method	Consequence
Set `DBWR_IO_SLAVES` `=` `TRUE`	Allocates additional processes for each disk channel, and these processes simulate asynchronous I/O. Try setting this to 3 or 4, and set the `LARGE_POOL_SIZE` parameter accordingly.
Set `BACKUP_TAPE_ IO_SLAVES` = `TRUE`	Simulates asynchronous tape I/O by spawning multiple processes to divide the work for the backup or restore operation. If you do not set this parameter, then all I/O to the tape layer is synchronous, which means that no other work can be done until the tape is done writing. `BACKUP_TAPE_IO_SLAVES` requires that the buffers for the respective disk or tape I/O be allocated from the shared memory (SGA), so that they can be shared between two processes. Therefore, you should allocate a large enough SGA. If you set this parameter, then also set `LARGE_POOL_SIZE`.
Set `LARGE_POOL_SIZE`	When attempting to get shared buffers for I/O slaves, Oracle does the following: If `LARGE_POOL_SIZE` is set, then Oracle attempts to get memory from the large pool. If this value is not large enough, then Oracle does not try to get buffers from the shared pool. If `LARGE_POOL_SIZE` is not set, then Oracle attempts to get memory from the shared pool. If Oracle cannot get enough memory, then it obtains I/O buffer memory from local process memory and writes a message to the `alert`.`log` file indicating that synchronous I/O is used for this backup.
Increase `DB_FILE_ DIRECT_IO_COUNT`	Causes RMAN to use larger buffers for disk I/O. The default buffer size used for backup and restore disk I/O is `DB_FILE_DIRECT_IO_COUNT` * `DB_BLOCK_SIZE`. The default value for `DB_FILE_DIRECT_IO_COUNT` is 64, so if `DB_BLOCK_SIZE` is 2048, then the buffer size is 128KB. On some platforms, the most efficient I/O buffer size may be more than 128KB. You can increase the `DB_FILE_DIRECT_IO_COUNT`, but the number of buffers per file stays fixed at 4.
Make sure the RMAN parameters `MAXOPENFILES` and `FILESPERSET` are not too low	Increases the number of files that RMAN can process at one time. Using default buffer sizes, each concurrently open file uses 512KB of process memory (or SGA large pool memory, if I/O processes are used) for buffers. The number of concurrent files should be just enough to keep the tape streaming. You must derive the correct number by trial and error, because unused block compression greatly affects the amount of disk data that is sent to the tape drive. If your tape drives are slower than your disk drives, then a value of 1 for `MAXOPENFILES` should be sufficient. If you want a high `FILESPERSET`, but you want to limit the memory allocated for buffers, then use `SETLIMIT`. For example: run{ allocate channel foo type disk; SETLIMIT CHANNEL foo maxopenfiles=4 backup database...} `SETLIMIT` limits the channels to have at most 4 open files. So, you could allocate 16 buffers for the input files and 4 buffers for the backup set. The default is 10, which may be high for your system.
READRATE	The `READRATE` parameter specifies units of buffers per second. For example, if your buffers are 128K and `READRATE` is 12, then RMAN is limited to reading 12*128K bytes per second from each datafile going into a backup set. You should test to find a value that improves performance of your queries while still letting RMAN complete the backup in a reasonable amount of time.
Increase the number of channels	Increases parallelism. Each channel must write to different filesystems, which should be on different disks. One channel per tape drive ensures that during restore, the files are read with the same sequence and timing as when they were created. You must specify the `FORMAT` parameter on the `ALLOCATE` `CHANNEL` statement. For example: run{ allocate channel foo1 type disk format '/filesys1/oracle_ backups/%d/%u_%p'; allocate channel foo2 type disk format '/filesys2/oracle_ backups/%d/%u_%p'; ...}

Configuring the Large Pool

You can optionally configure the large pool so that Oracle has a separate pool from which it can request large memory allocations. This prevents competition with other subsystems for the same memory.

As Oracle allocates more shared pool memory for the multi-threaded server session memory, the amount of shared pool memory available for the shared SQL cache decreases. If you allocate session memory from another area of shared memory, then Oracle can use the shared pool primarily for caching shared SQL and not incur the performance overhead from shrinking the shared SQL cache.

For I/O server processes and backup and restore operations, Oracle allocates buffers that are a few hundred kilobytes in size. Although the shared pool may be unable to satisfy this request, the large pool is able to do so. The large pool does not have an LRU list; Oracle does not attempt to age memory out of the large pool.

Use the LARGE_POOL_SIZE parameter to configure the large pool. To see in which pool (shared pool or large pool) the memory for an object resides, see the column POOL in V$SGASTAT.

Disk Speed	File System	Raw Devices
Reads per second	fast	slow
Writes per second	slow	fast

SORTS(MEMORY)	The number of sorts small enough to be performed entirely in sort areas without I/O to temporary segments on disk.
SORTS(DISK)	The number of sorts too large to be performed entirely in the sort area, requiring I/O to temporary segments on disk.

20Tuning I/O