8
Gathering Statistics

This chapter explains why statistics are important for the cost-based optimizer, and how to gather and use statistics.

This chapter contains the following sections:

Understanding Statistics

The cost-based optimization approach uses statistics to calculate the selectivity of predicates and to estimate the cost of each execution plan. Selectivity is the fraction of rows in a table that the SQL statement's predicate chooses. The optimizer uses the selectivity of a predicate to estimate the cost of a particular access method and to determine the optimal join order.

Statistics quantify the data distribution and storage characteristics of tables, columns, indexes, and partitions. The optimizer uses these statistics to estimate how much I/O and memory are required to execute a SQL statement using a particular execution plan. The statistics are stored in the data dictionary, and they can be exported from one database and imported into another (for example, to transfer production statistics to a test system to simulate the real environment, even though the test system may only have small samples of data).

You must gather statistics on a regular basis to provide the optimizer with information about schema objects. New statistics should be gathered after a schema object's data or structure are modified in ways that make the previous statistics inaccurate. For example, after loading a significant number of rows into a table, you should collect new statistics on the number of rows. After updating data in a table, you do not need to collect new statistics on the number of rows but you might need new statistics on the average row length.

Statistics can be generated with the ANALYZE statement or with the package DBMS_STATS.

The statistics generated include the following:

Table statistics
- Number of rows
- Number of blocks
- Number of empty blocks
- Average row length
Column statistics
- Number of distinct values (NDV) in column
- Number of nulls in column
- Data distribution (histogram)
Index statistics
- Number of leaf blocks
- Levels
- Clustering factor

Generating Statistics

Because the cost-based approach relies on statistics, you should generate statistics for all tables and clusters and all types of indexes accessed by your SQL statements before using the cost-based approach. If the size and data distribution of your tables change frequently, then you should generate these statistics regularly to ensure the statistics accurately represent the data in the tables.

Oracle generates statistics using the following techniques:

Estimation based on random data sampling
Exact computation
User-defined statistics collection methods

To perform an exact computation, Oracle requires enough space to perform a scan and sort of the table. If there is not enough space in memory, then temporary space may be required. For estimations, Oracle requires enough space to perform a scan and sort of only the rows in the requested sample of the table. For indexes, computation does not take up as much time or space, so it is best to perform a full computation.

Some statistics are always computed exactly, such as the number of data blocks currently containing data in a table or the depth of an index from its root block to its leaf blocks.

Use estimation for tables and clusters rather than computation, unless you need exact values. Because estimation rarely sorts, it is often much faster than computation, especially for large tables.

To estimate statistics, Oracle selects a random sample of data. You can specify the sampling percentage and whether sampling should be based on rows or blocks.

Row sampling reads rows without regard to their physical placement on disk. This provides the most random data for estimates, but it can result in reading more data than necessary. For example, in the worst case a row sample might select one row from each block, requiring a full scan of the table or index.
Block sampling reads a random sample of blocks and uses all of the rows in those blocks for estimates. This reduces the amount of I/O activity for a given sample size, but it can reduce the randomness of the sample if rows are not randomly distributed on disk. Block sampling is not available for index statistics.

When you generate statistics for a table, column, or index, if the data dictionary already contains statistics for the object, then Oracle updates the existing statistics. Oracle also invalidates any currently parsed SQL statements that access the object.

The next time such a statement executes, the optimizer automatically chooses a new execution plan based on the new statistics. Distributed statements issued on remote databases that access the analyzed objects use the new statistics the next time Oracle parses them.

When you associate a statistics type with a column or domain index, Oracle calls the statistics collection method in the statistics type if you analyze the column or domain index.

Statistics for Partitioned Schema Objects

Partitioned schema objects may contain multiple sets of statistics. They can have statistics which refer to the entire schema object as a whole (global statistics), they can have statistics which refer to an individual partition, and they can have statistics which refer to an individual subpartition of a composite partitioned object.

Unless the query predicate narrows the query to a single partition, the optimizer uses the global statistics. Because most queries are not likely to be this restrictive, it is most important to have accurate global statistics. Intuitively, it may seem that generating global statistics from partition-level statistics should be straightforward; however, this is only true for some of the statistics. For example, it is very difficult to figure out the number of distinct values for a column from the number of distinct values found in each partition because of the possible overlap in values. Therefore, actually gathering global statistics with the DBMS_STATS package is highly recommended, rather than calculating them with the ANALYZE statement.

Note:
Oracle currently does not gather global histogram statistics.

Using the ANALYZE Statement

The ANALYZE statement can generate statistics for cost-based optimization. However, using ANALYZE for this purpose is not recommended because of various restrictions, for example:

ANALYZE always runs serially.
ANALYZE calculates global statistics for partitioned tables and indexes instead of gathering them directly. This can lead to inaccuracies for some statistics, such as the number of distinct values.
- For partitioned tables and indexes, ANALYZE gathers statistics for the individual partitions and then calculates the global statistics from the partition statistics.
- For composite partitioning, ANALYZE gathers statistics for the subpartitions and then calculates the partition statistics and global statistics from the subpartition statistics.
ANALYZE cannot overwrite or delete some of the values of statistics that were gathered by DBMS_STATS.

ANALYZE can gather additional information that is not used by the optimizer, such as information about chained rows and the structural integrity of indexes, tables, and clusters. DBMS_STATS does not gather this information.

See Also:
For detailed information about the ANALYZE statement, see Oracle8i SQL Reference.

Using the DBMS_STATS Package

The PL/SQL package DBMS_STATS lets you generate and manage statistics for cost-based optimization. You can use this package to gather, modify, view, and delete statistics. You can also use this package to store sets of statistics.

The DBMS_STATS package can gather statistics on indexes, tables, columns, and partitions, as well as statistics on all schema objects in a schema or database. It does not gather cluster statistics--you can use DBMS_STATS to gather statistics on the individual tables instead of the whole cluster.

The statistics-gathering operations can run either serially or in parallel. Whenever possible, DBMS_STATS calls a parallel query to gather statistics with the specified degree of parallelism; otherwise, it calls a serial query or the ANALYZE statement. Index statistics are not gathered in parallel.

For partitioned tables and indexes, DBMS_STATS can gather separate statistics for each partition as well as global statistics for the entire table or index. Similarly, for composite partitioning DBMS_STATS can gather separate statistics for subpartitions, partitions, and the entire table or index. Depending on the SQL statement being optimized, the optimizer may choose to use either the partition (or subpartition) statistics or the global statistics.

DBMS_STATS gathers statistics only for cost-based optimization; it does not gather other statistics. For example, the table statistics gathered by DBMS_STATS include the number of rows, number of blocks currently containing data, and average row length but not the number of chained rows, average free space, or number of unused data blocks.

Note:
Currently, the DBMS_STATS package does not call statistics collection methods associated with individual columns. Use the ANALYZE statement to gather such information.

Gathering Statistics with the DBMS_STATS Package

Table 8-1 lists the procedures in the DBMS_STATS package for gathering statistics:

Table 8-1 Statistics Gathering Procedures in the DBMS_STATS Package

Procedure	Description
GATHER_INDEX_STATS	Collects index statistics.
GATHER_TABLE_STATS	Collects table, column, and index statistics.
GATHER_SCHEMA_STATS	Collects statistics for all objects in a schema.
GATHER_DATABASE_STATS	Collects statistics for all objects in a database.

See Also:
For syntax and examples of all DBMS_STATS procedures, see Oracle8i Supplied PL/SQL Packages Reference.

Gathering Index Statistics

Oracle can gather some statistics automatically while creating or rebuilding a B*-tree or bitmap index. The COMPUTE STATISTICS option of CREATE INDEX or ALTER INDEX ... REBUILD enables this gathering of statistics.

Note:
COMPUTE STATISTICS always gathers exact statistics.

The statistics that Oracle gathers for the COMPUTE STATISTICS option depend on whether the index is partitioned or nonpartitioned.

For a nonpartitioned index, Oracle gathers index, table, and column statistics while creating or rebuilding the index. In a concatenated-key index, the column statistics refer only to the leading column of the key.
For a partitioned index, Oracle does not gather any table or column statistics while creating the index or rebuilding its partitions.
- While creating a partitioned index, Oracle gathers index statistics for each partition and for the entire index. If the index uses composite partitioning, then Oracle also gathers statistics for each subpartition.
- While rebuilding a partition or subpartition of an index, Oracle gathers index statistics only for that partition or subpartition.

To ensure correctness of the statistics Oracle always uses base tables when creating an index with the COMPUTE STATISTICS option, even if another index is available that could be used to create the index.

If you do not use the COMPUTE STATISTICS clause, or if you have made major DML changes, then use the DBMS_STATS.GATHER_INDEX_STATS procedure to collect index statistics. The GATHER_INDEX_STATS procedure does not run in parallel.

Using this procedure is equivalent to running the following:

ANALYZE INDEX [ownname.]indname [PARTITION partname] COMPUTE STATISTICS |
ESTIMATE STATISTICS SAMPLE estimate_percent PERCENT

Gathering New Optimizer Statistics

Before gathering new statistics for a particular schema, use the DBMS_STATS.EXPORT_SCHEMA_STATS procedure to extract and save existing statistics. Then, use DBMS_STATS.GATHER_SCHEMA_STATS to gather new statistics. You can implement both of these with a single call to the GATHER_SCHEMA_STATS procedure.

If key SQL statements experience significant performance degradation, then either gather statistics again using a larger sample size, or perform the following steps:

Use DBMS_STATS.EXPORT_SCHEMA_STATS to save the new statistics.
Use DBMS_STATS.IMPORT_SCHEMA_STATS to restore the old statistics. The application is now ready to run again.

You may want to use the new statistics if they result in improved performance for the majority of SQL statements, and if the number of problem SQL statements is small. In this case, do the following:

Create a stored outline for each problematic SQL statement using the old statistics.

See Also:
Stored outlines are pre-compiled execution plans that Oracle can use to mimic proven application performance characteristics. For more information, see Chapter 10, "Using Plan Stability".

Use DBMS_STATS.IMPORT_SCHEMA_STATS to restore the new statistics. Your application is now ready to run with the new statistics. However, you will continue to achieve the previous performance levels for the problem SQL statements.

Gathering Automated Statistics

You can automatically gather statistics or create lists of tables that have stale or no statistics.

To automatically gather statistics, run the DBMS_STATS.GATHER_SCHEMA_STATS and DBMS_STATS.GATHER_DATABASE_STATS procedures with the OPTIONS and objlist parameters. Use the following values for the options parameter:

GATHER STALE

Gathers statistics on tables with stale statistics.

GATHER

Gathers statistics on all tables. (default)

GATHER EMPTY

Gathers statistics only on tables without statistics.

LIST STALE

Creates a list of tables with stale statistics.

LIST EMPTY

Creates a list of tables that do not have statistics.

The objlist parameter identifies an output parameter for the LIST STALE and LIST EMPTY options. The objlist parameter is of type DBMS_STATS.OBJECTTAB.

Designating Tables for Monitoring and Automated Statistics Gathering

To automatically gather statistics for a particular table, enable the monitoring attribute using the MONITORING keyword. This keyword is part of the CREATE TABLE and ALTER TABLE statement syntax.

After it is enabled, Oracle monitors the table for DML activity. This includes the approximate number of inserts, updates, and deletes for that table since the last time statistics were gathered. Oracle uses this data to identify tables with stale statistics.

View the data Oracle obtains from monitoring these tables by querying the USER_TAB_MODIFICATIONS view.

Note:
There may be a few hours delay while Oracle propagates information to this view.

To disable monitoring of a table, use the NOMONITORING keyword.

See Also:
For more information about the CREATE TABLE and ALTER TABLE syntax and the MONITORING and NOMONITORING keywords, see the Oracle8i SQL Reference.

Enabling Automated Statistics Gathering

The GATHER STALE option only gathers statistics for tables that have stale statistics and for which you have enabled the MONITORING attribute. To enable monitoring for tables, use the MONITORING keyword of the CREATE TABLE and ALTER TABLE statements, as described in "Designating Tables for Monitoring and Automated Statistics Gathering".

The GATHER STALE option maintains up-to-date statistics for the cost-based optimizer. Using this option at regular intervals also avoids the overhead associated with using the ANALYZE statement on all tables at one time. The GATHER option can incur much more overhead, because this option generally gathers statistics for a greater number of tables than GATHER STALE.

Use a script or job scheduling tool for the GATHER_SCHEMA_STATS and GATHER_DATABASE_STATS procedures to establish a frequency of statistics collection that is appropriate for your application. The frequency of collection intervals should balance the task of providing accurate statistics for the optimizer against the processing overhead incurred by the statistics collection process.

Creating Lists of Tables with Stale or No Statistics

You can use the GATHER_SCHEMA_STATS and GATHER_DATABASE_STATS procedures to create a list of tables with stale statistics. Use this list to identify tables for which you want to manually gather statistics.

You can also use these procedures to create a list of tables with no statistics. Use this list to identify tables for which you want to gather statistics, either automatically or manually.

Preserving Versions of Statistics

You can preserve versions of statistics for tables by specifying the stattab, statid, and statown parameters in the DBMS_STATS package. Use stattab to identify a destination table for archiving previous versions of statistics. Further identify these versions using statid to denote the date and time the version was made. Use statown to identify a destination schema if it is different from the schema(s) of the actual tables. You must first create such a table using the CREATE_STAT_TABLE procedure of the DBMS_STATS package.

Statistics Data

Statistics includes the following data:

Physical description of tables, for example, columns:
- NUM_ROWS
- BLOCKS
- AVGRLEN
Descriptions of attributes, for example, columns:
- DISTINCT VALUES
- LOWVAL
- HIGHVAL

Data Distribution

These attributes help you determine how the data is distributed across your tables. The optimizer assumes that the data is uniformly distributed. The actual data distribution in your tables can be easily analyzed by viewing the appropriate dictionary table, as described in DBA_TABLES for tables and DBA_TAB_COLUMNS for column statistics.

Attribute Skew

Histograms can be used to determine attribute skew. Descriptions of available access methods, for example, columns:

HEIGHT
LEAF BLOCK
DISTINCT KEYS
CLUSTERING FACTOR
LEAF BLOCKS PER KEY
DB BLOCKS PER KEY

See Also:
For more information on histograms, see "Using Histograms".

Missing Statistics

When statistics do not exist, the optimizer uses the following default values. Table 8-2 shows the defaults you can expect when statistics are missing.

Table 8-2 Default Table and Index Values When Statistics are Missing

Statistic	Default Value Used by Optimizer
Tables Cardinality Avg. row len No. of blocks Remote cardinality Remote average row length	100 rows 20 bytes 100 2000 rows 100 bytes
Indexes Levels Leaf blocks Leaf blocks/key Data blocks/key Distinct keys Clustering factor	1 25 1 1 100 800 (8*no. of blocks)

Statistic

Default Value Used by Optimizer

Tables

Cardinality
Avg. row len
No. of blocks
Remote cardinality
Remote average row length

100 rows

20 bytes

100

2000 rows

100 bytes

Indexes

Levels
Leaf blocks
Leaf blocks/key
Data blocks/key
Distinct keys
Clustering factor

100

800 (8*no. of blocks)

Using Statistics

This section provides guidelines on how to use and view the statistics. This includes:

Managing Statistics

This section describes statistics tables and lists the views that display information about statistics stored in the data dictionary.

Statistics Tables

The DBMS_STATS package enables you to store statistics in a statistics table. You can transfer the statistics for a column, table, index, or schema into a statistics table and subsequently restore those statistics to the data dictionary. The optimizer does not use statistics that are stored in a statistics table.

Statistics tables enable you to experiment with different sets of statistics. For example, you can back up a set of statistics before you delete them, modify them, or generate new statistics. You can then compare the performance of SQL statements optimized with different sets of statistics, and if the statistics stored in a table give the best performance, you can restore them to the data dictionary.

A statistics table can keep multiple distinct sets of statistics, or you can create multiple statistics tables to store distinct sets of statistics separately.

Viewing Statistics

You can use the DBMS_STATS package to view the statistics stored in the data dictionary or in a statistics table.

You can also query these data dictionary views for statistics in the data dictionary:

USER_TABLES, ALL_TABLES, and DBA_TABLES
USER_TAB_COLUMNS, ALL_TAB_COLUMNS, and DBA_TAB_COLUMNS
USER_INDEXES, ALL_INDEXES, and DBA_INDEXES
USER_CLUSTERS and DBA_CLUSTERS
USER_TAB_PARTITIONS, ALL_TAB_PARTITIONS, and DBA_TAB_PARTITIONS
USER_TAB_SUBPARTITIONS, ALL_TAB_SUBPARTITIONS, and DBA_TAB_SUBPARTITIONS
USER_IND_PARTITIONS, ALL_IND_PARTITIONS, and DBA_IND_PARTITIONS
USER_IND_SUBPARTITIONS, ALL_IND_SUBPARTITIONS, and DBA_IND_SUBPARTITIONS
USER_PART_COL_STATISTICS, ALL_PART_COL_STATISTICS, and DBA_PART_COL_STATISTICS
USER_SUBPART_COL_STATISTICS, ALL_SUBPART_COL_STATISTICS, and DBA_SUBPART_COL_STATISTICS

See Also:
For information on the statistics in these views, see Oracle8i Reference.

Verifying Table Statistics

To verify that the table statistics are available, execute the following statement against DBA_TABLES:

SQL> SELECT TABLE_NAME, NUM_ROWS, BLOCKS, AVG_ROW_LEN,
       TO_CHAR(LAST_ANALYZED, 'MM/DD/YYYY HH24:MI:SS')
     FROM DBA_TABLES
     WHERE TABLE_NAME IN ('SO_LINES_ALL','SO_HEADERS_ALL');

This returns the following typical data:

TABLE_NAME                 NUM_ROWS   BLOCKS   AVH_ROW_LEN   LAST_ANALYZED
------------------------   --------   -------  -----------   -------------
SO_HEADERS_ALL             1632264    207014            449  07/29/1999 00:59:51
SO_LINES_ALL              10493845   1922196            663  07/29/1999 01:16:09

Verifying Index Statistics

To verify that index statistics are available and assist you in determining which are the best indexes to use in your application, execute the following statement against the dictionary DBA_INDEXES table:

SQL> SELECT INDEX_NAME "NAME", NUM_ROWS, DISTINCT_KEYS "DISTINCT",
            LEAF_BLOCKS, CLUSTERING_FACTOR "CF", 
            AVG_LEAF_BLOCKS_PER_KEY "ALFBPKEY"
     FROM DBA_INDEXES
     WHERE TABLE_NAME ="AP_INVOICES_ALL"
     ORDER BY INDEX_NAME;

This returns the following typical data:

NAME                     NUM_ROWS DISTINCT LEAF_BLOCKS       CF   ALFBKEY
---------------------- ---------- -------- -----------  -------   -------
AP_INVOICES_N1             18941     80712      17060    431230         1
AP_INVOICES_N3             14995         2      21403    186450     10701
AP_INVOICES_N4             13196    439859      18669   2889596         1
AP_INVOICES_N5              9734       291      24145   1543140        82
AP_INVOICES_N6             18816   1567987      22708   2579791         1
AP_INVOICES_N9              9216         3      23271    264048      7757
AP_INVOICES_U1             10892   2861077      17074    342793         1
AP_INVOICES_U2             17176   3084212      28910   2499547         1

Optimizer Index Determination Criteria

The optimizer uses the following criteria when determining which index to use:

Number of rows in the index (cardinality)
Number of distinct keys. These define the selectivity of the index.
Level or height of the index. This indicates how deeply the data 'probe' must search in order to find the data.
Number of leaf blocks. This is the number of I/Os needed to find the desired rows of data.
Clustering factor (CF). This is the colocation amount of the index block relative to data blocks. The higher the CF the more likely the optimizer is to select this index.

Usage Notes

Use the following notes to assist you in deciding whether you have chosen an appropriate index for your table, data, and query:

DISTINCT

Consider index ap_invoices_n3, the number of distinct keys, is 2. The resulting selectivity based on index ap_invoices_n3 is poor, and the optimizer is not likely to use this index. Using this index fetches 50% of the data in the table. In this case, a full table scan is cheaper than using index ap_invoices_n3.

Index Cost Tie

The optimizer uses alphabetic determination: If the optimizer determines that the selectivity, cost, and cardinality of two finalist indexes is the same, then it uses the two indexes' names as the deciding factor. It chooses the index with name beginning with a lower alphabetic letter or number.

Verifying Column Statistics

To verify that column statistics are available, execute the following statement against the dictionary's DBA_TAB_COLUMNS view:

SQL> SELECT COLUMN_NAME, NUM_DISTINCT, NUM_NULLS, NUM_BUCKETS, DENSITY

FROM DBA_TAB_COLUMNS
WHERE TABLE_NAME ="PA_EXPENDITURE_ITEMS_ALL"
ORDER BY COLUMN_NAME;

This returns the following data:

COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS NUM_BUCKETS    DENSITY 
------------------------------ ------------ ---------- ----------- ---------- 
BURDEN_COST                            4300      71957           1 .000232558 
BURDEN_COST_RATE                        675    7376401           1 .001481481 
CONVERTED_FLAG                            1   16793903           1          1 
COST_BURDEN_DISTRIBUTED_FLAG              2      15796           1         .5 
COST_DISTRIBUTED_FLAG                     2          0           1         .5 
COST_IND_COMPILED_SET_ID                 87    6153143           1 .011494253 
EXPENDITURE_ID                      1171831          0           1 8.5337E-07 
TASK_ID                                8648          0           1 .000115634 
TRANSFERRED_FROM_EXP_ITEM_ID        1233787   15568891           1 8.1051E-07

Verifying column statistics are especially important for the following schema:

Join conditions
When the WHERE clause includes a column(s) with a bind variable, for example:
```
column x  = :variable_y
```

In these cases, the stored column statistics can be used to get a representative Cardinality estimation for the given expression.

Consider the data returned in the above example.

NUM_DISTINCT Column Statistic

Low

Column CONVERTED_FLAG: NUM_DISTINCT = 1. In this case this column has only one value. If in the WHERE clause, then there is a bind variable on column CONVERTED_FLAG = :variable_y, say. If CONVERTED_FLAG is low, as the case in this example, then this leads to poor selectivity and CONVERTED_FLAG is a poor candidate to be used as the index.

Column COST_BURDEN_DISTRIBUTED_FLAG: NUM_DISTINCT = 2. Likewise, this is low. COST_BURDEN_DISTRIBUTED_FLAG is not a good candidate for index unless there is much skew. If there is data skew of, say, 90%, then 90% of the data has one particular value and 10% of the data has another value. If the query only needs to access the 10%, then a histogram would be needed on that column in order for the optimizer to recognize the skew and utilize an index on this column.

High

NUM_DISTINCT is more than one million for column EXPEDITURE_ID. If there is a bind variable on column EXPENDITURE_ID, then this leads to high selectivity (implying high density of data on this column). In other words, EXPENDITURE_ID is a good candidate to be used as the index.

NUM_NULL Column Statistic

NUM_NULLS indicates the number of null statistics.

Low

For example, if a single column index has few null, such as the COST_DISTRIBUTED_FLAG column, then if this column is used as the index, the resulting data set will be large.

High

If there are many nulls on a particular column, such as the CONVERTED_FLAG column, then if this column is used as the index, the resulting data set will be small. This means that COST_DISTRIBUTED_FLAG would be a more appropriate column to index.

DENSITY Column Statistic

This indicates how dense the values of that column are.

Column Statistics and Join Methods

Column statistics are useful to help determine the most efficient join method, which, in turn, is also based on the number of rows returned.

Using Histograms

The cost-based optimizer uses data value histograms to get accurate estimates of the distribution of column data. A histogram partitions the values in the column into bands, so that all column values in a band fall within the same range. Histograms provide improved selectivity estimates in the presence of data skew, resulting in optimal execution plans with nonuniform data distributions.

One of the fundamental capabilities of the cost-based optimizer is determining the selectivity of predicates that appear in queries. Selectivity estimates are used to decide when to use an index and the order in which to join tables. Most attribute domains (a table's columns) are not uniformly distributed.

The cost-based optimizer uses height-based histograms on specified attributes to describe the distributions of nonuniform domains. In a height-based histogram, the column values are divided into bands so that each band contains approximately the same number of values. The useful information that the histogram provides, then, is where in the range of values the endpoints fall.

See Also:
For more information, see "Types of Histograms".

Consider a column C with values between 1 and 100 and a histogram with 10 buckets. If the data in C is uniformly distributed, then the histogram would look like this, where the numbers are the endpoint values:

The number of rows in each bucket is one tenth the total number of rows in the table. Four-tenths of the rows have values between 60 and 100 in this example of uniform distribution.

If the data is not uniformly distributed, then the histogram might look like this:

In this case, most of the rows have the value 5 for the column. In this example, only 1/10 of the rows have values between 60 and 100.

When to Use Histograms

Histograms can affect performance and should be used only when they substantially improve query plans. In general, you should create histograms on columns that are frequently used in WHERE clauses of queries and have a highly skewed data distribution. For many applications, it is appropriate to create histograms for all indexed columns because indexed columns typically are the columns most often used in WHERE clauses.

Histograms are persistent objects, so there is a maintenance and space cost for using them. You should compute histograms only for columns that you know have highly skewed data distribution. For uniformly distributed data, the cost-based optimizer can make fairly accurate guesses about the cost of executing a particular statement without the use of histograms.

Histograms, like all other optimizer statistics, are static. They are useful only when they reflect the current data distribution of a given column. (The data in the column can change as long as the distribution remains constant.) If the data distribution of a column changes frequently, you must recompute its histogram frequently.

Histograms are not useful for columns with the following characteristics:

All predicates on the column use bind variables.
The column data is uniformly distributed.
The column is not used in WHERE clauses of queries.
The column is unique and is used only with equality predicates.

Creating Histograms

You generate histograms by using the DBMS_STATS package or the ANALYZE statement. You can generate histograms for columns of a table or partition. Histogram statistics are not collected in parallel.

For example, to create a 10-bucket histogram on the SAL column of the emp table, issue the following statement:

EXECUTE DBMS_STATS.GATHER_TABLE_STATS 
('scott','emp', METHOD_OPT => 'FOR COLUMNS SIZE 10 sal');

The SIZE keyword declares the maximum number of buckets for the histogram. You would create a histogram on the SAL column if there was an unusually high number of employees with the same salary and few employees with other salaries. You can also collect histograms for a single partition of a table.

See Also:
For more information on the DBMS_STATS package, see Oracle8i Supplied PL/SQL Packages Reference.

Choosing the Number of Buckets for a Histogram

The default number of buckets for a histogram is 75. This value provides an appropriate level of detail for most data distributions. However, because the number of buckets in the histogram, also known as 'the sampling rate', and the data distribution both affect a histogram's usefulness, you may need to experiment with different numbers of buckets to obtain optimal results.

If the number of frequently occurring distinct values in a column is relatively small, then set the number of buckets to be greater than that number.

Types of Histograms

There are two types of histograms:

Height-Based Histograms

Height-based histograms place approximately the same number of values into each range, so that the endpoints of the range are determined by how many values are in that range.

Consider that a table's query results in the following four sample values: 4, 18, 30, and 35.

For a height-based histogram, we consider each of these values to occupy a portion of one bucket, in proportion to their size. The resulting selectivity is computed with the following formula:

S = Height(35)/ Height(4 + 18 + 30 + 35)

Value-Based Histograms

Consider the same four sample values in the example above. In a value-based histogram a bucket is used to represent each of the four distinct values. In other words, one bucket represents 4, one bucket represents 18, another represents 30, and another represents 35. The resulting selectivity is computed with the following formula:

S = [#rows (35)/(#rows(4) + #rows(18) + #rows(30) + #rows(35))]/ #buckets

If there are many different values anticipated for a particular column of your table, it is preferable to use the value-based histogram rather than the height-based histogram. This is because if there is much data skew in the height, then the skew can offset the selectivity calculation and give a non-representative selectivity value.

Histogram Example

The following example illustrates the use of a histogram in order to improve the execution plan and demonstrate the skewed behavior of the s6 indexed column.

UPDATE so_lines l 
SET open_flag=null, 
     s6=10, 
     s6_date=sysdate, 
WHERE l.line_type_code in ('REGULAR','DETAIL','RETURN') AND 
      l.open_flag||'' = 'Y'AND NVL(l.shipped_quantity, 0)=0 OR 
      NVL(l.shipped_quantity, 0) != 0 AND 
      l.shipped_quantity +NVL(l.cancelled_quantity, 0)= l.ordered_quantity)) AND 
      l.s6=18

This query shows the skewed distribution of data values for s6. In this case, there are two distinct non-null values: 10 and 18. The majority of the rows consists of s6 = 10 (1,589,464), while a small amount of rows consist of s6 = 18 (13,091).

S6:      COUNT(*)
======================
10        1,589,464
18           13,091
NULL         21,889

The selectivity of column s6, where s6 = 18:

S = 13,091 / (13,091 + 1,589,464) = 0.008

If No Histogram is Used: Then the selectivity of column s6 is assumed to be 50%, uniformly distributed across 10 and 18. This is not selective; therefore, s6 is not an ideal choice for use as an index.
If a Histogram is Used: Then the data distribution information is stored in the dictionary. This allows the optimizer to use this information and compute the correct selectivity based on the data distribution. In the above example, the selectivity, based on the histogram data, is 0.008. This a relatively high, or good, selectivity, which indicates to the optimizer to use an index on column s6 in the execution plan.

Viewing Histograms

You can view histogram information with the following data dictionary views:

USER_HISTOGRAMS, ALL_HISTOGRAMS, and DBA_HISTOGRAMS
USER_PART_HISTOGRAMS, ALL_PART_HISTOGRAMS, and DBA_PART_HISTOGRAMS
USER_SUBPART_HISTOGRAMS, ALL_SUBPART_HISTOGRAMS, and DBA_SUBPART_HISTOGRAMS
TAB_COLUMNS

Number of Rows

View the following DBA_HISTOGRAMS dictionary table for the number of buckets; i.e., the number of rows, for each column:

ENDPOINT_NUMBER
ENDPOINT_VALUE

See Also:
For column descriptions of data dictionary views, as well as histogram use and restrictions, see Oracle8i Reference.

Verifying Histogram Statistics

To verify that histogram statistics are available, execute the following statement against the dictionary's DBA_HISTOGRAMS table:

SQL> SELECT ENDPOINT_NUMBER, ENDPOINT_VALUE
     FROM DBA_HISTOGRAMS
     WHERE TABLE_NAME ="SO_LINES_ALL" AND COLUMN_NAME="S2";

This returns the following typical data:

ENDPOINT_NUMBER   ENDPOINT_VALUE
---------------  ---------------
          1365                 4
          1370                 5
          2124                 8
          2228                18

Consider the difference between two ENDPOINT_NUMBER values, such as 1370 - 1365 = 5. This indicates that 5 values are represented in the bucket containing the endpoint 1365.

If endpoint numbers are very different, then this implies the use of more buckets, where one row corresponds to one bucket.

GATHER STALE	Gathers statistics on tables with stale statistics.
GATHER	Gathers statistics on all tables. (default)
GATHER EMPTY	Gathers statistics only on tables without statistics.
LIST STALE	Creates a list of tables with stale statistics.
LIST EMPTY	Creates a list of tables that do not have statistics.

8 Gathering Statistics

Understanding Statistics

Generating Statistics

Statistics for Partitioned Schema Objects

Using the ANALYZE Statement

Using the DBMS_STATS Package

Gathering Statistics with the DBMS_STATS Package

Table 8-1 Statistics Gathering Procedures in the DBMS_STATS Package

Gathering Index Statistics

Gathering New Optimizer Statistics

Gathering Automated Statistics

Designating Tables for Monitoring and Automated Statistics Gathering

Enabling Automated Statistics Gathering

Creating Lists of Tables with Stale or No Statistics

Preserving Versions of Statistics

Statistics Data

Data Distribution

Attribute Skew

Missing Statistics

Table 8-2 Default Table and Index Values When Statistics are Missing

Using Statistics

Managing Statistics

Statistics Tables

Viewing Statistics

Verifying Table Statistics

Verifying Index Statistics

Optimizer Index Determination Criteria

Usage Notes

DISTINCT

Index Cost Tie

Verifying Column Statistics

NUM_DISTINCT Column Statistic

Low

High

NUM_NULL Column Statistic

Low

High

DENSITY Column Statistic

Column Statistics and Join Methods

Using Histograms

When to Use Histograms

Creating Histograms

Choosing the Number of Buckets for a Histogram

Types of Histograms

Height-Based Histograms

Value-Based Histograms

Histogram Example

Viewing Histograms

Number of Rows

Verifying Histogram Statistics

8
Gathering Statistics