Advanced Topics

JDBC and NLS

After a brief overview, this section covers the following topics:

Oracle's JDBC drivers support NLS (National Language Support). NLS lets you retrieve data or insert data into a database in any character set that Oracle supports. If the clients and the server use different character sets, then the driver provides the support to perform the conversions between the database character set and the client character set.

For more information on NLS, NLS environment variables, and the character sets that Oracle supports, see the Oracle8i National Language Support Guide. See the Oracle8i Reference for more information on the database character set and how it is created.

Here are a few examples of commonly used Java methods for JDBC that rely heavily on NLS character set conversion:

The java.sql.ResultSet methods getString() and getUnicodeStream() return values from the database as Java strings and as a stream of Unicode characters, respectively.
The oracle.sql.CLOB method getCharacterStream() returns the contents of a CLOB as a Unicode stream.
The oracle.sql.CHAR methods getString(), toString(), and getStringWithReplacement() convert the following data to strings:
- getString(): This converts the sequence of characters represented by the CHAR object to a string and returns a Java String object.
- toString(): This is identical to getString(), but if the character set is not recognized, then toString() returns a hexadecimal representation of the CHAR data.
- getStringWithReplacement(): This is identical to getString(), except characters that have no Unicode representation in the character set of this CHAR object are replaced by a default replacement character.

How JDBC Drivers Perform NLS Conversions

The techniques that the Oracle JDBC drivers use to perform character set conversion for Java applications depend on the character set the database uses. The simplest case is where the database uses the US7ASCII or WE8ISO8859P1 character set. In this case, the driver converts the data directly from the database character set to UCS-2, which is used in Java applications, and vice versa.

If you are working with databases that employ a non-US7ASCII or non-WE8ISO8859P1 character set (for example, Japanese or Korean), then the driver converts the data first to UTF-8 (this step does not apply to the server-side internal driver), then to UCS-2. For example, the driver always converts CHAR and VARCHAR2 data in a non-US7ASCII, non-WE8ISO8859P1 character set. It does not convert RAW data.

Note:
The JDBC drivers perform all character set conversions transparently. No user intervention is necessary for the conversions to occur.

JDBC OCI Driver and NLS

If you are using the JDBC OCI driver, then NLS is handled as in any other Oracle client situation. The client character set, language, and territory settings are in the NLS_LANG environment variable, which is set at client-installation time.

Note that there are also server-side settings for these parameters, determined during database creation. So, when performing character set conversion, the JDBC OCI driver has to take three factors into consideration:

database character set and language
client character set and language
Java applications character set: UCS-2

The JDBC OCI driver transfers the data from the server to the client in the character set of the database. Depending on the value of the NLS_LANG environment variable, the driver handles character set conversions in one of two ways:

If NLS_LANG is not specified, or specifies the US7ASCII or WE8ISO8859P1 character set, then the JDBC OCI driver uses Java to convert the character set from US7ASCII or WE8ISO8859P1 directly to UCS-2, or the reverse.

or:

If NLS_LANG specifies a non-US7ASCII or non-WE8ISO8859P1 character set, then the driver changes the value of the NLS_LANG parameter on the client to UTF-8. This happens automatically and does not require any user-intervention. OCI uses the NLS_LANG setting in converting the data from the database character set to UTF-8; the JDBC driver then converts the UTF-8 data to UCS-2.

Notes:

The driver sets the NLS_LANG character set to UTF-8 to minimize the number of conversions it performs in Java. It performs the conversion from database character set to UTF-8 in C.

The change to UTF-8 is for the JDBC application process only.

For more information on the NLS_LANG parameter, see the Oracle8i National Language Support Guide.

JDBC Thin Driver and NLS

If you are using the JDBC Thin driver, then there will presumably be no Oracle client installation. NLS conversions must be handled differently.

Language and Territory

The Thin driver obtains language and territory settings (NLS_LANGUAGE and NLS_TERRITORY) from the Java locale in the JVM user.language property. The date format (NLS_DATE_FORMAT) is set according to the territory setting.

Character Set

If the database character set is US7ASCII or WE8ISO8859P1, then the data is transferred to the client without any conversion. The driver then converts the character set to UCS-2 in Java.

If the database character set is something other than US7ASCII or WE8ISO8859P1, then the server first translates the data to UTF-8 before transferring it to the client. On the client, the JDBC Thin driver converts the data to UCS-2 in Java.

Server-Side Internal Driver and NLS

If your JDBC code running in the server accesses the database, then the JDBC server-side internal driver performs a character set conversion based on the database character set. The target character set of all Java programs is UCS-2.

NLS Support and Object Types

The Oracle JDBC class files, classes12.zip and classes111.zip, provide NLS support for the Thin and OCI drivers. The files contain all the necessary classes to provide complete NLS support for all Oracle character sets for CHAR, VARCHAR, LONGVARCHAR, and CLOB type data not retrieved or inserted as part of an Oracle object or collection type.

However, in the case of the CHAR and VARCHAR data portion of Oracle objects and collections, the JDBC class files provide support for only these commonly used character sets:

US7ASCII
WE8DEC
ISO-LATIN-1
UTF-8

To provide support for all NLS character sets, the Oracle 8i JDBC driver installation includes two additional files: nls_charset12.zip for JDK 1.2.x and nls_charset11.zip for JDK 1.1.x. The OCI and Thin drivers require these files to support all Oracle characters sets for CHAR and VARCHAR data in Oracle object types and collections. To obtain this support, you must add the appropriate nls_charset*.zip file to your CLASSPATH.

It is important to note that the nls_charset*.zip files are very large, because they must support a large number of character sets. To save space, you might want to keep only the classes you need from the nls_charset*.zip file. If you want to do this, follow these steps:

Unzip the appropriate nls_charset*.zip file.
Add only the needed character set classes to the CLASSPATH.
Remove the unneeded character set files from your system.

The character set extension class files are named in the following format:

CharacterConverter<OracleCharacterSetId>.class

where <OracleCharacterSetId> is the hexadecimal representation of the Oracle character set ID that corresponds to a character set name.

Note:
The preceding discussion is not relevant in using the server-side internal driver, which provides complete NLS support and does not require the NLS character set classes.

CHAR and VARCHAR2 Data Size Restrictions with the Thin Driver

If the database character set is neither ASCII (US7ASCII) nor ISO-LATIN-1 (WE8ISO8859P1), then the Thin driver must impose size restrictions for CHAR and VARCHAR2 bind parameters that are more restrictive than normal database size limitations. This is necessary to allow for data expansion during conversion.

The Thin driver checks CHAR or VARCHAR2 bind sizes when the setXXX() method is called. If the data size exceeds the size restriction, then the driver throws a SQL exception (ORA-17070 "Data size bigger than max size for this type") from the setXXX() call. This limitation is necessary to avoid the chance of data corruption whenever an NLS conversion occurs and increases the length of the data. This limitation is enforced when you are doing all the following:

using the Thin driver
using binds (not defines)
using CHAR or VARCHAR2 datatypes
connecting to a database whose character set is neither ASCII (US7ASCII) nor ISO-Latin-1 (WE8ISO8859P1)

Role of NLS Ratio

As previously discussed, when the database character set is neither US7ASCII nor WE8ISO8859P1, the Thin driver converts Java UCS-2 characters to UTF-8 encoding bytes for CHAR or VARCHAR2 binds. The UTF-8 encoding bytes are then transferred to the database, and the database converts the UTF-8 encoding bytes to the database character set encoding.

This conversion to the character set encoding might result in a size increase. The NLS ratio for a database character set indicates the maximum possible expansion in converting from UTF-8 to the character set:

NLS ratio = (maximum possible value of) [(size in database character set) / (size in UTF-8)]

Size Restriction Formulas

Table 18-1 shows the database size limitations for CHAR and VARCHAR2 data, and the Thin driver size restriction formulas for CHAR and VARCHAR2 binds. Database limits are in bytes. Formulas determine the maximum size of the UTF-8 encoding, in bytes.

Table 18-1 Maximum CHAR and VARCHAR2 Bind Sizes, Thin Driver

Oracle Version	Datatype	Max Size Allowed by Database (bytes)	Formula for Thin Driver Max Bind Size (UTF-8 bytes)
Oracle8 and Oracle8i	CHAR	2000	min(2000, 4000/NLS_ratio)
Oracle8 and Oracle8i	VARCHAR2	4000	4000/NLS_ratio
Oracle7	CHAR	255	255
Oracle7	VARCHAR2	2000	2000/NLS_ratio

The formulas guarantee that after the data is converted from UTF-8 to the database character set, the size will not exceed the database maximum size.

The number of UCS-2 characters that can be supported is determined by the number of bytes per character in the data. All ASCII characters are one byte long in UTF-8 encoding. Other character types can be two or three bytes long.

NLS Ratios and Calculated Size Restrictions for Common Character Sets

Table 18-2 lists the NLS ratios of some common server character sets, then shows the Thin driver maximum bind sizes for CHAR and VARCHAR2 data for each character set, as determined by using the NLS ratio in the appropriate formula.

Again, maximum bind sizes are for UTF-8 encoding, in bytes.

Table 18-2 NLS Ratio and Size Limits, Oracle8, Common Character Sets

Server Character Set	NLS Ratio	Thin Driver Max VARCHAR2 Bind Size (UTF-8 bytes)	Thin Driver Max CHAR Bind Size (UTF-8 bytes)
`WE8DEC`	1	4000	2000
`JA16SJIS`	2	2000	2000
`ISO 8859-1` through `10`	3	1333	1333