D
Glossary

ASCII

American Standard Code for Information Interchange. A common encoded 7-bit character set for English. ASCII includes the letters A-Z and a-z, as well as digits, punctuation symbols, and control characters. The Oracle character set name for this is US7ASCII.

Binary Sorting

Sorting of character strings based on their binary coded value representations.

Case Conversion

Case conversion refers to changing a character from its uppercase to lowercase form, or vice versa.

Character

A character is an abstract element of a text. A character is different from a glyph (font glyph), which is a specific instance of a character. For example, the first character of the English upper-case Alphabet can be printed (or displayed) as A, A, A, etc. All these different forms are different glyphs, but representing the same character. A character, a character code and a glyph are related as follows.

character --(encoding)--> character code --(font)--> glyph

When we have the first character of the English upper-case Alphabet in computer memory, we actually have a number (or a character code). The character code is 0x41 if we are using the ASCII encoding scheme, or the character code is 0xc1 if we are using the EBCDIC encoding scheme, or it can be some other number if we are using different encoding scheme. When we print (or display) this character, we use a font. We have to choose a font for the ASCII encoding scheme (or a font for a superset of the ASCII encoding scheme) if we are using the ASCII encoding scheme, or we have to choose a font for the EBCDIC encoding scheme if we are using the EBCDIC encoding scheme. Now the character is printed (or displayed) as A, A, A, or some other form. All these different forms are different glyphs, but represent the same character.

Character Code

A character code is a number which represents a specific character. In order for computers to handle a character, we need a specific number which is assigned to that character. The number (or the character code) depends on what encoding scheme we are using. For example, the first character of the English upper-case Alphabet has the character code 0x41 for the ASCII encoding scheme, but the same character has the character code 0xc1 for the EBCDIC encoding scheme. (See "character" also.)

Character Set

A character set is a set of characters for a specific language (or languages). There can be many different character sets just for one language.

Sometimes, a character set doesn't imply any specific character encoding scheme.

In this manual, a character set generally implies a specific character encoding scheme, which is how a number (or a character code) is assigned to each character of the character set. Therefore, the meaning of the term character set is generally same as encoded character set in this manual.

Character String

A character string is a serial string of characters.

A character string can also consist of no character. In this case, the character string doesn't include any character. This character string is called "null string". "The number of characters" of this character string is 0 (zero).

Coded Character Set

Same as encoded character set.

An independent unit used to represent data, such as a letter, a letter with a diacritical mark, a digit, ideograph, punctuation, or symbol.

Character Classification

Character classification information provides details about the type of character associated with each legal character code; that is, whether it is an alphabetic, uppercase, lowercase, punctuation, control, or space character, etc.

Character Encoding Scheme

A character encoding scheme is a rule that assigns numbers (or character codes) to all characters in a character set. We also use the shortened term encoding scheme (or encoding method, or just encoding).

Character Set Conversion

Conversion from one encoded character set to another.

Client Character Set

The encoded character set which the client uses. A client character set can differ from the database server character set, in which case, character set conversion must occur.

Collation

Ordering of character strings in a given alphabet in a linguistic sort order or a binary sort order.

Combining Character

A character that graphically combines with a preceding base character. These characters are not used in isolation. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.

Composite Character

A single character which can be represented by a composite character sequence. This type of character is found in the scripts of Thai, Lao, Vietnamese, and Korean Hangul, as well as many Latin characters used in European languages.

Composite Character Sequence

A character sequence consisting of a base character followed by one or more combining characters. This is also referred to as a combining character sequence.

Database Character Set

The encoded character set in which text is stored in the database is represented. This includes CHAR, VARCHAR2, LONG, and CLOB column values and all SQL and PL/SQL text stored in the database.

Diacritical Mark

A mark added to a letter that usually provides information about pronunciation or stress.

DBCS

DBCS stands for Double-Byte (Coded) Character Set. However, this term should be used carefully. (Use the term multibyte (coded) character set when appropriate.) See "double-byte" also.

Double byte

Double-byte (or doublebyte or double byte) means two bytes. However, this term should be used carefully. (Use the term multibyte when appropriate.) For many characters of many languages, double-byte is not enough (this is especially true for UTF8 encoding of Unicode).

EBCDIC

Extended Binary Coded Decimal Interchange Code. EBCDIC is a family of encoded character sets used mostly on IBM systems.

Encoded Character Set

An encoded character set is a character set with an associated character encoding scheme.

An encoded character set specifies how a number (or a character code) is assigned to each character of the character set based on a character encoding scheme.

Encoding

Encoding Method or Encoding scheme. Same as Character Encoding Scheme.

Encoding Scheme

See "Character Encoding Schemes".

EUC

Extended UNIX Codes. A common encoding method used on Asian UNIX systems. It combines up to four different encoded character sets in a single data stream.

Euro

The new unit of currency used by participating member states of the European Union.

Font

An ordered collection of character glyphs which provides a graphical representation of characters within a character set.

Glyph

A glyph (font glyph) is a specific instance of a character. A character can have many different glyphs. For example, the first character of the English upper-case Alphabet can be printed (or displayed) as A, A, A, etc.

All these different forms are different glyphs, but representing the same character. (See "character" also.)

Ideograph

A symbol representing an idea. Chinese is an example of an ideographic system.

Internationalization

The process of making software flexible enough to be used in many different linguistic and cultural environments. Internationalization should not be confused with localization, which is the process of preparing software for use in one specific locale.

ISO

International Standards Organization.

ISO/IEC 10646

A universal character set standard defining the characters of most major scripts used in the modern world. In 1993, ISO adopted Unicode version 1.1 as ISO/IEC 10646-1:1993. ISO/IEC 10646 has two formats: UCS2 is a 2-byte fixed-width format and UCS4 is a 4-byte fixed-width format. There are three levels of implementation, all relating to support for composite characters. Level 1 requires no composite character support, level 2 requires support for specific scripts (including most of the Unicode scripts such as Arabic, Thai, etc.), and level 3 requires unrestricted support for composite characters in all languages.

ISO Currency

The 3-letter abbreviation used to denote a local currency, which is based on the ISO 4217 standard. For example, "USD" represents the United States Dollar.

ISO 8859

A family of 8-bit encoded character sets. The most common one is ISO 8859-1 (also known as Latin-1), and is used for Western European languages.

Latin-1

Formally known as the ISO 8859-1 character set standard. An 8-bit extension to ASCII which adds 128 characters covering the most common Latin characters used in Western Europe. The Oracle character set name for this is WE8ISO8859P1. See also "ISO 8859".

Linguistic Index

An index built on a linguistic collation order.

Linguistic Sorting

Sorting of strings based on requirements from a locale instead of based on the binary representation of the strings.

Local Currency

The currency symbol used in a country or region. For example, "$" represents the United States Dollar.

Locale

A collection of information regarding the linguistic and cultural preferences from a particular region. Typically, a locale consists of language territory, character set, linguistic, and calendar information defined in NLS data files.

Localization

The process of providing language- or culture-specific information for software systems. Translation of an application's user interface would be an example of localization. Localization should not be confused with internationalization, which is the process of generalizing software so it can handle many different linguistic and cultural conventions.

Monolingual Support

Support for only one language.

Multibyte

Multi-byte (or multibyte or multi byte) means two or more bytes.

When we assign character codes to all characters for a specific language (or a group of languages), one byte (8 bits) can represent 256 different characters. Two bytes (16 bits) can represent up to 65,536 different characters. However, two bytes are still not enough to represent all the characters for many languages. We use 3 bytes or 4 bytes for those characters.

One example is UTF8 encoding of Unicode. In UTF8, there are a lot of 2-byte and 3-byte characters.

Another example is Traditional Chinese language used in Taiwan. It has more than 80,000 different characters. We are using 4 bytes for some of those characters under some character encoding schemes used in Taiwan.

Multibyte Character

A multibyte character is a character whose character code consists of two or more bytes under a certain character encoding scheme. Note that the same character may have different character code where the character encoding scheme is different. Without knowing which character encoding scheme we are using, we cannot tell which character is a multibyte character. For example, Japanese Hankaku-Katakana (half width Katakana) characters are one byte in JA16SJIS encoded character set, two bytes in JA16EUC, and three bytes in UTF8. See "single-byte character" also.

Multibyte Character String

A multibyte character string is a character string which consists of one of the below.

No character

(The character string is called "null string" in this case.)
One or more single-byte character(s)
A mixture of one or more single-byte character(s) and one or more multibyte character(s)
One or more multibyte character(s)

Theoretically, we can exclude single-byte character strings (character strings including only single-byte characters) from the list above. However, it's probably more convenient for software to handle single-byte character strings as one type of multibyte character strings.

NCHAR Character Set

An alternate character set from the database character set that can be specified for NCHAR, NVARCHAR2, and NCLOB columns. NCHAR character sets, unlike the database character set, can support fixed-width multibyte character sets. Care must be taken when selecting an NCHAR character set, since its character repertoire must be included in the database character set as well.

Net8

Net8 enables two or more computers that run the Oracle server to exchange data through a third-party network. It is independent of the communications protocol.

NLS

National Language Support. NLS allows users to interact with the database in their native languages. It also allows applications to run in different linguistic and cultural environments.

NLSDATA

A general phrase referring to the contents in many files with .nlb suffixes. These files contain data that the NLSRTL library uses to provide specific NLS support.

NLSRTL

National Language Support Run-Time Library. This library is responsible for providing locale-independent algorithms for internationalization. The locale-specific information (i.e., NLSDATA) is read by the NLSRTL library during run-time.

Replacement Character

A character used during character conversion when the desired character is not available in the target character set. For example, "?" is often used as Oracle's default replacement character.

Restricted Multilingual Support

Multilingual support which is restricted to a group of related languages. Support for related languages, but not all languages. Similar language families, such as Western European languages can be represented with, for example, ISO 8859/1. In this case, however, Thai could not be added.

SQL*Net

Now called Net8. Net8 enables two or more computers that run the Oracle server to exchange data through a third-party network. It is independent of the communications protocol.

Script

A collection of related graphic symbols used in a writing system. Some scripts are used to represent multiple languages, and some languages use multiple scripts. Example of scripts include Latin, Arabic, and Han.

Server Character Set

The character set used by the database server.

Single-byte

Single-byte (or singlebyte or single byte) means one byte. One byte usually consists of 8 bits. When we assign character codes to all characters for a specific language, one byte (8 bits) can represent 256 different characters.

Single-byte character

A single-byte character is a character whose character code consists of one byte under a certain character encoding scheme. Note that the same character may have different character code where the character encoding scheme is different. Without knowing which character encoding scheme we are using, we cannot tell which character is a single-byte character. For example, the euro currency symbol is one byte in WE8MSWIN1252 encoded character set, two bytes in UCS2, and three bytes in UTF8. See "multibyte character" also.

Single-byte Character String

A single-byte character string is a character string which consists of one of the below.

No character

(The character string is called "null string" in this case.)
One or more single-byte character(s).

UCS-2

UCS stands for "Universal Multiple-Octet Coded Character Set". It is a 1993 ISO and IEC standard character set. See "UCS2".

UCS2

Fixed-width 16-bit Unicode. Each character occupies 16 bits of storage. The Latin-1 characters are the first 256 code points in this standard, so it can be viewed as a 16-bit extension of Latin-1.

UCS4

Fixed-width 32-bit Unicode. Each character occupies 32 bits of storage. The UCS2 characters are the first 65,536 code points in this standard, so it can be viewed as a 32-bit extension of UCS2. This is also sometimes referred to as ISO-10646. ISO-10646 is a standard that specifies up to 2,147,483,648 characters in 32768 planes, of which the first plane is the UCS2 set. The ISO standard also specifies transformations between different encodings.

Unicode

Unicode is a type of universal character set, a collection of 64K characters encoded in a 16-bit space. It encodes nearly every character in just about every existing character set standard, covering most written scripts used in the world. It is owned and defined by Unicode Inc. Unicode is canonical encoding which means its value can be passed around in different locales. But it does not guarantee a round-trip conversion between it and every Oracle character set without information loss.

Unicode Codepoint

A 16-bit binary value that can represent a unit of encoded text for processing and interchange. Every point between U+0000 and U+FFFF is a code point. The term is interchangeable with code element, code position, and code value.

Unicode Mapping Between UCS and UTF Formats

The following shows how different Unicode-related character sets relate to one another in terms of character code value ranges:

UCS2 UTF8 Description

0x0000 - 0x007F

0x00 - 0x7F

Single bytes

0x0080 - 0x07FF

0xC0 - 0xDF

2-byte sequence leaders (5+6 bits)

0x0800 - 0xFFFF

0xE0 - 0xEF

3-byte sequence leaders (4+6+6 bits)

0x80 - 0xBF

Follower bytes (6 bits each)

UCS4 UTF8 Description

0x00000000 - 0x0000007F

0x00 - 0x7F

Single bytes

0x00000080 - 0x000007FF

0xC0 - 0xDF

2-byte sequence leaders (5+6 bits)

0x00000800 - 0x0000FFFF

0xE0 - 0xEF

3-byte sequence leaders (4+6+6 bits)

0x00001000 - 0x001FFFFF

0xF0 - 0xF7

4-byte sequence leaders (3+6+6+6 bits)

0x00200000 - 0x03FFFFFF

0xF8 - 0xFB

5-byte sequence leaders (2+6+6+6+6 bits)

0x04000000 - 0x7FFFFFFF

0xFC - 0xFD

6-byte sequence leaders (1+6+6+6+6+6 bits)

0x80 - 0xBF

Follower bytes (6 bits each)

0xFE - 0xFF

Reserved or unused

UCS4 UTF16 Description

0x00000000 - 0x0000FFFF

0x0000 - 0xFFFF

Same as UCS2

0x00010000 - 0x0010FFFF

0xD800 - 0xDBFF

High surrogate ((x-0x10000)>>10)&0x3FF

0xDC00 - 0xDFFF

Low surrogate (x-0x10000)&0x3FF

0x00110000 - 0x7FFFFFFF

Not mapped to UTF16

Unrestricted Multilingual Support

Being able to use as many languages as desired. A universal character set, such as Unicode, helps to provide unrestricted multilingual support because it supports a very large character repertoire, encompassing most modern languages of the world.

UTF-8

A variable-width encoding of UCS2 which uses sequences of 1, 2, or 3 bytes per character. Characters from 0-127 (the 7-bit ASCII characters) are encoded with one byte, characters from 128-2047 require two bytes, and characters from 2048-65535 require three bytes. The Oracle character set name for this is UTF8 (for the Unicode 2.1 standard). The standard has left room for expansion to support the UCS4 characters with sequences of 4, 5, and 6 bytes per character.

UTF-16

An extension to UCS2 that allows for pairs of UCS2 code points to represent extended characters from the UCS4 set. UCS2 has ranges of code points allocated for high (leading) and low (trailing) surrogates that support UTF16 encodings.

Wide Character

A fixed-width character format that is well-suited for extensive text processing because it allows for data to be processed in consistent fixed-width chunks. Wide characters are intended for supporting internal character processing, and are therefore implementation-dependent.

UCS2	UTF8	Description
0x0000 - 0x007F	0x00 - 0x7F	Single bytes
0x0080 - 0x07FF	0xC0 - 0xDF	2-byte sequence leaders (5+6 bits)
0x0800 - 0xFFFF	0xE0 - 0xEF	3-byte sequence leaders (4+6+6 bits)
	0x80 - 0xBF	Follower bytes (6 bits each)

UCS4	UTF8	Description
0x00000000 - 0x0000007F	0x00 - 0x7F	Single bytes
0x00000080 - 0x000007FF	0xC0 - 0xDF	2-byte sequence leaders (5+6 bits)
0x00000800 - 0x0000FFFF	0xE0 - 0xEF	3-byte sequence leaders (4+6+6 bits)
0x00001000 - 0x001FFFFF	0xF0 - 0xF7	4-byte sequence leaders (3+6+6+6 bits)
0x00200000 - 0x03FFFFFF	0xF8 - 0xFB	5-byte sequence leaders (2+6+6+6+6 bits)
0x04000000 - 0x7FFFFFFF	0xFC - 0xFD	6-byte sequence leaders (1+6+6+6+6+6 bits)
	0x80 - 0xBF	Follower bytes (6 bits each)
	0xFE - 0xFF	Reserved or unused

UCS4	UTF16	Description
0x00000000 - 0x0000FFFF	0x0000 - 0xFFFF	Same as UCS2
0x00010000 - 0x0010FFFF	0xD800 - 0xDBFF	High surrogate ((x-0x10000)>>10)&0x3FF
	0xDC00 - 0xDFFF	Low surrogate (x-0x10000)&0x3FF
0x00110000 - 0x7FFFFFFF		Not mapped to UTF16

D Glossary