Oracle8i National Language Support Guide Release 2 (8.1.6) Part Number A76966-01 |
|
American Standard Code for Information Interchange. A common encoded 7-bit character set for English. ASCII includes the letters A-Z and a-z, as well as digits, punctuation symbols, and control characters. The Oracle character set name for this is US7ASCII.
Sorting of character strings based on their binary coded value representations.
Case conversion refers to changing a character from its uppercase to lowercase form, or vice versa.
A character is an abstract element of a text. A character is different from a glyph (font glyph), which is a specific instance of a character. For example, the first character of the English upper-case Alphabet can be printed (or displayed) as A, A, A, etc. All these different forms are different glyphs, but representing the same character. A character, a character code and a glyph are related as follows.
character --(encoding)--> character code --(font)--> glyph
When we have the first character of the English upper-case Alphabet in computer memory, we actually have a number (or a character code). The character code is 0x41 if we are using the ASCII encoding scheme, or the character code is 0xc1 if we are using the EBCDIC encoding scheme, or it can be some other number if we are using different encoding scheme. When we print (or display) this character, we use a font. We have to choose a font for the ASCII encoding scheme (or a font for a superset of the ASCII encoding scheme) if we are using the ASCII encoding scheme, or we have to choose a font for the EBCDIC encoding scheme if we are using the EBCDIC encoding scheme. Now the character is printed (or displayed) as A, A, A, or some other form. All these different forms are different glyphs, but represent the same character.
A character code is a number which represents a specific character. In order for computers to handle a character, we need a specific number which is assigned to that character. The number (or the character code) depends on what encoding scheme we are using. For example, the first character of the English upper-case Alphabet has the character code 0x41 for the ASCII encoding scheme, but the same character has the character code 0xc1 for the EBCDIC encoding scheme. (See "character" also.)
A character set is a set of characters for a specific language (or languages). There can be many different character sets just for one language.
Sometimes, a character set doesn't imply any specific character encoding scheme.
In this manual, a character set generally implies a specific character encoding scheme, which is how a number (or a character code) is assigned to each character of the character set. Therefore, the meaning of the term character set is generally same as encoded character set in this manual.
A character string is a serial string of characters.
A character string can also consist of no character. In this case, the character string doesn't include any character. This character string is called "null string". "The number of characters" of this character string is 0 (zero).
Same as encoded character set.
An independent unit used to represent data, such as a letter, a letter with a diacritical mark, a digit, ideograph, punctuation, or symbol.
Character classification information provides details about the type of character associated with each legal character code; that is, whether it is an alphabetic, uppercase, lowercase, punctuation, control, or space character, etc.
A character encoding scheme is a rule that assigns numbers (or character codes) to all characters in a character set. We also use the shortened term encoding scheme (or encoding method, or just encoding).
Conversion from one encoded character set to another.
The encoded character set which the client uses. A client character set can differ from the database server character set, in which case, character set conversion must occur.
Ordering of character strings in a given alphabet in a linguistic sort order or a binary sort order.
A character that graphically combines with a preceding base character. These characters are not used in isolation. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
A single character which can be represented by a composite character sequence. This type of character is found in the scripts of Thai, Lao, Vietnamese, and Korean Hangul, as well as many Latin characters used in European languages.
A character sequence consisting of a base character followed by one or more combining characters. This is also referred to as a combining character sequence.
The encoded character set in which text is stored in the database is represented. This includes CHAR, VARCHAR2, LONG, and CLOB column values and all SQL and PL/SQL text stored in the database.
A mark added to a letter that usually provides information about pronunciation or stress.
DBCS stands for Double-Byte (Coded) Character Set. However, this term should be used carefully. (Use the term multibyte (coded) character set when appropriate.) See "double-byte" also.
Double-byte (or doublebyte or double byte) means two bytes. However, this term should be used carefully. (Use the term multibyte when appropriate.) For many characters of many languages, double-byte is not enough (this is especially true for UTF8 encoding of Unicode).
Extended Binary Coded Decimal Interchange Code. EBCDIC is a family of encoded character sets used mostly on IBM systems.
An encoded character set is a character set with an associated character encoding scheme.
An encoded character set specifies how a number (or a character code) is assigned to each character of the character set based on a character encoding scheme.
Encoding Method or Encoding scheme. Same as Character Encoding Scheme.
See "Character Encoding Schemes".
Extended UNIX Codes. A common encoding method used on Asian UNIX systems. It combines up to four different encoded character sets in a single data stream.
The new unit of currency used by participating member states of the European Union.
An ordered collection of character glyphs which provides a graphical representation of characters within a character set.
A glyph (font glyph) is a specific instance of a character. A character can have many different glyphs. For example, the first character of the English upper-case Alphabet can be printed (or displayed) as A, A, A, etc.
All these different forms are different glyphs, but representing the same character. (See "character" also.)
A symbol representing an idea. Chinese is an example of an ideographic system.
The process of making software flexible enough to be used in many different linguistic and cultural environments. Internationalization should not be confused with localization, which is the process of preparing software for use in one specific locale.
International Standards Organization.
A universal character set standard defining the characters of most major scripts used in the modern world. In 1993, ISO adopted Unicode version 1.1 as ISO/IEC 10646-1:1993. ISO/IEC 10646 has two formats: UCS2 is a 2-byte fixed-width format and UCS4 is a 4-byte fixed-width format. There are three levels of implementation, all relating to support for composite characters. Level 1 requires no composite character support, level 2 requires support for specific scripts (including most of the Unicode scripts such as Arabic, Thai, etc.), and level 3 requires unrestricted support for composite characters in all languages.
The 3-letter abbreviation used to denote a local currency, which is based on the ISO 4217 standard. For example, "USD" represents the United States Dollar.
A family of 8-bit encoded character sets. The most common one is ISO 8859-1 (also known as Latin-1), and is used for Western European languages.
Formally known as the ISO 8859-1 character set standard. An 8-bit extension to ASCII which adds 128 characters covering the most common Latin characters used in Western Europe. The Oracle character set name for this is WE8ISO8859P1. See also "ISO 8859".
An index built on a linguistic collation order.
Sorting of strings based on requirements from a locale instead of based on the binary representation of the strings.
The currency symbol used in a country or region. For example, "$" represents the United States Dollar.
A collection of information regarding the linguistic and cultural preferences from a particular region. Typically, a locale consists of language territory, character set, linguistic, and calendar information defined in NLS data files.
The process of providing language- or culture-specific information for software systems. Translation of an application's user interface would be an example of localization. Localization should not be confused with internationalization, which is the process of generalizing software so it can handle many different linguistic and cultural conventions.
Support for only one language.
Multi-byte (or multibyte or multi byte) means two or more bytes.
When we assign character codes to all characters for a specific language (or a group of languages), one byte (8 bits) can represent 256 different characters. Two bytes (16 bits) can represent up to 65,536 different characters. However, two bytes are still not enough to represent all the characters for many languages. We use 3 bytes or 4 bytes for those characters.
One example is UTF8 encoding of Unicode. In UTF8, there are a lot of 2-byte and 3-byte characters.
Another example is Traditional Chinese language used in Taiwan. It has more than 80,000 different characters. We are using 4 bytes for some of those characters under some character encoding schemes used in Taiwan.
A multibyte character is a character whose character code consists of two or more bytes under a certain character encoding scheme. Note that the same character may have different character code where the character encoding scheme is different. Without knowing which character encoding scheme we are using, we cannot tell which character is a multibyte character. For example, Japanese Hankaku-Katakana (half width Katakana) characters are one byte in JA16SJIS encoded character set, two bytes in JA16EUC, and three bytes in UTF8. See "single-byte character" also.
A multibyte character string is a character string which consists of one of the below.
(The character string is called "null string" in this case.)
Theoretically, we can exclude single-byte character strings (character strings including only single-byte characters) from the list above. However, it's probably more convenient for software to handle single-byte character strings as one type of multibyte character strings.
An alternate character set from the database character set that can be specified for NCHAR, NVARCHAR2, and NCLOB columns. NCHAR character sets, unlike the database character set, can support fixed-width multibyte character sets. Care must be taken when selecting an NCHAR character set, since its character repertoire must be included in the database character set as well.
Net8 enables two or more computers that run the Oracle server to exchange data through a third-party network. It is independent of the communications protocol.
National Language Support. NLS allows users to interact with the database in their native languages. It also allows applications to run in different linguistic and cultural environments.
A general phrase referring to the contents in many files with .nlb suffixes. These files contain data that the NLSRTL library uses to provide specific NLS support.
National Language Support Run-Time Library. This library is responsible for providing locale-independent algorithms for internationalization. The locale-specific information (i.e., NLSDATA) is read by the NLSRTL library during run-time.
A character used during character conversion when the desired character is not available in the target character set. For example, "?" is often used as Oracle's default replacement character.
Multilingual support which is restricted to a group of related languages. Support for related languages, but not all languages. Similar language families, such as Western European languages can be represented with, for example, ISO 8859/1. In this case, however, Thai could not be added.
Now called Net8. Net8 enables two or more computers that run the Oracle server to exchange data through a third-party network. It is independent of the communications protocol.
A collection of related graphic symbols used in a writing system. Some scripts are used to represent multiple languages, and some languages use multiple scripts. Example of scripts include Latin, Arabic, and Han.
The character set used by the database server.
Single-byte (or singlebyte or single byte) means one byte. One byte usually consists of 8 bits. When we assign character codes to all characters for a specific language, one byte (8 bits) can represent 256 different characters.
A single-byte character is a character whose character code consists of one byte under a certain character encoding scheme. Note that the same character may have different character code where the character encoding scheme is different. Without knowing which character encoding scheme we are using, we cannot tell which character is a single-byte character. For example, the euro currency symbol is one byte in WE8MSWIN1252 encoded character set, two bytes in UCS2, and three bytes in UTF8. See "multibyte character" also.
A single-byte character string is a character string which consists of one of the below.
(The character string is called "null string" in this case.)
UCS stands for "Universal Multiple-Octet Coded Character Set". It is a 1993 ISO and IEC standard character set. See "UCS2".
Fixed-width 16-bit Unicode. Each character occupies 16 bits of storage. The Latin-1 characters are the first 256 code points in this standard, so it can be viewed as a 16-bit extension of Latin-1.
Fixed-width 32-bit Unicode. Each character occupies 32 bits of storage. The UCS2 characters are the first 65,536 code points in this standard, so it can be viewed as a 32-bit extension of UCS2. This is also sometimes referred to as ISO-10646. ISO-10646 is a standard that specifies up to 2,147,483,648 characters in 32768 planes, of which the first plane is the UCS2 set. The ISO standard also specifies transformations between different encodings.
Unicode is a type of universal character set, a collection of 64K characters encoded in a 16-bit space. It encodes nearly every character in just about every existing character set standard, covering most written scripts used in the world. It is owned and defined by Unicode Inc. Unicode is canonical encoding which means its value can be passed around in different locales. But it does not guarantee a round-trip conversion between it and every Oracle character set without information loss.
A 16-bit binary value that can represent a unit of encoded text for processing and interchange. Every point between U+0000 and U+FFFF is a code point. The term is interchangeable with code element, code position, and code value.
The following shows how different Unicode-related character sets relate to one another in terms of character code value ranges:
Being able to use as many languages as desired. A universal character set, such as Unicode, helps to provide unrestricted multilingual support because it supports a very large character repertoire, encompassing most modern languages of the world.
A variable-width encoding of UCS2 which uses sequences of 1, 2, or 3 bytes per character. Characters from 0-127 (the 7-bit ASCII characters) are encoded with one byte, characters from 128-2047 require two bytes, and characters from 2048-65535 require three bytes. The Oracle character set name for this is UTF8 (for the Unicode 2.1 standard). The standard has left room for expansion to support the UCS4 characters with sequences of 4, 5, and 6 bytes per character.
An extension to UCS2 that allows for pairs of UCS2 code points to represent extended characters from the UCS4 set. UCS2 has ranges of code points allocated for high (leading) and low (trailing) surrogates that support UTF16 encodings.
A fixed-width character format that is well-suited for extensive text processing because it allows for data to be processed in consistent fixed-width chunks. Wide characters are intended for supporting internal character processing, and are therefore implementation-dependent.
|
![]() Copyright © 1996-2000, Oracle Corporation. All Rights Reserved. |
|