Oracle8i interMedia Text Reference Release 2 (8.1.6) Part Number A77063-01 |
|
Indexing, 5 of 11
Use the lexer preference to specify the language of the text to be indexed. To create a lexer object, you must use one of the following objects:
Use the BASIC_LEXER object to identify tokens for creating Text indexes for English and all other supported single-byte languages.
The BASIC_LEXER is also used to enable base-letter conversion, composite word indexing, case-sensitive indexing and alternate spelling for single-byte languages that have extended character sets.
In English, you can use the BASIC_LEXER to enable theme indexing.
The BASIC_LEXER supports all single-byte character sets plus UTF8.
BASIC_LEXER has the following attributes:
Specify the characters that indicate a word continues on the next line and should be indexed as a single token. The most common continuation characters are hyphen '-' and backslash '\'.
Specify a single character that, when it appears in a string of digits, indicates that the digits are groupings within a larger single unit.
For example, comma ',' might be defined as numgroup characters because it often indicates a grouping of thousands when it appears in a string of digits.
Specify the characters that, when they appear in a string of digits, cause Oracle to index the string of digits as a single unit or word.
For example, period '.' can be defined as numjoin characters because it often serves as decimal points when it appears in a string of digits.
Specify the non-alphanumeric characters that, when they appear anywhere in a word (beginning, middle, or end), are processed as alphanumeric and included with the token in the Text index. This includes printjoins that occur consecutively.
For example, if the hyphen '-' and underscore '_' characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the Text index as pseudo-intellectual and _file_.
Specify the non-alphanumeric characters that, when they appear at the end of a word, indicate the end of a sentence. The defaults are period '.', question mark '?', and exclamation point '!'.
Characters that are defined as punctuations are removed from a token before text indexing; however, if a punctuations character is also defined as a printjoins character, the character is only removed if it is the last character in the token and it is immediately preceded by the same character.
For example, if the period (.) is defined as both a printjoins and a punctuations character, the following transformations take place during indexing and querying as well:
Token | Indexed Token |
---|---|
.doc |
.doc |
dog.doc |
dog.doc |
dog..doc |
dog..doc |
dog. |
dog |
dog... |
dog.. |
In addition, BASIC_LEXER uses punctuations characters in conjunction with newline and whitespace characters to determine sentence and paragraph deliminters for sentence/paragraph searching.
Specify the non-alphanumeric characters that, when they appear within a word, identify the word as a single token; however, the characters are not stored with the token in the Text index.
For example, if the hyphen character '-' is defined as a skipjoins, the word pseudo-intellectual is stored in the Text index as pseudointellectual.
For startjoins, specify the characters that when encountered as the first character in a token explicitly identify the start of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token. In addition, the first startjoins character in a string of startjoins characters implicitly end the previous token.
For endjoins, specify the characters that when encountered as the last character in a token explicitly identify the end of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token.
The following rules apply to both startjoins and endjoins:
Specify the characters that are treated as blank spaces between tokens. BASIC_LEXER uses whitespace characters in conjunction with punctuations and newline characters to identify character strings that serve as sentence delimiters for sentence/paragraph searching.
The predefined, default values for whitespace are 'space' and 'tab'; these values cannot be changed. Specifying characters as whitespace characters adds to these defaults.
Specify the characters that indicate the end of a line of text. BASIC_LEXER uses newline characters in conjunction with punctuations and whitespace characters to identify character strings that server as paragraph delimiters for sentence/paragraph searching.
The only valid values for newline are NEWLINE and CARRIAGE_RETURN (for carriage returns). The default is NEWLINE.
Specify whether characters that have diacritical marks (umlauts, cedillas, acute accents, etc.) are converted to their base form before being stored in the Text index. The default is NO (base-letter conversion disabled).
Specify whether the lexer converts the tokens in Text index entries to all uppercase or stores the tokens exactly as they appear in the text. The default is NO (tokens converted to all uppercase).
Specify whether composite word indexing is disabled or enabled for either GERMAN or DUTCH text. The default is DEFAULT (composite word indexing disabled).
Specify YES to index theme information in English. This makes ABOUT queries more precise. The index_themes and index_text attributes cannot both be NO.
If you use the BASIC_LEXER and specify no value for index_themes, this attribute defaults to NO.
Specify which knowledge base to use for theme generation when index_themes is set to YES. When index_themes is NO, setting this parameter has no effect on anything. The default is AUTO, which instructs the system to set this parameter according to the language of the environment.
Specify YES to index word information. The index_themes and index_text attributes cannot both be NO.
The default is YES.
Specify either GERMAN, DANISH, or SWEDISH to enable alternate spelling in one of these languages. By default, alternate spelling is enabled in all three languages. You can specify NONE for no alternate spelling.
See Also:
For more information about the alternate spelling conventions Oracle uses, see Appendix F, "Alternate Spelling Conventions". |
The following example sets printjoin characters and disables theme indexing with the BASIC_LEXER:
begin ctx_ddl.create_preference('mylex', 'BASIC_LEXER'); ctx_ddl.set_attribute('mylex', 'printjoins', '_-'); ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO'); ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES'); end;
To create the index with no theme-indexing and with printjoins characters set as above, issue the following statement:
create index myindex on mytable ( docs ) indextype is ctxsys.context parameters ( 'LEXER mylex' );
Use this lexer to index text columns that contain documents of different languages. For example, you can use this lexer to index a text column that stores English, German, and Japanese documents.
This lexer has no attributes.
You create a multi-lexer preference with CTX_DDL.CREATE_PREFERENCE and then add language-specific lexers to the multi-lexer preference with CTX_DDL.ADD_SUB_LEXER. You must also have a language column in your base table to index multi-language tables. You specify the language column when you index with CREATE INDEX. See the example.
Create the multi-language table with a primary key, a text column, and a language column as follows:
create table globaldoc ( doc_id number primary key, lang varchar2(3), text clob );
Assume that the table holds mostly English documents, with the occasional German or Japanese document. To handle the three languages, you must create three sub-lexers, one for English, one for German, and one for Japanese:
ctx_ddl.create_preference('english_lexer','basic_lexer'); ctx_ddl.set_attribute('english_lexer','index_themes','yes'); ctx_ddl.set_attribtue('english_lexer','theme_language','english'); ctx_ddl.create_preference('german_lexer','basic_lexer'); ctx_ddl.set_attribute('german_lexer','composite','german'); ctx_ddl.set_attribute('german_lexer','mixed_case','yes'); ctx_ddl.set_attribute('german_lexer','alternate_spelling','german'); ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');
Create the multi-lexer preference:
ctx_ddl.create_preference('global_lexer', 'multi_lexer');
Since the stored documents are mostly English, make the English lexer the default using CTX_DDL.ADD_SUB_LEXER:
ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');
Now add the German and Japanese lexers in their respective languages with CTX_DDL.ADD_SUB_LEXER. Also assume that the language column is expressed in ISO 639-2, so we have to add those as alternate values.
ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger'); ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');
Now create the index globalx
, specifying the multi-lexer preference and the language column in the parameter string as follows:
create index globalx on globaldoc(text) indextype is ctxsys.context parameters ('lexer global_lexer language column lang');
At query time, the multi-lexer examines the language setting and uses the sub-lexer preference for that language to parse the query. If the language is not set, then the default lexer is used.
Otherwise, the query is parsed and run as usual. Since the index contains tokens from multiple languages, such a query can return documents in several languages. To limit your query to a given language, use a structured clause on the language column.
The CHINESE_VGRAM_LEXER object identifies tokens in Chinese text for creating Text indexes. It has no attributes.
You can use this lexer if your database character set is one of the following:
The JAPANESE_VGRAM_LEXER object identifies tokens in Japanese for creating Text indexes. It has no attributes.
You can use this lexer if your database character set is one of the following:
The KOREAN_LEXER object identifies tokens in Korean text for creating Text indexes.
You can use this lexer if your database character set is one of the following:
When you use the KOREAN_LEXER, specify the following boolean attributes:
Sentence and paragraph sections are not supported with the Korean lexer.
|
![]() Copyright © 1996-2000, Oracle Corporation. All Rights Reserved. |
|