NAME

dtsrindex — Load
inverted index for document objects

SYNOPSIS

dtsrindex-ddbname [-tetxstr] [-h0 | -hhashsz ] [-rrecdots] [-bbatchsz] [-ccachesz] [-iinbufsz] file

DESCRIPTION

dtsrindex is the second of a pair of programs that load a database with documents data from an input fzk file. dtsrload loads document header information and optionally the documents themselves. dtsrindex parses words from document text and loads them into the inverted index files. Word parsing is performed in the specified language and linguistic codeset of the database. The inverted index contains the search terms used for subsequent online queries.

An fzk file can be generated by dtsrhan manually with a text editor, or by a special application program created for the purpose. Typically the same fzk file is used for dtsrload and dtsrindex. However, it is not required and there are situations where it may not be desirable. If the same fzk file is not used by both programs, the one used for dtsrindex must represent the same objects in the same order. Only the unique key line and the text portions of the file are used by this program. (See dtsrfzkfiles(4) for information about DtSearch fzk files).

A document's unique key in the fzk file must already preexist in the database (that is, dtsrload must be executed before dtsrindex). If any words are already indexed for the unique document key, indicating dtsrload "updated" the document, then the newly parsed words from the current fzk file will totally replace the previously indexed words.

When duplicate record ids are encountered in a single fzk file, only the first occurrence of the document is indexed into the database; the second one is discarded. Sinxe this is exactly the same discard order as dtsrload, the same fzk file can be used for both programs. Duplicate record ids are maintained during execution with a hash table.

dtsrindex performs two passes. In the first pass, dtsrindex constructs an inverted index in memory of all the words it parses from the fzk file. Since the index is built in memory, it is possible to run out of memory for very large fzk files. For this reason very large fzk files are processed in batches. Execution time in the first pass depends on the size of the fzk file.

In the second pass, dtsrindex merges the information in the memory index into the database's disk inverted index. Execution time in the second pass depends on both the size of the incoming fzk file and the overall size of the database.

If dtsrindex is interrupted in the first pass, it can be reexecuted without database damage. However if it is interrupted in the second pass, the database will be corrupted. Database backups are always recommended.

Caution:

To prevent database corruption, execute dtsrindex only after all users of a preexisting database have exited their search programs. For a single fzk file, dtsrload must be executed immediately before dtsrindex so that dtsrindex can map the words it indexes to the correct internal database addresses. Only after both programs successfully complete execution may users again be allowed to perform online searches of the database.

OPTIONS

The following options are available:

Note:

If an option takes a value, the value must be directly appended to the option name without white space.

-ddbname: Specifies the 1 to 8 ASCII character name of the database to be updated. If an optional directory path is not prepended to the database name, dtsrindex will attempt to open the database from the current working directory. File name extensions for database files are automatically appended.
-tetxstr: Specifies the end of document text delimiter string. The default document separator in an fzk file is an ASCII form feed character followed by an ASCII line feed ('&'). For certain multibyte languages it may be more convenient to specify a nonASCII string as the document delimiter.
-h0: Instructs dtsrindex to not check for duplicate record ids. This option should not be specified unless it is certain that there are no duplicate ids in the fzk file.
-hhashsz: Sets the duplicate record id hash table size to hashsz. The default is 3000. dtsrindex will execute more efficiently if the specified table size is larger than the number of documents in the fzk file.
-rrecdots: Instructs dtsrindex to print a progress character to stdout for every recdots documents processed during the first pass. The default is 20.
-bbatchsz: Sets the batch size to batchsz. The default is 10000. The batch size is the maximum number of records processed in Pass 1 before copying the in memory index to disk in Pass 2. Larger batch sizes significantly improve execution time in Pass 2, but require exponentially larger amounts of memory. The default batch size has been optimized for moderately fast machines with large amounts of memory.
-ccachesz: Sets the number of 1024 byte cache pages used by the DtSearch Database Management System to cachesz. The default is 64. The cache size affects memory paging performance for word b-trees. cacheszshould be greater than or equal to 16, in even powers of 2. The default is usually sufficient.
-iinbufsz: Sets the size of the input line buffer to inbufsz. The default is 1024 bytes. This buffer is used only for reading the four ASCII header lines for each document in an fzk file. (The text portion of each document is parsed on the fly a word at a time.) Increasing inbufsz may be appropriate for very large abstracts, but the default is sufficient in most cases.

OPERANDS

The required input file name (file) identifies the file to be processed by dtsrindex. It can optionally include a path prefix, either from root or relative to the current working directory. If a file name extension is not specified, dtsrindex assumes a default extension of .fzk.

ENVIRONMENT VARIABLES

None.

RESOURCES

None.

ACTIONS/MESSAGES

None.

RETURN VALUES

The return values are as follows:

0: dtsrindex completed successfully.
1: dtsrindex successfully recovered from an error. This occurs when one or more documents were discarded because of a partially invalid fzk file format, duplicate record ids, or empty record text.
>1: dtsrindex encountered a fatal error.

FILES

dtsrindex reads the specified fzk file and opens all the database and related language files for the specified database name.

dtsrindex updates the following database files:

: dbname.d21
: dbname.d22
: dbname.d23
: dbname.k21
: dbname.k22
: dbname.k23
: dbname.d99

EXAMPLES

Index all words in the fzk file named batch1.fzk in the current working directory into database mydb.

dtsrindex -dmydb batch1

Load database mydb with the documents specified in the fzk file /u/dtsearch/jpndocs.1. Three ASCII plus signs at the bottom of each document signals the end of document text and the beginning of the next fzk file record.

dtsrindex -dmydb -t+++ /u/dtsearch/jpndocs.1