NAME

dtsrhanfile — Describes the format and syntax of DtSearch han files

SYNOPSIS

filename.han

DESCRIPTION

Han files are the user generated profile files for dtsrhan. They identify fields in incoming text from which output fzk file fields can be constructed. The data from han files are loaded into memory by dtsrhan at initialization time. dtsrhan and han files have not been internationalized; han files may only contain ASCII characters.

General Format

All identifiers must begin with a letter, and must be composed entirely of alphanumerics and/or the underscore.

Observe the following points when using using "strings":

•: If an identifying string contains quotes, use a backslash to create the quote. Example:

this string

: would find the string this string "contains" quotes.
•: The above point makes it necessary to use double backslashes to create a single backslash. Example:

this string has a \ backslash

: would find the string this string has a backslash.
•: Actually, using the backslash in any string will cause the next character to be included without exception. Thus, a string with this is test will end up being this is a test. The backslash is ignored, and the next character is imbedded in the string. This is only needed in the two cases described above, but can be used for any purpose.

Individual Line Syntax

# ... | blank line: Han file comment. Any line beginning with a pound sign in the first column, or any blank line, is discarded.
line identifier = physical_line_number: Defines a line with a physical line number in the record. physical_line_number must be a number.
line identifier = column_number,: Defines a line using a column number and a 'signature' string that should appear at that column. column_number can be a number, or * for 'any column'. "string" should be a string that occurs on the line in question. It is possible to define complex signatures using multiple clauses.
field identifier = line_identifier,: Defines a field based on a declared line, a string found on that line, the offset from the first letter of the string, and the length of field.
: line_identifier is an identifier declared with the line directive (see above).
: "string" is a string for relative positioning, where a field will follow a string that may not always occur in the same position on a line. If it is known that the field will always be in the same position, an empty string("") may be used. string must be enclosed in double quotes. offset must be a number, identifying the offset from the first character in the string. It starts at position 1, not 0, and may be negative.
: length represents the length of the field. It may be a number, or it may be one of two special tokens:

eow: End of word. The field will begin at offset and continue until the next white-space character.
eoln: End of line. The field will begin at offset and continue to the end of the line.

: An identifier string beginning with 3 uppercase M's ("MMM...") will be considered an English month name string. At run time, if the first 3 chars of the field's value equal the first three chars of an English month name, the value string will be translated to a two character string of digits in the range "01" to "12". For example, if field MMMmymonth had an original value of "April ", it will be translated to "04" before use.
: In the case where a line identifier is associated with multiple lines in a single document, the field value will be determined from the last occurrence of the line within the record.
constant identifier =: Defines a constant field that can be used in abstracts and keys. The identifier is defined exactly the same as a field identifier. The value must be enclosed in double quotes.
date = null | field_id [+ field_id] ...: Defines the document date for each document. It will be converted into a correctly formatted fzk file date line.
: null specifies undated documents. Undated documents always qualify for searches irrespective of date qualifiers in DtSearchQuery.
: field_id is an identifier declared using the field or constant directives (see above). "MMM" fields are often useful for date assemblies.
: Multiple fields may be concatenated into a date.
: After concatenation, the assembled date must be of the following format: YYYYMMDDhhmm (exactly 12 digits). For example, 199404171701 is April 17, 1994 at 5:01 pm. 200405031000 is May 3, 2004, at 10:00 am (10 o'oclock).
: Dates before 1900 or after 5995 are invalid.
: If date is not specified or is invalid, a generated date based on the current date and time will be used, but an invalid date will also generate an error message.
key = field_id [+ field_id] ... | time | count: Defines the unique database key for each record in a fzk file.
: field_id is a field identifier declared using the field or constant directives.
: Multiple fields may be concatenated into a key.
: time is a special keyword used to generate keys based on the current run date and time, plus a sequential count suffix.
: count is a special keyword used to generate keys based on a sequential count of records.
upper: Specifies that keys written by handel are to be entirely converted to upper case. Without using this directive, mixed-case keys are allowed.
keychar = A | B | ...Z: Defines the character used to categorize keys for DtSearch. It must be an uppercase ASCII alphabetic character.
delimiter = line_identifier, bottom: Defines the end of text (ETX) delimiter that will separate records.
: line_identifier is an identifier declared with the line directive.
: bottom is required. It specifies that the ETX will occur at the bottom of each record. Top of record delimiters are not supported.
image = all | none: Defines whether the document image retrieved by DtSearchRetrieve is to contain all or none of the record, prior to application of imageinclude or imageexclude directives later in the han file. It defaults to all.
imageinclude = line_identifier [- line_identifier]: Defines a line (or range of lines) to be included in the image. line_identifier is an identifier declared with the line directive.
imageexclude = line_identifier [- line_identifier]: Defines a line (or range of lines) to be excluded from the image. line_identifier is an identifier declared with the line directive.
abstract = field(s) field_identifier [+ field_identifier]...: Defines the abstract to be placed into the fzk file. It is created from the concatenations of fields. field_identifier is an identifier declared with the field directive.
delblanklines = true | false: Determines if blank lines are to be removed from the record image or not. It defaults to false.

Example

The sample han file shown here describes a text file containing a concatenated set of man pages documents.