NAME

huffcode — Create optimized DtSearch compression/decompression tables

SYNOPSIS

huffcode [-llit_thresh | -l- ] [-o] huffile [textfile]

DESCRIPTION

huffcode creates optimized DtSearch compression/decompression tables.

Documents stored in a DtSearch database text repository can be first compressed using a Huffman text compression algorithm. The algorithm provides optimal compression only with preanalysis of the statistical distribution of bytes in the database corpus. huffcode analyses a text corpus and generates DtSearch compression and decompression tables. It is provided as a convenience utility for database developers who want to optimize offline storage requirements. Compression is not used in databases created without the ability to store text in a DtSearch repository.

huffcode reads a text file as input and writes out ophuf.huf (compression or "encode" table) and ophuf.c (decompression or "decode" table). ophuf.huf is an external ascii file that also retains the statistical information on how it was generated. huffcode can be executed repeatedly against different text samples, continually accumulating results. In the case of a small or static text corpus, the entire corpus can be fed into huffcode for optimal huffman compression. In large or dynamic databases the typical practice is to feed dynamic f representative text samples.

The huffman code tables are created once for each API instance (not once per database) before any documents are loaded. The only program to read the encode table, an external file, is dtsrload. The ophuf.huf file generated by huffcode should be used instead of the provided default file prior to the first run of dtsrload for any databases to be accessed by a particular API instance. The decode table, a C module, should be compiled and linked into the application code ahead of the API library to override the default decode module in the library. Huf files and decode modules are not user editable.

HCTREE_ID

It is imperative that the encode and decode tables reflect identical byte statistics to prevent decode errors. The first line of ophuf.huf includes a long integer value named HCTREE_ID. Each execution of huffcode generates a new, unique hctree_id integer. dtsrload loads this integer into the database configuration and status record when it loads the first document into a new database. Thereafter, each execution of dtsrload for that database confirms that the same hctree_id is used for each document compression. It will abort if the ophuf.huf hctree_id does not match the value for a database from previous executions.

hctree_id is also stored as a variable in the decode module ophuf.c. DtSearchInit will not open any database listed in the ocf file whose hctree_id, as stored in its configuration and status record, does not match the value in the decode module. The dtsrdbrec utility will print the hctree_id value for any database.

OPTIONS

The following options are available:

Note:

If an option takes a value, the value must be directly appended to the option name without white space.

-llit_thresh: Sets the literal character's minimum threshold to the integer specified by lit_thresh.
: This Huffman algorithm implements a pseudo-character called the literal character. It represents all characters whose frequency is so low that no huffman translation will be attempted. This reduces the maximum length of the coded bit string when there are lots of zero- or low-frequency bytes in the text corpus. For example, pure ASCII text files only occasionally have byte values less than 32 (control characters) and rarely greater than 127 (high order bit turned on). The lit_thresh value specifies the literal character's threshold count. After counting is completed, any character in the encode table occurring with frequency less than or equal to lit_thresh will be coded with the literal character.
: If this option and the -l- option are omitted, the default is -l0, meaning that literal coding is provided only for bytes that never occur (counts of zero).
-l-: Disables literal character encoding. Disabling literal character encoding in corpa with unbalanced byte frequency distributions will lead to extremely long bit string codes. Most natural language text corpa are represented by highly unbalanced frequency distributions so this option is not recommended for most DtSearch applications.
: If this option and the -llit_thresh option are omitted, the default is -l0, meaning that literal coding is provided only for bytes that never occur (counts of zero).
-o: Suppresses the overwrite prompt. It preauthorizes erasure and reinitialization of the decode module.
textfile: Specifies an optional input file of text that is representative of the entire text corpus of the databases. It should contain bytes in the same relative abundances as occur in documents in the entire corpus. Since huffcode can be executed repeatedly with different document textfiles, it is possible to analyze the entire actual corpus if it is small enough or static.
: If textfile is not specified, the byte frequencies in the currently loaded tables are not changed, and the huffman codes are recomputed with the existing frequencies. This is useful for examining the relative merits of using different literal character thresholds.

OPERANDS

The required input file name (huffile) is the base file name of the encode table, excluding the .huf extension. dtsrload expects huffile to be ophuf. Similarly, the decode module will be named huffile.c.

At the beginning of each new execution, huffcode tries to open the encode table file and continue byte frequency counting from the last run. If the huf file represented by huffile does not exist, the table's counts are initialized to zeroes. The decode module is recomputed fresh each run, whether it existed before or not.

ENVIRONMENT VARIABLES

None.

RESOURCES

None.

ACTIONS/MESSAGES

None.

RETURN VALUES

The return values are as follows:

0: huffcode completed successfully.
nonzero: huffcode encountered an error.

FILES

huffcode reads the specified huffile. It also reads textfile if it is specified. It writes to huffile.huf and huffile.c.

EXAMPLES

Read ophuf.huf if it exists and initialize the internal byte count table with its byte frequency counts. If ophuf.huf does not exist, the internal byte counts will be initialized to zeros. The encoding table in the original huf file will be discarded. The text file foo.txt will be read and its individual byte frequencies added to the internal byte count table. Then, ophuf.huf will be written out, with an encoding scheme based on the current byte counts, and with a literal character encoding all bytes that have zero frequency. Finally, if the decode module ophuf.c already exists, a prompt requesting permission to overwrite it will be output to stdout and, if an affirmative response is read from stdin, a new version corresponding to the new ophuf.huf will be written out.

huffcode ophuf foo.txt

Read myappl.huf and initialize the internal byte count table with its byte frequency counts. Since no textfile argument is specified, the only possible action is to build different coding tables using existing frequency counts in myappl.huf. The new tables will be based on a literal character implementation where only bytes that occur more than 200 times will be given an encoding; all other bytes will be encoded with the literal character. After new encoding tables are generated myappl.huf will be written out. The decode module myappl.c will also be written out without prompting whether it preexists or not.

huffcode -l200 -o myappl