|
NAMEhuffcode — Create optimized DtSearch compression/decompression tables SYNOPSIShuffcode [-llit_thresh | -l- ] [-o] huffile [textfile] DESCRIPTIONhuffcode creates optimized DtSearch compression/decompression tables. Documents stored in a DtSearch database text repository can be first compressed using a Huffman text compression algorithm. The algorithm provides optimal compression only with preanalysis of the statistical distribution of bytes in the database corpus. huffcode analyses a text corpus and generates DtSearch compression and decompression tables. It is provided as a convenience utility for database developers who want to optimize offline storage requirements. Compression is not used in databases created without the ability to store text in a DtSearch repository. huffcode reads a text file as input and writes out ophuf.huf (compression or "encode" table) and ophuf.c (decompression or "decode" table). ophuf.huf is an external ascii file that also retains the statistical information on how it was generated. huffcode can be executed repeatedly against different text samples, continually accumulating results. In the case of a small or static text corpus, the entire corpus can be fed into huffcode for optimal huffman compression. In large or dynamic databases the typical practice is to feed dynamic f representative text samples. The huffman code tables are created once for each API instance (not once per database) before any documents are loaded. The only program to read the encode table, an external file, is dtsrload. The ophuf.huf file generated by huffcode should be used instead of the provided default file prior to the first run of dtsrload for any databases to be accessed by a particular API instance. The decode table, a C module, should be compiled and linked into the application code ahead of the API library to override the default decode module in the library. Huf files and decode modules are not user editable. HCTREE_IDIt is imperative that the encode and decode tables reflect identical byte statistics to prevent decode errors. The first line of ophuf.huf includes a long integer value named HCTREE_ID. Each execution of huffcode generates a new, unique hctree_id integer. dtsrload loads this integer into the database configuration and status record when it loads the first document into a new database. Thereafter, each execution of dtsrload for that database confirms that the same hctree_id is used for each document compression. It will abort if the ophuf.huf hctree_id does not match the value for a database from previous executions. hctree_id is also stored as a variable in the decode module ophuf.c. DtSearchInit will not open any database listed in the ocf file whose hctree_id, as stored in its configuration and status record, does not match the value in the decode module. The dtsrdbrec utility will print the hctree_id value for any database. OPTIONSThe following options are available: Note:
If an option takes a value, the value must be directly appended to the option name without white space.
OPERANDSThe required input file name (huffile) is the base file name of the encode table, excluding the .huf extension. dtsrload expects huffile to be ophuf. Similarly, the decode module will be named huffile.c. At the beginning of each new execution, huffcode tries to open the encode table file and continue byte frequency counting from the last run. If the huf file represented by huffile does not exist, the table's counts are initialized to zeroes. The decode module is recomputed fresh each run, whether it existed before or not. ENVIRONMENT VARIABLESNone. RESOURCESNone. ACTIONS/MESSAGESNone. RETURN VALUESThe return values are as follows:
FILEShuffcode reads the specified huffile. It also reads textfile if it is specified. It writes to huffile.huf and huffile.c. EXAMPLESRead ophuf.huf if it exists and initialize the internal byte count table with its byte frequency counts. If ophuf.huf does not exist, the internal byte counts will be initialized to zeros. The encoding table in the original huf file will be discarded. The text file foo.txt will be read and its individual byte frequencies added to the internal byte count table. Then, ophuf.huf will be written out, with an encoding scheme based on the current byte counts, and with a literal character encoding all bytes that have zero frequency. Finally, if the decode module ophuf.c already exists, a prompt requesting permission to overwrite it will be output to stdout and, if an affirmative response is read from stdin, a new version corresponding to the new ophuf.huf will be written out. huffcode ophuf foo.txt
Read myappl.huf and initialize the internal byte count table with its byte frequency counts. Since no textfile argument is specified, the only possible action is to build different coding tables using existing frequency counts in myappl.huf. The new tables will be based on a literal character implementation where only bytes that occur more than 200 times will be given an encoding; all other bytes will be encoded with the literal character. After new encoding tables are generated myappl.huf will be written out. The decode module myappl.c will also be written out without prompting whether it preexists or not. huffcode -l200 -o myappl
SEE ALSOdtsrcreate(1), dtsrdbrec(1), dtsrload(1), DtSrAPI(3), DtSearch(5)
|