|
|
| |
mifluz(3) |
FreeBSD Library Functions Manual |
mifluz(3) |
mifluz - C++ library to use and manage inverted indexes
#include <mifluz.h>
main()
{
Configuration* config = WordContext::Initialize();
WordList* words = new WordList(*config);
...
delete words;
WordContext::Finish();
}
The purpose of mifluz is to provide a C++ library to build and query a
full text inverted index. It is dynamically updatable, scalable (up to 1Tb
indexes), uses a controlled amount of memory, shares index files and memory
cache among processes or threads and compresses index files to 50% of the raw
data. The structure of the index is configurable at runtime and allows
inclusion of relevance ranking information. The query functions do not require
loading all the occurrences of a searched term. They consume very few
resources and many searches can be run in parallel.
The file management library used in mifluz is a modified Berkeley
DB (www.sleepycat.com) version 3.1.14.
- Configuration
-
reads the configuration file and manages it in memory.
- WordContext
-
read configuration and setup mifluz context.
- WordCursor
-
abstract class to search and retrieve entries in a WordList
object.
- WordCursorOne
-
search and retrieve entries in a WordListOne object.
- WordDBInfo
- inverted index usage environment.
- WordDict
-
manage and use an inverted index dictionary.
- WordKey
- inverted index key.
- WordKeyInfo
- information on the key structure of the inverted index.
- WordList
-
abstract class to manage and use an inverted index file.
- WordListOne
-
manage and use an inverted index file.
- WordMonitor
- monitoring classes activity.
- WordRecord
- inverted index record.
- WordRecordInfo
- information on the record structure of the inverted index.
- WordReference
- inverted index occurrence.
- WordType
- defines a word in term of allowed characters, length etc.
- htdb_dump
-
dump the content of an inverted index in Berkeley DB
fashion
- htdb_load
-
displays statistics for Berkeley DB environments.
- htdb_stat
-
displays statistics for Berkeley DB environments.
- mifluzdict
-
dump the dictionnary of an inverted index.
- mifluzdump
-
dump the content of an inverted index.
- mifluzload
-
load the content of an inverted index.
- mifluzsearch
- search the content of an inverted index.
The format of the configuration file read by WordContext::Initialize is:
keyword: value
Comments may be added on lines starting with a #. The default configuration file
is read from from the file pointed by the MIFLUZ_CONFIG environment
variable or ~/.mifluz or /etc/mifluz.conf in this order. If no
configuration file is available, builtin defaults are used. Here is an example
configuration file:
wordlist_extend: true
wordlist_cache_size: 10485760
wordlist_page_size: 32768
wordlist_compress: 1
wordlist_wordrecord_description: NONE
wordlist_wordkey_description: Word/DocID 32/Flags 8/Location 16
wordlist_monitor: true
wordlist_monitor_period: 30
wordlist_monitor_output: monitor.out,rrd
- wordlist_allow_numbers {true|false} <number> (default
false)
- A digit is considered a valid character within a word if this
configuration parameter is set to true otherwise it is an error to
insert a word containing digits. See the Normalize method for more
information.
- wordlist_cache_inserts {true|false} (default false)
- If true all Insert calls are cached in memory. When the WordList
object is closed or a different access method is called the cached entries
are flushed in the inverted index.
- wordlist_cache_max <bytes> (default 0)
- Maximum size of the cumulated cache files generated when doing bulk
insertion with the BatchStart() function. When this limit is
reached, the cache files are all merged into the inverted index. The value
0 means infinite size allowed. See WordList(3) for the rationale behind
cache file handling.
- wordlist_cache_size <bytes> (default 500K)
- Berkeley DB cache size (see Berkeley DB documentation) Cache makes a huge
difference in performance. It must be at least 2% of the expected total
data size. Note that if compression is activated the data size is eight
times larger than the actual file size. In this case the cache must be
scaled to 2% of the data size, not 2% of the file size. See Cache
tuning in the mifluz guide for more hints. See WordList(3) for the
rationale behind cache file handling.
- wordlist_compress {true|false} (default false)
- Activate compression of the index. The resulting index is eight times
smaller than the uncompressed index.
- wordlist_env_dir <directory> (default .)
- Only valid if wordlist_env_share set to true. Specify the
directory in which the sharable environment will be created. All inverted
indexes specified with a non-absolute pathname will be created relative to
this directory.
- wordlist_env_share {true,false} (default false)
- If true a sharable environment is open or created if none exist.
- wordlist_env_skip {true,false} (default false)
- If true no environment is created at all. This must never be used if a
WordList object is created. It may be useful if only WordKey
objects are used, for instance.
- wordlist_extend {true|false} (default false)
- If true maintain reference count of unique words. The
Noccurrence method gives access to this count.
- wordlist_locale <locale> (default C)
- Set the locale of the program to locale for more information.
- wordlist_lowercase {true|false} <number> (default true)
- If a word contains upper case letters it is converted to lowercase if this
configuration parameter is true, otherwise it is left untouched.
- wordlist_maximum_word_length <number> (default 25)
- The maximum length of a word. See the Normalize method for more
information.
- wordlist_mimimun_word_length <number> (default 3)
- The minimum length of a word. See the Normalize method for more
information.
- wordlist_monitor {true|false} (default false)
- If true create a WordMonitor instance to gather statistics and
build reports.
- wordlist_monitor_output <file>[,{rrd,readable] (default
stderr)
- Print reports on file instead of the default stderr If
type is set to rrd the output is fit for the
benchmark-report script. Otherwise it a (hardly :-) readable
string.
- wordlist_monitor_period <sec> (default 0)
- If the value sec is a positive integer, set a timer to print
reports every sec seconds. The timer is set using the ALRM signal
and will fail if the calling application already has a handler on that
signal.
- wordlist_page_size <bytes> (default 8192)
- Berkeley DB page size (see Berkeley DB documentation)
- wordlist_truncate {true|false} <number> (default true)
- If a word is too long according to the wordlist_maximum_word_length
it is truncated if this configuration parameter is true otherwise
it is considered an invalid word.
- wordlist_valid_punctuation [characters] (default none)
- A list of punctuation characters that may appear in a word. These
characters will be removed from the word before insertion in the
index.
- wordlist_verbose <number> (default 0)
- Set the verbosity level of the WordList class.
1 walk logic
2 walk logic details
3 walk logic lots of details
- wordlist_wordkey_description <desc> (no default)
- Describe the structure of the inverted index key. In the following
explanation of the <desc> format, mandatory words are in bold
and values that must be replaced in italic.
Word bits/name bits [/...]
The name is an alphanumerical symbolic name for the key
field. The bits is the number of bits required to store this
field. Note that all values are stored in unsigned integers (unsigned
int). Example:
Word 8/Document 16/Location 8
- wordlist_wordkey_document [field ...] (default none)
- A white space separated list of field numbers that define a document. The
field number list must not contain gaps. For instance 1 2 3 is valid but 1
3 4 is not valid. This configuration parameter is not used by the mifluz
library but may be used by a query application to define the semantic of a
document. In response to a query, the application will return a list of
results in which only distinct documents will be shown.
- wordlist_wordkey_location field (default none)
- A single field number that contains the position of a word in a given
document. This configuration parameter is not used by the mifluz library
but may be used by a query application.
- wordlist_wordrecord_description {NONE|DATA|STR} (no default)
- NONE: the record is empty
DATA: the record contains an integer (unsigned int)
STR: the record contains a string (String)
MIFLUZ_CONFIG file name of configuration file read by WordContext(3).
Defaults to ~/.mifluz. or /usr/etc/mifluz.conf
Loic Dachary loic@gnu.org
The Ht://Dig group http://dev.htdig.org/
htdb_dump(1), htdb_stat(1), htdb_load(1), mifluzdump(1), mifluzload(1),
mifluzsearch(1), mifluzdict(1), WordContext(3), WordList(3), WordDict(3),
WordListOne(3), WordKey(3), WordKeyInfo(3), WordType(3), WordDBInfo(3),
WordRecordInfo(3), WordRecord(3), WordReference(3), WordCursor(3),
WordCursorOne(3), WordMonitor(3), Configuration(3)
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |