Construct a new classifier. The filename arguments to the constructor
must refer to files containing tables of n-gram probabilites for
languages (language models). These tables can be generated using the
trainlid(1) utility program.
Identify the language of a text given in $string. The identify()
method returns the value specified in the <B>_LANGB> field of the
probabilities table of the language in which the text is most likely
written (see WARNINGS below).
Internally, the identify() method calls the calculate() method.
Calculate the probabilities for a text to be in the languages known to
the classifier. This method returns a reference to an array. The
array represents a table of languages and the probabiliy for each
language. Each array element is a reference to an array containing
two elements: The language name and the associated probability. For
example, you may get something like this:
[en.iso-8859-1, -450.804230119916], ...]
The elements are sorted in descending order by probability. You can
use this data to assess the reliability of the categorization and make
your own decision using application-specific metrics.
When neither a trigram nor a bigram is found, the calculation deviates
slightly from the formula given by Dunning (1994). According to
Dunnings formula, one would estimate the probability as:
p = log(1/#alph)
where #alph is the size of the alphabet of a particular language.
This penalizes different language models with different values because
the alphabet sizes of the languages differ.
However, the size of the alphabet is much larger for Asian languages
than for European languages. For example, for the sample data in the
Lingua::Ident distribution trainlid(1) reports #alph = 127 for zh.big5
vs. #alph = 31 for de.iso-8859-1. This means that Asian languages are
penalized much harder than European languages when an estimation must
To use the same penalty for all languages, calculate() now uses the
average of all alphabet sizes instead.
<B>NOTE:B> This has only been lightly tested yetfeedback is welcome.