This module computes the similarity of two text documents or strings by
searching for literal word token overlaps. This just means that it
determines how many word tokens are are identical between the two
strings. Various scores are computed based on the number of shared
words, and the length of the strings.
At present similarity measurements are made between entire files or
strings, and finer granularity is not supported. Files are treated as
one long input string, so overlaps can be found across sentence and
Files are first converted into strings by getSimilarity(), then
getSimilarityStrings() does the actual processing. It counts the number
of overlaps (matching words) and finds the longest common subsequences
(phrases) between the two strings. However, most of the measures except
for lesk do not use the information about phrasal matches.
Text::Similarity::Overlaps returns the F-measure, which is a normalized
value between 0 and 1. Normalization can be turned off by specifying
--no-normalize, in which case the raw_score is returned, which is simply
the number of words that overlap between the two strings.
In addition, Overlaps returns the cosine, E-measure, precision, recall,
Dice coefficient, and Lesk scores in the allScores table.
precision = raw_score / length_file_2
recall = raw_score / length_file_1
F-measure = 2 * precision * recall / (precision + recall)
Dice = 2 * raw_score / (sum of string lengths)
E-measure = 1 - F-measure
Cosine = raw_score / sqrt (precision + recall)
Lesk = sum of the squares of the length of phrasal matches
(normalized by dividing by the product of the string lengths)
The raw_score is simply the number of matching words between the two
inputs, without respect to their order. Note that matches are literal
and must be exact, so cat and cats do not match. This corresponds to
the idea of the intersection between the two strings.
None of these measures (except lesk) considers the order of the matches.
In those cases jim bit the dog and the dog bit jim are considered
exact matches and will attain the highest possible matching score,
which would be a raw_score of 4 if not normalized and 1 if the score is
normalized (which would result in the f-measure being returned).
lesk is different in that it looks for phrasal matches and scores them
more highly. The lesk measure is based on the measure of the same name
included in WordNet::Similarity. There it is used to match the
overlapping text found in the gloss entries of the lexical database /
dictionary WordNet in order to measure semantic relatedness.
The lesk measure finds the length of all the overlaps and squares them.
It then sums those scores, and if the score is normalized divides them
by the product of the lengths of the strings. For example:
the dog bit jim
jim bit the dog
The raw_score is 4, since the two strings are made up of identical
words (just in different orders). The F-measure is equal to 1, as are
the Cosine, and the Dice Coefficient. In fact, the F-Measure and the
Dice Coefficient are always equivalent, but both are presented since
some users may be more familiar with one formulation versus the other.
The raw_lesk score is 2^2 + 1 + 1 = 6, because the dog is a phrasal
match between the strings and thus contributes its length squared to
the raw_lesk score. The normalized lesk score is 0.375, which is 6 /
(4 * 4), or the raw_lesk score divided by the product of the lengths of
the two strings. Note that the normalized lesk score has a maximum value
of 1, since if there are n words in the two strings, then their maximum
overlap is n words, which receives a raw_lesk score of n^2, which is
the divided by the product of the string lengths, which is again n^2..
There is some cleaning of text performed automatically, which includes
removal of most punctuation except embedded apostrophes and
underscores. All text is made lower case. This occurs both for file and
Copyright (C) 2004-2008 by Jason Michelizzi and Ted Pedersen
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA