

 
Manual Reference Pages  DISTANCE (3)
NAME
distance
 compute the distance between two pieces of data
CONTENTS
Synopsis
Description
Levenshtein Distances
Damerau Distance
Needlemanwunsch Distance
Hamming Distances
Bloom Filter Distances
Jaccard Distance
Minkowski Distance
Return Values
Bugs
Author
See Also
SYNOPSIS
.Fd #include <distance.h>
int
levenshtein_d const void *d1 size_t len1 const void *d2 size_t len2
int
damerau_d const void *d1 size_t len1 const void *d2 size_t len2
double
needleman_wunsch_d const void *d1 size_t len1 const void *d2 size_t len2 matrix *m
int
hamming_d const void *d1 size_t len1 const void *d2 size_t len2
void
bloom_create const void *data size_t len const void *digest size_t digest_len
double
bloom_d const void *digest1 const void *digest2 size_t digest_len
float
jaccard_d const void *d1 size_t len1 const void *d2 size_t len2
float
minkowski_d const void *d1 size_t len1 const void *d2 size_t len2 int power
MANHATTAN_D const void *d1 size_t len1 const void *d2 size_t len2
EUDCLID_D const void *d1 size_t len1 const void *d2 size_t len2
DESCRIPTION
The
distance
library is used to compare pieces of data for similarity. Specifically, it
contains a number of methods to find the "edit distance" between inputs,
or the number of differences between them. These differences are calculated
using various mechanisms.
Inputs are operated on pairwise as
*s
and
*t
and the edit distance value is returned.
LEVENSHTEIN DISTANCES
Levenshtein distance (LD) is a measure of the similarity between two
inputs, which we will refer to as the source
s
and the target
input
t.
The distance is the number of deletions, insertions, or substitutions
required to transform
s
into
t.
For example,
If s is "test" and t is "test", then LD(s,t) = 0, because no
transformations are needed. The strings are already identical.
If s is "test" and t is "tent", then LD(s,t) = 1, because one
substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the inputs are.
Levenshtein distance is named after the Russian scientist Vladimir
Levenshtein, who devised the algorithm in 1965. If you can’t spell or
pronounce Levenshtein, the metric is also sometimes called edit distance.
DAMERAU DISTANCE
The Damerau distance is almost identical to the Levenshtein distance but
can tolerate adjascent characters that have been swapped, a common
spelling mistake or typographic error. For example, if s is "abcd" and t
is "acbd", then DD(s,t) is 0 because of the transposition of "b" and "c".
Other costs found in the Levenshtein distance are identical.
NEEDLEMANWUNSCH DISTANCE
The Levenshtein distance algorithm assumes that the cost of all
insertions or conversions is equal. However, in some scenarios this
may not be desirable and may mask the acceptable distances between
inputs.
A modified Levenshtein distance algorithm, also known as the NeedlemanWunsch
distance algorithm, accepts a cost matrix as
an extra input. This matrix structure contains two cost matricies
for each pair of characters to convert from and to. The costs of
inserting this character and converting between characters is
listed in this matrix. The structure of the matrix is as follows:
struct matrix {
char input[255];
float conversion[255][255];
float insertion[255][255];
};
These values should be assigned before the distance algorithm is
used.
HAMMING DISTANCES
The Hamming distance H is defined only for inputs of the same length.
For two inputs
s
and
t,
H(s, t) is the number of places in which the two string differ, i.e.,
have different characters.
BLOOM FILTER DISTANCES
A Bloom Filter (BF) is a method of encoding a variable length piece of
input data to a fixed length signature. The basic idea is to map
attributes of the data to specific positions in a binary vector. First,
the result vector is initialized to all 0’s. Next, a series of hash
functions are applied to the data to map some part of the input data
to a position in the result vector. Thus, if a is the input data,
and m is length of the result vector, then hash(a) produces a value in the
range 0...m1.
A similar method encoding data into a fixed length result vector can be
accomplished using a crytographic digest. A digest function like MD5
is designed such that the difference between any two digests does not in
any way correlate to similar differences in input messages. This is
an important property in certain applications however it may also be
desirable that the difference between digests corresponds to
differences in the input messages. Because a Bloom filter is derived from
attributes of the input message, the difference between any two Bloom
Filter digests corresponds to differences in the input data.
This Bloom filter implementation is meant take any byte sequence as input
so a generic hash function is used. The input stream is divided into
chunks and hashed using a 32 bit adler modulo m. In order to get the most
complete coverage, the chunk N+1 includes all but one of the bytes in
chunk N and one byte not included in chunk N. (i.e. the chunk window
slides to right one byte per hash.)
The difference between any two Bloom filter digests is computed by taking
the logical and of the two digests and computing the number of bit of
similarity. The similar count is then divided by maximum of the total
number of bit in either digest. This acts to normalize the bit count for
large and small signatures. The similarity is then turned into a distance
by take 1  normalized bit count.
JACCARD DISTANCE
The Jaccard similarity between two strings is the ratio of the size of their
intersection to the size of their union. This distance is 1 minus this
similarity sscore. Indentical strings have a Jacard distance of 0,
completely dissimilar strings have a score of 1. The two inputs must be
of the same size.
MINKOWSKI DISTANCE
The Minkowski distance between two strings is the geometric distance between
two inputs and uses a variable scaling factor,
power.
When this value is 1 the Minkowski distance is equal to the Manhattan
distance. When
power
is 2 it yields the Euclidian distance between two strings. When this
value grows, the value of the Minkowski distance tends towards a
Chebyshev result. Therefore by inceasing
power,
one can place more numerical value on the largest distance (in terms of
elements in the two vectors in question). Two macros are provided for
these modifications to the Minkowski distance,
EUCLID_D
and
MANHATTAN_D.
A disadvantage of the Minkowski method is that if one element in the
vectors has a wider range than the other elements then that large
range may ’dilute’ the distances of the smallrange elements.
RETURN VALUES
Each fuction returns the calculated distance between the two inputs.
A distance of 0 indicates that the strings are the same. A distance
less than 0 indicates that an error has occurred. For the Hamming
and Jaccard distances, this is because the two inputs are of different
sizes.
BUGS
The
minkowski_d
implementation is possibly broken, as are the
MANHATTAN_D
and
EUCLID_D
macros.
AUTHOR
Lorenzo Seidenari wrote the Levenshtein distance implementation.
Jose Nazario
<jose@monkey.org>
wrote the implementations of the Damerau, Hamming and Jaccard distance
algorithms along with the NeedlemanWunsch Levenshtein distance
implementation here.
Evan Cooke
adapted the adler32 routine from Jeanloup Gailly and Mark Adler used in the
bloom_d
function, which he also wrote.
SEE ALSO
.Rs
Binary codes capable of correcting deletions, insertions and reversals
.Re
.Rs
A technique for computer detection and correction of spelling errors
.Re
.Rs
A general method applicable to the search for similarities in the amino acid sequence of two proteins
.Re
.Rs
Space/time tradeoffs in hash coding with allowable errors
.Re
.Rs
Error Detecting and Error Correcting Codes
.Re
.Rs
Etude comparative de la distribution florale dans une portion des Alpes et du Jura
.Re
Visit the GSP FreeBSD Man Page Interface. Output converted with manServer 1.07. 