GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  TEXT_SIMILARITY.PL (1)

.ds Aq ’

NAME

text_simlarity.pl - Measure the pair-wise similarity between files or strings

CONTENTS

SYNOPSIS



 text_similarity.pl --type Text::Similarity::Overlaps --normalize
                         --string .......this is one ????this is two

 text_similarity.pl --type Text::Similarity::Overlaps --no-normalize
                         --string .......this is one ????this is two

 text_similarity.pl --type Text::Similarity::Overlaps
                         --string sir winston churchill Churchill, Winston Sir

 text_similarity.pl --type Text::Similarity::Overlaps ../GPL.txt ../FDL.txt

 text_similarity.pl --verbose --type Text::Similarity::Overlaps ../GPL.txt ../FDL.txt

 text_similarity.pl --verbose --stoplist stoplist.txt --type Text::Similarity::Overlaps
                        ../GPL.txt ../FDL.txt

 text_similarity.pl [[--verbose] [--stoplist=FILE] [--no-normalize] [--string]]
                        --type=TYPE | --help | --version] FILE1 FILE2



DESCRIPTION

This script is a simple command-line interface to the Text::Similarity Perl modules. A method for computing similarity must be specified via the --type option, and then that method is used to measure the similarity of two strings or two files.

Text::Similarity::Overlaps measures similarity by counting the number of words that overlap (match) between the two inputs, without regard to order. So, all of the following strings would have the same pairwise similarity (they would each have a raw score of 4 relative to each other, meaning that 4 words are overlapping or matching).



 winston churchill was here
 here was winston churchill
 winston was here churchill



By default Text::Similarity::Overlaps returns a normalized F-measure between 0 and 1. Normalization can be turned off by specifying --no-normalize. It returns various other overlap based scores if you specify --verbose.

OPTIONS

<B>--typeB>=TYPE The type of text similarity measure. Valid values include:



    Text::Similarity::Overlaps



<B>--stoplistB>=FILE The name of a file containing stop words. Under the ./sample directory, we give two formats of the stop words format, one word per line(stoplist.txt) and one word in the regular expression format per line(stoplist-nsp.regex). If you want to mix these two formats to make your own stop words file, it is also all right.
<B>--no-normalizeB> Do not normalize scores. Normally, scores are normalized so that they range from 0 to 1. Using this option will give you a raw score instead.
<B>--stringB> Input will be provided on the command line as strings, not files.
<B>--verboseB> Show all the matches that are found between the files, their length and frequency, as well as precision, recall, F-measure, E-measure, Cosine, and the Dice Coefficient.
<B>--helpB> Show a detailed help message.
<B>--versionB> Show version information.

AUTHORS



 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Jason Michelizzi

 Ying Liu, University of Minnesota, Twin Cities
 liux0395 at umn.edu



Last modified by: $Id: text_similarity.pl,v 1.1.1.1 2013/06/26 02:38:12 tpederse Exp $

BUGS

--compfile is not working, seems to cause hang (tdp 3/21/08)

COPYRIGHT AND LICENSE

Copyright (C) 2004-2010, Jason Michelizzi, Ted Pedersen and Ying Liu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

Search for    or go to Top of page |  Section 1 |  Main Index


perl v5.20.3 TEXT_SIMILARITY (1) 2013-06-26

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.