|
NAMELingua::Stem::En - Porter's stemming algorithm for 'generic' English SYNOPSIS use Lingua::Stem::En;
my $stems = Lingua::Stem::En::stem({ -words => $word_list_reference,
-locale => 'en',
-exceptions => $exceptions_hash,
});
DESCRIPTIONThis routine applies the Porter Stemming Algorithm to its parameters, returning the stemmed words. It is derived from the C program "stemmer.c" as found in freewais and elsewhere, which contains these notes: Purpose: Implementation of the Porter stemming algorithm documented
in: Porter, M.F., "An Algorithm For Suffix Stripping,"
Program 14 (3), July 1980, pp. 130-137.
Provenance: Written by B. Frakes and C. Cox, 1986.
I have re-interpreted areas that use Frakes and Cox's "WordSize" function. My version may misbehave on short words starting with "y", but I can't think of any examples. The step numbers correspond to Frakes and Cox, and are probably in Porter's article (which I've not seen). Porter's algorithm still has rough spots (e.g current/currency, -ings words), which I've not attempted to cure, although I have added support for the British -ise suffix. CHANGES 1999.06.15 - Changed to '.pm' module, moved into Lingua::Stem namespace,
optionalized the export of the 'stem' routine
into the caller's namespace, added named parameters
1999.06.24 - Switch core implementation of the Porter stemmer to
the one written by Jim Richardson <jimr@maths.usyd.edu.au>
2000.08.25 - 2.11 Added stemming cache
2000.09.14 - 2.12 Fixed *major* :( implementation error of Porter's algorithm
Error was entirely my fault - I completely forgot to include
rule sets 2,3, and 4 starting with Lingua::Stem 0.30.
-- Jerilyn Franz
2003.09.28 - 2.13 Corrected documentation error pointed out by Simon Cozens.
2005.11.20 - 2.14 Changed rule declarations to conform to Perl style convention
for 'private' subroutines. Changed Exporter invokation to more
portable 'require' vice 'use'.
2006.02.14 - 2.15 Added ability to pass word list by 'handle' for in-place stemming.
2009.07.27 - 2.16 Documentation Fix
2020.06.20 - 2.30 Version renumber for module consistency.
2020.09.26 - 2.31 Fix for Latin1/UTF8 issue in documentation
METHODS
NOTESThis code is almost entirely derived from the Porter 2.1 module written by Jim Richardson. SEE ALSOLingua::Stem AUTHORJim Richardson, University of Sydney jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html Integration in Lingua::Stem by Jerilyn Franz, FreeRun Technologies, <cpan@jerilyn.info> COPYRIGHTJim Richardson, University of Sydney Jerilyn Franz, FreeRun Technologies This code is freely available under the same terms as Perl. BUGSTODO
|