GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  LINGUA::STEM::EN (3)

.ds Aq ’

NAME

Lingua::Stem::En - Porter’s stemming algorithm for ’generic’ English

CONTENTS

SYNOPSIS



    use Lingua::Stem::En;
    my $stems   = Lingua::Stem::En::stem({ -words => $word_list_reference,
                                        -locale => en,
                                    -exceptions => $exceptions_hash,
                                     });



DESCRIPTION

This routine applies the Porter Stemming Algorithm to its parameters, returning the stemmed words.

It is derived from the C program stemmer.c as found in freewais and elsewhere, which contains these notes:



   Purpose:    Implementation of the Porter stemming algorithm documented
               in: Porter, M.F., "An Algorithm For Suffix Stripping,"
               Program 14 (3), July 1980, pp. 130-137.
   Provenance: Written by B. Frakes and C. Cox, 1986.



I have re-interpreted areas that use Frakes and Cox’s WordSize function. My version may misbehave on short words starting with y, but I can’t think of any examples.

The step numbers correspond to Frakes and Cox, and are probably in Porter’s article (which I’ve not seen). Porter’s algorithm still has rough spots (e.g current/currency, -ings words), which I’ve not attempted to cure, although I have added support for the British -ise suffix.

CHANGES



 1999.06.15 - Changed to .pm module, moved into Lingua::Stem namespace,
              optionalized the export of the stem routine
              into the callers namespace, added named parameters

 1999.06.24 - Switch core implementation of the Porter stemmer to
              the one written by Jim Richardson <jimr@maths.usyd.edu.au>

 2000.08.25 - 2.11 Added stemming cache

 2000.09.14 - 2.12 Fixed *major* :( implementation error of Porters algorithm
              Error was entirely my fault - I completely forgot to include
              rule sets 2,3, and 4 starting with Lingua::Stem 0.30.
              -- Benjamin Franz

 2003.09.28 - 2.13 Corrected documentation error pointed out by Simon Cozens.

 2005.11.20 - 2.14 Changed rule declarations to conform to Perl style convention
              for private subroutines. Changed Exporter invokation to more
              portable require vice use.

 2006.02.14 - 2.15 Added ability to pass word list by handle for in-place stemming.

 2009.07.27   2.16 Documentation Fix



METHODS

stem({ -words => \@words, -locale => ’en’, -exceptions => \%exceptions }); Stems a list of passed words using the rules of US English. Returns an anonymous array reference to the stemmed words.

Example:



  my @words         = ( wordy, another );
  my $stemmed_words = Lingua::Stem::En::stem({ -words => \@words,
                                              -locale => en,
                                          -exceptions => \%exceptions,
                          });



If the first element of @words is a list reference, then the stemming is performed ’in place’ on that list (modifying the passed list directly instead of copying it to a new array).

This is only useful if you do not need to keep the original list. If you <B>doB> need to keep the original list, use the normal semantic of having ’stem’ return a new list instead - that is faster than making your own copy <B>andB> using the ’in place’ semantics since the primary difference between ’in place’ and ’by value’ stemming is the creation of a copy of the original list. If you <B>don’tB> need the original list, then the ’in place’ stemming is about 60% faster.

Example of ’in place’ stemming:



  my $words         = [ wordy, another ];
  my $stemmed_words = Lingua::Stem::En::stem({ -words => [$words],
                          -locale => en,
                      -exceptions => \%exceptions,
                      });



The ’in place’ mode returns a reference to the original list with the words stemmed.

stem_caching({ -level => 0|1|2 }); Sets the level of stem caching.

’0’ means ’no caching’. This is the default level.

’1’ means ’cache per run’. This caches stemming results during a single
call to ’stem’.

’2’ means ’cache indefinitely’. This caches stemming results until
either the process exits or the ’clear_stem_cache’ method is called.

clear_stem_cache; Clears the cache of stemmed words

NOTES

This code is almost entirely derived from the Porter 2.1 module written by Jim Richardson.

SEE ALSO



 Lingua::Stem



AUTHOR



  Jim Richardson, University of Sydney
  jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html

  Integration in Lingua::Stem by
  Benjamin Franz, FreeRun Technologies,
  snowhare@nihongo.org or http://www.nihongo.org/snowhare/



COPYRIGHT

Jim Richardson, University of Sydney Benjamin Franz, FreeRun Technologies

This code is freely available under the same terms as Perl.

BUGS

TODO

Search for    or go to Top of page |  Section 3 |  Main Index


perl v5.20.3 LINGUA::STEM::EN (3) 2010-04-29

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.