Lingua::Stem::En - Porter’s stemming algorithm for ’generic’ English



    use Lingua::Stem::En;
    my $stems   = Lingua::Stem::En::stem({ -words => $word_list_reference,
                                        -locale => en,
                                    -exceptions => $exceptions_hash,


This routine applies the Porter Stemming Algorithm to its parameters, returning the stemmed words.

It is derived from the C program stemmer.c as found in freewais and elsewhere, which contains these notes:

   Purpose:    Implementation of the Porter stemming algorithm documented
               in: Porter, M.F., "An Algorithm For Suffix Stripping,"
               Program 14 (3), July 1980, pp. 130-137.
   Provenance: Written by B. Frakes and C. Cox, 1986.

I have re-interpreted areas that use Frakes and Cox’s WordSize function. My version may misbehave on short words starting with y, but I can’t think of any examples.

The step numbers correspond to Frakes and Cox, and are probably in Porter’s article (which I’ve not seen). Porter’s algorithm still has rough spots (e.g current/currency, -ings words), which I’ve not attempted to cure, although I have added support for the British -ise suffix.


 1999.06.15 - Changed to .pm module, moved into Lingua::Stem namespace,
              optionalized the export of the stem routine
              into the callers namespace, added named parameters

 1999.06.24 - Switch core implementation of the Porter stemmer to
              the one written by Jim Richardson <>

 2000.08.25 - 2.11 Added stemming cache

 2000.09.14 - 2.12 Fixed *major* :( implementation error of Porters algorithm
              Error was entirely my fault - I completely forgot to include
              rule sets 2,3, and 4 starting with Lingua::Stem 0.30.
              -- Benjamin Franz

 2003.09.28 - 2.13 Corrected documentation error pointed out by Simon Cozens.

 2005.11.20 - 2.14 Changed rule declarations to conform to Perl style convention
              for private subroutines. Changed Exporter invokation to more
              portable require vice use.

 2006.02.14 - 2.15 Added ability to pass word list by handle for in-place stemming.

 2009.07.27   2.16 Documentation Fix


stem({ -words => \@words, -locale => ’en’, -exceptions => \%exceptions }); Stems a list of passed words using the rules of US English. Returns an anonymous array reference to the stemmed words.


  my @words         = ( wordy, another );
  my $stemmed_words = Lingua::Stem::En::stem({ -words => \@words,
                                              -locale => en,
                                          -exceptions => \%exceptions,

If the first element of @words is a list reference, then the stemming is performed ’in place’ on that list (modifying the passed list directly instead of copying it to a new array).

This is only useful if you do not need to keep the original list. If you <B>doB> need to keep the original list, use the normal semantic of having ’stem’ return a new list instead - that is faster than making your own copy <B>andB> using the ’in place’ semantics since the primary difference between ’in place’ and ’by value’ stemming is the creation of a copy of the original list. If you <B>don’tB> need the original list, then the ’in place’ stemming is about 60% faster.

Example of ’in place’ stemming:

  my $words         = [ wordy, another ];
  my $stemmed_words = Lingua::Stem::En::stem({ -words => [$words],
                          -locale => en,
                      -exceptions => \%exceptions,

The ’in place’ mode returns a reference to the original list with the words stemmed.

stem_caching({ -level => 0|1|2 }); Sets the level of stem caching.

’0’ means ’no caching’. This is the default level.

’1’ means ’cache per run’. This caches stemming results during a single
call to ’stem’.

’2’ means ’cache indefinitely’. This caches stemming results until
either the process exits or the ’clear_stem_cache’ method is called.

clear_stem_cache; Clears the cache of stemmed words


This code is almost entirely derived from the Porter 2.1 module written by Jim Richardson.




  Jim Richardson, University of Sydney or

  Integration in Lingua::Stem by
  Benjamin Franz, FreeRun Technologies, or


This code is freely available under the same terms as Perl.



