Manual Reference Pages - LINGUA::EN::SUMMARIZE (3)
Lingua::EN::Summarize - A simple tool for summarizing bodies of English text.
my $summary = summarize( $text ); # Easy, no? :-)
my $summary = summarize( $text, maxlength => 500 ); # 500-byte summary
my $summary = summarize( $text, filter => html ); # Strip HTML formatting
my $summary = summarize( $text, wrap => 75 ); # Wrap output to 75 col.
This is a simple module which makes an unscientific effort at
summarizing English text. It recognizes simple patterns which look like
statements, abridges them, and concatenates them into something vaguely
resembling a summary. It needs more work on large bodies of text, but
it seems to have a decent effect on small inputs at the moment.
Lingua::EN::Summarize exports one function, summarize(), which takes
the text to summarize as its first argument, and any number of optional
directives in name => value form. The options itll take are:
Unlike the HTML::Summarize module (which is quite interesting, and worth
a look), this module considers its input to be plain English text, and
doesnt try to gather any information from the formatting. Thus, without
any cues from the documents format, the scheme that HTML::Summarize
uses isnt applicable here. The current scheme goes something like this:
Specifies the maximum length, in bytes, of the generated summary.
Prettyprints the summary output by wrapping it to the number of columns
which you specify.
Passes the text through a filter before handing it to the summarizer.
Currently, only two filters are implemented: "html", which uses
HTML::TreeBuilder and HTML::FormatText to strip all HTML formatting from
a document, and "easyhtml", which quickly (and less accurately)
strips all HTML from a document using a simple regular expression, if
you dont have the abovementioned modules. An "email" filter, for
converting mail and news messages to easily-summarizable text, is in the
works for the next version.
"Filter the text according to the users filter option. Split the text
into discrete sentences with the Text::Sentence module, then further
split them into clauses on commas and semicolons. Keep only the ones
that have a (subject very-simple-verb object) structure. Construct the
summary out of the first sentences in the list, staying within the
maxlength limit, or under 30% of the size of the original text,
whichever is smaller."
Needless to say, this is a very simple and not terribly universally
effective scheme, but its good enough for a first draft, and Ill bang
on it more later. Like I said, its not a scientific approach to the
problem, but its better than nothing (and often better than
HTML::Summarize!), and I dont really need A.I. quality output from it.
Dennis Taylor, <email@example.com>
|perl v5.20.3 ||SUMMARIZE (3) ||2001-02-20 |
Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.