HTML::Summary - generate a summary from a web page
use HTML::Summary;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
$tree->parse( $document );
my $summarizer = HTML::Summary->new(
LENGTH => 200,
USE_META => 1,
);
$summary = $summarizer->generate( $tree );
$summarizer->option( 'USE_META' => 1 );
$length = $summarizer->option( 'LENGTH' );
if ( $summarizer->meta_used() ) {
# do something
}
The "HTML::Summary" module produces summaries from the textual content
of web pages. It does so using the location heuristic, which determines the
value of a given sentence based on its position and status within the
document; for example, headings, section titles and opening paragraph
sentences may be favoured over other textual content. A LENGTH option can be
used to restrict the length of the summary produced.
Possible attributes are:
- VERBOSE
- Generate verbose messages to STDERR.
- LENGTH
- Maximum length of summary (in bytes). Default is 500.
- USE_META
- Flag to tell summarizer whether to use the content of the
"<META"> tag in the page header, if one is present,
instead of generating a summary from the body text. Note that if
the USE_META flag is set, this overrides the LENGTH flag - in other words,
the summary provided by the "<META"> tag is returned in
full, even if it is greater than LENGTH bytes. Default is 0 (no).
my $summarizer = HTML::Summary->new(LENGTH => 200);
Get / set HTML::Summary configuration options.
my $length = $summarizer->option( 'LENGTH' );
$summarizer->option( 'USE_META' => 1 );
Takes an HTML::Element object, and generates a summary from it.
my $tree = HTML::TreeBuilder->new;
$tree->parse( $document );
my $summary = $summarizer->generate( $tree );
Returns 1 if the META tag description was used to generate the summary.
if ( $summarizer->meta_used() ) {
# do something ...
}
HTML::TreeBuilder, Text::Sentence, Lingua::JA::Jcode, Lingua::JA::Jtruncate.
<https://github.com/neilb/HTML-Summary>
This module was originally whipped up by Neil Bowers and Tony Rose. It was then
developed and maintained by Ave Wrigley and Tony Rose.
Neil Bowers is currently maintaining the HTML-Summary distribution.
Neil Bowers <neilb@cpan.org>
Copyright (c) 1997 Canon Research Centre Europe (CRE). All rights reserved.
This is free software; you can redistribute it and/or modify it under the same
terms as the Perl 5 programming language system itself.