![]() |
![]()
| ![]() |
![]()
NAMEHTML::ExtractContent - An HTML content extractor with scoring heuristics SYNOPSISuse HTML::ExtractContent; use LWP::UserAgent; my $agent = LWP::UserAgent->new; my $res = $agent->get('http://www.example.com/'); my $extractor = HTML::ExtractContent->new; $extractor->extract($res->decoded_content); print $extractor->as_text; DESCRIPTIONHTML::ExtractContent is a module for extracting content from HTML with scoring heuristics. It guesses which block of HTML looks like content according to scores depending on the amount of punctuation marks and the lengths of non-tag texts. It also guesses whether content end in the block or continue to the next block. METHODS
ACKNOWLEDGMENTHiromichi Kishi contributed towards development of this module as a partner of pair programming. Implementation of this module is based on the Ruby module ExtractContent by Nakatani Shuyo. AUTHORINA Lintaro <tarao at cpan.org> COPYRIGHTCopyright (C) 2008 INA Lintaro / Hatena. All rights reserved. Copyright of the original implementationCopyright (c) 2007/2008 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved. LICENCEThis library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. SEE ALSO<http://rubyforge.org/projects/extractcontent/>
|