![]() |
![]()
| ![]() |
![]()
NAMEHTML::ContentExtractor - extract the main content from a web page by analysising the DOM tree! VERSIONVersion 0.03 SYNOPSISuse HTML::ContentExtractor; my $extractor = HTML::ContentExtractor->new(); my $agent=LWP::UserAgent->new; my $url='http://sports.sina.com.cn/g/2007-03-23/16572821174.shtml'; my $res=$agent->get($url); my $HTML = $res->decoded_content(); $extractor->extract($url,$HTML); print $extractor->as_html(); print $extractor->as_text(); DESCRIPTIONWeb pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions. A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed. Please notice the input HTML should be encoded in utf-8 format( so do the spam words), thus the module can handle web pages in any language (I've used it to process English, Chinese, and Japanese web pages).
AUTHORZhang Jun, "<jzhang533 at gmail.com>" COPYRIGHT & LICENSECopyright 2007 Zhang Jun, all rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
|