Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Contact Us
Online Help
Domain Status
Man Pages

Virtual Servers

Topology Map

Server Agreement
Year 2038

USA Flag



Man Pages

Manual Reference Pages  -  HTML::CONTENTEXTRACTOR (3)

.ds Aq ’


HTML::ContentExtractor - extract the main content from a web page by analysising the DOM tree!



Version 0.03


    use HTML::ContentExtractor;
    my $extractor = HTML::ContentExtractor->new();
    my $agent=LWP::UserAgent->new;

    my $url=;
    my $res=$agent->get($url);
    my $HTML = $res->decoded_content();

    print $extractor->as_html();
    print $extractor->as_text();


Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions.

A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed.

Please notice the input HTML should be encoded in utf-8 format( so do the spam words), thus the module can handle web pages in any language (I’ve used it to process English, Chinese, and Japanese web pages).
$e = HTML::ContentExtractor->new(%options); Constructs a new HTML::ContentExtractor object. The optional %options hash can be used to set the options list below.
$e->table_tags(\@tags); This is used to get/set the table tags array. The tags are used as the container tags.
$e->ignore_tags(\@tags); This is used to get/set the ignore tags array. The elements of such tags will be removed.
$e->spam_words(\@strings); This is used to get/set the spam words list. The elements have such string will be removed.
$e->link_text_ratio($ratio); This is used to get/set the link/text ratio, default is 0.05.
$e->min_text_len($len); This is used to get/set the min text length, default is 20. If length of the text of an elment is less than this value, this element will be removed.
$e->extract($url,$HTML); This is used to perform the extraction process. Please notice the input $HTML must be encoded in UTF-8.
$e->as_html(); Return the extraction result in HTML format.
$e->as_text(); Return the extraction result in text format.


Zhang Jun, <jzhang533 at>


Copyright 2007 Zhang Jun, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Search for    or go to Top of page |  Section 3 |  Main Index

perl v5.20.3 HTML::CONTENTEXTRACTOR (3) 2007-06-23

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.