GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  HTML::EXTRACTMAIN (3)

.ds Aq ’

NAME

HTML::ExtractMain - Extract the main content of a web page

CONTENTS

VERSION

Version 0.62

SYNOPSIS



    use HTML::ExtractMain qw( extract_main_html );

    my $html = <<END;
    <div id="header">Header</div>
    <div id="nav"><a href="/">Home</a></div>
    <div id="body">
        <p>Foo</p>
        <p>Baz</p>
    </div>
    <div id="footer">Footer</div>
    END

    my $main_html = extract_main_html($html);
    if (defined $main_html) {
        # do something with $main_html here
        # $main_html is <div id="body"><p>Foo</p><p>Baz</p></div>
    }



EXPORT

extract_main_html is optionally exported

FUNCTIONS

    extract_main_html

extract_main_html takes HTML content, and uses the Readability algorithm to detect the main body of the page, usually skipping headers, footers, navigation, etc.

It takes a single argument, either an HTML string, or an HTML::TreeBuilder tree. (If passed a tree, the tree will be modified and destroyed.)

If the HTML’s main content is found, it’s returned as an XHTML snippet. The returned HTML will not look like what you put in. (Source formatting, e.g. indentation, will be removed, and you may get back XHTML when you put in HTML.)

If a most relevant block of content is not found, extract_main_html returns undef.

AUTHOR

Anirvan Chatterjee, <anirvan at cpan.org>

BUGS

Please report any bugs or feature requests to bug-html-extractmain at rt.cpan.org, or through the web interface at <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-ExtractMain>. I will be notified, and then you’ll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.



    perldoc HTML::ExtractMain



You can also look for information at:
o RT: CPAN’s request tracker

<http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ExtractMain>

o AnnoCPAN: Annotated CPAN documentation

<http://annocpan.org/dist/HTML-ExtractMain>

o CPAN Ratings

<http://cpanratings.perl.org/d/HTML-ExtractMain>

o Search CPAN

<http://search.cpan.org/dist/HTML-ExtractMain/>

SEE ALSO

o HTML::Feature
o HTML::ExtractContent

ACKNOWLEDGEMENTS

The Readability algorithm is ported from Arc90’s JavaScript original, built as part of the excellent Readability application, online at <http://lab.arc90.com/experiments/readability/>, repository at <http://code.google.com/p/arc90labs-readability/>.

COPYRIGHT & LICENSE

Copyright 2009-2010 Anirvan Chatterjee, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Search for    or go to Top of page |  Section 3 |  Main Index


perl v5.20.3 HTML::EXTRACTMAIN (3) 2010-12-15

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.