GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  HTML::EXTRACTCONTENT (3)

.ds Aq ’

NAME

HTML::ExtractContent - An HTML content extractor with scoring heuristics

CONTENTS

SYNOPSIS



 use HTML::ExtractContent;
 use LWP::UserAgent;

 my $agent = LWP::UserAgent->new;
 my $res = $agent->get(http://www.example.com/);

 my $extractor = HTML::ExtractContent->new;
 $extractor->extract($res->decoded_content);
 print $extractor->as_text;



DESCRIPTION

HTML::ExtractContent is a module for extracting content from HTML with scoring heuristics. It guesses which block of HTML looks like content according to scores depending on the amount of punctuation marks and the lengths of non-tag texts. It also guesses whether content end in the block or continue to the next block.

METHODS

new


 $extractor = HTML::ExtractContent->new;



Creates a new HTML::ExtractContent instance.

extract


 $extractor->extract($html);



Extracts content from $html. $html must have its UTF-8 flag on.

as_text


 $extractor->extract($html)->as_text;



Returns extracted content as a plain text. All tags are eliminated.

as_html


 $extractor->extract($html)->as_html;



Returns extracted content as an HTML text. Note that the returned text is neither fully tagged nor valid HTML. It doesn’t contain tags such as <html> and it may have block tags that are not closed, or closed but not opened. This method is intended for the case that you need to analyse link tags in the text for example.

ACKNOWLEDGMENT

Hiromichi Kishi contributed towards development of this module as a partner of pair programming.

Implementation of this module is based on the Ruby module ExtractContent by Nakatani Shuyo.

AUTHOR

INA Lintaro <tarao at cpan.org>

COPYRIGHT

Copyright (C) 2008 INA Lintaro / Hatena. All rights reserved.

    Copyright of the original implementation

Copyright (c) 2007/2008 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.

LICENCE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

<http://rubyforge.org/projects/extractcontent/>
Search for    or go to Top of page |  Section 3 |  Main Index


perl v5.20.3 HTML::EXTRACTCONTENT (3) 2015-03-10

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.