Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Contact Us
Online Help
Domain Status
Man Pages

Virtual Servers

Topology Map

Server Agreement
Year 2038

USA Flag



Man Pages

Manual Reference Pages  -  HTML::LINKEXTRACTOR (3)

.ds Aq ’


HTML::LinkExtractor - Extract links from an HTML document



HTML::LinkExtractor is used for extracting links from HTML. It is very similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

Example ( <B>please run the examplesB> ):

    use HTML::LinkExtractor;
    use Data::Dumper;

    my $input = q{If <a href=""> I am a LINK!!! </a>};
    my $LX = new HTML::LinkExtractor();


    print Dumper($LX->links);
    # the above example will yield
    $VAR1 = [
                _TEXT => <a href=""> I am a LINK!!! </a>,
                href => bless(do{\(my $o =}, URI::http),
                tag => a

HTML::LinkExtractor will also correctly extract nested link-type tags.


    ## the demo
    perl file.html othefile.html

    ## or if the module is installed, but you dont know where

    perl -MHTML::LinkExtractor -e" system $^X, $INC{q{HTML/}} "
    perl -MHTML::LinkExtractor -e system $^X, $INC{q{HTML/}} 

    ## or

    use HTML::LinkExtractor;
    use LWP qw( get ); #     use LWP::Simple qw( get );

    my $base =;
    my $html = get($base./recent);
    my $LX = new HTML::LinkExtractor();


    print qq{<base href="$base">\n};

    for my $Link( @{ $LX->links } ) {
    ## new modules are linked  by /author/NAME/Dist
        if( $$Link{href}=~ m{^\/author\/\w+} ) {
            print $$Link{_TEXT}."\n";

    undef $LX;

    ## or

    use HTML::LinkExtractor;
    use Data::Dumper;

    my $input = q{If <a href=""> I am a LINK!!! </a>};
    my $LX = new HTML::LinkExtractor(
        sub {
            print Data::Dumper::Dumper(@_);


    #### Calculate to total size of a web-page
    #### adds up the sizes of all the images and stylesheets and stuff

    use strict;
    use LWP; #     use LWP::Simple;
    use HTML::LinkExtractor;
    my $url  = shift ||;
    my $html = get($url);
    my $Total = length $html;
    print "initial size $Total\n";
    my $LX = new HTML::LinkExtractor(
        sub {
            my( $X, $tag ) = @_;
            unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_IN_NEED ) {
    print "$$tag{tag}\n";
                for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}} ) {
                    if( exists $$tag{$urlAttr} ) {
                        my $size = (head( $$tag{$urlAttr} ))[1];
                        $Total += $size if $size;
    print "adding $size\n" if $size;
    print "The total size of \n$url\n is $Total bytes\n";


CW$LX->new([\&callback, [$baseUrl, [1]]])

Accepts 3 arguments, all of which are optional. If for example you want to pass a $baseUrl, but don’t want to have a callback invoked, just put undef in place of a subref.

This is the only class method.
1. a callback ( a sub reference, as in sub{}, or \&sub) which is to be called each time a new LINK is encountered ( for @HTML::LinkExtractor::TAGS_IN_NEED this means
after the closing tag is encountered )

The callback receives an object reference($LX) and a link hashref.

2. and a base URL ( URI->new, so its up to you to make sure it’s valid which is used to convert all relative URI’s to absolute ones.

    $ALinkP{href} = URI->new_abs( $ALink{href}, $base );

3. A boolean (just stick with 1). See the example in DESCRIPTION. Normally, you’d get back _TEXT that looks like

    _TEXT => <a href=""> I am a LINK!!! </a>,

If you turn this option on, you’ll get the following instead

    _TEXT =>  I am a LINK!!! ,

The private utility function _stripHTML does this by using HTML::TokeParsers method get_trimmed_text.

You can turn this feature on an off by using $LX->strip(undef || 0 || 1)

CW$LX->parse( $filename || *FILEHANDLE || \$FileContent )

Each time you call parse, you should pass it a $filename a *FILEHANDLE or a \$FileContent

Each time you call parse a new HTML::TokeParser object is created and stored in $this->{_tp}.

You shouldn’t need to mess with the TokeParser object.


Only after you call parse will this method return anything. This method returns a reference to an ArrayOfHashes, which basically looks like (Data::Dumper output)

    $VAR1 = [ { tag => img, src => image.png }, ];

Please note that if yo provide a callback this array will be empty.

CW$LX->strip( [ 0 || 1 ])

If you pass in undef (or nothing), returns the state of the option. Passing in a true or false value sets the option.

If you wanna know what the option does see $LX->new([\&callback, [$baseUrl, [1]]])

WHAT’S A LINK-type tag

Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be link-type-tag.

Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the possible tag attributes which can contain URI’s (the links!!)

Take a look at @HTML::LinkExtractor::TAGS_IN_NEED to see the tags for which the _TEXT attribute is provided, like <a href="#"> TEST </a>

    How can that be?!?!

I took at look at %HTML::Tagset::linkElements and the following URL’s

    And the special cases

    !doctype  is really a process instruction, but is still listed
    in %TAGS with url as the attribute


    <meta HTTP-EQUIV="Refresh" CONTENT="5; URL=">
    If there is a valid url, url is set as the attribute.
    The meta tag has no attributes listed in %TAGS.


HTML::LinkExtor, HTML::TokeParser, HTML::Tagset.


D.H (PodMaster)

Please use to report bugs.

Just go to to see a bug list and/or repot new ones.


Copyright (c) 2003, 2004 by D.H. (PodMaster). All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. The LICENSE file contains the full text of the license.

Search for    or go to Top of page |  Section 3 |  Main Index

perl v5.20.3 LINKEXTRACTOR (3) 2005-01-07

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.