GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
HTML::Parser::Simple(3) User Contributed Perl Documentation HTML::Parser::Simple(3)

HTML::Parser::Simple - Parse nice HTML files without needing a compiler

        #!/usr/bin/env perl

        use strict;
        use warnings;

        use HTML::Parser::Simple;

        # -------------------------

        # Method 1:

        my($p) = HTML::Parser::Simple -> new
        (
                input_file  => 'data/s.1.html',
                output_file => 'data/s.2.html',
        );

        $p -> parse_file;

        # Method 2:

        my($p) = HTML::Parser::Simple -> new;

        $p -> parse_file('data/s.1.html', 'data/s.2.html');

        # Method 3:

        my($p) = HTML::Parser::Simple -> new;

        print $p -> parse('<html>...</html>') -> traverse($p -> root) -> result;

Of course, these can be abbreviated by using method chaining. E.g. Method 2 could be:

        HTML::Parser::Simple -> new -> parse_file('data/s.1.html', 'data/s.2.html');

See scripts/parse.html.pl and scripts/parse.xhtml.pl.

"HTML::Parser::Simple" is a pure Perl module.

It parses HTML V 4 files, and generates a tree of nodes, with 1 node per HTML tag.

The data associated with each node is documented in the "FAQ".

See also HTML::Parser::Simple::Attributes and HTML::Parser::Simple::Reporter.

This module is available as a Unix-style distro (*.tgz).

See <http://savage.net.au/Perl-modules.html> for details.

See <http://savage.net.au/Perl-modules/html/installing-a-module.html> for help on unpacking and installing.

new(...) returns an object of type "HTML::Parser::Simple".

This is the class contructor.

Usage: "HTML::Parser::Simple -> new".

This method takes a hash of options.

Call "new()" as "new(option_1 => value_1, option_2 => value_2, ...)".

Available options (each one of which is also a method):

o input_file => $a_file_name
This takes the file name, including the path, of the input file.

Default: '' (the empty string).

o output_file => $a_file_name
This takes the file name, including the path, of the output file.

Default: '' (the empty string).

o verbose => $Boolean
This takes either a 0 or a 1.

Write more or less progress messages.

Default: 0.

o xhtml => $Boolean
This takes either a 0 or a 1.

0 means do not accept an XML declaration, such as <?xml version="1.0" encoding="UTF-8"?> at the start of the input file, and some other XHTML features, explained next.

1 means accept XHTML input.

Default: 0.

The only XHTML changes to this code, so far, are:

o Accept the XML declaration
E.g.: <?xml version="1.0" standalone='yes'?>.
o Accept attribute names containing the ':' char
E.g.: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">.

Returns a hashref where the keys are the names of block-level HTML tags.

The corresponding values in the hashref are just 1.

Typical keys: address, form, p, table, tr.

Note: Some keys, e.g. tr, are also returned by "self_close()".

Returns the Tree::Simple object which the parser calls the current node.

Returns the nesting depth of the current tag.

The method is just here in case you need it.

Returns a hashref where the keys are the names of HTML tags of type empty.

The corresponding values in the hashref are just 1.

Typical keys: area, base, input, wbr.

Returns a hashref where the keys are the names of HTML tags of type inline.

The corresponding values in the hashref are just 1.

Typical keys: a, em, img, textarea.

Gets or sets the input file name used by "parse($input_file_name, $output_file_name)".

Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to "new()", and over the internal values set with "input_file($in_file_name)" and "output_file($out_file_name)".

'input_file' is a parameter to "new()". See "Constructor and Initialization" for details.

Print $msg to STDERR if "new()" was called as "new(verbose => 1)", or if "$p -> verbose(1)" was called.

Otherwise, print nothing.

This is the constructor. See "Constructor and initialization" for details.

Returns the type of the most recently created node, global, head, or body.

See the first question in the "FAQ" for details.

Gets or sets the output file name used by "parse($input_file_name, $output_file_name)".

Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to "new()", and over the internal values set with "input_file($in_file_name)" and "output_file($out_file_name)".

'output_file' is a parameter to "new()". See "Constructor and Initialization" for details.

Returns the invocant. Thus "$p -> parse" returns $p. This allows for method chaining. See the "Synopsis".

Parses the string of HTML in $html, and builds a tree of nodes.

After calling "$p -> parse($html)", you must call "$p -> traverse($p -> root)" before calling "$p -> result".

Alternately, use "$p -> parse_file", which calls all these methods for you.

Note: "parse()" may be called directly or via "parse_file()".

Returns the invocant. Thus "$p -> parse_file" returns $p. This allows for method chaining. See the "Synopsis".

Parses the HTML in the input file, and writes the result to the output file.

"parse_file()" calls "parse($html)" and "traverse($node)", using "$p -> root" for $node.

Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to "new()", and over the internal values set with "input_file($in_file_name)" and "output_file($out_file_name)".

Lastly, the parameters passed in to "parse_file($input_file_name, $output_file_name)" are used to update the internal values set with the input_file and output_file parameters passed in to "new()", or set with calls to "input_file($in_file_name)" and "output_file($out_file_name)".

Returns the string which is the result of the parse.

See scripts/parse.html.pl.

Returns the Tree::Simple object which the parser calls the root of the tree of nodes.

Returns a hashref where the keys are the names of HTML tags of type self close.

The corresponding values in the hashref are just 1.

Typical keys: dd, dt, p, tr.

Note: Some keys, e.g. tr, are also returned by "block()".

Returns a string to be used as a regexp, to capture tags and their optional attributes.

It does not return qr/$s/; it just returns $s.

This regexp takes one of two forms, depending on the state of the xhtml option. See "xhtml($Boolean)".

The regexp has four (4) sets of capturing parentheses:

o 1 for the whole tag and attribute and trailing / combination
E.g.: <(....)>
o 1 for the tag itself
E.g.: <(img)...>
o 1 for the optional attributes of the tag
E.g.: <img (src="/graph.svg" alt="A graph")>
o 1 for the optional trailing / of the tag
E.g.: <img ... (/)>

Returns the invocant. Thus "$p -> traverse" returns $p. This allows for method chaining. See the "Synopsis".

Traverses the tree of nodes, starting at $node.

You normally call this as "$p -> traverse($p -> root)", to ensure all nodes are visited.

See the "Synopsis" for sample code.

Or, see scripts/traverse.file.pl, which uses HTML::Parser::Simple::Reporter, and calls "traverse($node)" via "traverse_file($input_file_name)" in HTML::Parser::Simple::Reporter.

Gets or sets the verbose parameter.

'verbose' is a parameter to "new()". See "Constructor and Initialization" for details.

Gets or sets the xhtml parameter.

If you call this after object creation, the trigger feature of Moos is used to call "tagged_attribute()" so as to correctly set the regexp which recognises xhtml.

'xhtm'> is a parameter to "new()". See "Constructor and Initialization" for details.

The data of each node is a hash ref. The keys/values of this hash ref are:
o attributes
This is the string of HTML attributes associated with the HTML tag.

Attributes are stored in lower-case.

So, <table align = 'center' summary = 'Body'> will have an attributes string of " align = 'center' summary = 'body'".

Note the leading space.

o content
This is an arrayref of bits and pieces of content.

Consider this fragment of HTML:

<p>I did <i>not</i> say I <i>liked</i> debugging.</p>

When parsing 'I did ', the number of child nodes (of <p>) is 0, since <i> has not yet been detected.

So, 'I did ' is stored in the 0th element of the arrayref belonging to <p>.

Likewise, 'not' is stored in the 0th element of the arrayref belonging to the node <i>.

Next, ' say I ' is stored in the 1st element of the arrayref belonging to <p>, because it follows the 1st child node (<i>).

Likewise, ' debugging' is stored in the 2nd element of the arrayref belonging to <p>.

This way, the input string can be reproduced by successively outputting the elements of the arrayref of content interspersed with the contents of the child nodes (processed recusively).

Note: If you are processing this tree, never forget that there can be content after the last child node has been closed, but before the current node is closed.

Note: The DOCTYPE declaration is stored as the 0th element of the content of the root node.

o depth
The nesting depth of the tag within the document.

The root is at depth 0, '<html>' is at depth 1, '<head>' and '<body>' are a depth 2, and so on.

It is just there in case you need it.

o name
So, the tag '<html>' will mean the name is 'html'.

Tag names are stored in lower-case.

The root of the tree is called 'root', and holds the DOCTYPE, if any, as content.

The root has the node 'html' as the only child, of course.

o node_type
This holds 'global' before '<head>' and between '</head>' and '<body>', and after '</body>'.

It holds 'head' for all nodes from '<head>' to '</head>', and holds 'body' from '<body>' to '</body>'.

It is just there in case you need it.

Tags are stored in lower-case, in a tree managed by Tree::Simple.

Attributes are stored in the same case as in the original HTML.

The root of the tree is returned be "root()".

They are treated as content. This includes the prefix '<!--' and the suffix '-->'.

It is treated as content belonging to the root of the tree.

It is treated as content belonging to the root of the tree.

No, never.

Up to V 4.

Make yourself a nice cup of tea, and then fix your page.

No.

For example, if you feed in a HTML page without the title tag, this module does not care.

There are various ways.
o See scripts/parse.html.pl
o By installing HTML::Revelation, of course!
Sample output:

<http://savage.net.au/Perl-modules/html/CreateTable.html>.

Preferably, see the previous question, or...

Suggested steps:

Note: There are quite a few files involved. Proceed with caution.

o Select a HTML file to test
Call this input.html.
o Run input.html thru reveal.pl
Reveal.pl ships with HTML::Revelation.

Call the output file output.1.html.

o Run input.html thru parse.html.pl
parse.html.pl ships with HTML::Parser::Simple.

Call the output file parsed.html.

o Run parsed.html thru reveal.pl
Call the output file output.2.html.
o Compare output.1.html and output.2.html
If they match, or even if they don't match, you're finished.

No, never.

Help with quirks: <http://www.quirksmode.org/sitemap.html>.

Yes. If your HTML file is not nice, the interpretation of tag nesting will not match your preconceptions.

In such cases, do not seek to fix the code. Instead, fix your (faulty) preconceptions, and fix your HTML file.

The 'a' tag, for example, is defined to be an inline tag, but the 'div' tag is a block-level tag.

I do not define 'a' to be inline, others do, e.g. <http://www.w3.org/TR/html401/> and hence HTML::Tagset.

Inline means:

        <a href = "#NAME"><div class = 'global_toc_text'>NAME</div></a>

will not be parsed as an 'a' containing a 'div'.

The 'a' tag will be closed before the 'div' is opened. So, the result will look like:

        <a href = "#NAME"></a><div class = 'global_toc_text'>NAME</div>

To achieve what was presumably intended, use 'span':

        <a href = "#NAME"><span class = 'global_toc_text'>NAME</span></a>

Some people (*cough* *cough*) have had to redo their entire websites due to this very problem.

Of course, this is just one of a vast set of possible problems.

You have been warned.

During testing, Tree::Fast crashed, so I replaced it with Tree and everything worked. Spooky.

Late news: Tree does not cope with an arrayref stored in the metadata, so I have switched to Tree::DAG_Node.

Stop press: As an experiment I switched to Tree::Simple. Since it also works I will just keep using it.

o The API
That name sounds like a pure Perl version of the same API as used by HTML::Parser.

But the 2 APIs are not, and are not meant to be, compatible.

o The tie-in
Some people might falsely assume HTML::Parser can automatically fall back to HTML::Parser::PurePerl in the absence of a compiler.

o The sophisticated way
As always with OO code, sub-class! In this case, you write a new version of the traverse() method.

See HTML::Parser::Simple::Reporter, for example. It overrides "traverse($node)".

o The crude way
Alternately, implement another method in your sub-class, e.g. process(), which recurses like traverse(). Then call parse() and process().

I edit with UltraEdit. That means, in general, leading 4-space tabs.

All vertical alignment within lines is done manually with spaces.

Perl::Critic is off the agenda.

For the 2012 Google Code-in, I had a quick look at 122 class-building classes, and decided Moos was suitable, given it is pure-Perl and has the trigger feature I needed.

See <http://savage.net.au/Module-reviews/html/gci.2012.class.builder.modules.html>.

This Perl HTML parser has been converted from a JavaScript one written by John Resig.

<http://ejohn.org/files/htmlparser.js>.

Well done John!

Note also the comments published here:

<http://groups.google.com/group/envjs/browse_thread/thread/edd9033b9273fa58>.

<https://github.com/ronsavage/HTML-Parser-Simple>

Email the author, or log a bug on RT:

<https://rt.cpan.org/Public/Dist/Display.html?Name=HTML::Parser::Simple>.

"HTML::Parser::Simple" was written by Ron Savage <ron@savage.net.au> in 2009.

Home page: <http://savage.net.au/index.html>.

Australian copyright (c) 2009 Ron Savage.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html
2015-01-25 perl v5.32.1

Search for    or go to Top of page |  Section 3 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.