NAME

HTML::Parser::Simple - Parse nice HTML files without needing a compiler

Synopsis

        #!/usr/bin/env perl

        use strict;
        use warnings;

        use HTML::Parser::Simple;

        # -------------------------

        # Method 1:

        my($p) = HTML::Parser::Simple -> new
        (
                input_file  => 'data/s.1.html',
                output_file => 'data/s.2.html',
        );

        $p -> parse_file;

        # Method 2:

        my($p) = HTML::Parser::Simple -> new;

        $p -> parse_file('data/s.1.html', 'data/s.2.html');

        # Method 3:

        my($p) = HTML::Parser::Simple -> new;

        print $p -> parse('<html>...</html>') -> traverse($p -> root) -> result;

Of course, these can be abbreviated by using method chaining. E.g. Method 2 could be:

        HTML::Parser::Simple -> new -> parse_file('data/s.1.html', 'data/s.2.html');

See scripts/parse.html.pl and scripts/parse.xhtml.pl.

Description

"HTML::Parser::Simple" is a pure Perl module.

It parses HTML V 4 files, and generates a tree of nodes, with 1 node per HTML tag.

The data associated with each node is documented in the "FAQ".

See also HTML::Parser::Simple::Attributes and HTML::Parser::Simple::Reporter.

Distributions

This module is available as a Unix-style distro (*.tgz).

See <http://savage.net.au/Perl-modules.html> for details.

See <http://savage.net.au/Perl-modules/html/installing-a-module.html> for help on unpacking and installing.

Constructor and initialization

new(...) returns an object of type "HTML::Parser::Simple".

This is the class contructor.

Usage: "HTML::Parser::Simple -> new".

This method takes a hash of options.

Call "new()" as "new(option_1 => value_1, option_2 => value_2, ...)".

Available options (each one of which is also a method):

o input_file => $a_file_name

This takes the file name, including the path, of the input file.

Default: '' (the empty string).

o output_file => $a_file_name

This takes the file name, including the path, of the output file.

Default: '' (the empty string).

o verbose => $Boolean

This takes either a 0 or a 1.

Write more or less progress messages.

Default: 0.

o xhtml => $Boolean

This takes either a 0 or a 1.

0 means do not accept an XML declaration, such as <?xml version="1.0" encoding="UTF-8"?> at the start of the input file, and some other XHTML features, explained next.

1 means accept XHTML input.

Default: 0.

The only XHTML changes to this code, so far, are:

o Accept the XML declaration: E.g.: <?xml version="1.0" standalone='yes'?>.
o Accept attribute names containing the ':' char: E.g.: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">.

Methods

block()

Returns a hashref where the keys are the names of block-level HTML tags.

The corresponding values in the hashref are just 1.

Typical keys: address, form, p, table, tr.

Note: Some keys, e.g. tr, are also returned by "self_close()".

current_node()

Returns the Tree::Simple object which the parser calls the current node.

depth()

Returns the nesting depth of the current tag.

The method is just here in case you need it.

empty()

Returns a hashref where the keys are the names of HTML tags of type empty.

The corresponding values in the hashref are just 1.

Typical keys: area, base, input, wbr.

inline()

Returns a hashref where the keys are the names of HTML tags of type inline.

The corresponding values in the hashref are just 1.

Typical keys: a, em, img, textarea.

input_file($in_file_name)

Gets or sets the input file name used by "parse($input_file_name, $output_file_name)".

Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to "new()", and over the internal values set with "input_file($in_file_name)" and "output_file($out_file_name)".

'input_file' is a parameter to "new()". See "Constructor and Initialization" for details.

log($msg)

Print $msg to STDERR if "new()" was called as "new(verbose => 1)", or if "$p -> verbose(1)" was called.

Otherwise, print nothing.

new()

This is the constructor. See "Constructor and initialization" for details.

node_type()

Returns the type of the most recently created node, global, head, or body.

See the first question in the "FAQ" for details.

output_file($out_file_name)

Gets or sets the output file name used by "parse($input_file_name, $output_file_name)".

Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to "new()", and over the internal values set with "input_file($in_file_name)" and "output_file($out_file_name)".

'output_file' is a parameter to "new()". See "Constructor and Initialization" for details.

parse($html)

Returns the invocant. Thus "$p -> parse" returns $p. This allows for method chaining. See the "Synopsis".

Parses the string of HTML in $html, and builds a tree of nodes.

After calling "$p -> parse($html)", you must call "$p -> traverse($p -> root)" before calling "$p -> result".

Alternately, use "$p -> parse_file", which calls all these methods for you.

Note: "parse()" may be called directly or via "parse_file()".

parse_file($input_file_name, $output_file_name)

Returns the invocant. Thus "$p -> parse_file" returns $p. This allows for method chaining. See the "Synopsis".

Parses the HTML in the input file, and writes the result to the output file.

"parse_file()" calls "parse($html)" and "traverse($node)", using "$p -> root" for $node.

Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to "new()", and over the internal values set with "input_file($in_file_name)" and "output_file($out_file_name)".

Lastly, the parameters passed in to "parse_file($input_file_name, $output_file_name)" are used to update the internal values set with the input_file and output_file parameters passed in to "new()", or set with calls to "input_file($in_file_name)" and "output_file($out_file_name)".

result()

Returns the string which is the result of the parse.

See scripts/parse.html.pl.

root()

Returns the Tree::Simple object which the parser calls the root of the tree of nodes.

self_close()

Returns a hashref where the keys are the names of HTML tags of type self close.

The corresponding values in the hashref are just 1.

Typical keys: dd, dt, p, tr.

Note: Some keys, e.g. tr, are also returned by "block()".

tagged_attribute()

Returns a string to be used as a regexp, to capture tags and their optional attributes.

It does not return qr/$s/; it just returns $s.

This regexp takes one of two forms, depending on the state of the xhtml option. See "xhtml($Boolean)".

The regexp has four (4) sets of capturing parentheses:

o 1 for the whole tag and attribute and trailing / combination: E.g.: <(....)>
o 1 for the tag itself: E.g.: <(img)...>
o 1 for the optional attributes of the tag: E.g.: <img (src="/graph.svg" alt="A graph")>
o 1 for the optional trailing / of the tag: E.g.: <img ... (/)>

traverse($node)

Returns the invocant. Thus "$p -> traverse" returns $p. This allows for method chaining. See the "Synopsis".

Traverses the tree of nodes, starting at $node.

You normally call this as "$p -> traverse($p -> root)", to ensure all nodes are visited.

See the "Synopsis" for sample code.

Or, see scripts/traverse.file.pl, which uses HTML::Parser::Simple::Reporter, and calls "traverse($node)" via "traverse_file($input_file_name)" in HTML::Parser::Simple::Reporter.

verbose($Boolean)

Gets or sets the verbose parameter.

'verbose' is a parameter to "new()". See "Constructor and Initialization" for details.

xhtml($Boolean)

Gets or sets the xhtml parameter.

If you call this after object creation, the trigger feature of Moos is used to call "tagged_attribute()" so as to correctly set the regexp which recognises xhtml.

'xhtm'> is a parameter to "new()". See "Constructor and Initialization" for details.

FAQ

What is the format of the data stored in each node of the tree?

The data of each node is a hash ref. The keys/values of this hash ref are:

o attributes

This is the string of HTML attributes associated with the HTML tag.

Attributes are stored in lower-case.

So, <table align = 'center' summary = 'Body'> will have an attributes string of " align = 'center' summary = 'body'".

Note the leading space.

o content

This is an arrayref of bits and pieces of content.

Consider this fragment of HTML:

I did not say I liked debugging.

When parsing 'I did ', the number of child nodes (of ) is 0, since has not yet been detected.

So, 'I did ' is stored in the 0th element of the arrayref belonging to .

Likewise, 'not' is stored in the 0th element of the arrayref belonging to the node .

Next, ' say I ' is stored in the 1st element of the arrayref belonging to , because it follows the 1st child node ().

Likewise, ' debugging' is stored in the 2nd element of the arrayref belonging to .

This way, the input string can be reproduced by successively outputting the elements of the arrayref of content interspersed with the contents of the child nodes (processed recusively).

Note: If you are processing this tree, never forget that there can be content after the last child node has been closed, but before the current node is closed.

Note: The DOCTYPE declaration is stored as the 0th element of the content of the root node.

o depth

The nesting depth of the tag within the document.

The root is at depth 0, '<html>' is at depth 1, '<head>' and '<body>' are a depth 2, and so on.

It is just there in case you need it.

o name

So, the tag '<html>' will mean the name is 'html'.

Tag names are stored in lower-case.

The root of the tree is called 'root', and holds the DOCTYPE, if any, as content.

The root has the node 'html' as the only child, of course.

o node_type

This holds 'global' before '<head>' and between '</head>' and '<body>', and after '</body>'.

It holds 'head' for all nodes from '<head>' to '</head>', and holds 'body' from '<body>' to '</body>'.

It is just there in case you need it.

How are tags and attributes handled?

Tags are stored in lower-case, in a tree managed by Tree::Simple.

Attributes are stored in the same case as in the original HTML.

The root of the tree is returned be "root()".

How are HTML comments handled?

They are treated as content. This includes the prefix ''.

How is DOCTYPE handled?

It is treated as content belonging to the root of the tree.

How is the XML declaration handled?

It is treated as content belonging to the root of the tree.

Does this module handle all HTML pages?

No, never.

Which versions of HTML does this module handle?

Up to V 4.

What do I do if this module does not handle my HTML page?

Make yourself a nice cup of tea, and then fix your page.

Does this validate the HTML input?

No.

For example, if you feed in a HTML page without the title tag, this module does not care.

How do I view the output HTML?

There are various ways.

o See scripts/parse.html.pl
o By installing HTML::Revelation, of course!: Sample output:
<http://savage.net.au/Perl-modules/html/CreateTable.html>.

How do I test this module (or my file)?

Preferably, see the previous question, or...

Suggested steps:

Note: There are quite a few files involved. Proceed with caution.

o Select a HTML file to test: Call this input.html.
o Run input.html thru reveal.pl: Reveal.pl ships with HTML::Revelation.
Call the output file output.1.html.
o Run input.html thru parse.html.pl: parse.html.pl ships with HTML::Parser::Simple.
Call the output file parsed.html.
o Run parsed.html thru reveal.pl: Call the output file output.2.html.
o Compare output.1.html and output.2.html: If they match, or even if they don't match, you're finished.

Will you implement a 'quirks' mode to handle my special HTML file?

No, never.

Help with quirks: <http://www.quirksmode.org/sitemap.html>.

Is there anything I should be aware of?

Yes. If your HTML file is not nice, the interpretation of tag nesting will not match your preconceptions.

In such cases, do not seek to fix the code. Instead, fix your (faulty) preconceptions, and fix your HTML file.

The 'a' tag, for example, is defined to be an inline tag, but the 'div' tag is a block-level tag.

I do not define 'a' to be inline, others do, e.g. <http://www.w3.org/TR/html401/> and hence HTML::Tagset.

Inline means:

        <a href = "#NAME"><div class = 'global_toc_text'>NAME</div></a>

will not be parsed as an 'a' containing a 'div'.

The 'a' tag will be closed before the 'div' is opened. So, the result will look like:

        <a href = "#NAME"></a><div class = 'global_toc_text'>NAME</div>

To achieve what was presumably intended, use 'span':

        <a href = "#NAME"><span class = 'global_toc_text'>NAME</span></a>

Some people (*cough* *cough*) have had to redo their entire websites due to this very problem.

Of course, this is just one of a vast set of possible problems.

You have been warned.

Why did you use Tree::Simple but not Tree or Tree::Fast or Tree::DAG_Node?

During testing, Tree::Fast crashed, so I replaced it with Tree and everything worked. Spooky.

Late news: Tree does not cope with an arrayref stored in the metadata, so I have switched to Tree::DAG_Node.

Stop press: As an experiment I switched to Tree::Simple. Since it also works I will just keep using it.

Why is this module not called HTML::Parser::PurePerl?

o The API: That name sounds like a pure Perl version of the same API as used by HTML::Parser.
But the 2 APIs are not, and are not meant to be, compatible.
o The tie-in: Some people might falsely assume HTML::Parser can automatically fall back to HTML::Parser::PurePerl in the absence of a compiler.

How do I output my own stuff while traversing the tree?

o The sophisticated way: As always with OO code, sub-class! In this case, you write a new version of the traverse() method.
See HTML::Parser::Simple::Reporter, for example. It overrides "traverse($node)".
o The crude way: Alternately, implement another method in your sub-class, e.g. process(), which recurses like traverse(). Then call parse() and process().

How is the source formatted?

I edit with UltraEdit. That means, in general, leading 4-space tabs.

All vertical alignment within lines is done manually with spaces.

Perl::Critic is off the agenda.

Why did you choose Moos?

For the 2012 Google Code-in, I had a quick look at 122 class-building classes, and decided Moos was suitable, given it is pure-Perl and has the trigger feature I needed.

See <http://savage.net.au/Module-reviews/html/gci.2012.class.builder.modules.html>.

Credits

This Perl HTML parser has been converted from a JavaScript one written by John Resig.

<http://ejohn.org/files/htmlparser.js>.

Well done John!

Note also the comments published here:

<http://groups.google.com/group/envjs/browse_thread/thread/edd9033b9273fa58>.

Repository

<https://github.com/ronsavage/HTML-Parser-Simple>

Support

Email the author, or log a bug on RT:

<https://rt.cpan.org/Public/Dist/Display.html?Name=HTML::Parser::Simple>.

Author

"HTML::Parser::Simple" was written by Ron Savage <ron@savage.net.au> in 2009.

Home page: <http://savage.net.au/index.html>.

Copyright

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html