NAME

HTML::TableContentParser - Do interesting things with the contents of tables.

SYNOPSIS

  use HTML::TableContentParser;
  my $p = HTML::TableContentParser->new();
  my $html = read_html_from_somewhere();
  my $tables = $p->parse_file( $html );
  for my $t (@$tables) {
    for my $r (@{$t->{rows}}) {
      print 'Row:';
      for my $c (@{$r->{cells}}) {
        print " [$c->{data}]";
      }
      print "\n";
    }
  }

DESCRIPTION

This package parses tables out of HTML. The return from the parse is a reference to an array containing the tables found.

Tables appear in the output in the order in which they are encountered. If a table is nested inside a cell of another table, it will appear after the containing table in the output, and any connection between the two will be lost. As of version 0.200_01, the appearance of a nested table should not cause any truncation of the containing table.

The following tags are processed by this module: "<table>", "<caption>", "<tr>", "<th>", and "<td>". In the return from the parse method, each tag is represented by a hash reference, having the tag's attributes as keys, and the attribute values as values. In addition, the following keys will be provided:

"<table>"

caption: the "<caption>" tag, if any
headers: a reference to an array containing all the "<th>" tags, in the order encountered
rows: a reference to an array containing all the "<tr>" tags, in the order encountered

"<caption>"

data: the content of the "<caption>" tag

"<tr>"

cells: a reference to an array containing all the "<td>" tags, in the order encountered, with "undef" representing any "<th>" tags encountered. Trailing "undef" values will be dropped, and the entire key will be absent unless actual "<td>" tags are found in the row.
Note that prior to version 0.299_01, "<th>" tags were not represented at all.
headers: new with version 0.299_01, this is a reference to an array containing all the "<th>" tags in the row, in the order encountered, with "undef" representing any "<td>" tags. Trailing "undef" values will be dropped, and the entire key will be absent unless actual "<th>" tags are found in the row.
It is the understanding of the current author (TRW) that in valid HTML "<th>" tags must occur inside a "<tr>" element, so they need to be recognized there, rather than (or in addition to) in isolation.

"<th>"

data: the content of the "<th>" tag

"<td>"

data: the content of the "<td>" tag

METHODS

This module is a subclass of HTML::Parser. It provides only one new method, classic(), which is an accessor for the attribute of the same name. The following inherited (or overridden) methods may profitably be called by the user.

new

 my $p = HTML::TableContentParser->new();

This static method instantiates the parser object. The only supported argument is

classic

If this argument is set to 1, "<th>" tags are handled in the pre-0.299_01 way. That is, the "<tr>" hash will not contain a "{headers}" key, and its "{cells}" key will not contain any "undef" values corresponding to "<th>" elements.

If this argument is set to 0, you get the behavior documented for 0.299_01 and after.

If this argument is "undef" or omitted, the value of $HTML::TableContentParser::CLASSIC is used.

No other values are supported -- that is, the author reserves them, and the behavior when you use them may change without warning.

classic

This method returns the value of the "classic" attribute, whether specified or defaulted.

parse

 my $tables = $p->parse( $html );

This method parses the given HTML. The return is a reference to an array containing all the tables found.

GLOBALS

The following global variables, properly localized, can be used to modify the behavior of this module.

$HTML::TableContentParser::CLASSIC

This variable provides the default value of the "classic" argument to new(), and is subject to the same restrictions.

$HTML::TableContentParser::DEBUG

If set to 1, causes debug output to STDERR (via "warn()"). Setting this to any true value (including 1) is unsupported in the sense that the behavior of this module in response to any true value is explicitly undocumented, and can change without notice.

EXPORTS

Nothing.

CAVEATS, BUGS, and TODO

The "rowspan" and "colspan" attributes are reported but ignored. That is,

 <tr><td colspan="2">Moe</td><td>Howard</td></tr>

occupies three columns in the HTML table, but only two entries are made in the "{cells}" value of the hash that represents this row.

AUTHOR

Simon Drabble <sdrabble@cpan.org>

Thomas R. Wyant, III wyant at cpan dot org

COPYRIGHT AND LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0. For more details, see the full text of the licenses in the directory LICENSES.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.