GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
HTML::TableContentParser(3) User Contributed Perl Documentation HTML::TableContentParser(3)

HTML::TableContentParser - Do interesting things with the contents of tables.

  use HTML::TableContentParser;
  my $p = HTML::TableContentParser->new();
  my $html = read_html_from_somewhere();
  my $tables = $p->parse_file( $html );
  for my $t (@$tables) {
    for my $r (@{$t->{rows}}) {
      print 'Row:';
      for my $c (@{$r->{cells}}) {
        print " [$c->{data}]";
      }
      print "\n";
    }
  }

This package parses tables out of HTML. The return from the parse is a reference to an array containing the tables found.

Tables appear in the output in the order in which they are encountered. If a table is nested inside a cell of another table, it will appear after the containing table in the output, and any connection between the two will be lost. As of version 0.200_01, the appearance of a nested table should not cause any truncation of the containing table.

The following tags are processed by this module: "<table>", "<caption>", "<tr>", "<th>", and "<td>". In the return from the parse method, each tag is represented by a hash reference, having the tag's attributes as keys, and the attribute values as values. In addition, the following keys will be provided:

"<table>"
caption
the "<caption>" tag, if any
headers
a reference to an array containing all the "<th>" tags, in the order encountered
rows
a reference to an array containing all the "<tr>" tags, in the order encountered
"<caption>"
data
the content of the "<caption>" tag
"<tr>"
cells
a reference to an array containing all the "<td>" tags, in the order encountered, with "undef" representing any "<th>" tags encountered. Trailing "undef" values will be dropped, and the entire key will be absent unless actual "<td>" tags are found in the row.

Note that prior to version 0.299_01, "<th>" tags were not represented at all.

headers
new with version 0.299_01, this is a reference to an array containing all the "<th>" tags in the row, in the order encountered, with "undef" representing any "<td>" tags. Trailing "undef" values will be dropped, and the entire key will be absent unless actual "<th>" tags are found in the row.

It is the understanding of the current author (TRW) that in valid HTML "<th>" tags must occur inside a "<tr>" element, so they need to be recognized there, rather than (or in addition to) in isolation.

"<th>"
data
the content of the "<th>" tag
"<td>"
data
the content of the "<td>" tag

This module is a subclass of HTML::Parser. It provides only one new method, classic(), which is an accessor for the attribute of the same name. The following inherited (or overridden) methods may profitably be called by the user.

 my $p = HTML::TableContentParser->new();

This static method instantiates the parser object. The only supported argument is

classic
If this argument is set to 1, "<th>" tags are handled in the pre-0.299_01 way. That is, the "<tr>" hash will not contain a "{headers}" key, and its "{cells}" key will not contain any "undef" values corresponding to "<th>" elements.

If this argument is set to 0, you get the behavior documented for 0.299_01 and after.

If this argument is "undef" or omitted, the value of $HTML::TableContentParser::CLASSIC is used.

No other values are supported -- that is, the author reserves them, and the behavior when you use them may change without warning.

This method returns the value of the "classic" attribute, whether specified or defaulted.

 my $tables = $p->parse( $html );

This method parses the given HTML. The return is a reference to an array containing all the tables found.

The following global variables, properly localized, can be used to modify the behavior of this module.

This variable provides the default value of the "classic" argument to new(), and is subject to the same restrictions.

If set to 1, causes debug output to STDERR (via "warn()"). Setting this to any true value (including 1) is unsupported in the sense that the behavior of this module in response to any true value is explicitly undocumented, and can change without notice.

Nothing.

The "rowspan" and "colspan" attributes are reported but ignored. That is,

 <tr><td colspan="2">Moe</td><td>Howard</td></tr>

occupies three columns in the HTML table, but only two entries are made in the "{cells}" value of the hash that represents this row.

This module is a very specific tool to address a very specific problem. One of the following modules may better address your needs.

HTML::Parser. This is a general HTML parser, which forms the basis for this module.

HTML::TreeBuilder. This is a general HTML parser, with methods to search and traverse the parse tree once generated.

Mojo::DOM in the Mojolicious distribution. This is a general HTML/XML DOM parser, with methods to search the parse tree using CSS selectors.

Simon Drabble <sdrabble@cpan.org>

Thomas R. Wyant, III wyant at cpan dot org

Copyright (C) 2002 Simon Drabble

Copyright (C) 2017-2018 Thomas R. Wyant, III

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0. For more details, see the full text of the licenses in the directory LICENSES.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

2018-02-01 perl v5.32.1

Search for    or go to Top of page |  Section 3 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.