GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  HTML::TABLEPARSER (3)

.ds Aq ’

NAME

HTML::TableParser - Extract data from an HTML table

CONTENTS

SYNOPSIS



  use HTML::TableParser;

  @reqs = (
           {
            id => 1.1,                    # id for embedded table
            hdr => \&header,              # function callback
            row => \&row,                 # function callback
            start => \&start,             # function callback
            end => \&end,                 # function callback
            udata => { Snack => Food }, # arbitrary user data
           },
           {
            id => 1,                      # table id
            cols => [ Object Type,
                      qr/object/ ],       # column name matches
            obj => $obj,                  # method callbacks
           },
          );

  # create parser object
  $p = HTML::TableParser->new( \@reqs,
                   { Decode => 1, Trim => 1, Chomp => 1 } );
  $p->parse_file( foo.html );


  # function callbacks
  sub start {
    my ( $id, $line, $udata ) = @_;
    #...
  }

  sub end {
    my ( $id, $line, $udata ) = @_;
    #...
  }

  sub header {
    my ( $id, $line, $cols, $udata ) = @_;
    #...
  }

  sub row  {
    my ( $id, $line, $cols, $udata ) = @_;
    #...
  }



DESCRIPTION

<B>HTML::TableParserB> uses <B>HTML::ParserB> to extract data from an HTML table. The data is returned via a series of user defined callback functions or methods. Specific tables may be selected either by a matching a unique table id or by matching against the column names. Multiple (even nested) tables may be parsed in a document in one pass.

    Table Identification

Each table is given a unique id, relative to its parent, based upon its order and nesting. The first top level table has id 1, the second 2, etc. The first table nested in table 1 has id 1.1, the second 1.2, etc. The first table nested in table 1.1 has id 1.1.1, etc. These, as well as the tables’ column names, may be used to identify which tables to parse.

    Data Extraction

As the parser traverses a selected table, it will pass data to user provided callback functions or methods after it has digested particular structures in the table. All functions are passed the table id (as described above), the line number in the HTML source where the table was found, and a reference to any table specific user provided data.
Table Start The <B>startB> callback is invoked when a matched table has been found.
Table End The <B>endB> callback is invoked after a matched table has been parsed.
Header The <B>hdrB> callback is invoked after the table header has been read in. Some tables do not use the <B><th>B> tag to indicate a header, so this function may not be called. It is passed the column names.
Row The <B>rowB> callback is invoked after a row in the table has been read. It is passed the column data.
Warn The <B>warnB> callback is invoked when a non-fatal error occurs during parsing. Fatal errors croak.
New This is the class method to call to create a new object when <B>HTML::TableParserB> is supposed to create new objects upon table start.

    Callback API

Callbacks may be functions or methods or a mixture of both. In the latter case, an object must be passed to the constructor. (More on that later.)

The callbacks are invoked as follows:



  start( $tbl_id, $line_no, $udata );

  end( $tbl_id, $line_no, $udata );

  hdr( $tbl_id, $line_no, \@col_names, $udata );

  row( $tbl_id, $line_no, \@data, $udata );

  warn( $tbl_id, $line_no, $message, $udata );

  new( $tbl_id, $udata );



    Data Cleanup

There are several cleanup operations that may be performed automatically:
Chomp <B>B>chomp()<B>B> the data
Decode Run the data through <B>HTML::Entities::decodeB>.
DecodeNBSP Normally <B>HTML::Entitites::decodeB> changes a non-breaking space into a character which doesn’t seem to be matched by Perl’s whitespace regexp. Setting this attribute changes the HTML nbsp character to a plain ’ol blank.
Trim remove leading and trailing white space.

    Data Organization

Column names are derived from cells delimited by the <B><th>B> and <B></th>B> tags. Some tables have header cells which span one or more columns or rows to make things look nice. <B>HTML::TableParserB> determines the actual number of columns used and provides column names for each column, repeating names for spanned columns and concatenating spanned rows and columns. For example, if the table header looks like this:



 +----+--------+----------+-------------+-------------------+
 |    |        | Eq J2000 |             | Velocity/Redshift |
 | No | Object |----------| Object Type |-------------------|
 |    |        | RA | Dec |             | km/s |  z  | Qual |
 +----+--------+----------+-------------+-------------------+



The columns will be:



  No
  Object
  Eq J2000 RA
  Eq J2000 Dec
  Object Type
  Velocity/Redshift km/s
  Velocity/Redshift z
  Velocity/Redshift Qual



Row data are derived from cells delimited by the <B><td>B> and <B></td>B> tags. Cells which span more than one column or row are handled correctly, i.e. the values are duplicated in the appropriate places.

METHODS

new


   $p = HTML::TableParser->new( \@reqs, \%attr );



This is the class constructor. It is passed a list of table requests as well as attributes which specify defaults for common operations. Table requests are documented in Table Requests.

The %attr hash provides default values for some of the table request attributes, namely the data cleanup operations ( Chomp, Decode, Trim ), and the multi match attribute MultiMatch, i.e.,



  $p = HTML::TableParser->new( \@reqs, { Chomp => 1 } );



will set <B>ChompB> on for all of the table requests, unless overridden by them. The data cleanup operations are documented above; MultiMatch is documented in Table Requests.

<B>DecodeB> defaults to on; all of the others default to off.

parse_file This is the same function as in <B>HTML::ParserB>.
parse This is the same function as in <B>HTML::ParserB>.

Table Requests

A table request is a hash used by <B>HTML::TableParserB> to determine which tables are to be parsed, the callbacks to be invoked, and any data cleanup. There may be multiple requests processed by one call to the parser; each table is associated with a single request (even if several requests match the table).

A single request may match several tables, however unless the <B>MultiMatchB> attribute is specified for that request, it will be used for the first matching table only.

A table request which matches a table id of DEFAULT will be used as a catch-all request, and will match all tables not matched by other requests. Please note that tables are compared to the requests in the order that the latter are passed to the <B>B>new()<B>B> method; place the <B>DEFAULTB> method last for proper behavior.

    Identifying tables to parse

<B>HTML::TableParserB> needs to be told which tables to parse. This can be done by matching table ids or column names, or a combination of both. The table request hash elements dedicated to this are:
id This indicates a match on table id. It can take one of these forms:
exact match


  id => $match
  id => 1.2



Here $match is a scalar which is compared directly to the table id.

regular expression


  id => $re
  id => qr/1\.\d+\.2/



$re is a regular expression, which must be constructed with the qr// operator.

subroutine


  id => \&my_match_subroutine
  id => sub { my ( $id, $oids ) = @_ ;
           $oids[0] > 3 && $oids[1] < 2 }



Here id is assigned a coderef to a subroutine which returns true if the table matches, false if not. The subroutine is passed two arguments: the table id as a scalar string ( e.g. 1.2.3) and the table id as an arrayref (e.g. $oids = [ 1, 2, 3]).

id may be passed an array containing any combination of the above:



  id => [ 1.2, qr/1\.\d+\.2/, sub { ... } ]



Elements in the array may be preceded by a modifier indicating the action to be taken if the table matches on that element. The modifiers and their meanings are:
- If the id matches, it is explicitly excluded from being processed by this request.
-- If the id matches, it is skipped by <B>allB> requests.
+ If the id matches, it will be processed by this request. This is the default action.

An example:



  id => [ -, 1.2, DEFAULT ]



indicates that this request should be used for all tables, except for table 1.2.



  id => [ --, 1.2 ]



Table 2 is just plain skipped altogether.

cols This indicates a match on column names. It can take one of these forms:
exact match


  cols => $match
  cols => Snacks01



Here $match is a scalar which is compared directly to the column names. If any column matches, the table is processed.

regular expression


  cols => $re
  cols => qr/Snacks\d+/



$re is a regular expression, which must be constructed with the qr// operator. Again, a successful match against any column name causes the table to be processed.

subroutine


  cols => \&my_match_subroutine
  cols => sub { my ( $id, $oids, $cols ) = @_ ;
                ... }



Here cols is assigned a coderef to a subroutine which returns true if the table matches, false if not. The subroutine is passed three arguments: the table id as a scalar string ( e.g. 1.2.3), the table id as an arrayref (e.g. $oids = [ 1, 2, 3]), and the column names, as an arrayref (e.g. $cols = [ col1, col2 ]). This option gives the calling routine the ability to make arbitrary selections based upon table id and columns.

cols may be passed an arrayref containing any combination of the above:



  cols => [ Snacks01, qr/Snacks\d+/, sub { ... } ]



Elements in the array may be preceded by a modifier indicating the action to be taken if the table matches on that element. They are the same as the table id modifiers mentioned above.

colre <B>This is deprecated, and is present for backwards compatibility only.B> An arrayref containing the regular expressions to match, or a scalar containing a single reqular expression
More than one of these may be used for a single table request. A request may match more than one table. By default a request is used only once (even the DEFAULT id match!). Set the MultiMatch attribute to enable multiple matches per request.

When attempting to match a table, the following steps are taken:
1. The table id is compared to the requests which contain an id match. The first such match is used (in the order given in the passed array).
2. If no explicit id match is found, column name matches are attempted. The first such match is used (in the order given in the passed array)
3. If no column name match is found (or there were none requested), the first request which matches an <B>idB> of DEFAULT is used.

    Specifying the data callbacks

Callback functions are specified with the callback attributes start, end, hdr, row, and warn. They should be set to code references, i.e.



  %table_req = ( ..., start => \&start_func, end => \&end_func )



To use methods, specify the object with the obj key, and the method names via the callback attributes, which should be set to strings. If you don’t specify method names they will default to (you guessed it) start, end, hdr, row, and warn.



  $obj = SomeClass->new();
  # ...
  %table_req_1 = ( ..., obj => $obj );
  %table_req_2 = ( ..., obj => $obj, start => start,
                             end => end );



You can also have <B>HTML::TableParserB> create a new object for you for each table by specifying the class attribute. By default the constructor is assumed to be the class <B>B>new()<B>B> method; if not, specify it using the new attribute:



  use MyClass;
  %table_req = ( ..., class => MyClass, new => mynew );



To use a function instead of a method for a particular callback, set the callback attribute to a code reference:



  %table_req = ( ..., obj => $obj, end => \&end_func );



You don’t have to provide all the callbacks. You should not use both obj and class in the same table request.

<B>HTML::TableParserB> automatically determines if your object or class has one of the required methods. If you wish it not to use a particular method, set it equal to undef. For example



  %table_req = ( ..., obj => $obj, end => undef )



indicates the object’s <B>endB> method should not be called, even if it exists.

You can specify arbitrary data to be passed to the callback functions via the udata attribute:



  %table_req = ( ..., udata => \%hash_of_my_special_stuff )



    Specifying Data cleanup operations

Data cleanup operations may be specified uniquely for each table. The available keys are Chomp, Decode, Trim. They should be set to a non-zero value if the operation is to be performed.

    Other Attributes

The MultiMatch key is used when a request is capable of handling multiple tables in the document. Ordinarily, a request will process a single table only (even DEFAULT requests). Set it to a non-zero value to allow the request to handle more than one table.

LICENSE

This software is released under the GNU General Public License. You may find a copy at



   http://www.fsf.org/copyleft/gpl.html



AUTHOR

Diab Jerius (djerius@cpan.org)

SEE ALSO

HTML::Parser, HTML::TableExtract.
Search for    or go to Top of page |  Section 3 |  Main Index


perl v5.20.3 HTML::TABLEPARSER (3) 2014-08-22

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.