Adds to the internal hash of tags which never contain any out-of-tag content. This hash is %YAPE::HTML::EMPTY, and contains the following tag names: area, base, br, hr, img, input, link, meta, and param. Deletion from this hashmust be done manually. Adding to this hash automatically adds to the %OPEN hash, described next.
Adds to the internal hash of tags which do not require a closing tag. This hash is %YAPE::HTML::OPEN, and contains the following tag names: area, base, br, dd, dt, hr, img, input, li, link, meta, p, and param. Deletion from this hash must be done manually.
There is a subtle difference between empty and open tags. For example, the <AREA> tag contains a few attributes, but there is no text associated with it (nor any other tags), and therefore, is empty; the <LI>, on the other hand,
use YAPE::HTML qw( MyExt::Mod );
If supplied no arguments, the module is loaded normally, and the node classes are given the proper inheritence (from YAPE::HTML::Element). If you supply a module (or list of modules), import will automatically include them (if needed) and set up their node classes with the proper inheritence that is, it will append YAPE::HTML to @MyExt::Mod::ISA, and YAPE::HTML::xxx to each node classs @ISA (where xxx is the name of the specific node class).
It also copies the %OPEN and %EMPTY hashes, as well as the OPEN() and EMPTY() functions, into the MyExt::Mod namespace. This process is designed to save you from having to place @ISA assignments all over the place.
It also copies the %SSI hash. This hash is not suggested to be altered, and therefore it does not have any public interface (you have to fiddle with it yourself). It exists to ensure an SSI is valid.
my $p = YAPE::HTML->new($HTML, $strict);
Creates a YAPE::HTML object, using the contents of the $HTML string as its HTML to parse. The optional second argument determines whether this parser instance will demand strict comment parsing and require all tags to be closed with a closing tag or a / at the end of the tag (<HR />). Any true value (except for the special string -NO_STRICT) will turn strict parsing on. This is off by default. (This could be considered a bug.)
my $text = $p->chunk($len);
Returns the next $len characters in the input string; $len defaults to 30 characters. This is useful for figuring out why a parsing error occurs.
|o||my $done = $p->done;|
my $errstr = $p->error;
Returns the parser error message.
my $coderef = $p->extract(...);
Returns a code reference that returns the next object that matches the criteria given in the arguments. This is a fundamental feature of the module, and you can extract that from Extracting Sections.
my $node = $p->display(...);
Returns a string representation of the entire content. It calls the parse method in case there is more data that has not yet been parsed. This calls the fullstring method on the root nodes. Check the YAPE::HTML::Element docs on the arguments to fullstring.
|o||my $node = $p->next;|
my $node = $p->parse;
Calls next until all the data has been parsed.
my $attr = $p->quote($string);
Returns a quoted string, suitable for using as an attribute. It turns any embedded " characters into ". This can also be called as a raw function:
my $root = $p->root;
Returns an array reference holding the root of the tree structure for documents that contain multiple top-level tags, this will have more than one element.
my $state = $p->state;
Returns the current state of the parser. It is one of the following values: close(TAG), comment, done, dtd, error, open(TAG), pi, ssi, text, text(script), or text(xmp). The open and close states contain the name of the element in parentheses (ex. open(img)). Tag names, as well as the names of attributes, are converted to lowercase. The state of text(script) refers to text found inside an <SCRIPT> element, and likewise for text(xmp).
my $HTMLnode = $p->top;
Returns the first <HTML> node it finds in the tree structure.
YAPE::HTML allows comprehensive extraction of tags, text, comments, DTDs, PIs, and SSIs, using a simple, yet rich, syntax:
my $extor = $parser->extract( TYPE => [ REQS ], ... );
TYPE can be either the name of a tag ("table"), a regular expression that matches tags (qr/^t[drh]$/), or a special string to match all tags (-TAG), all text (-TEXT), all comments (-COMMENT), all DTDs (-DTD), all PIs (-PI), and all SSIs (-SSI).
REQS varies from element to element:
Here are some example uses:
o -TAG, -DTD, -PI, -SSI
A list of attributes that the tag/DTD/PI/SSI must have.
o -TEXT, -COMMENT
A list of strings and regexes that the content of the text/comment must have or match.
o all tags starting with h
my $extor = $parser->extract(qr/^h/ => );
o all tags with an align attribute
my $extor = $parser->extract(-TAG => [align]);
o all text containing the word japhy
my $extor = $parser->extract(-TEXT => [qr/\bjaphy\b/i]);
o tags involving links
my $extor = $parser->extract( a => [href], area => [href], base => [href], body => [background], img => [src], # ... );
This is a list of special features of YAPE::HTML.
o On-the-fly cleaning of HTML
will appear as:
upon request for output. In addition, tags that are left dangling open at the end of an HTML document get closed. That means:
will appear as:
If strict checking is off, the only error youll receive from mismatched HTML tags is a closing tag out-of-place.
This is a listing of things to add to future versions of this module.
o HTML entity translation (via HTML::Entities no doubt)
Add a flag to the fullstring method of objects, -EXPAND, which will display &...; HTML escapes as the character representing them.
o Toggle case of output (lower/upper case)
Add a flag to the fullstring method of objects, -UPPER, which will display tag and attribute names in uppercase.
o Super-strict syntax checking
DTD-like strictness in regards to nesting of elements <LI> is not allowed to be outside an <OL> or <UL> element.
o Make it faster, of course
Theres probably some inherent slowness to this method, but it works. And it supports the robust extract method.
o Combine CLOSED and IMPLICIT
Make three constants, CLOSED_NO, CLOSED_YES, and CLOSED_IMPL.
Following is a list of known or reported bugs.
o Inheritence fixed again (fixed in 1.11) o Inheritence was fouled up (fixed in 1.10)
o The above features arent in here yet. ;) o Strict syntax-checking is not on by default. o This documentation might be incomplete. o DTD, PI, and SSI support is incomplete. o Probably need some more test cases. o SSI conditional tags dont contain content.
Visit YAPEs web site at http://www.pobox.com/~japhy/YAPE/.
The YAPE::HTML::Element documentation, for information on the node classes.
Jeff "japhy" Pinyan CPAN ID: PINYAN firstname.lastname@example.org http://www.pobox.com/~japhy/
Hey! <B>The above document had some coding errors, which are explained below:B>
Around line 725: You forgot a =back before =head1
|perl v5.20.3||HTML (3)||2001-02-06|