YAPE::Regex - Yet Another Parser/Extractor for Regular Expressions
This document refers to YAPE::Regex version 4.00.
use YAPE::Regex;
use strict;
my $regex = qr/reg(ular\s+)?exp?(ression)?/i;
my $parser = YAPE::Regex->new($regex);
# here is the tokenizing part
while (my $chunk = $parser->next) {
# ...
}
The "YAPE" hierarchy of modules is an attempt at a unified means of
parsing and extracting content. It attempts to maintain a generic interface,
to promote simplicity and reusability. The API is powerful, yet simple. The
modules do tokenization (which can be intercepted) and build trees, so that
extraction of specific nodes is doable.
This module is yet another (?) parser and tree-builder for Perl regular
expressions. It builds a tree out of a regex, but at the moment, the extent of
the extraction tool for the tree is quite limited (see "Extracting
Sections"). However, the tree can be useful to extension modules.
In addition to the base class, "YAPE::Regex", there is the auxiliary
class "YAPE::Regex::Element" (common to all "YAPE" base
classes) that holds the individual nodes' classes. There is documentation for
the node classes in that module's documentation.
- •
- "use YAPE::Regex;"
- •
- "use YAPE::Regex qw( MyExt::Mod );"
If supplied no arguments, the module is loaded normally, and the node
classes are given the proper inheritence (from
"YAPE::Regex::Element"). If you supply a module (or list of
modules), "import" will automatically include them (if needed)
and set up their node classes with the proper inheritence -- that
is, it will append "YAPE::Regex" to @MyExt::Mod::ISA, and
"YAPE::Regex::xxx" to each node class's @ISA (where
"xxx" is the name of the specific node class).
package MyExt::Mod;
use YAPE::Regex 'MyExt::Mod';
# does the work of:
# @MyExt::Mod::ISA = 'YAPE::Regex'
# @MyExt::Mod::text::ISA = 'YAPE::Regex::text'
# ...
- •
- "my $p = YAPE::Regex->new($REx);"
Creates a "YAPE::Regex" object, using the contents of $REx as a
regular expression. The "new" method will attempt to
convert $REx to a compiled regex (using "qr//") if $REx isn't
already one. If there is an error in the regex, this will fail, but the
parser will pretend it was ok. It will then report the bad token when it
gets to it, in the course of parsing.
- •
- "my $text = $p->chunk($len);"
Returns the next $len characters in the input string; $len defaults to 30
characters. This is useful for figuring out why a parsing error
occurs.
- •
- "my $done = $p->done;"
Returns true if the parser is done with the input string, and false
otherwise.
- •
- "my $errstr = $p->error;"
Returns the parser error message.
- •
- "my $backref = $p->extract;"
Returns a code reference that returns the next back-reference in the regex.
For more information on enhancements in upcoming versions of this module,
check "Extracting Sections".
- •
- "my $node = $p->display(...);"
Returns a string representation of the entire content. It calls the
"parse" method in case there is more data that has not yet been
parsed. This calls the "fullstring" method on the root nodes.
Check the "YAPE::Regex::Element" docs on the arguments to
"fullstring".
- •
- "my $node = $p->next;"
Returns the next token, or "undef" if there is no valid token.
There will be an error message (accessible with the "error"
method) if there was a problem in the parsing.
- •
- "my $node = $p->parse;"
Calls "next" until all the data has been parsed.
- •
- "my $node = $p->root;"
Returns the root node of the tree structure.
- •
- "my $state = $p->state;"
Returns the current state of the parser. It is one of the following values:
"alt", "anchor", "any", "backref",
capture(N), "Cchar", "class", "close",
"code", "comment", "cond(TYPE)",
"ctrl", "cut", "done", "error",
"flags", "group", "hex", "later",
"lookahead(neg|pos)", "lookbehind(neg|pos)",
"macro", "named", "oct", "slash",
"text", and "utf8hex".
For capture(N), N will be the number the captured pattern represents.
For "cond(TYPE)", TYPE will either be a number representing
the back-reference that the conditional depends on, or the string
"assert".
For "lookahead" and "lookbehind", one of "neg"
and "pos" will be there, depending on the type of
assertion.
- •
- "my $node = $p->top;"
Synonymous to "root".
While extraction of nodes is the goal of the "YAPE" modules, the
author is at a loss for words as to what needs to be extracted from a regex.
At the current time, all the "extract" method does is allow you
access to the regex's set of back-references:
my $extor = $parser->extract;
while (my $backref = $extor->()) {
# ...
}
"japhy" is very open to suggestions as to the approach to node
extraction (in how the API should look, in addition to what should be
proffered). Preliminary ideas include extraction keywords like the output of
-Dr (or the "re" module's "debug" option).
- •
- "YAPE::Regex::Explain"
Presents an explanation of a regular expression, node by node.
- •
- "YAPE::Regex::Reverse" (Not released)
Reverses the nodes of a regular expression.
This is a listing of things to add to future versions of this module.
- •
- Create a robust "extract" method
Open to suggestions.
Following is a list of known or reported bugs.
- •
- "use charnames ':full'"
To understand "\N{...}" properly, you must be using 5.6.0 or
higher. However, the parser only knows how to resolve full names (those
made using "use charnames ':full'"). There might be an option in
the future to specify a class name.
The "YAPE::Regex::Element" documentation, for information on the node
classes. Also, "Text::Balanced", Damian Conway's excellent module,
used for the matching of "(?{ ... })" and "(??{ ... })"
blocks.
The original author is Jeff "japhy" Pinyan (CPAN ID: PINYAN).
Gene Sullivan (gsullivan@cpan.org) is a co-maintainer.
This module is free software; you can redistribute it and/or modify it under the
same terms as Perl itself. See perlartistic.