|
NAME"Parse::Lex" - Generator of lexical analyzers - moving pointer inside text SYNOPSIS require 5.005;
use Parse::Lex;
@token = (
qw(
ADDOP [-+]
LEFTP [\(]
RIGHTP [\)]
INTEGER [1-9][0-9]*
NEWLINE \n
),
qw(STRING), [qw(" (?:[^"]+|"")* ")],
qw(ERROR .*), sub {
die qq!can\'t analyze: "$_[1]"!;
}
);
Parse::Lex->trace; # Class method
$lexer = Parse::Lex->new(@token);
$lexer->from(\*DATA);
print "Tokenization of DATA:\n";
TOKEN:while (1) {
$token = $lexer->next;
if (not $lexer->eoi) {
print "Line $.\t";
print "Type: ", $token->name, "\t";
print "Content:->", $token->text, "<-\n";
} else {
last TOKEN;
}
}
__END__
1+2-5
"a multiline
string with an embedded "" in it"
an invalid string with a "" in it"
DESCRIPTIONThe classes "Parse::Lex" and "Parse::CLex" create lexical analyzers. They use different analysis techniques: 1. "Parse::Lex" steps through the analysis by moving a pointer within the character strings to be analyzed (use of pos() together with "\G"), 2. "Parse::CLex" steps through the analysis by consuming the data recognized (use of "s///"). Analyzers of the "Parse::CLex" class do not allow the use of anchoring in regular expressions. In addition, the subclasses of "Parse::Token" are not implemented for this type of analyzer. A lexical analyzer is specified by means of a list of tokens passed as arguments to the new() method. Tokens are instances of the "Parse::Token" class, which comes with "Parse::Lex". The definition of a token usually comprises two arguments: a symbolic name (like "INTEGER"), followed by a regular expression. If a sub ref (anonymous subroutine) is given as third argument, it is called when the token is recognized. Its arguments are the "Parse::Token" instance and the string recognized by the regular expression. The anonymous subroutine's return value is used as the new string contents of the "Parse::Token" instance. The order in which the lexical analyzer examines the regular expressions is determined by the order in which these expressions are passed as arguments to the new() method. The token returned by the lexical analyzer corresponds to the first regular expression which matches (this strategy is different from that used by Lex, which returns the longest match possible out of all that can be recognized). The lexical analyzer can recognize tokens which span multiple records. If the definition of the token comprises more than one regular expression (placed within a reference to an anonymous array), the analyzer reads as many records as required to recognize the token (see the documentation for the "Parse::Token" class). When the start pattern is found, the analyzer looks for the end, and if necessary, reads more records. No backtracking is done in case of failure. The analyzer can be used to analyze an isolated character string or a stream of data coming from a file handle. At the end of the input data the analyzer returns a "Parse::Token" instance named "EOI" (End Of Input). Start ConditionsYou can associate start conditions with the token-recognition rules that comprise your lexical analyzer (this is similar to what Flex provides). When start conditions are used, the rule which succeeds is no longer necessarily the first rule that matches. A token symbol may be preceded by a start condition specifier for the associated recognition rule. For example: qw(C1:TERMINAL_1 REGEXP), sub { # associated action },
qw(TERMINAL_2 REGEXP), sub { # associated action },
Symbol "TERMINAL_1" will be recognized only if start condition "C1" is active. Start conditions are activated/deactivated using the start(CONDITION_NAME) and end(CONDITION_NAME) methods. start('INITIAL') resets the analysis automaton. Start conditions can be combined using AND/OR operators as follows: C1:SYMBOL condition C1
C1:C2:SYMBOL condition C1 AND condition C2
C1,C2:SYMBOL condition C1 OR condition C2
There are two types of start conditions: inclusive and exclusive, which are declared by class methods inclusive() and exclusive() respectively. With an inclusive start condition, all rules are active regardless of whether or not they are qualified with the start condition. With an exclusive start condition, only the rules qualified with the start condition are active; all other rules are deactivated. Example (borrowed from the documentation of Flex): use Parse::Lex;
@token = (
'EXPECT', 'expect-floats', sub {
$lexer->start('expect');
$_[1]
},
'expect:FLOAT', '\d+\.\d+', sub {
print "found a float: $_[1]\n";
$_[1]
},
'expect:NEWLINE', '\n', sub {
$lexer->end('expect') ;
$_[1]
},
'NEWLINE2', '\n',
'INT', '\d+', sub {
print "found an integer: $_[1] \n";
$_[1]
},
'DOT', '\.', sub {
print "found a dot\n";
$_[1]
},
);
Parse::Lex->exclusive('expect');
$lexer = Parse::Lex->new(@token);
The special start condition "ALL" is always verified. Methods
ERROR HANDLINGTo handle the cases of token non-recognition, you can define a specific token at the end of the list of tokens that comprise our lexical analyzer. If searching for this token succeeds, it is then possible to call an error handling function: qw(ERROR (?s:.*)), sub {
print STDERR "ERROR: buffer content->", $_[0]->lexer->buffer, "<-\n";
die qq!can\'t analyze: "$_[1]"!;
}
EXAMPLESctokenizer.pl - Scan a stream of data using the "Parse::CLex" class. tokenizer.pl - Scan a stream of data using the "Parse::Lex" class. every.pl - Use of the "every" method. sexp.pl - Interpreter for prefix arithmetic expressions. sexpcond.pl - Interpeter for prefix arithmetic expressions, using conditions. BUGSAnalyzers of the "Parse::CLex" class do not allow the use of regular expressions with anchoring. SEE ALSO"Parse::Token", "Parse::LexEvent", "Parse::YYLex". AUTHORPhilippe Verdret. Documentation translated to English by Vladimir Alexiev and Ocrat. ACKNOWLEDGMENTSVersion 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat has significantly contributed to improving this documentation. Thanks also to the numerous people who have sent me bug reports and occasionally fixes. REFERENCESFriedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates 1996. Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990. FLEX - A Scanner generator (available at ftp://ftp.ee.lbl.gov/ and elsewhere) COPYRIGHTCopyright (c) 1995-1999 Philippe Verdret. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
|