GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
Marpa::Recognizer(3) User Contributed Perl Documentation Marpa::Recognizer(3)

Marpa::Recognizer - Marpa Recognizer Objects

    my $recce = Marpa::Recognizer->new( { grammar => $grammar } );

    my @tokens = (
        [ 'Number', 42 ],
        [ 'Multiply', ],
        [ 'Number', 1 ],
        [ 'Add', ],
        [ 'Number', 7 ],
    );

    $recce->tokens( \@tokens );

The "Marpa::Recognizer::new" constructor creates a recognizer object from a precomputed Marpa grammar object. The "Marpa::Recognizer::tokens" method recognizes tokens.

Pedantically, recognition and evaluation are distinct things. Recognition is determining whether there is a parse. Evaluation is determining the value of a parse. The Marpa::Recognizer handles both recognition and evaluation.

In traditional parsing, location is position in a token stream. This document will assume that Marpa is using the traditional, token-stream, model of the input, unless it states otherwise. Marpa supports other input models.

The current parse location, or current earleme, is the location at which the next input is expected. Intuitively, the current earleme can be thought of as the recognizer's current position. Alternative input models, and the reasons for the term earleme itself, are discussed in the document on alternative input models.

Recognition may reach a point where the recognizer is exhausted. A recognizer is exhausted if there is no parse at the current location, and if no possible additional input could produce parses at any subsequent location. If a recognizer is not exhausted, it is active.

The evaluators look for a parse from the parse start location to the parse end location. By default, the parse start location is the beginning of the input. By default, the parse end location is the end of the input. In other words, the default is to parse the entire input.

When an evaluator is called using the default parse end location, the evaluator will find no parses in an exhausted recognizer. It may be possible for an exhausted recognizer to produce a valid parse if the parse end location is set to a location before the end of input.

The "new" method's arguments are references to hashes of named arguments. In each key/value pair of these hashes, the key is the argument name, and the hash value is the value of the argument. The "new" method either returns a new recognizer object or throws an exception. The named arguments are described in a section below.

Indicates that input is finished. Calling "end_input" is not necessary or useful in the recognizer's "default" mode. In "stream" mode, the "end_input" method is needed to indicate the end of input.

The "end_input" method takes no arguments. The "end_input" method returns a Perl true value on success. On failure, it throws an exception.

Calling "end_input" multiple times on the same recognizer object is harmless, but useless. The second and subsequent calls will return a Perl true, but will have no effect. Once "end_input" has been called, calls to "tokens" on the same recognizer object will cause an exception to be thrown.

The "set" method's arguments are references to hashes of named arguments. The "set" method can be used to set or change named arguments after the recognizer has been created. Details of the named arguments are below.

The recognizer generates a considerable amount of internal data while building tables. These internal data are no longer needed when input is finished and the recognizer's tables are complete.

The "strip" method deletes much of the recognizer's internal data, but leaves the tables and other data needed for the evaluation phase. Stripping a recognizer greatly reduces the amount of memory it uses. Attempting to strip a recognizer before input is finished will cause an exception.

    my @tokens = (
        [ 'Number', 42 ],
        [ 'Multiply', ],
        [ 'Number', 1 ],
        [ 'Add', ],
        [ 'Number', 7 ],
    );

    $recce->tokens( \@tokens );

The "tokens" method takes two arguments. The first is a reference to an array of token descriptors. The second, optional, argument is a reference to an index into that array, used when the call is interactive.

Tokens descriptors are also references to arrays. The elements in each token descriptor are, in order:

  • Token name

    Required. Must be the name of a valid terminal symbol. For details about terminal symbols, see "Terminals" in Marpa::Grammar.

  • Token value

    Optional. Can be any Perl scalar. If omitted, the token value is a Perl "undef".

  • Token length

    Optional, and not used in the traditional, token-stream, model of the input. Token length is described in the document on alternative input models.

  • Token offset

    Optional, and not used in the traditional, token-stream, model of the input. Token offset is described in the document on alternative input models.

In scalar context, if the recognizer remains active, the "tokens" method returns the number of the current earleme. In array context, if the recognizer remains active, the "tokens" method returns an array of two elements. The first element of the returned array is the number of the current earleme. The second element is a reference to a list of expected tokens.

The list of expected tokens returned by the "tokens" method is used for prediction-driven lexing. It is an array of strings. The strings are the names of the symbols that are valid as token names at the current parse location. For detail on how to use the list of expected tokens, see the section on interactive input.

When a call to the "tokens" method in scalar context exhausts the recognizer, the "tokens" method returns "undef". When a call to the "tokens" method in array context exhausts the recognizer, the "tokens" method returns the empty array.

The value mutator evaluates and returns a parse result. It is described in its own section.

Returns a Perl true when its argument is the name of a terminal symbol. Otherwise, returns a Perl false. Not often needed, but in special sitations a lexer may find this the most convenient way to determine if a symbol is a terminal.

The return value of the status method is that same as that of the "tokens" method. For more details, see the description of "tokens" method.

The "status" method can be used to check the status of the recognizer without calling the "tokens" method. The "tokens" method is the only way to get the list of expected tokens at earleme 0, before the first call of the "tokens" method.

    print $recce->show_earley_sets()
        or die "print failed: $ERRNO";

Returns a multi-line string listing every Earley item in every Earley set. "show_earley_sets" requires knowledge of Marpa internals to interpret.

For debugging grammars, users will want to use "show_progress" instead. "show_progress" contains the information necessary for debugging grammars and interpreting parse progress.

    print $recce->show_progress()
        or die "print failed: $ERRNO";

Returns a string describing the progress of the parse. With no arguments, the string contains reports for the last completed earleme, which is typically the one of most interest. With a non-negative argument N, the string contains reports for earleme N.

With two numeric arguments, N and M, the arguments are interpreted as a range of earlemes and the returned string contains reports for all earlemes from N to M. The first argument must be non-negative. If the second argument is a negative integer, "-M", it indicates the Mth earleme from the last. In other words, -1 is the last earleme, -2 the next to last, etc. The call "$recce->show_progress(0, -1)" will print progress reports for the entire parse.

"show_progress" is an important tool for debugging application grammars. It can also be used to track the progress of a parse or to investigate how a parse works. Details are in the document on debugging Marpa grammars.

The "grammar" named argument is required. Its value must be a precomputed Marpa grammar object.

The "mode" named argument is optional. If present, it must be a string, either ""default"" or ""stream"".

In "stream" mode, the "tokens" method may be called repeatedly. To indicate that input is finished, it is necessary to call the "end_input" method.

In "default" mode, only one call to the "tokens" method is allowed for each recognizer object. The input is automatically finished after that one call. In "default" mode, the "end_input" method is not useful. As the name indicates, "default" mode is the default.

The value must be a string: either ""none"" or ""constant"". When the value is ""none"", Marpa returns the parse results in arbitrary order. When the value is ""constant"", Marpa allows the user to control the order in which parse results are returned by specifying ranking actions which assign values to rules and tokens.

The default is for parse results to be returned in arbitrary order. For details, see the section on ranking in the document on semantics.

The "too_many_earley_items" argument is optional. If specified, it sets the Earley item warning threshold. If an Earley set becomes larger than the Earley item warning threshold, a warning is printed to the trace file handle.

Marpa parses from any BNF, and can handle grammars and inputs which produce large Earley sets. But parsing that involves large Earley sets can be slow. Large Earley sets are something most applications can, and will wish to, avoid.

By default, Marpa calculates an Earley item warning threshold based on the size of the grammar. The default threshold will never be less than 100. If the Earley item warning threshold is set to 0, warnings about large Earley sets are turned off. For details about Earley sets, see the implementation document.

A boolean. If true, causes each Earley set to be written to the trace file handle as it is completed. For details about Earley sets, see the implementation document.

The value is a file handle. Traces and warning messages go to the trace file handle. By default the trace file handle is inherited from the grammar used to create the recognizer.

Very handy in debugging, and often useful even when the problem is not in the lexing. The value is a trace level. When the trace level is 0, tracing of terminals is off. This is the default.

At a trace level of 1 or higher, Marpa traces terminals as they are accepted or rejected by the recognizer. At a trace level of 2 or higher, Marpa traces the terminals expected at every earleme. Practical grammars often expect a large number of different terminals at many locations, so the output from a trace level of 2 can be voluminous.

The value is a boolean. Warnings are written to the trace file handle. By default, the recognizer's warnings are on. Usually, an application will want to leave them on.

    RECCE_RESPONSE: for ( my $token_ix = 0;; ) {

        my ( $current_earleme, $expected_tokens ) =
            $recce->tokens( \@tokens, \$token_ix );

        last RECCE_RESPONSE if $token_ix > $#tokens;

        fix_things( \@tokens, $expected_tokens )
            or die q{Don't know how to fix things};

    } ## end for ( my $token_ix = 0;; )

Marpa allows prediction-driven lexing. In other words, Marpa can tell the lexer which symbols are acceptable as tokens at the next location in the parse. This can be very useful.

Interactive input takes place, like all input, via the "tokens" method. If a call of the "tokens" method passes a second argument, that call of the "tokens" method is interactive. Interactive calls to the "tokens" method are allowed only on recognizers that are in "stream" mode.

In interactive input, the first argument is a reference to a token descriptor array, and the second argument is a reference to a token descriptor index. The "tokens" method will read token descriptors starting at the token descriptor index. Token descriptors before the token descriptor index are ignored.

An unexpected token descriptor is a token descriptor whose token name is not the name of one of the terminal symbols expected at that point in the input. The token represented by an unexpected token descriptor is an unexpected token. In non-interactive calls to the "tokens" method, an unexpected token descriptor causes an exception to be thrown.

In interactive calls, the "tokens" method returns leaving the token descriptor index pointing to the first unexpected token descriptor. If there were no unexpected token descriptors, the token descriptor index will be set to one past the end of the token descriptor array.

The behavior of the token descriptor index is designed to facilitate loops like the one in the example above. In that loop, the token descriptor index is initialized to zero, and is never directly modified after that. The loop ends when the token descriptor index points past the end of the token descriptor array.

The return values for interactive calls to "tokens" are the same as the return values for non-interactive calls. Interactive calls to "tokens" can be made in scalar context, but most often applications will want to use the list of expected tokens, which is only returned in array context.

Recognizers in "stream" mode are very flexible. Both interactive and non-interactive "tokens" method calls can be made to the same recognizer. A single recognizer can read input from any number of token descriptor arrays. Between calls to "tokens", the application is allowed to edit the token descriptor array, and to change the token descriptor index.

This flexiblity is possible because the recognizer is stateless with respect to token descriptor arrays and their token descriptor indexes. The "tokens" method does not keep track of which token descriptor arrays it has seen. It does not keep track of which token descriptors it has already read. It does not keep track of where the token descriptor indexes have been in the past. Statelessness means that, regardless of what an application does with its token descriptor arrays, it is impossible to make them inconsistent with the recognizer's internal data.

Marpa's HTML parser, Marpa::HTML, is an example of how this flexibility can help with a non-trivial, real-life application. When an interactive "tokens" call returns due to an invalid token name, Marpa::HTML tries to fix things in two ways.

First, it looks at the expected tokens list for the name of a token that will work as a "virtual" token. If Marpa::HTML finds an acceptable virtual token name, it will create a token descriptor for it "on the fly". In effect, Marpa::HTML supplies tokens that the parser wants but which are missing in the physical input. Marpa::HTML inserts the virtual tokens into the input using non-interactive "tokens" method calls. After inserting the virtual tokens, Marpa::HTML loops back to the main, interactive, "tokens" call, resuming the original input at the point where it left off.

Second, under some circumstances Marpa::HTML will change the next token descriptor in the token descriptor array to fit the parser's expectations. Marpa::HTML then loops back to the main interactive call to "tokens". When the main interactive call to "tokens" resumes, the token descriptor index is unchanged, but points to an altered token descriptor.

    my $value_ref = $recce->value;
    my $value = $value_ref ? ${$value_ref} : 'No Parse';

The "value" method call evaluates and returns a parse result. Its arguments are zero or more hashes of named arguments. It returns a reference to the value of the next parse result, or undef if there are no more parse results.

These are the named arguments available to the "value" method call:

The "value" method's "end" named argument specifies the parse end location. The default is for the parse to end where the input did, so that the parse returned is of the entire input.

The "value" method's "closures" named argument is a reference to a hash. In each key/value pair of this hash, the key must be an action name. The hash value must be a CODE ref.

Sources of action names include

  • The "action" properties of rules
  • The "default_action" named argument of grammars
  • The "lhs" properties of rules
  • The "ranking_action" properties of rules
  • For its "new" method, the "action_object" named argument of grammars

When an action name is a key in the "closures" named argument, the usual action resolution mechanism of the semantics is bypassed. A common use of the "closures" named argument is to allow anonymous subroutines to be semantic actions. For more details, see the document on semantics.

The value must be an integer. If it is greater than zero, the evaluator will return no more than that number of parse results. If it is zero, there will be no limit on the number of parse results returned. The default is for there to be no limit.

Marpa allows extremely ambiguous grammars. "max_parses" can be used if the user only wants to see the first few parse results of an ambiguous parse. "max_parses" is also useful to limit CPU usage and output length when testing and debugging.

The value must be a string: either ""none"" or ""constant"". When the value is ""none"", Marpa returns the parse results in arbitrary order. When the value is ""constant"", Marpa allows the user to control the order in which parse results are returned by specifying ranking actions. The default is for parse results to be returned in arbitrary order. For details, see the section on parse order in the semantics document.

The "value" method's "trace_actions" named argument is a boolean. If the boolean value is true, Marpa traces the resolution of action names to Perl closures. A boolean value of false turns tracing off, which is the default. Traces are written to the trace file handle.

The "value" method's "trace_values" named argument is a numeric trace level. If the numeric trace level is 1, Marpa traces values as they are computed in the evaluation stack. A trace level of 0 turns value tracing off, which is the default. Traces are written to the trace file handle.

Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.
2022-04-09 perl v5.32.1

Search for    or go to Top of page |  Section 3 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.