Marpa::Advanced::Implementation - Marpa Implementation
This document describes the Marpa data structures that are useful for tracing
and debugging grammars. It assumes that the reader knows the Marpa API, and
that she has a general knowledge of parsing. All the examples in this document
assume that none of the data has been stripped.
Marpa rewrites grammars, adding internal symbols and rules in the process. These
rewritings do not affect the semantics, but they do show up when you examine
the internals.
Marpa's internal symbols have
tags at the end, enclosed in square
brackets. This means that all Marpa internal symbols end in a right square
bracket.
Many parsers add their own start rule and their own start symbol to grammars.
Marpa is no exception. The new internal start symbol is the original start
symbol with ""[']"" suffixed. An example of a Marpa
internal start symbol is ""Expression[']"". If the grammar
allows a null parse, there will also be a Marpa internal nulling start symbol,
with ""['][]"" suffixed. An example of a Marpa internal
nulling start symbol would be ""Script['][]"".
A symbol is
nulled if it produces the empty sentence.
Nulling
symbols are those which are
always nulled.
Nullable symbols
are those which are
sometimes nulled.
Non-nullable symbols are
those which are
never nulled.
Pedantically, all nulling symbols are also nullable symbols. A
proper
nullable is any nullable symbol which is not a nulling symbol. In other
words, a proper nullable is a symbol that might or might not be nulled.
Nullable symbols were a problem in previous versions of Earley parsers. Aycock
and Horspool 2002 outlined a new approach for dealing with them. I use their
ideas with modifications of my own.
Marpa rewrites its grammar to eliminate proper nullables. It does this by
turning every proper nullable into two symbols: a non-nullable variant and a
nulling variant. The non-nullable variant keeps the original symbol's name,
but is no longer allowed to appear in places where it might be nulled.
The name of the nulling variant is that of the original symbol with the nulling
tag (""[]"") suffixed. When the nulling variant is used,
it must be nulled.
The newly introduced nulling symbols will not appear on any left hand sides,
with one exception: grammars that allow a null parse will have a nulling start
rule. Except for the nulling start symbol, Marpa marks nulling symbols
internally and recognizes them directly, without the need for empty rules.
Rules with proper nullables on the RHS have to be replaced with new rules
covering every possible combination of the non-nullable and nulling variants.
That rewrite is called the CHAF rewrite, and is described in the next section.
To deal with the splitting of RHS proper nullables into two symbols, one
non-nullable and one nulling, Aycock and Horspool created new rules covering
all possible RHS combinations of non-nullable and nulling symbols. This
factoring is exponential in the theoretical worst case, and I believe
would create efficiency issues in practice as well.
A result due to Chomsky shows that any grammar can be rewritten as a grammar
with at most two symbols on the right hand side. Relaxing Chomsky's rewrite to
allow right hand sides with any number of symbols, but at most two proper
nullables, produces a rewrite I call CHAF (Chomsky-Horspool-Aycock Form).
CHAF changes the worst case to linear, and in practical cases lowers the
multiplier. Here's an example of a CHAF rewrite. First, the rule:
{ lhs => 'statement',
rhs => [
qw/optional_whitespace expression
optional_whitespace optional_modifier
optional_whitespace/
]
}
This rule contains four instances of proper nullables, illustrating my fear that
grammars of practical interest will have lots of proper nullables on right
hand sides. "optional_whitespace" and "optional_modifier"
are both proper nullables and "optional_whitespace" appears three
times.
Here's is the output from "show_rules", showing what Marpa does with
this rule:
0: statement -> optional_whitespace expression optional_whitespace optional_modifier optional_whitespace
/* !used */
13: statement -> optional_whitespace expression statement[R0:2] /* vrhs real=2 */
14: statement -> optional_whitespace expression optional_whitespace[] optional_modifier[] optional_whitespace[]
15: statement -> optional_whitespace[] expression statement[R0:2] /* vrhs real=2 */
16: statement -> optional_whitespace[] expression optional_whitespace[] optional_modifier[] optional_whitespace[]
17: statement[R0:2] -> optional_whitespace statement[R0:3] /* vlhs vrhs real=1 */
18: statement[R0:2] -> optional_whitespace optional_modifier[] optional_whitespace[] /* vlhs real=3 */
19: statement[R0:2] -> optional_whitespace[] statement[R0:3] /* vlhs vrhs real=1 */
20: statement[R0:3] -> optional_modifier optional_whitespace /* vlhs real=2 */
21: statement[R0:3] -> optional_modifier optional_whitespace[] /* vlhs real=2 */
22: statement[R0:3] -> optional_modifier[] optional_whitespace /* vlhs real=2 */
Rule 0 is the original rule. Because Marpa has rewritten it, the rule is marked
""!used"", telling later stages in the precomputation to
ignore it. Marpa breaks Rule 0 up into three pieces, each with no more than
two proper nullables. Marpa then eliminates the proper nullables in each piece
by factoring.
Factoring is so called because the proper nullables are
"factored out" into their nulling and non-nulling variants.
To
factor a piece, Marpa first rewrites it into multiple rules, one for
each possible combination of nulled and non-nulled symbols. Marpa then
replaces each proper nullable with its nulling or non-nullable variant, as
appropriate. There are two symbol variants for each proper nullable, and a
maximum of two proper nullables for each piece. That means that each piece
will be factored into, at most, four rules.
In the example in the above display, the original rule (Rule 0) was broken into
three pieces. Rule 0 had 5 RHS symbols, and the three pieces contain,
respectively, the first two RHS symbols; the third RHS symbol; and the last
two (that is, 4th and 5th) RHS symbols.
The first piece of Rule 0 is factored into four rules: Rules 13 to 16. The
second piece is factored into three rules: Rules 17 to 19. The third and last
piece of Rule 0 is also factored into three rules: Rules 20 to 22.
When a rule is broken into pieces, the left hand side of the first piece is the
left hand side of the original rule. New symbols are introduced to be the left
hand sides of the other pieces. The names of the new LHS symbols are formed by
suffixing a tag to the name of the original left hand side. The suffixed tag
begins with a capital "R" and identifies the location of the piece
in the original rule. For example, the tag ""[R0:3]""
indicates that this symbol is the left hand side for the piece of Rule 0 which
begins at right hand symbol 3 (the first symbol is symbol 0).
When a new LHS symbol is created for a piece, the worst case is that the new LHS
is also a proper nullable. This worst case occurs when all the original
symbols in the piece for the new LHS and all symbols in all subsequent pieces
are proper nullables. When a newly created LHS symbol is a proper nullable, it
counts against the limit of two proper nullables per piece, and it must also
be factored, just like the proper nullables in the original rule.
Rule 0 is such a worst case. The last three symbols of the right hand side are
all proper nullables. That means that all the symbols in the last two pieces
of the original rule are proper nullables. Therefore both of the newly created
LHS symbols are proper nullables.
The original Rule 0 has 4 proper nullables. After the CHAF rewrite, there are 6
proper nullables: the original 4 plus the 2 symbols newly created to serve as
left hand sides. This is why, in order to have at most 2 proper nullables per
piece, the original rule must to be divided into 3 pieces.
Even though Rule 0 is a CHAF worst case, CHAF's factoring result in 10 rules.
The original Aycock-Horspool factoring (NNF) resulted in 16 rules. For cases
other than the worst, and for situations with more than 4 proper nullables,
the advantage of CHAF becomes overwhelming. For 5 proper nullables, there
would be 13 rules for CHAF, versus 32 for NNF. For 6 proper nullables, there
would be 16 rules for CHAF, versus 64 for NNF.
CHAF preserves user semantics. Marpa, when it splits the rule into pieces and
factors the pieces, inserts logic to gather and preserve the values of child
nodes. The user's semantic actions see the original child values as if the
CHAF rewrite had never occurred.
Internally, Marpa converts productions specified as sequences into BNF
productions. The conversion is done in a standard way. For example,
{
lhs => 'statements',
rhs => [qw/statement/],
separator => 'comma',
min => 1
}
becomes
1: statements -> statement[Seq:1-*][Sep:comma] /* vrhs discard_sep real=0 */
2: statements -> statement[Seq:1-*][Sep:comma] comma /* vrhs discard_sep real=1 */
3: statement[Seq:1-*][Sep:comma] -> statement /* vlhs real=1 */
4: statement[Seq:1-*][Sep:comma] -> statement[Seq:1-*][Sep:comma] comma statement /* vlhs vrhs real=2 */
In the added symbol, the tag ""[Seq:1-*]"" indicates this is
a symbol for a sequence of from 1 to an infinite number of symbols and the tag
""[Sep:comma]"" that it is "comma" separated.
Here's another example, this time of a sequence without a separator. The rule
{ lhs => 'block', rhs => [qw/statements/], min => 0 },
is rewritten by Marpa as follows:
5: block -> /* empty !used nullable */
6: block -> statements[Seq:1-*] /* vrhs real=0 */
7: statements[Seq:1-*] -> statements /* vlhs real=1 */
8: statements[Seq:1-*] -> statements[Seq:1-*] statements /* vlhs vrhs real=1 */
Note that rule 5, the empty rule with the "block" symbol as its LHS,
is marked ""!used"". "block" is a proper
nullable, and rules from sequence conversion undergo the same rewriting of
proper nullables as any other rules. In this case a nulling variant of
"block" ("block[]") was created. That made the empty rule
created by the sequence conversion useless, and that is why it was marked
""!used"".
Creating the Earley sets is not parsing in the strict pedantic sense. Creating
the Earley sets is
recognition -- determining whether or not a grammar
recognizes an input. During recognition, Marpa creates
Earley items and
puts them into Earley sets. The input is recognized if an Earley item of the
correct form is in the Earley set located at the end of parsing. In Marpa,
parsing in the strict sense takes place after the recognition phase is done,
using the Earley sets built by the recognizer.
Every Earley item has a
current earleme. An Earley item's current earleme
is also called its
dot earleme, its
current location, or its
dot location. An
Earley set is the set of all the Earley items
with the same dot location.
In the default, token-stream, input model, the
earleme location is
exactly the same as the zero-based token stream position. Marpa allows other
input models, and in those models the term
earleme can take on other
meanings. User interested in the alternative input models should look at
Marpa::Advanced::Models.
In addition to its
current earleme, each Earley item has a
AHFA
state, and an
origin. The origin of an Earley item can also be
called its
start location. Here's a representation of an Earley item,
as you might see it in debugging output from Marpa:
S4@2-3
This Earley item is for AHFA state 4 (that is the ""S4""
part). The ""@2-3"" part says that this Earley item
originates at earleme 2, and is current at earleme 3. The number of an Earley
item's current earleme is always the same as number of the Earley set that
contains the item.
(Note to experts: Those familiar with Earley parsing will note that
"S4@2-3" does not look like a traditional Earley item. AHFA states
are not traditional -- they come from Aycock and Horspool. The use of the
at-sign as punctuation is from Grune and Jacobs. And I have added the current
earleme to the notation.)
To understand what is going on in the Earley items, it will be necessary to
understand AHFA's (Aycock-Horspool Finite Automata). Let's look at an Earley
item:
S5@0-3
This states that this item is for AHFA state 5; that it originates at earleme 0;
and that it is currently at (or has its dot position at) earleme 3. We can get
a description of the AHFA states using the "show_AHFA" method. (For
those who like the big picture, the entire "show_AHFA" output, along
with the other trace and debugging outputs for this example, is in the
appendices.)
Here's what "show_AHFA" says about AHFA state 5:
S5: 2,8
Expression -> Term .
Term -> Term . Add Term
<Add> => S8; S9
The first line of "show_AHFA"'s description of AHFA state 5 begins
with its name: "S5". The two numbers on the first line after the
AHFA state name are the numbers of the
LR(0) NFA states from which this
AHFA state was formed. They will not be mentioned again in this document, and
may be ignored.
Marpa uses AHFA states to track the parse. Every AHFA state has one or more
LR(0) items. In the above display, the second and third lines are
LR(0) items.
LR(0) items are rules with a dot
added to indicate how far recognition has proceeded into the rule.
A dot between two symbols indicates that all symbols before the dot have been
recognized, but that none of the symbols after the dot have been recognized. A
dot at the beginning of an
LR(0) item indicates that none of its
symbols have yet been recognized. A dot at the end indicates that the rule is
complete -- that all of the rule's symbols have been recognized.
The location of the dot is called the
dot position. The symbol before the
dot position in an
LR(0) item is called the
pre-dot symbol. The
symbol after the dot position in an
LR(0) item is called the
post-dot symbol.
The last line in the above display shows the
transitions from AHFA state
S5. It says that, when Marpa sees an "Add" token, it will transition
both to AHFA state S8 and to AHFA state S9. We will use both transitions in
our examples.
It's important not to confuse Earley items with
LR(0) items. Earley items
are built from one or more
LR(0) items. In traditional Earley parsing,
each Earley item contained one and only one
LR(0) item. This made the
internals simpler, but was not as efficient. Marpa combines
LR(0) items
into AHFA states based on the ideas in Aycock and Horspool 2002.
Marpa places an important restriction on
LR(0) items. The post-dot symbol
is
never a nullable symbol. In a completed rule, the dot position will
be at the very end of a rule, after all symbols. When an
LR(0) item is
not a completed rule, the post-dot symbol will always be non-nullable.
Intuitively, it may be helpful to think of Marpa as recognizing nulling
symbols instantly, whenever they are encountered, so that the dot position
never stops in front of a nulling symbol.
Because nulling symbols are never post-dot symbols, the dot position is tracked
in terms of the non-nulling symbols only. Nulling symbols are skipped over as
if they didn't exist. This document speaks of the
non-null symbol
position in a rule, or simply of the
non-null position. Recognition
of a rule starts with the dot position at the
first non-null position.
An
LR(0) item with the dot position at the first non-null position is
called a
rule initialization.
Beginning from the rule initialization, as each post-dot symbol is recognized,
the dot position advances to the
next non-null position. Once all of
the zero or more symbols are recognized, the
LR(0) item becomes a
completed rule. When an
LR(0) item is not a completed rule, it
is called an
incomplete LR(0) item, or an
incomplete
rule.
Every AHFA state contains one or more
LR(0) items. This means that every
AHFA state is a set of statements about rule recognition. An Earley item is
present in the Earley sets if and only if every rule in every
LR(0)
item in the AHFA state of the Earley item has been recognized as far as its
dot position. The origin of each
LR(0) item corresponds to the origin
of the Earley item, and the dot position of each
LR(0) item corresponds
to the dot location of the Earley item.
New items come into the Earley sets in four ways: scanning, prediction,
completion, and initialization.
Scanning is the addition of Earley
items to represent the recognition of symbols directly in the input.
Rule
completion (or simply
completion) occurs when one of the rules of
the grammar is fully recognized, or "completed". Rule completions
add Earley items to represent the recognition of symbols that are not found
directly in the input, but that are found indirectly by applying the rules of
the grammar.
Prediction adds Earley items to represent rules that might start at a
given location.
Initialization adds one or two Earley items to Earley
set 0 to represent the start rules.
Scanning adds Earley items to indicate which tokens have been recognized in the
input, and where. Let's look at the case where we are at earleme 3; we are
about to recognize an "Add" token; and we have in our Earley set 3
the Earley item S5@0-3. S5@0-3 is the Earley item we already looked at.
The current earleme for S5@0-3 is earleme 3. Here, again, is the information for
AHFA state S5, the AHFA state in Earley item S5@0-3.
S5: 2,8
Expression -> Term .
Term -> Term . Add Term
<Add> => S8; S9
Marpa knows that it can instruct the lexer to look for a "Add" token
at earleme 3, because in AHFA state S5, there is a transition on an
"Add" token. In this example, an "Add" token corresponds
to a plus sign (""+"").
Marpa's state machine is not fully deterministic, it is quasi-deterministic.
That is, Marpa's state machine is not a DFA (Deterministic Finite Automata). A
DFA state, for any given token, has at most one transition. Marpa's AHFA
states, for any given token, can have up to two transitions. In the case of
AHFA state S5, on an "Add" token there are two transitions: one to
AHFA state S8 and one to AHFA state S9. We will use both transitions in our
examples.
The transition to AHFA state S9 is an example of adding a
predicted Earley
item to an Earley set. The transition to AHFA state S8 is an example of
adding a
scanned Earley item to an Earley set. At this point, we are
discussing scanning, so the first transition that we will look at is the
transition to AHFA state 8.
S8: 9
Term -> Term Add . Term
<Term> => S12
Which Earley set does the new Earley item for AHFA state S8 go into? In this
example, the "Add" token has a length (in earlemes) of 1. The
current earleme for S5@0-3 is earleme 3. Since 3+1 equals 4, the current
earleme for the new Earley item will be earleme 4.
The Earley item containing the transition and from-state used to create a
scanned Earley item is called the
predecessor of the scanned Earley
item. In this example the predecessor of the scanned Earley item is the Earley
item S5@0-3. The origin of a scanned Earley item is always the same as the
origin of its predecessor.
Collecting what we've determined so far, the new, scanned, Earley item will have
its dot location at earleme 4, will have AHFA state S8, and will have its
origin at earleme 0. This means that the new, scanned, Earley item will be
S8@0-4, and will look like this:
S8@0-4 [p=S5@0-3; s=Add; t=\'+']
Any item which is added to an Earley set based on a AHFA state transition from a
predecessor, is called that predecessor's
successor. In this example,
S8@0-4 is the successor of S5@0-3.
In the current example, the token values are "42",
""*"", "7", ""+" and
"1". The current earleme for Earley item S8@0-4 is earleme 4,
indicating that the 4th token has just been seen. The fourth token value is a
plus sign (""+""), which the lexer of this example read as
an "Add" token.
After the name of the Earley item is an
Earley item source choice, shown
in brackets. When the context makes the meaning clear, an
Earley item
source choice may be called a
source or a
source choice. In
this example, the source choice is "[p=S5@0-3; s=Add; t=\'+']". This
parse is not ambiguous, so all Earley items have at most one source choice
entry.
In this case the source choice entry is for a
token choice. Token choices
are shown as "[p="
predecessor";
s="
token_name"; t="
token_value"]"
3-tuples. The token choice "[p=S5@0-3; s=Add; t=\'+']" indicates
that the scanned token was an "Add" symbol, that the scanned token
value is "'+'", and that the predecessor of S8@0-4 is S5@0-3.
Token Lengths
Above, we calculated the location of the new Earley item assuming that the
"Add" token had a length, in earlemes, of 1. Where did this token
length come from?
A length of 1 is the default length for a token. In fact, in the default,
token-stream, model of input, a token length of 1 is the
only token
length possible. When other, non-standard, input models are used, tokens can
be variable length. For details on alternative input models, see
Marpa::Advanced::Models.
In the previous section, we looked at one of the two transitions from AHFA state
S5 at earleme 3 on an "Add" token. The transition we looked at was
to AHFA state S8. Now we will look at the other transition, to AHFA state S9.
Here is AHFA state S9:
S9: predict; 3,5,7,11
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S13
In the first line of the "show_AHFA" output for AHFA state S9,
immediately after the name of the state, comes the string "predict".
This means that S9 is a prediction state, and that the new Earley item we add
as a result of the transition to S9 will be a predicted Earley item.
The current location of a predicted Earley item is determined in the same way as
the current location of a scanned item. We take the current location of the
Earley item containing the transition and add the length of the token causing
the transition. In this case the transition is in S5@0-3, whose current
location is 3. The token is the same as it was in the previous section: an
"Add" token with a token length of 1. As before, 3+1 equals 4, so
that the new current location will be 4.
Predicted Earley items determine their start location, or origin, in a special
way. The origin of a predicted Earley item is always the same as its current
location. So the origin of this new, predicted, Earley item will be 4.
Now that we know its AHFA state, current location and origin, we are in a
position to add the predicted Earley item. It goes into Earley set 4, and
looks like this:
S9@4-4
Note that no source choice is shown. Source choices are not recorded for
predicted items.
More about Prediction States
Prediction states exist because the presence of incomplete
LR(0) items at
an earleme location gives rise to rule initializations at that same earleme
location. When an
LR(0) item with a post-dot symbol is present at a
given location, we call that symbol an
expected symbol at that
location. At any given location, for every expected symbol, we can also expect
the rule initializations for all the rules with that expected symbol on the
left-hand side.
Any rule initialization we introduce will have a post-dot symbol. If that
post-dot symbol is not already an expected symbol, it may cause further rule
initializations to be expected at the same location. In other words, rule
initializations recurse.
The AHFA states invented by Aycock and Horspool
handle these recursive rule initializations efficiently. Above we looked at AHFA
states S8 and S9. In the following paragraphs, we'll examine the recursive
sequence of rule initializations as it originates in S8 and is collected in
S9.
The AHFA's invented by Aycock and Horspool pair every transition to a prediction
state with the transition to a kernel (or non-prediction) state. In this case,
S8 and S9 are a pair, where S8 is the kernel state. S8 contains only one
LR(0) item:
Term -> Term Add . Term
In this
LR(0) item the post-dot symbol is "Term". S9, the
prediction AHFA state paired with kernel AHFA state S8, will contain all the
rule initializations caused by the expectation of a "Term" symbol.
There are two rules with "Term" on their left hand side. Their rule
initializations are
Term -> . Term Add Term
Term -> . Factor
Rule initializations recurse. The rule initializations above have two post-dot
symbols: "Term" and "Factor". "Term" is already
an expected symbol, but "Factor" is a new expected symbol. When
"Factor" is an expected symbol at an earleme location, all rules
with "Factor" on their left hand side will have rule initializations
at that same earleme location. There are two such rules. Here are their rule
initializations:
Factor -> . Factor Multiply Factor
Factor -> . Number
The post-dot symbols in these rule initializations are "Factor" and
"Number". "Factor" is already an expected symbol.
"Number" is new as an expected symbol, but there is no rule in our
grammar with "Number" on its left hand side.
The recursion is finished. We have added all the rule initializations for the
expected symbols found in previous steps. There are no new expected symbols in
this recursion step. That means there will be no more new rule
initializations, and therefore no more new expected symbols.
Starting with the expected symbol in AHFA state S8, we found four rule
initializations. The prediction state paired with S8 is S9. Let's look at it
again:
S9: predict; 3,5,7,11
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S13
We see that, indeed, S9 contains all four of the rule initializations that we
found in this section.
Above, we saw a scanned Earley item, added as the result of recognizing a symbol
directly in the input. New symbols are also recognized indirectly, when rules
are completed. When a rule is recognized all the way to the end, so that the
dot location is after the rightmost right hand side symbol, then the left hand
side symbol of that rule is also recognized.
As a reminder, an
LR(0) item with the dot at the end of the rule is
called a
completed rule. Whenever an Earley item contains a completed
rule, that indicates that the left hand side symbol of the completed rule was
recognized. The left hand side symbol of a completed rule is a
completed
symbol.
Earley items with completed symbols can cause other Earley items to be added to
the Earley sets. Let's look at one example. S4@2-3, in Earley set 3, contains
a completed symbol. Here is what AHFA state 4 looks like:
S4: 6
Factor -> Number .
Every completed symbol has an origin and a current earleme. The origin and the
current earleme of a completed symbol are exactly the same as the origin and
the current earleme of the Earley item that completes that symbol. In this
case, the Earley item completing the symbol "Factor" is S4@2-3 so
that "Factor" is a completed symbol with origin at 2 and current
earleme at 3.
An Earley item that completes a symbol is called a
completion Earley
item, or simply a
completion. As a result of recognizing a
completion Earley item, Marpa may add one or more
completion effect Earley
items to the Earley sets. A
completion effect Earley item is often
called simply a
completion effect, or just an
effect.
When a completion effect Earley item is added to the Earley sets, the completion
Earley item which caused the effect is called a
completion cause Earley
item. A
completion cause Earley item is often called simply a
completion cause, or just an
cause.
For a completion effect Earley item to be added to the Earley sets, a completion
cause is necessary but not sufficient. There must also be a matching
predecessor Earley item. A
matching predecessor Earley item is an
Earley item whose origin is the same as the origin of the completed symbol,
and whose the post-dot symbol is the completed symbol.
In our example, "Factor" is a completed symbol with origin 2. We will
add an Earley item as a completion effect if we find a matching predecessor:
an Earley item in Earley set 2 which has "Factor" as one of the
post-dot symbols.
In this example we find two matching predecessors: S6@0-2 and S7@2-2. Completion
effects are added for both of these. We'll look only at the completion effect
that has S6@0-2 as its matching predecessor. Here is AHFA state 6.
S6: 13
Factor -> Factor Multiply . Factor
<Factor> => S10
The transition for "Factor" in AHFA state S6 is to S10, so the new
completion effect will have AHFA state S10. A completion effect always has the
same current earleme as its completion cause (in this case S4@2-3). The origin
of a completion effect is always the same as the origin of its matching
predecessor (in this case S6@0-2). The means our new new completion effect
will have AHFA state S10, origin 0 and current earleme 3. Here it is:
S10@0-3 [p=S6@0-2; c=S4@2-3]
Note the source choice in brackets after the name: "[p=S6@0-2;
c=S4@2-3]". Previously we saw a token source choice. This is another kind
of source choice, a
completion source choice. This source choice says
the reason for S10@0-3 to be in the Earley sets is the completion cause S4@2-3
and its matching predecessor, S6@0-2.
When a parse is ambiguous, Earley items can have multiple source choices. In
this example, the parse is unambiguous, and all scanned Earley items and all
completion effect Earley items will have one and only one source choice.
Completions recurse. If a completion effect contains a completed symbol, and if
that completion effect has a matching predecessor, then it is also a
completion cause.
The completion effect that we just added (S10@0-3) is a completion cause as
well. It is a step in a recursive sequence of completions. Earley item S10@0-3
has AHFA state S10. AHFA state S10 contains a completed rule:
S10: leo-c; 14
Factor -> Factor Multiply Factor .
(Ignore the "leo-c" notation for now. It indicates that S10 is what's
called a Leo completion. The Leo logic is dealt in another document.)
The completed symbol is "Factor", with origin 0 and current earleme 3.
Looking for matching predecessors at earleme 0, we find one: S1@0-0. Here is
what AHFA state 1 looks like:
S1: predict; 1,3,5,7,11
Expression -> . Term
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S5
The completed symbol is "Factor" and its transition in S1 is to S3, so
we add S3@0-3 to Earley set 3.
S3@0-3 [p=S1@0-0; c=S10@0-3]
In our scanning example, we referred to S5@0-3. One more step in the sequence of
completions and we will see S5@0-3 again. The origin of S3@0-3 is at earleme
0. "Term" is only the completed symbol in AHFA state 3.
S3: 4,12
Term -> Factor .
Factor -> Factor . Multiply Factor
<Multiply> => S6; S7
S1@0-0 contains an
LR(0) item which has "Term" as the post-dot
symbol. In AHFA state S1, the transition on "Term" is to AHFA state
S5. This means S5@0-3 will be added to Earley set 3 as a completion effect,
with S3@0-3 as the cause and S1@0-0 as the matching predecessor:
S5@0-3 [p=S1@0-0; c=S3@0-3]
We call a completion effect Earley item the
successor of its matching
predecessor. For example, S5@0-3 is the successor of S1@0-0. The matching
predecessor of an Earley item is often simply called its
predecessor.
For example, S1@0-0 is the predecessor of S5@0-3.
One or two Earley items are put into Earley set 0 to start things off. In our
example, the initial Earley items are
S0@0-0
S1@0-0
S0@0-0 will always be present in Earley set 0. AHFA state 0 contains the start
rules, with the dot pointer at the beginning. Only Earley set 0 will contain
an Earley item for AHFA state 0.
S1@0-0 contains the rules predicted by S0@0-0. For some very simple grammars
S1@0-0 is not necessary.
As of this writing, this section is not a complete presentation of Marpa's
Generation 3 evaluator. Instead, it focuses on the issues relevant to users
using Marpa's advanced trace accessors.
As we have seen, Marpa uses AHFA states in its Earley items, and these can
contain more than one dotted rule. To produce an actual parse result, Marpa
must find the dotted rules which are applicable for that particular parse. A
dotted rule that applies in a particular parse result is called an
applicable dotted rule or an
applicable LR(0)
item.
The search for applicable dotted rules begins with the start Earley item. There
will always be exactly one dotted rule in a start Earley item, and that one
dotted rule will always be applicable to the current parse result. In other
words, in any parse result, the dotted rule in its start Earley item is an
applicable dotted rule. The search for the other dotted rules applicable to
that parse result proceeds top-down, starting with the applicable dotted rule
in the start Earley item.
If the parse is not ambiguous, there will be exactly one applicable dotted rule
for every Earley item. If the parse is ambiguous, at some points there may be
more than one applicable dotted rule. Ambiguous parsing is discussed in
greater depth in the next section.
To find applicable dotted rules while proceeding top-down, Marpa must be able to
do two things. First, given a pairing of a completion effect with its
completion cause, and an applicable dotted rule in the completion effect,
Marpa must be able to determine the applicable dotted rules in the completion
cause. A dotted rule in a completion cause is
applicable if and only if
the LHS symbol of the dotted rule in the completion cause is the same as the
first non-null symbol before the dot position of the applicable dotted rule in
the completion effect.
Second, given a pairing of a successor Earley item with its predecessor Earley
item, and an applicable dotted rule in the successor, Marpa must be able to
determine the applicable dotted rules in the predecessor Earley item. Recall
that a dotted rule is a duple containing a production of the grammar and a dot
position in that production. A dotted rule in a predecessor is
applicable if and only if
- •
- its production is the same as the production of the applicable dotted rule
in the successor, and
- •
- its dot position is exactly one non-null symbol earlier than the dot
position of the applicable dotted rule in the successor.
The parse in this example is not ambiguous. Marpa allows ambiguous parses. When
a parse is ambiguous, the procedure for adding Earley items is the same as
above, except when Marpa attempts to add the same Earley item twice.
Two Earley items are considered "the same" if they have the same AHFA
state, the same origin, and the same current earleme. Marpa will never add the
same Earley item twice.
In cases where Marpa might add the same Earley item twice, Marpa instead adds
another source choice to the already existing Earley item. This means that
some Earley items might have multiple source choices. It is possible for an
Earley item to end up with
- •
- Both a token source choice and a completion source choice;
- •
- Multiple token source choices;
- •
- Multiple completion source choices;
- •
- Multiple token source choices and multiple completion source choices;
A parse is ambiguous if and only if it has one or more points of ambiguity. A
point of ambiguity is an Earley item with two or more source choices,
or an Earley item with two or more applicable dotted rules.
Marpa calculates parse results using fairly traditional methods. Marpa first
derives a pre-order tree from the Earley sets. Marpa then evaluates the
pre-order tree bottom-up, using an evaluation stack.
The evaluation stack is initialized to empty. Every time Marpa evaluates a tree
node, it pushes the result onto the evaluation stack. When a tree node has
child values, they are popped from the evaluation stack. The calculation of
the parse result is complete when Marpa pushes the value of the root tree-node
onto the evaluation stack. This value becomes the value of the parse. The
"trace_values" output in the appendix traces the calculation of the
parse result in our example.
In some very important cases, the naive implementation of the traditional method
of evaluating a parse tree is extremely wasteful. Grammars of great practical
interest often contain very long sequences. A C language "translation
unit" is defined as sequence of definitions and declarations, and that
sequence can be extremely long. A Perl lexical unit is also a sequence, again,
often quite long.
When a sequence is expressed as BNF rules, most of the rule instances used to
implement a long sequence will have no significant semantics of their own.
Popping and pushing the symbols for each of these rule instances is wasteful.
It is more efficient to leave the values of the sequence items on the
evaluation stack until the sequence is complete.
To optimize sequences and other cases where symbols are best left on the stack,
Marpa marks some of its internal symbols as "virtual". A symbol is
virtual if it has no "real" semantics. In the presence of
virtual symbols, Marpa skips much of the usual popping and pushing of values
on the evaluation stack.
Marpa uses three special rule properties to track virtual symbols:
"vlhs", "vrhs" and "real". "vlhs"
means that the left hand side is a virtual symbol. "vrhs" means that
the right hand side contains a virtual symbol. "real" is an integer,
indicating the number of real (that is, non-virtual) symbols on the right hand
side.
Marpa uses the ideas from an algorithm developed by Joop Leo to deal with right
recursion. The Leo logic is complex enough to merit its own example, and so is
presented in a separate document.
As mentioned, in the strict sense, Earley's algorithm is just a recognizer. The
input is recognized successfully if there is an Earley item for a completed
start rule state in the Earley set at the end of parsing. A
completed start
rule state is a AHFA state containing a completed start rule. A completed
start rule is an
LR(0) item for a start rule with a dot position at the
end of the rule.
A
start rule is any rule that has Marpa's internal start symbol on its
left hand side. There are at most two start rules in any Marpa grammar: the
empty start rule and the
non-empty start rule. As a consequence,
there are at most two completed start rule states in any Marpa grammar: the
empty completed start rule state and the
non-empty completed start
rule state.
The empty start rule is the rule with Marpa's internal start symbol on its left
hand side and an empty right hand side. There is an empty start rule in a
Marpa grammar if and only if that Marpa grammar allows the null parse. (Saying
that a grammar allows the null parse is the same as saying that a grammar
recognizes the zero-length, or empty, token string.) In grammars which allow a
null parse, the start state, AHFA state 0, is the empty completed start rule
state.
The non-empty start rule is the rule with Marpa's internal start symbol on its
left hand side and exactly one symbol on its right hand side -- the grammar's
original start symbol. AHFA state S2 is the non-empty completed start rule
state for our example parse:
S2: leo-c; 16
Expression['] -> Expression .
A parse is
successful at earleme N if it contains an
Earley item for a completed start rule state with a dot earleme of
N.
The parse in our example is successful at earleme location 5, because the
following Earley item appears in Earley set 5:
S2@0-5 [p=S0@0-0; c=S5@0-5]
A parse is
successful if that parse is successful at its end of parsing
location. Marpa's default idea of the end of parsing location depends on its
input model. (The end of parsing location can be also be explicitly set by the
user.)
The parse in our example reads five tokens, uses the default input model, and
uses the default end of parsing. That puts the end of parsing for our example
parse at earleme location 5. Since parsing is successful at location 5, the
parse in our example is successful.
Below are the code and the trace outputs for the example used in this document.
my $grammar = Marpa::Grammar->new(
{ start => 'Expression',
actions => 'My_Actions',
default_action => 'first_arg',
strip => 0,
rules => [
{ lhs => 'Expression', rhs => [qw/Term/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' },
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
],
}
);
$grammar->precompute();
my $recce = Marpa::Recognizer->new( { grammar => $grammar } );
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', q{*} ],
[ 'Number', 1 ],
[ 'Add', q{+} ],
[ 'Number', 7 ],
);
$recce->tokens( \@tokens );
sub My_Actions::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
sub My_Actions::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
sub My_Actions::first_arg { shift; return shift; }
my $value_ref = $recce->value();
my $value = $value_ref ? ${$value_ref} : 'No Parse';
0: Expression, lhs=[0] rhs=[5] terminal
1: Term, lhs=[1 3] rhs=[0 3] terminal
2: Factor, lhs=[2 4] rhs=[1 4] terminal
3: Number, lhs=[] rhs=[2] terminal
4: Add, lhs=[] rhs=[3] terminal
5: Multiply, lhs=[] rhs=[4] terminal
6: Expression['], lhs=[5] rhs=[]
0: Expression -> Term
1: Term -> Factor
2: Factor -> Number
3: Term -> Term Add Term
4: Factor -> Factor Multiply Factor
5: Expression['] -> Expression /* vlhs real=1 */
Start States: S0; S1
S0: 15
Expression['] -> . Expression
<Expression> => S2; leo(Expression['])
S1: predict; 1,3,5,7,11
Expression -> . Term
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S5
S2: leo-c; 16
Expression['] -> Expression .
S3: 4,12
Term -> Factor .
Factor -> Factor . Multiply Factor
<Multiply> => S6; S7
S4: 6
Factor -> Number .
S5: 2,8
Expression -> Term .
Term -> Term . Add Term
<Add> => S8; S9
S6: 13
Factor -> Factor Multiply . Factor
<Factor> => S10; leo(Factor)
S7: predict; 5,11
Factor -> . Number
Factor -> . Factor Multiply Factor
<Factor> => S11
<Number> => S4
S8: 9
Term -> Term Add . Term
<Term> => S12; leo(Term)
S9: predict; 3,5,7,11
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S13
S10: leo-c; 14
Factor -> Factor Multiply Factor .
S11: 12
Factor -> Factor . Multiply Factor
<Multiply> => S6; S7
S12: leo-c; 10
Term -> Term Add Term .
S13: 8
Term -> Term . Add Term
<Add> => S8; S9
Last Completed: 5; Furthest: 5
Earley Set 0
S0@0-0
S1@0-0
Earley Set 1
S4@0-1 [p=S1@0-0; s=Number; t=\42]
S3@0-1 [p=S1@0-0; c=S4@0-1]
S5@0-1 [p=S1@0-0; c=S3@0-1]
S2@0-1 [p=S0@0-0; c=S5@0-1]
Earley Set 2
S6@0-2 [p=S3@0-1; s=Multiply; t=\'*']
S7@2-2
Earley Set 3
S4@2-3 [p=S7@2-2; s=Number; t=\1]
S10@0-3 [p=S6@0-2; c=S4@2-3]
S11@2-3 [p=S7@2-2; c=S4@2-3]
S3@0-3 [p=S1@0-0; c=S10@0-3]
S5@0-3 [p=S1@0-0; c=S3@0-3]
S2@0-3 [p=S0@0-0; c=S5@0-3]
Earley Set 4
S8@0-4 [p=S5@0-3; s=Add; t=\'+']
S9@4-4
Earley Set 5
S4@4-5 [p=S9@4-4; s=Number; t=\7]
S3@4-5 [p=S9@4-4; c=S4@4-5]
S12@0-5 [p=S8@0-4; c=S3@4-5]
S13@4-5 [p=S9@4-4; c=S3@4-5]
S5@0-5 [p=S1@0-0; c=S12@0-5]
S2@0-5 [p=S0@0-0; c=S5@0-5]
Pushed value from a12 T@0-1_Number: Number = \42
Popping 1 values to evaluate a12 T@0-1_Number, rule: 2: Factor -> Number
Calculated and pushed value: 42
Pushed value from a10 R4:1@0-1T@1-2_Multiply: Multiply = \'*'
Pushed value from a9 T@2-3_Number: Number = \1
Popping 1 values to evaluate a9 T@2-3_Number, rule: 2: Factor -> Number
Calculated and pushed value: 1
Popping 3 values to evaluate a8 R4:2@0-2F2@2-3, rule: 4: Factor -> Factor Multiply Factor
Calculated and pushed value: 42
Popping 1 values to evaluate a7 F4@0-3, rule: 1: Term -> Factor
Calculated and pushed value: 42
Pushed value from a5 R3:1@0-3T@3-4_Add: Add = \'+'
Pushed value from a4 T@4-5_Number: Number = \7
Popping 1 values to evaluate a4 T@4-5_Number, rule: 2: Factor -> Number
Calculated and pushed value: 7
Popping 1 values to evaluate a3 F2@4-5, rule: 1: Term -> Factor
Calculated and pushed value: 7
Popping 3 values to evaluate a2 R3:2@0-4F1@4-5, rule: 3: Term -> Term Add Term
Calculated and pushed value: 49
Popping 1 values to evaluate a1 F3@0-5, rule: 0: Expression -> Term
Calculated and pushed value: 49
New Virtual Rule: a0 F0@0-5, rule: 5: Expression['] -> Expression
Symbol count is 1, now 1 rules
Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software
under the Perl license. For details see the LICENSE file in the Marpa
distribution.