Manual Reference Pages  MARPA::ADVANCED::IMPLEMENTATION (3)
.ds Aq ’
NAME
Marpa::Advanced::Implementation  Marpa Implementation
CONTENTS
OVERVIEW
This document describes
the Marpa data structures that
are useful for
tracing and debugging
grammars.
It assumes that the reader
knows
the Marpa API,
and that she
has a general knowledge of parsing.
All the examples
in this document
assume that none of the data has been stripped.
GRAMMAR REWRITING
Marpa rewrites grammars,
adding internal symbols and rules in the process.
These rewritings do not affect the semantics,
but they do show up when you examine the internals.
Marpa’s internal symbols have <B>tagsB> at the end,
enclosed in square brackets.
This means that all Marpa internal symbols end in a right square bracket.
Adding a Start Rule
Many parsers add their own start rule
and their own start symbol
to grammars.
Marpa is no exception.
The new internal start symbol is the original start symbol with "[] suffixed.
An example of a Marpa internal start symbol is Expression[].
If the grammar allows a null parse, there will also be a Marpa
internal nulling
start symbol,
with [][] suffixed.
An example of a Marpa internal nulling start symbol would be Script[][]".
Elminating Proper Nullable Symbols
A symbol is <B>nulledB> if it produces the empty sentence.
<B>Nulling symbolsB> are those which are always nulled.
<B>Nullable symbolsB> are those which are sometimes nulled.
<B>Nonnullable symbolsB> are those which are never nulled.
Pedantically, all nulling symbols are also nullable symbols.
A <B>proper nullableB> is any nullable symbol which is not a nulling symbol.
In other words,
a proper nullable is a symbol that
might or might not be nulled.
Nullable symbols were a problem in previous versions of Earley parsers.
Aycock and Horspool 2002
outlined a new approach for dealing with them.
I use their ideas with modifications of my own.
Marpa rewrites its grammar to eliminate proper nullables.
It does this by turning every proper nullable into two symbols:
a nonnullable variant and a nulling variant.
The nonnullable variant keeps the original symbol’s name,
but is no longer allowed to appear in places
where it might be nulled.
The name of the nulling variant
is that of the original symbol with the nulling tag ("[]") suffixed.
When the nulling variant is used,
it must be nulled.
The newly introduced nulling symbols will not appear on any left hand sides,
with one exception:
grammars that allow a null parse
will have a nulling start rule.
Except for the nulling start symbol,
Marpa marks nulling symbols internally and
recognizes them directly,
without the need for empty rules.
Rules with proper nullables on the RHS
have to be replaced with
new rules covering every possible combination of
the nonnullable and nulling variants.
That rewrite is called the CHAF rewrite,
and is described in the next section.
CHAF Rewriting
To deal with the splitting of RHS proper nullables into two symbols,
one nonnullable and one nulling,
Aycock and Horspool created new rules covering all possible RHS combinations
of nonnullable and nulling symbols.
This <B>factoringB> is exponential in the theoretical worst case,
and I believe would create efficiency issues in practice as well.
A result due to Chomsky shows that any grammar can be rewritten
as a grammar with at most two symbols on the right hand side.
Relaxing Chomsky’s rewrite to allow right hand sides with any number of symbols,
but at most two proper nullables,
produces a rewrite I call CHAF (ChomskyHorspoolAycock Form).
CHAF changes the worst case to linear, and in practical cases lowers
the multiplier. Here’s an example of a CHAF rewrite.
First, the rule:
{ lhs => statement,
rhs => [
qw/optional_whitespace expression
optional_whitespace optional_modifier
optional_whitespace/
]
}
This rule contains four instances
of proper nullables,
illustrating my fear
that grammars of practical interest will
have lots of proper nullables
on right hand sides.
optional_whitespace
and optional_modifier are both
proper nullables and
optional_whitespace
appears three times.
Here’s is the output from show_rules, showing what Marpa does with this rule:
0: statement > optional_whitespace expression optional_whitespace optional_modifier optional_whitespace
/* !used */
13: statement > optional_whitespace expression statement[R0:2] /* vrhs real=2 */
14: statement > optional_whitespace expression optional_whitespace[] optional_modifier[] optional_whitespace[]
15: statement > optional_whitespace[] expression statement[R0:2] /* vrhs real=2 */
16: statement > optional_whitespace[] expression optional_whitespace[] optional_modifier[] optional_whitespace[]
17: statement[R0:2] > optional_whitespace statement[R0:3] /* vlhs vrhs real=1 */
18: statement[R0:2] > optional_whitespace optional_modifier[] optional_whitespace[] /* vlhs real=3 */
19: statement[R0:2] > optional_whitespace[] statement[R0:3] /* vlhs vrhs real=1 */
20: statement[R0:3] > optional_modifier optional_whitespace /* vlhs real=2 */
21: statement[R0:3] > optional_modifier optional_whitespace[] /* vlhs real=2 */
22: statement[R0:3] > optional_modifier[] optional_whitespace /* vlhs real=2 */
Rule 0 is the original rule. Because Marpa has rewritten it,
the rule is marked "!used",
telling later stages in the precomputation to ignore it.
Marpa breaks Rule 0 up into three pieces,
each with no more than two proper nullables.
Marpa then eliminates the proper nullables in each piece by factoring.
<B>FactoringB> is so called because the proper nullables are
factored out into their nulling and nonnulling variants.
To <B>factorB> a piece, Marpa first rewrites it into multiple rules,
one for each possible combination of nulled and nonnulled symbols.
Marpa then replaces each proper nullable
with its nulling or nonnullable variant, as appropriate.
There are two symbol variants for each proper nullable,
and a maximum of two proper nullables for each piece.
That means that each piece will be factored into, at most,
four rules.
In the example in the above display,
the original rule (Rule 0) was broken into three pieces.
Rule 0 had 5 RHS symbols,
and the three pieces contain, respectively, the first two RHS symbols;
the third RHS symbol;
and the last two (that is, 4th and 5th) RHS symbols.
The first piece of Rule 0 is factored into four rules: Rules 13 to 16.
The second piece is factored into three rules: Rules 17 to 19.
The third and last piece of Rule 0 is also factored into three rules:
Rules 20 to 22.
When a rule is broken into pieces, the
left hand side of the first piece is the left hand
side of the original rule.
New symbols are introduced to be the left hand sides
of the other pieces.
The names of the new LHS symbols are formed
by suffixing a tag to
the name of the original left hand side.
The suffixed tag begins with a capital R and identifies the
location of the piece in the original rule.
For example,
the tag "[R0:3]" indicates that this
symbol is the left hand side for the piece of Rule 0 which begins at right hand symbol 3
(the first symbol is symbol 0).
When a new LHS symbol is created for a piece,
the worst case is that the new LHS is
also a proper nullable.
This worst case occurs when
all the original symbols in the
piece for the new LHS
and all symbols in all subsequent pieces are proper nullables.
When a newly created LHS symbol is a proper nullable,
it counts against the limit of two proper nullables per piece,
and it must also be factored,
just like the proper nullables in the original rule.
Rule 0 is such a worst case.
The last three symbols
of the right hand side are all
proper nullables.
That means that all the symbols in the last two pieces of the original rule
are proper nullables.
Therefore both of the newly created LHS symbols are proper nullables.
The original Rule 0 has 4 proper nullables.
After the CHAF rewrite,
there are 6 proper nullables:
the original 4 plus the 2 symbols newly created to serve as left hand sides.
This is why, in order to have at most 2 proper nullables per piece,
the original rule must to be divided into 3 pieces.
Even though Rule 0 is a CHAF worst case,
CHAF’s factoring result in 10 rules.
The original AycockHorspool factoring (NNF)
resulted in 16 rules.
For cases other than the worst,
and for situations with
more than 4 proper nullables, the advantage of CHAF becomes overwhelming.
For 5 proper nullables, there would be 13 rules for CHAF,
versus 32 for NNF.
For 6 proper nullables, there would be 16 rules for CHAF,
versus 64 for NNF.
CHAF preserves user semantics.
Marpa,
when it splits the rule into
pieces and factors the pieces,
inserts logic to gather and preserve the values of child nodes.
The user’s semantic actions
see the original child values
as if the CHAF rewrite had never occurred.
Converting Sequence Productions to BNF
Internally, Marpa converts productions specified as sequences into BNF productions.
The conversion is done in a standard way. For example,
{
lhs => statements,
rhs => [qw/statement/],
separator => comma,
min => 1
}
becomes
1: statements > statement[Seq:1*][Sep:comma] /* vrhs discard_sep real=0 */
2: statements > statement[Seq:1*][Sep:comma] comma /* vrhs discard_sep real=1 */
3: statement[Seq:1*][Sep:comma] > statement /* vlhs real=1 */
4: statement[Seq:1*][Sep:comma] > statement[Seq:1*][Sep:comma] comma statement /* vlhs vrhs real=2 */
In the added symbol, the tag "[Seq:1*] indicates this is a symbol for a sequence
of from 1 to an infinite number of symbols
and the tag [Sep:comma]" that it is
comma separated.
Here’s another example, this time of a sequence without a separator.
The rule
{ lhs => block, rhs => [qw/statements/], min => 0 },
is rewritten by Marpa as follows:
5: block > /* empty !used nullable */
6: block > statements[Seq:1*] /* vrhs real=0 */
7: statements[Seq:1*] > statements /* vlhs real=1 */
8: statements[Seq:1*] > statements[Seq:1*] statements /* vlhs vrhs real=1 */
Note that rule 5, the empty rule with the block symbol as its LHS,
is marked "!used".
block is a proper nullable, and rules from sequence conversion
undergo the same rewriting of proper nullables as any other rules.
In this case a nulling variant of block (block[])
was created.
That made the empty rule created by the sequence conversion
useless,
and that is why it was marked "!used".
EARLEY SETS
Creating the Earley sets is
not parsing in the strict pedantic sense.
Creating the Earley sets is <B>recognitionB> —
determining whether or not a grammar recognizes an input.
During recognition,
Marpa creates <B>Earley itemsB> and puts them into Earley sets.
The input is recognized if
an Earley item of the correct form is in the Earley set
located at the end of parsing.
In Marpa,
parsing in the strict sense takes place after
the recognition phase is done,
using the Earley sets built by the recognizer.
Every Earley item has a <B>current earlemeB>.
An Earley item’s current earleme is also called its <B>dot earlemeB>,
its <B>current locationB>, or its <B>dot locationB>.
An <B>Earley setB>
is the set of all the Earley items with the same dot location.
In the default, tokenstream, input model,
the <B>earlemeB> location is exactly the same as the
zerobased token stream position.
Marpa allows other input models,
and in those models the term <B>earlemeB> can take on
other meanings.
User interested in the alternative input models
should look at
Marpa::Advanced::Models.
In addition to its <B>current earlemeB>,
each Earley item has a <B>AHFA stateB>, and an <B>originB>.
The origin of an Earley item can also be called its <B>start locationB>.
Here’s a representation of an Earley item, as you might see it in
debugging output from Marpa:
S4@23
This Earley item is for AHFA state 4 (that is the "S4 part).
The @23" part says that this Earley item
originates at earleme 2, and is current at earleme 3.
The number of an Earley item’s current earleme
is always
the same as number of the Earley set that contains the item.
(Note to experts:
Those familiar with Earley parsing will note that S4@23 does not look like
a traditional Earley item.
AHFA states are not traditional —
they come from Aycock and Horspool.
The use of the atsign as punctuation is from
Grune and Jacobs.
And I have added the current earleme to the notation.)
AHFA STATES
To understand what is going on in the Earley items,
it will be necessary to understand AHFA’s (AycockHorspool Finite Automata).
Let’s look at an Earley item:
S5@03
This states that this item is for AHFA state 5;
that it originates at earleme 0;
and that it is currently at
(or has its dot position at)
earleme 3.
We can get a description of the AHFA states
using the show_AHFA method.
(For those who like the big picture,
the entire show_AHFA output,
along with the other
trace and debugging outputs for this example,
is in the appendices.)
Here’s what show_AHFA says about AHFA state 5:
S5: 2,8
Expression > Term .
Term > Term . Add Term
<Add> => S8; S9
The first line of show_AHFA’s description
of AHFA state 5 begins with its name: S5.
The two numbers on the first line
after the AHFA state name
are the numbers of the LR(0) NFA states
from which this AHFA state was formed.
They
will not be mentioned again in this document,
and may be ignored.
Marpa uses AHFA states to track the parse.
Every AHFA state has one or more LR(0) items.
In the above display,
the second and third lines are LR(0) items.
<B>B>LR<B>(0) itemsB> are rules with a dot added to indicate
how far recognition has proceeded into the rule.
A dot between two symbols indicates that all symbols
before the dot
have been recognized,
but that none of the symbols after the dot have been recognized.
A dot at the beginning of an LR(0) item
indicates that none of its symbols have yet been recognized.
A dot at the end indicates that the rule is <B>completeB>
— that all of the rule’s symbols have been recognized.
The location of the dot is called the <B>dot positionB>.
The symbol before the dot position in an LR(0) item is called the <B>predot symbolB>.
The symbol after the dot position in an LR(0) item is called the <B>postdot symbolB>.
The last line in the above display shows the <B>transitionsB>
from AHFA state S5.
It says that, when Marpa sees an Add token,
it will transition both to AHFA state S8
and to AHFA state S9.
We will use both transitions in our examples.
It’s important not to confuse Earley items with LR(0) items.
Earley items are built from one or more LR(0) items.
In traditional Earley parsing, each Earley item contained one and only one LR(0) item.
This made the internals simpler, but was not as efficient.
Marpa combines LR(0) items into AHFA states
based on the ideas in Aycock and Horspool 2002.
Marpa places an important restriction on LR(0) items.
The postdot symbol is <B>neverB> a nullable symbol.
In a completed rule,
the dot position will be at the very end of a rule, after all
symbols.
When an LR(0) item is not a completed rule,
the postdot symbol will always be nonnullable.
Intuitively, it may be helpful to think of Marpa as
recognizing nulling symbols instantly,
whenever they are encountered,
so that the dot position never stops in front of a nulling symbol.
Because nulling symbols are never postdot symbols,
the dot position is tracked
in terms of the nonnulling symbols only.
Nulling symbols are skipped over as if they didn’t exist.
This document speaks of the <B>nonnull symbol positionB> in a rule,
or simply of the <B>nonnull positionB>.
Recognition of a rule starts with the dot position
at the <B>first nonnull positionB>.
An LR(0) item with the dot position at the first nonnull position
is called a <B>rule initializationB>.
Beginning from the rule initialization,
as each postdot symbol is recognized,
the dot position advances to the <B>next nonnull positionB>.
Once all of the zero or more symbols are recognized,
the LR(0) item becomes a <B>completed ruleB>.
When an LR(0) item is not a completed rule,
it is called an <B>incompleteB> LR(0) item,
or an <B>incomplete ruleB>.
Every AHFA state contains one or more LR(0) items.
This means that every AHFA state
is a set of statements about rule recognition.
An Earley item is present in the Earley sets if and only if
every rule in every LR(0) item in the AHFA state of the Earley item has
been recognized as far as its dot position.
The origin of each LR(0) item corresponds to the origin of the Earley item,
and the dot position of each LR(0) item corresponds
to the dot location of the Earley item.
HOW EARLEY SETS ARE BUILT
New items come into the Earley sets in four ways:
scanning, prediction, completion, and initialization.
<B>ScanningB> is the addition of Earley items
to represent the recognition of
symbols directly in the input.
<B>Rule completionB>
(or simply <B>completionB>)
occurs when one of the rules
of the grammar is fully recognized, or completed.
Rule completions add Earley items to
represent the recognition of symbols that are
not found directly in the input,
but that are found indirectly by applying the rules of the grammar.
<B>PredictionB> adds Earley items
to represent rules that might
start at a given location.
<B>InitializationB> adds one or two Earley items to Earley set 0
to represent the start rules.
Scanning
Scanning adds Earley items to indicate
which tokens have been recognized in the
input, and where.
Let’s look at the case where we are at earleme 3;
we are about to recognize an Add token;
and we have in our Earley set 3
the Earley item S5@03.
S5@03 is the Earley item we
already looked at.
The current earleme for
S5@03 is earleme 3.
Here, again, is the information for
AHFA state S5,
the AHFA state
in Earley item S5@03.
S5: 2,8
Expression > Term .
Term > Term . Add Term
<Add> => S8; S9
Marpa knows that it can instruct the lexer to look for a Add token at earleme 3,
because in AHFA state S5,
there is a transition on an Add token.
In this example, an Add token corresponds to a plus sign ("+").
Marpa’s state machine is
not fully deterministic, it is quasideterministic.
That is, Marpa’s state machine is not a DFA
(Deterministic Finite Automata).
A DFA state,
for any given token,
has at most one transition.
Marpa’s AHFA states,
for any given token,
can have up to two transitions.
In the case of AHFA state S5,
on an Add token
there are two transitions:
one to AHFA state S8
and one to AHFA state S9.
We will use both transitions in our examples.
The transition to
AHFA state S9 is
an example of adding a <B>predicted Earley itemB>
to an Earley set.
The transition to
AHFA state S8 is
an example of adding a <B>scanned Earley itemB>
to an Earley set.
At this point, we are discussing scanning,
so the first transition that we will look at
is the transition to AHFA state 8.
S8: 9
Term > Term Add . Term
<Term> => S12
Which Earley set does the new Earley item for AHFA state S8 go into?
In this example, the Add token has a length
(in earlemes) of 1.
The current earleme for S5@03 is earleme 3.
Since 3+1 equals 4,
the current earleme for the new Earley item will be earleme 4.
The Earley item containing the transition and fromstate
used to create a scanned Earley item
is called the <B>predecessorB> of the scanned Earley item.
In this example the predecessor of the scanned Earley item
is the Earley item
S5@03.
The origin of a scanned Earley item is always the same
as the origin of its predecessor.
Collecting what
we’ve determined so far, the new, scanned, Earley item
will have its dot location at earleme 4,
will have AHFA state S8,
and will have its origin at earleme 0.
This means that the new, scanned, Earley item will be S8@04,
and will look like this:
S8@04 [p=S5@03; s=Add; t=\+]
Any item which is added to an Earley set based on a AHFA state transition
from a predecessor,
is called that predecessor’s <B>successorB>.
In this example, S8@04 is the successor of S5@03.
In the current example, the token values are
"42,
*,
7,
+ and
"1.
The current earleme for
Earley item S8@04 is earleme 4,
indicating that the 4th token has just been
seen.
The fourth token value is a plus sign (+"),
which the lexer of this example read
as an Add token.
After the name of the Earley item is an
<B>Earley item source choiceB>, shown in brackets.
When the context makes the
meaning clear, an <B>Earley item source choiceB>
may be called a <B>sourceB> or a <B>source choiceB>.
In this example, the source choice is
[p=S5@03; s=Add; t=\+].
This parse is not ambiguous,
so all Earley items have at most one source choice entry.
In this case the source choice entry is for a <B>token choiceB>.
Token choices are shown as [p=predecessor; s=token_name; t=token_value] 3tuples.
The token choice
[p=S5@03; s=Add; t=\+]
indicates that
the scanned token was an Add symbol,
that the scanned token value is "+",
and that
the predecessor of S8@04 is
S5@03.
Token Lengths
Above, we calculated the location of the new Earley item assuming that the
Add token had a length, in earlemes, of 1.
Where did this token length come from?
A length of 1 is the default length for a token.
In fact, in the default, tokenstream, model of input,
a token length of 1 is the <B>onlyB> token length possible.
When other, nonstandard, input models are used, tokens
can be variable length.
For details on alternative input models,
see Marpa::Advanced::Models.
Prediction
In the previous section, we looked at one of
the two transitions from AHFA state S5 at earleme 3
on an Add token.
The transition we looked at was to AHFA state S8.
Now we will look at the other transition, to AHFA state S9.
Here is AHFA state S9:
S9: predict; 3,5,7,11
Term > . Factor
Factor > . Number
Term > . Term Add Term
Factor > . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S13
In the first line of the show_AHFA output for AHFA state S9,
immediately after the name of the state, comes the string "predict".
This means that S9 is a prediction state,
and that the new Earley item we add as a result of the transition
to S9 will be a predicted Earley item.
The current location of a predicted Earley item is determined in
the same way as the current location of a scanned item.
We take the current location of the Earley item containing the
transition and add the length of the token causing the transition.
In this case the transition is in S5@03, whose current location is 3.
The token is the same as it was in the previous section: an Add
token with a token length of 1.
As before, 3+1 equals 4, so that the new current location will be 4.
Predicted Earley items determine
their start location,
or origin,
in a special way.
The origin of a predicted Earley item is always the same as
its current location.
So the origin of this new,
predicted, Earley item will be 4.
Now that we know its AHFA state, current location and origin,
we are in a position to add the predicted Earley item.
It goes into Earley set 4, and looks like this:
S9@44
Note that no source choice is shown.
Source choices are not recorded for predicted items.
More about Prediction States
Prediction states exist because
the presence of incomplete LR(0) items at an earleme location
gives rise to rule initializations
at that same earleme location.
When
an LR(0) item with a postdot symbol is present at a given
location,
we call that symbol an <B>expected symbolB> at that location.
At any given location,
for every expected symbol,
we can also expect
the rule initializations for all the rules
with that expected symbol on the lefthand side.
Any rule initialization we introduce will have a postdot symbol.
If that postdot symbol is not already an expected symbol,
it may cause further rule initializations to be expected
at the same location.
In other words, rule initializations recurse.
The AHFA states invented by
Aycock and Horspool
handle these recursive rule initializations
efficiently.
Above we looked at AHFA states S8 and S9.
In the following paragraphs,
we’ll examine the recursive sequence of rule
initializations as it
originates in S8
and is collected in S9.
The AHFA’s invented by
Aycock and Horspool pair
every transition to a prediction state with the
transition to a kernel (or nonprediction) state.
In this case, S8 and S9 are a pair,
where S8 is the kernel state.
S8 contains only one LR(0) item:
Term > Term Add . Term
In this LR(0) item the postdot symbol is Term.
S9, the prediction AHFA state
paired with kernel AHFA state S8,
will contain
all the rule initializations
caused by the expectation
of a Term symbol.
There are two rules with
Term on their left hand side.
Their rule initializations are
Term > . Term Add Term
Term > . Factor
Rule initializations recurse.
The rule initializations above have two postdot symbols:
Term and Factor.
Term is already an expected symbol,
but Factor is a new expected symbol.
When Factor is an expected symbol
at an earleme location,
all rules with
Factor on their left hand side
will have rule initializations
at that same earleme location.
There are two such rules.
Here are their rule initializations:
Factor > . Factor Multiply Factor
Factor > . Number
The postdot symbols in these rule initializations
are Factor and Number.
Factor is already an expected symbol.
Number is new as an expected symbol,
but there is no rule in our grammar with
Number on its left hand side.
The recursion is finished.
We have added all the rule initializations for
the expected symbols found in previous steps.
There are no new expected symbols in this recursion
step.
That means there will be no more new rule initializations,
and therefore no more new expected symbols.
Starting with the expected symbol in AHFA state S8,
we found four rule initializations.
The prediction state paired with S8 is S9.
Let’s look at it again:
S9: predict; 3,5,7,11
Term > . Factor
Factor > . Number
Term > . Term Add Term
Factor > . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S13
We see that, indeed,
S9 contains all four of the rule
initializations
that we found in this section.
Completion
Above, we saw a scanned Earley item,
added as the result
of recognizing a symbol directly in the input.
New symbols are also recognized indirectly,
when rules are completed.
When a rule is recognized all the way to the end,
so that the dot location is
after the rightmost right hand side symbol,
then the left hand side symbol of that rule
is also recognized.
As a reminder,
an LR(0) item with the dot at the end of the rule
is called a <B>completed ruleB>.
Whenever an Earley item contains a completed rule,
that indicates that the left hand side symbol of the completed rule
was recognized.
The left hand side symbol of a completed rule is
a <B>completed symbolB>.
Earley items with completed symbols
can cause other Earley items to be added
to the Earley sets.
Let’s look at one example.
S4@23, in Earley set 3,
contains a completed symbol.
Here is what AHFA state 4 looks like:
S4: 6
Factor > Number .
Every completed symbol has an origin and a current earleme.
The origin and the current earleme of a completed symbol
are exactly the same as the
origin and the current earleme
of the Earley item that completes
that symbol.
In this case,
the Earley item completing the symbol Factor
is S4@23 so that
Factor is a completed symbol with
origin at 2 and current earleme at 3.
An Earley item that completes a symbol is called a <B>completion Earley itemB>,
or simply a <B>completionB>.
As a result of recognizing a completion Earley item,
Marpa may add
one or more <B>completion effect Earley itemsB> to the Earley sets.
A <B>completion effect Earley itemB> is often called simply
a <B>completion effectB>, or just an <B>effectB>.
When a completion effect Earley item is added to the Earley sets,
the completion Earley item which caused the effect is
called a <B>completion cause Earley itemB>.
A <B>completion cause Earley itemB> is often called simply
a <B>completion causeB>, or just an <B>causeB>.
For a completion effect Earley item to be added to the Earley sets,
a completion cause is necessary
but not sufficient.
There must also be a matching predecessor Earley item.
A <B>matching predecessorB> Earley item
is an Earley item
whose origin is the same as the origin of the completed symbol,
and whose the postdot symbol is the completed symbol.
In our example, Factor is a completed symbol
with origin 2.
We will add an Earley item as a completion effect if we find a matching
predecessor:
an Earley item in
Earley set 2 which has Factor as one of the postdot symbols.
In this example we find two matching predecessors:
S6@02 and S7@22.
Completion effects are added for both of these.
We’ll look only at the completion effect
that has
S6@02 as its
matching predecessor.
Here is AHFA state 6.
S6: 13
Factor > Factor Multiply . Factor
<Factor> => S10
The transition for Factor in AHFA state S6
is to S10, so the new completion effect will have
AHFA state S10.
A completion effect always has the same current earleme
as its completion cause (in this case S4@23).
The origin of a completion effect is always the
same as the origin of its matching predecessor (in this case S6@02).
The means our new
new completion effect
will have AHFA state S10, origin 0 and current earleme 3.
Here it is:
S10@03 [p=S6@02; c=S4@23]
Note the source choice in brackets after the name:
[p=S6@02; c=S4@23].
Previously we saw a token source choice.
This is another kind of source choice, a <B>completion source choiceB>.
This source choice says the reason for S10@03 to be in the Earley sets
is the completion cause S4@23
and its matching predecessor,
S6@02.
When a parse is ambiguous,
Earley items can have multiple source choices.
In this example,
the parse is unambiguous,
and all scanned Earley items
and all completion effect Earley items will
have one and only one source choice.
Completions recurse.
If a completion effect
contains a completed symbol,
and if that completion effect has a matching predecessor,
then it is also a completion cause.
The completion effect that we just added (S10@03)
is a completion cause as well.
It is a step in a recursive sequence of completions.
Earley item S10@03 has AHFA state S10.
AHFA state S10 contains a completed rule:
S10: leoc; 14
Factor > Factor Multiply Factor .
(Ignore the leoc notation for now.
It indicates that S10 is what’s called a Leo completion.
The Leo logic is dealt in another
document.)
The completed symbol is Factor,
with origin
0 and current earleme 3.
Looking for
matching predecessors at earleme 0,
we find one:
S1@00.
Here is what AHFA state 1 looks like:
S1: predict; 1,3,5,7,11
Expression > . Term
Term > . Factor
Factor > . Number
Term > . Term Add Term
Factor > . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S5
The completed symbol is Factor and its transition
in S1 is to S3, so we add S3@03 to Earley set 3.
S3@03 [p=S1@00; c=S10@03]
In our scanning example,
we referred to S5@03.
One more step in the sequence of completions
and we will see
S5@03 again.
The origin of S3@03 is at earleme 0.
Term is only the completed symbol in AHFA state 3.
S3: 4,12
Term > Factor .
Factor > Factor . Multiply Factor
<Multiply> => S6; S7
S1@00 contains an LR(0) item
which has Term as the postdot
symbol.
In AHFA state S1, the transition on Term
is to AHFA state S5.
This means S5@03 will be
added to Earley set 3 as a completion effect,
with S3@03 as the cause and S1@00 as the matching predecessor:
S5@03 [p=S1@00; c=S3@03]
We call a completion effect Earley item the
<B>successorB> of its matching predecessor.
For example,
S5@03 is the successor
of S1@00.
The matching predecessor of an Earley item is often
simply called
its <B>predecessorB>.
For example,
S1@00 is the predecessor of
S5@03.
Initialization
One or two Earley items are put into Earley set 0 to start things off.
In our example, the initial Earley items are
S0@00
S1@00
S0@00 will always be present in Earley set 0.
AHFA state 0 contains the start rules, with the dot pointer at the beginning.
Only Earley set 0 will contain an Earley item for AHFA state 0.
S1@00 contains the rules predicted by S0@00.
For some very simple grammars S1@00 is not necessary.
EVALUATION
As of this writing,
this section is not a complete presentation of Marpa’s
Generation 3 evaluator.
Instead,
it focuses on the issues relevant to users
using Marpa’s advanced trace accessors.
Applicable Dotted Rules
As we have seen, Marpa uses AHFA states in its Earley items,
and these can contain more than one dotted rule.
To produce an actual parse result, Marpa must
find the dotted rules which are applicable for
that particular parse.
A dotted rule that applies in a particular parse result
is called an <B>applicable dotted ruleB>
or an <B>applicable B>LR<B>(0) itemB>.
The search for applicable dotted rules begins with the start
Earley item.
There will always be exactly one dotted rule in
a start Earley item,
and that one dotted rule will always be applicable
to the current parse result.
In other words,
in any parse result,
the dotted rule in its start Earley item
is an applicable dotted rule.
The search for the other dotted rules applicable to that parse result
proceeds topdown,
starting with
the applicable dotted rule in the start Earley item.
If the parse is not ambiguous, there will be exactly one
applicable dotted rule for every Earley item.
If the parse is ambiguous,
at some points there may be more than one applicable dotted
rule.
Ambiguous parsing is discussed in greater depth in the
next section.
To find applicable dotted rules
while proceeding topdown,
Marpa must be able to do two things.
First,
given a pairing of a completion effect
with its completion cause,
and an applicable dotted rule in the completion effect,
Marpa must be able to
determine the applicable dotted rules in the completion cause.
A dotted rule in a completion cause is <B>applicableB>
if and only if the LHS symbol of the dotted rule
in the completion cause
is the same
as the first nonnull symbol
before the dot position of the applicable dotted rule
in the completion effect.
Second,
given a pairing of a successor Earley item
with its predecessor Earley item,
and an applicable dotted rule in the successor,
Marpa must be able to
determine the applicable dotted rules in the predecessor
Earley item.
Recall that a dotted rule is a duple containing a
production of the grammar and a dot position in that production.
A dotted rule in a predecessor is <B>applicableB>
if and only if

o

its production is the same as the
production of the
applicable dotted rule
in the successor, and

o

its dot position is exactly one nonnull symbol earlier
than the dot position of the applicable dotted rule
in the successor.


Ambiguous Parsing
The parse in this example is not ambiguous.
Marpa allows ambiguous parses.
When a parse is ambiguous, the procedure for adding
Earley items is the same as above,
except when Marpa attempts to add the same
Earley item twice.
Two Earley items are considered the same
if they have the same AHFA state,
the same origin,
and the same current earleme.
Marpa will never add the same Earley item twice.
In cases where Marpa might add the same Earley item twice,
Marpa instead adds another source choice to the
already existing Earley item.
This means that
some Earley items might have multiple source choices.
It is possible for an
Earley item to end up with

o

Both a token source choice and a completion source choice;

o

Multiple token source choices;

o

Multiple completion source choices;

o

Multiple token source choices and multiple completion source choices;


A parse is ambiguous if and only if it has one or more points of ambiguity.
A <B>point of ambiguityB> is
an Earley item with two or more
source choices,
or an Earley item with
two or more applicable dotted rules.
Calculating the Parse Result
Marpa calculates parse results using fairly traditional methods.
Marpa first derives a preorder tree from the Earley sets.
Marpa then evaluates the preorder tree bottomup,
using an evaluation stack.
The evaluation stack is initialized to empty.
Every time Marpa evaluates a tree node,
it pushes the result onto the evaluation stack.
When a tree node has child values,
they are popped from the evaluation stack.
The calculation of the parse result is complete
when Marpa
pushes the value of the root treenode
onto the evaluation stack.
This value
becomes the value of the parse.
The trace_values output in the
appendix
traces the calculation of the parse result
in our example.
Optimizations
In some very important cases,
the naive implementation
of the traditional method of evaluating a parse
tree is extremely wasteful.
Grammars of great practical interest often contain very long sequences.
A C language translation unit is defined as sequence of definitions
and declarations, and that sequence
can be extremely long.
A Perl lexical unit is also a sequence, again,
often quite long.
When a sequence is expressed as BNF rules,
most of the rule instances used to implement
a long sequence
will have
no significant semantics of their own.
Popping and pushing the symbols for each
of these rule instances
is wasteful.
It is more efficient to leave the values of
the sequence items
on the evaluation stack until the sequence is complete.
To optimize
sequences
and other cases
where symbols
are best left on the stack,
Marpa marks some of its internal symbols as virtual.
A symbol is <B>virtualB> if it has no real semantics.
In the presence of virtual symbols,
Marpa skips much of the usual popping and pushing of values
on the evaluation stack.
Marpa uses three
special rule properties to track virtual symbols:
vlhs, vrhs and real.
vlhs means that the left hand side is a virtual symbol.
vrhs means that the right hand side contains
a virtual symbol.
real is an integer, indicating the number of real (that is, nonvirtual)
symbols on the right hand side.
THE LEO LOGIC
Marpa uses the ideas from an algorithm developed
by Joop Leo to deal
with right recursion.
The Leo logic is complex enough to merit its own
example, and so is presented in a separate
document.
DETERMINING WHETHER A PARSE IS SUCCESSFUL
As mentioned, in the strict sense,
Earley’s algorithm is just a recognizer.
The input is recognized successfully if
there is
an Earley item for a completed start rule state
in the Earley set at
the end of parsing.
A <B>completed start rule stateB> is a AHFA state containing a completed start rule.
A completed start rule is an LR(0) item for a start rule
with a dot position at the end of the rule.
A <B>start ruleB> is any rule
that has Marpa’s internal start symbol on
its left hand side.
There are at most two start rules in any Marpa grammar:
the <B>empty start ruleB>
and the <B>nonempty start ruleB>.
As a consequence, there are at most two completed start rule states
in any Marpa grammar:
the <B>empty completed start rule stateB>
and the <B>nonempty completed start rule stateB>.
The empty start rule is the rule with Marpa’s internal start symbol
on its left hand side and an empty right hand side.
There is an empty start rule in a Marpa grammar if and only if
that Marpa grammar allows the null parse.
(Saying that a grammar allows the null parse is the same
as saying that a grammar recognizes the zerolength,
or empty, token string.)
In grammars which allow a null parse,
the start state, AHFA state 0,
is the empty completed start rule state.
The nonempty start rule is the rule with Marpa’s internal start symbol
on its left hand side and exactly one symbol on its right hand side —
the grammar’s original start symbol.
AHFA state S2 is the nonempty completed start rule state
for our example parse:
S2: leoc; 16
Expression[] > Expression .
A parse is <B>successful at earleme B>N<B>B> if it
contains an Earley item
for a completed start rule state
with a dot earleme of N.
The parse in our example is successful at earleme location 5,
because
the following Earley item appears
in Earley set 5:
S2@05 [p=S0@00; c=S5@05]
A parse is <B>successfulB> if that parse is
successful at its end of parsing location.
Marpa’s default idea of the end of parsing location
depends on its input model.
(The end of parsing location
can be also be explicitly set by the user.)
The parse in our example
reads five tokens,
uses the default input model,
and uses the default end of parsing.
That puts the end of parsing for our
example parse at earleme location 5.
Since parsing is successful at location 5,
the parse in our example is successful.
APPENDIX: THE EXAMPLE
Below are the code and the trace outputs
for the example used in this
document.
Code for the example
my $grammar = Marpa::Grammar>new(
{ start => Expression,
actions => My_Actions,
default_action => first_arg,
strip => 0,
rules => [
{ lhs => Expression, rhs => [qw/Term/] },
{ lhs => Term, rhs => [qw/Factor/] },
{ lhs => Factor, rhs => [qw/Number/] },
{ lhs => Term, rhs => [qw/Term Add Term/], action => do_add },
{ lhs => Factor,
rhs => [qw/Factor Multiply Factor/],
action => do_multiply
},
],
}
);
$grammar>precompute();
my $recce = Marpa::Recognizer>new( { grammar => $grammar } );
my @tokens = (
[ Number, 42 ],
[ Multiply, q{*} ],
[ Number, 1 ],
[ Add, q{+} ],
[ Number, 7 ],
);
$recce>tokens( \@tokens );
sub My_Actions::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
sub My_Actions::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
sub My_Actions::first_arg { shift; return shift; }
my $value_ref = $recce>value();
my $value = $value_ref ? ${$value_ref} : No Parse;
CWshow_symbols Output
0: Expression, lhs=[0] rhs=[5] terminal
1: Term, lhs=[1 3] rhs=[0 3] terminal
2: Factor, lhs=[2 4] rhs=[1 4] terminal
3: Number, lhs=[] rhs=[2] terminal
4: Add, lhs=[] rhs=[3] terminal
5: Multiply, lhs=[] rhs=[4] terminal
6: Expression[], lhs=[5] rhs=[]
CWshow_rules Output
0: Expression > Term
1: Term > Factor
2: Factor > Number
3: Term > Term Add Term
4: Factor > Factor Multiply Factor
5: Expression[] > Expression /* vlhs real=1 */
CWshow_AHFA Output
Start States: S0; S1
S0: 15
Expression[] > . Expression
<Expression> => S2; leo(Expression[])
S1: predict; 1,3,5,7,11
Expression > . Term
Term > . Factor
Factor > . Number
Term > . Term Add Term
Factor > . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S5
S2: leoc; 16
Expression[] > Expression .
S3: 4,12
Term > Factor .
Factor > Factor . Multiply Factor
<Multiply> => S6; S7
S4: 6
Factor > Number .
S5: 2,8
Expression > Term .
Term > Term . Add Term
<Add> => S8; S9
S6: 13
Factor > Factor Multiply . Factor
<Factor> => S10; leo(Factor)
S7: predict; 5,11
Factor > . Number
Factor > . Factor Multiply Factor
<Factor> => S11
<Number> => S4
S8: 9
Term > Term Add . Term
<Term> => S12; leo(Term)
S9: predict; 3,5,7,11
Term > . Factor
Factor > . Number
Term > . Term Add Term
Factor > . Factor Multiply Factor
<Factor> => S3
<Number> => S4
<Term> => S13
S10: leoc; 14
Factor > Factor Multiply Factor .
S11: 12
Factor > Factor . Multiply Factor
<Multiply> => S6; S7
S12: leoc; 10
Term > Term Add Term .
S13: 8
Term > Term . Add Term
<Add> => S8; S9
CWshow_earley_sets Output
Last Completed: 5; Furthest: 5
Earley Set 0
S0@00
S1@00
Earley Set 1
S4@01 [p=S1@00; s=Number; t=\42]
S3@01 [p=S1@00; c=S4@01]
S5@01 [p=S1@00; c=S3@01]
S2@01 [p=S0@00; c=S5@01]
Earley Set 2
S6@02 [p=S3@01; s=Multiply; t=\*]
S7@22
Earley Set 3
S4@23 [p=S7@22; s=Number; t=\1]
S10@03 [p=S6@02; c=S4@23]
S11@23 [p=S7@22; c=S4@23]
S3@03 [p=S1@00; c=S10@03]
S5@03 [p=S1@00; c=S3@03]
S2@03 [p=S0@00; c=S5@03]
Earley Set 4
S8@04 [p=S5@03; s=Add; t=\+]
S9@44
Earley Set 5
S4@45 [p=S9@44; s=Number; t=\7]
S3@45 [p=S9@44; c=S4@45]
S12@05 [p=S8@04; c=S3@45]
S13@45 [p=S9@44; c=S3@45]
S5@05 [p=S1@00; c=S12@05]
S2@05 [p=S0@00; c=S5@05]
CWtrace_values Output
Pushed value from a12 T@01_Number: Number = \42
Popping 1 values to evaluate a12 T@01_Number, rule: 2: Factor > Number
Calculated and pushed value: 42
Pushed value from a10 R4:1@01T@12_Multiply: Multiply = \*
Pushed value from a9 T@23_Number: Number = \1
Popping 1 values to evaluate a9 T@23_Number, rule: 2: Factor > Number
Calculated and pushed value: 1
Popping 3 values to evaluate a8 R4:2@02F2@23, rule: 4: Factor > Factor Multiply Factor
Calculated and pushed value: 42
Popping 1 values to evaluate a7 F4@03, rule: 1: Term > Factor
Calculated and pushed value: 42
Pushed value from a5 R3:1@03T@34_Add: Add = \+
Pushed value from a4 T@45_Number: Number = \7
Popping 1 values to evaluate a4 T@45_Number, rule: 2: Factor > Number
Calculated and pushed value: 7
Popping 1 values to evaluate a3 F2@45, rule: 1: Term > Factor
Calculated and pushed value: 7
Popping 3 values to evaluate a2 R3:2@04F1@45, rule: 3: Term > Term Add Term
Calculated and pushed value: 49
Popping 1 values to evaluate a1 F3@05, rule: 0: Expression > Term
Calculated and pushed value: 49
New Virtual Rule: a0 F0@05, rule: 5: Expression[] > Expression
Symbol count is 1, now 1 rules
LICENSE AND COPYRIGHT
Copyright 20072010 Jeffrey Kegler, all rights reserved.
Marpa is free software under the Perl license.
For details see the LICENSE file in the Marpa distribution.
perl v5.20.3  MARPA::ADVANCED::IMPLEMENTATION (3)  20160403 
Visit the GSP FreeBSD Man Page Interface. Output converted with manServer 1.07. 