LL(1) Parsing
0. VocabularyPermalink
In this post we use the following abbreviation, symbols, and operators:
means for two sets- Grammar
is a map of where could be:- a single terminal
- a single variable
- an RHS sequence as in any production
- a single terminal
is a map of where- PPT stands for Predictive Parsing Table
1. If you know your grammar is Permalink
Our logic should follow this state diagram on writing an
Let’s start with the easy case: suppose you already know your grammar is
If your grammar is simple or straightforward, you can write the parser directly by looking at the productions. E.g.:
S : 'a' X ;
| 'b' Y ;
The parser is like:
def S():
if lookahead() == 'a':
match('a')
X()
elif lookahead() == 'b':
match('b')
Y()
else:
raise error()
If your grammar is not so straightforward, like:
S : AX ;
| BY ;
A : CD ;
B : EF ;
...
then you need PPT since you cannot infer when to match A
or when to match B
simply from the lookahead and productions.
def S():
if lookahead() == ???:
A()
X()
elif lookahead() == ???:
B()
Y()
else:
raise error()
2. If you don’t know if your grammar is Permalink
But what if you’re not sure whether your grammar is
The verification can also be done with a PPT.
3. How to compute PPTPermalink
In 2 out of the 3 cases we listed above, we need to compute PPT. It needs do some groundwork, specifically, the
3.1 SetPermalink
3.1.1 DefinitionsPermalink
Definition: Given CFG
Definition: If we only consider single symbols
Nullable property and
Some textbooks define a property called nullable as:
while some other define a
Some textbooks also prefer excluding
If
3.1.2 Pilot Algorithm: computing set for a single symbol, when other sets are knownPermalink
We can use the following algorithm:
where:
for-else
works like in Python- If
, then the innerfor
loop breaks, and - You can also initialize
to contain all so Line (1) and (2) can be removed
Explanation on the block starting from Line (3):
- if
, then , and then . Therefore , and consequently - if
is nullable, then . Logic falls back to above. - if
is not nullable, then we don’t have to consider since can never be the leftmost symbol for , w.r.t. this particular production.
If you have a grammar without any
However in real world,
3.1.3 Algorithm: computing sets for all symbolsPermalink
Now we want to compute all
3.1.3.1 A Flawed IntuitionPermalink
An intuition is to call compute_first_set_for_symbol
compute_first_set_for_symbol
This intuition has a big flaw: it cannot handle circular dependencies. Note that:
- Circular dependencies means endless recurisive calls
- We usually elimiates Left Recursion beforehand but it is only a special form of circular dependencies
Some circular dependencies are direct and obvious, like:
A -> B a
B -> A b | c
Some are involved with “skip-ahead” and less obvious, like:
A -> B C a
B -> ε | d
C -> A e | f
where compute_first_set_for_symbol
3.1.3.2 Fixed-Point ApproachPermalink
How it works:
- Go through every production, compute every
where is the LHS of the production - Repeat until
is not changed anymore
This approach cannot eliminate circular dependencies, but terimiates in time when the fixed-point is reached.
3.1.4 Algorithm: computing set for any RHS sequencePermalink
To construct PPT, we need
3.2 SetPermalink
Definition: Given CFG
I.e.
We can follow the fixed-point strategy of Section 3.1.3.2 and implement a single function to compute all
We need pre-computed collection
Note that always
The algorithm is shown below:
Explanation:
- Line (1): if
, then string appears in some sentential form. Apply the production , and string appears in the corresponding sentential form. Therefore - Line (2)(3)(4):
- If
, logic falls back to Line (1) - If
, then , which means . Plus should be reachable from , therefore there should be a sentential form , and in the end
- If
Note the directions of
sets: find production like , update sets: find production like , update
3.3 Construct PPT from and SetsPermalink
PPT
If you have a grammar without any
- All the three algorithms on
sets are way simpler- See Section 3.1.2 as an example
- No need to compute
sets at all! - Line (1)(2)(3) in above algorithm can be removed
Howeve as we mentioned in Section 3.1.2, in real world,
3.4 Prove your grammar is Permalink
Corollary:
To write parsers for ambiguous grammars, sometimes we impose prioritization of certain derivation. This is like prioritizing certain production involved in a PPT conflict.
3.5 ExamplePermalink
See LL(1) Parser Visualization, by Zak Kincaid and Shaowei Zhu.
S -> E
E -> T E' // Expression
E' -> + T E' | ε
T -> F T' // Term
T' -> * F T' | ε
F -> ( E ) | id // Factor
Variable | ||
---|---|---|
Variable / Terminal | ||||||
---|---|---|---|---|---|---|
def S():
E()
def E():
T()
E_prime()
def T():
F()
T_prime()
def T_prime():
if lookahead() in "$+)": # T' -> ε
return
elif lookahead() == "*":
consume('*')
F()
T_prime()
else:
raise ValueError()
def F():
if lookahead() == "id":
consume("id")
return
elif lookahead() == "(":
consume("(")
E()
consume(")")
else:
raise ValueError()
def E_prime():
if lookahead() in "$)": # E' -> ε
return
elif lookahead() == "+":
consume('+')
T()
E_prime()
else:
raise ValueError()
4. Write your parserPermalink
Just follow the script skeleton in Section 1.
Other top-down parser implementation techniques can be found in Top-Down Parsers: Recursive Descent, Predictive, and More.
留下评论