13 分钟阅读

0. VocabularyPermalink

In this post we use the following abbreviation, symbols, and operators:

  • AB means A:=AB for two sets A,B
  • Grammar G=(V,Σ,S,P)
  • Φ is a map of αFIRST(α) where α could be:
    • a single terminal aΣ
    • a single variable XV
    • an RHS sequence as in any production (Xα)P
  • Λ is a map of XFOLLOW(X) where XV
  • PPT stands for Predictive Parsing Table

1. If you know your grammar is LL(1)Permalink

Our logic should follow this state diagram on writing an LL(1) parser:

Let’s start with the easy case: suppose you already know your grammar is LL(1).

If your grammar is simple or straightforward, you can write the parser directly by looking at the productions. E.g.:

S : 'a' X ;
  | 'b' Y ;

The parser is like:

def S():
    if lookahead() == 'a':
        match('a')
        X()
    elif lookahead() == 'b':
        match('b')
        Y()
    else:
        raise error()

If your grammar is not so straightforward, like:

S : AX ;
  | BY ;
A : CD ;
B : EF ;
...

then you need PPT since you cannot infer when to match A or when to match B simply from the lookahead and productions.

def S():
    if lookahead() == ???:
        A()
        X()
    elif lookahead() == ???:
        B()
        Y()
    else:
        raise error()

2. If you don’t know if your grammar is LL(1)Permalink

But what if you’re not sure whether your grammar is LL(1)? Before writing a parser, you’ll need to verify that your grammar satisfies the LL(1) condition.

The verification can also be done with a PPT.

3. How to compute PPTPermalink

In 2 out of the 3 cases we listed above, we need to compute PPT. It needs do some groundwork, specifically, the FIRST and FOLLOW sets.

  • FIRST(α(VΣ))={xΣx is the first terminal derivable from α. I.e. αx}
  • FOLLOW(AV)={xΣx is the first terminal that follows A in a sentential form. I.e. SAx}

3.1 FIRST SetPermalink

3.1.1 DefinitionsPermalink

Definition: Given CFG G=(V,Σ,P,S) and a sequence of symbols α(VΣ), the FIRST set of α is the set of terminals that begin the sentential forms derivable from α. Plus, if α can derive ε, then εFIRST(α).

Definition: If we only consider single symbols XVΣ, we can define:

FIRST(X)={{X}if XΣ{x(XxβxΣβ(VΣ))}if XV and Xε{x(XxβxΣβ(VΣ))}{ε}if XV and Xε

FIRST sets on α(VΣ) is similar to the cases of XV.

Nullable property and Derives-ε function

Some textbooks define a property called nullable as:

αεα is nullable

while some other define a Derives-ε function for the same purpose:

αεDerives-ε(α)=True

Some textbooks also prefer excluding ε from the definition of FIRST(α).

If α=Y1Y2Yk, these statements are equivalent:

  1. Y1Y2Ykε
  2. Y1Y2Yk is nullable
  3. Derives-ε(Y1Y2Yk)=True
  4. εL(Y1Y2Yk)
  5. εFIRST(Y1Y2Yk)
  6. εFIRST(Y1)FIRST(Y2)FIRST(Yk)=i=0kFIRST(Yi)

3.1.2 Pilot Algorithm: computing FIRST set for a single symbol, when other FIRST sets are knownPermalink

We can use the following algorithm:

global Φ= a map of known FIRST sets except X'sprocedure compute_first_set_for_symbol(XVΣ)(1)if XΣ(2)return Φ[X]={X}// if XVΦ[X]={ε} if (Xε)P else {}(3)for each (XY1Y2Yk)Pfor i=1 to kΦ[X](Φ[Yi]{ε})if εΦ[Yi]break else: /* This else-clause executes only if the loop terminates normally without breaking, meaning every Yi is nullable. */ Φ[X]{ε}return Φ[X]

where:

  • for-else works like in Python
  • If YiΣ, then the inner for loop breaks, and εΦ[X]
  • You can also initialize Φ to contain all Φ[a]={a},aΣ so Line (1) and (2) can be removed

Explanation on the block starting from Line (3):

  • if bΦ[Y1], then Y1b, and then XbY2Yk. Therefore bΦ[X], and consequently Φ[Y1]Φ[X]
  • if Y1 is nullable, then XY2Yk. Logic falls back to above.
  • if Y1 is not nullable, then we don’t have to consider Φ[Y2] since Y2 can never be the leftmost symbol for X, w.r.t. this particular production.

If you have a grammar without any ε-production, this algorithm becomes way simpler:

global Φ= a map of known FIRST sets except X'sprocedure compute_first_set_for_symbol(XVΣ)if XΣreturn {X}// if XVΦ[X]={}for each (XY1Y2Yk)PΦ[X]Φ[Y1]return Φ[X]

However in real world, ε-production are not always elimimated.

3.1.3 Algorithm: computing FIRST sets for all symbolsPermalink

Now we want to compute all FIRST sets at the grammar level with a single function.

3.1.3.1 A Flawed IntuitionPermalink

An intuition is to call compute_first_set_for_symbol(S,Φ={}) with modification: if Φ[Yi] is unknow, recursively call compute_first_set_for_symbol(Yi,Φ).

This intuition has a big flaw: it cannot handle circular dependencies. Note that:

  1. Circular dependencies means endless recurisive calls
  2. We usually elimiates Left Recursion beforehand but it is only a special form of circular dependencies

Some circular dependencies are direct and obvious, like:

A -> B a
B -> A b | c

Some are involved with “skip-ahead” and less obvious, like:

A -> B C a
B -> ε | d
C -> A e | f

where Φ[A] depends on Φ[B] and Φ[C] (because B is nullable), and Φ[C] depends on Φ[A]. Therefore compute_first_set_for_symbol(A,) will end up calling itself.

3.1.3.2 Fixed-Point ApproachPermalink

How it works:

  1. Go through every production, compute every Φ(X) where X is the LHS of the production
  2. Repeat until Φ is not changed anymore
global Φ={x{x}xΣ}Φ{XηXV,η={ε} if (Xε)P else {}}procedure compute_first_sets_for_V()repeat until Φ is not changedfor each (XY1Y2Yk)Pfor i=1 to kΦ[X](Φ[Yi]{ε})if εΦ[Yi]break else: /* This else-clause executes only if the loop terminates normally without breaking, meaning every Yi is nullable. */ Φ[X]{ε}return Φ

This approach cannot eliminate circular dependencies, but terimiates in time when the fixed-point is reached.

3.1.4 Algorithm: computing FIRST set for any RHS sequencePermalink

To construct PPT, we need Φ[α] for each RHS α appearing in a production. This algorithm is similar to Section 3.1.2:

global Φ=compute_first_sets_for_V()procedure compute_first_set_for_RHS(α=Y1Y2Yk)for i=1 to kΦ[α](Φ[Yi]{ε})if εΦ[Yi]break else: /* This else-clause executes only if the loop terminates normally without breaking, meaning every Yi is nullable. */ Φ[α]{ε}return Φ[α]

3.2 FOLLOW SetPermalink

Definition: Given CFG G=(V,Σ,P,S) and a single variable AV, the FOLLOW set of A is the set of terminals that can appear immediately to the right of A in some sentential form. Plus, if A is the rightmost symbol in some sentential form, then the EOF symbol FOLLOW(A).

I.e. AV

FOLLOW(A)={t(S+αAtβtΣ) or (SαAt=)}.

We can follow the fixed-point strategy of Section 3.1.3.2 and implement a single function to compute all FOLLOW sets at once at the grammar level.

We need pre-computed collection Φ of FIRST sets in the computation, and we initialize the collection Λ of FOLLOW sets as: Λ[S]={} since we always have SS.

Note that always εFOLLOW()

The algorithm is shown below:

global Φ=compute_first_sets_for_V()global Λ={S{}}{X{}XV,XS}procedure compute_follow_sets_for_V()repeat until Λ is not changedfor each (YαX)P(1)Λ[X]Λ[Y]for each (YαXβ)P(2)Λ[X](Φ[β]{ε})(3)if εΦ[β](4)Λ[X]Λ[Y]return Λ

Explanation:

  • Line (1): if bΛ[Y], then string Yy appears in some sentential form. Apply the production YαX, and string αXb appears in the corresponding sentential form. Therefore bΛ[X]
  • Line (2)(3)(4):
    • If εΦ[β], logic falls back to Line (1)
    • If bΦ[β], then βb, which means YαXb. Plus Y should be reachable from S, therefore there should be a sentential form SYαXb, and in the end bΛ[X]

Note the directions of X and Y in the two algorithms:

  • FIRST sets: find production like XY1Yk, update Φ[X]Φ[Yi]
  • FOLLOW sets: find production like YX, update Λ[X]Λ[Y]

3.3 Construct PPT from FIRST and FOLLOW SetsPermalink

PPT T is a table whose row-index is V, column-index is Σ, and each cell is a production P.

global Φ={αcompute_first_set_for_RHS(α)(Xα)P}global Λ=compute_follow_sets_for_V()procedure construct_PPT()init T as an empty |V|×|Σ| tablefor each (Xα)Pfor each terminal pΦ[α]T[X,p]=(Xα)(1)if εΦ[α](2)for each terminal qΛ[X](3)T[X,q]=(Xα)return T

If you have a grammar without any ε-production, the whole thing becomes way simpler:

  1. All the three algorithms on FIRST sets are way simpler
  2. No need to compute FOLLOW sets at all!
  3. Line (1)(2)(3) in above algorithm can be removed

Howeve as we mentioned in Section 3.1.2, in real world, ε-production are not always elimimated.

3.4 Prove your grammar is LL(1)Permalink

Corollary: G is a LL(1) grammar T is conflict-free There is at most 1 production in any cell of T.

To write parsers for ambiguous grammars, sometimes we impose prioritization of certain derivation. This is like prioritizing certain production involved in a PPT conflict.

3.5 ExamplePermalink

See LL(1) Parser Visualization, by Zak Kincaid and Shaowei Zhu.

S -> E
E -> T E'          // Expression
E' -> + T E' | ε
T -> F T'          // Term
T' -> * F T' | ε
F -> ( E ) | id    // Factor
Variable FIRST FOLLOW
S {(,id}  
E {(,id} {),}
E {+,ε} {),}
T {(,id} {+,),}
T {,ε} {+,),}
F {(,id} {+,,),}
Variable / Terminal + ( ) id
S       SE   SE
E       ETE   ETE
E Eε E+TE     Eε  
T       TFT   TFT
T Tε Tε TFT   Tε  
F       F(E)   Fid
def S():
    E()

def E():
    T()
    E_prime()

def T():
    F()
    T_prime()

def T_prime():
    if lookahead() in "$+)":  # T' -> ε
        return
    elif lookahead() == "*":
        consume('*')
        F()
        T_prime()
    else:
        raise ValueError()

def F():
    if lookahead() == "id":
        consume("id")
        return
    elif lookahead() == "(":
        consume("(")
        E()
        consume(")")
    else:
        raise ValueError()

def E_prime():
    if lookahead() in "$)":  # E' -> ε
        return
    elif lookahead() == "+":
        consume('+')
        T()
        E_prime()
    else:
        raise ValueError()
    

4. Write your LL(1) parserPermalink

Just follow the script skeleton in Section 1.

Other top-down parser implementation techniques can be found in Top-Down Parsers: Recursive Descent, Predictive, and More.

分类:

更新时间:

留下评论