12 分钟阅读

Reference:

1. Intuition & ExamplePermalink

LR(1) 有一个很明显的问题就是 item/state space 的爆炸式增长;LALR(1) 的想法是 “to merge LR(1) states with the same core by combining their lookaheads, thus lowering the total count of states”.

这里需要正式定义一下 core:

Definition: The core of an LR(1) state is its LR(0) items (i.e. the item set with their lookaheads dropped).

注意与 kernel item 区分。

所以如果两个 LR(1) states 的 core 相同,LALR(1) 认为它们就可以合并,合并的方式是把相同 item 的 lookaheads 取 union.

举个例子:

S' -> S
S -> XX
X -> aX
X -> b

“State#1” in red font represents the accept state. States with the same non-white backgraound color are mergeable by LALR(1).

LALR(1) 认为 State#3 + State#6State#4 + State#7State#8 + State#9 可以合并,于是得到:

---
config:
  layout: elk
  look: handDrawn
---
flowchart LR
    subgraph I0["State#0"]
        direction TB
        i00["$$S' \to \cdot S, \Finv$$"] ~~~
        i01["$$S \to \cdot XX, \Finv$$"] ~~~
        i02["$$X \to \cdot aX, \lbrace a, b \rbrace$$"] ~~~
        i03["$$X \to \cdot b, \lbrace a, b \rbrace$$"]
    end

    subgraph I1["State#1"]
        direction TB
        i10["$$S' \to S \cdot, \Finv$$"]
    end

    subgraph I2["State#2"]
        direction TB
        i20["$$S \to X \cdot X, \Finv$$"] ~~~
        i21["$$X \to \cdot aX, \Finv$$"] ~~~
        i22["$$X \to \cdot b, \Finv$$"]
    end

    subgraph I36["State#36"]
        direction TB
        i30["$$X \to a \cdot X, \lbrace a, b, \Finv\rbrace$$"] ~~~
        i31["$$X \to \cdot aX, \lbrace a, b, \Finv\rbrace$$"] ~~~
        i32["$$X \to \cdot b, \lbrace a, b, \Finv\rbrace$$"]
    end

    subgraph I47["State#47"]
        direction TB
        i40["$$X \to b \cdot, \lbrace a, b, \Finv\rbrace$$"]
    end

    subgraph I5["State#5"]
        direction TB
        i50["$$S \to XX \cdot, \Finv$$"]
    end

    subgraph I89["State#89"]
        direction TB
        i80["$$X \to aX \cdot, \lbrace a, b, \Finv\rbrace$$"]
    end

    I0 -->|$$a$$| I36
    I0 -->|$$b$$| I47
    I0 -->|$$S$$| I1
    I0 -->|$$X$$| I2
    
    I2 -->|$$a$$| I36
    I2 -->|$$b$$| I47
    I2 -->|$$X$$| I5

    I36 -->|$$a$$| I36
    I36 -->|$$b$$| I47
    I36 -->|$$X$$| I89

    style I1 color:#f66

    style I36 fill:#E8DAEF

    style I47 fill:#D4E6F1

    style I89 fill:#D4EFDF

合并后是有可能直接变成 SLR(1) 的。

2. Conflicts Emerged after Merging / Partially LALR(1)?Permalink

合并 state 必然会增加 conflict 的 possibility:

  • conflicting 的两个 items 如果是分开的两个 states 中的,就没事
  • 一合并就必然出事

Theorem: When a conflict arises after state merging, we say the grammar is not LALR(1).

按照 LR Parsing #7: LR(0) vs SLR(1) vs LR(1) - 4. The Expressive Power Perspective 的思路,有:

Grammar Possibility of Conflicts so a conflict-free grammar must be … structually Expressive Power
LR(0) 🎲🎲🎲🎲 High 🚫🚫🚫🚫 most restrictive 👑 Low
SLR(1) 🎲🎲🎲 🚫🚫🚫 👑👑
LALR(1) 🎲🎲 🚫🚫 👑👑👑
LR(1) 🎲 Low 🚫 least restrictive 👑👑👑👑 High

为了避免出现 conflict,我有一个 rough 的 idea:只合并那些不会有 conflict 的 states;有 conflict 的 states 我们不合并就好了。这么一来就得到了一种 half-LR(1)-half-LALR(1) 的形态。

学界的工作做得比我这个 rough idea 更细致,比如 IELR(1),这里就不展开了。

3. Parsing Table ConstructionPermalink

Same with LR(1)’s.

4. State-Merging AlgorithmsPermalink

4.0 老算法改造Permalink

之前的算法 有这么几个问题:

  1. GOTO(1) 职责上稍微有点不清晰:
    • 它要负责创建新的 closure
    • 同时它又是 TGOTO 填表的依据
  2. repeat until C is not changed 没有一个具体的、高效的实现手段

我们可以这么改造一下:

// compute the canonical collection of item sets for grammar Gprocedure CC(1)(G) -> Set[Set[Item]]:i=[SS,]// initial kernel item from the dummy production I=CLOSURE(1)({i})// initial item setC={I}// initial canonical collectionUQ=Queue(). append(I)// unvisited item setsrepeat until UQ is emptyI=UQ. popleft()// visiting the first item set in the queuefor each symbol XJ=GOTO(1)(I,X)if J=continueif JC// J is not empty, and new to C// should be marked as unvisitedC. add(J)UQ. append(J)TGOTO[I,X]=Jreturn C

TACTION 填表放到 CC(1) 结束之后再做。

4.1 Brute-ForcePermalink

即在 CC(1) 结束之后再去 scan C 寻找可以合并的 item sets,而且还要去改动 TGOTO,程序写起来很麻烦。

但是如果是做练习题,这个方法还行,而且 TGOTO 有个规律可以用:

  1. 假设我们构建好了 LR(1)C={I0,I1,,In}
  2. For each core C, find all item sets having that core, and compute their union, say Jx=κx where κx={IiCcore(Ii)=the x-th core in C}
  3. 假设 step 2 的结果是 C={J0,J1,,Jm}:
    • TGOTO
      • 如果 Jp=IiIk,那么 X, GOTO(1)(Ii,X),GOTO(1)(Ik,X) 也应该有相同的 core,所以你肯定能找到一个 Jq=GOTO(1)(Ii,X)GOTO(1)(Ik,X)C. 于是我们可以删除 TGOTO(Ii,X),TGOTO(Ik,X),然后添加 TGOTO(Jp,X)=Jq
      • TGOTO(?,?)=Ii or  or Ik 都要改成 TGOTO(?,?)=Jp
    • TACTIONLR(1) 的方法一致

4.2 A simple algorithm: step-by-step mergingPermalink

First described by Anderson et al. in 1973.

// compute the canonical collection of item sets for grammar Gprocedure CC_LALR(1.1)(G) -> Set[Set[Item]]:i=[SS,]// initial kernel item from the dummy production I=CLOSURE(1)({i})// initial item setC={I}// initial canonical collectionUQ=Queue(). append(I)// unvisited item setsrepeat until UQ is emptyI=UQ. popleft()// visiting the first item set in the queuefor each symbol XJ=GOTO(1)(I,X)if J=continueif KC such that core(K)=core(J)// 📌 able to mergeK=K. deep\_copy()K=K. merge(J)if K=K// no change after mergingcontinue// discard J and KC. remove(K)C. add(K)// since it has chnged, it may lead to new and different statesUQ. append(K)TGOTO[I,X]=Kreturn C

Notes on 📌:

  • 你只可能找出 only one such K,如果有多个的话,它们理应已经被 merge 了
  • 这里你可能需要一个 map(core, item_set),查找比较方便

Cons:

  • still generates almost all LR(1) states (现实中 LR(1) 大几千个 states 被合并成 LALR(1) 几百个 states 的情况是很常见的,> 10:1 ratio)
  • K 需要 revisit 的频率很高,还是有很多重复的计算

4.3 The Channel Algorithm (used by yacc)Permalink

Described in YACC: Yet Another Compiler-Compiler by Stephen C. Johnson; detailed by Aho, Sethi and Ullman in the Dragon Book.

4.3.0 Intuition: A channel is a passage in the the LR(0) NFA that carries over lookaheads (among items)Permalink

With this example grammar:

// A non-LR(0) grammar for differences of numbers
S -> E
E -> E - T
E -> T
T -> n
T -> (E)

the book Parsing Techniques constructed a NFA for it:

and lookaheads are carried over by two types of channels within the NFA:

  • is like a placeholder for lookaheads
  • dotted lines represent “propagated” channel
  • dashed lines represent “spontaneous” channel

But in reality you don’t need to run the Channel Algorithm on NFAs like this. Actually it’s easier to work on DFAs.

The skeleton of the Channel Algorithm is like:

  1. Compute all LR(0) kernel items
  2. Compute the lookaheads (by channels) for those kernel items, making them LALR(1) kernel items
  3. Expand those LALR(1) kernel items into LALR(1) item sets by CLOSURE(1)

4.3.1 Compute LR(0) Kernel ItemsPermalink

Definition:

  • Kernel items: the initial item [SS], plus all items whose dots are not at the left end
  • Non-kernel items: all other items with their dots at the left end, except for [SS]

An item set may have k>1 kernel items.

You can use the CC procedure of LR Parsing #2: Structural Encoding of LR(0) Parsing DFA to compute all LR(0) items and then remove the non-kernel ones. Otherwise you can modify the procedure so that every kernel item is marked whenever it’s created.

4.3.2 Lookahead Determining AlgorithmPermalink

首先我们要定义两种不同的 channel,或者说两种不同的 lookahead-attachment (to kernels) 的形式。

强烈不建议参考 Dragon Book 的 Example 4.61 下的两个 bullet points,那根本就不是 formal definition,是针对 Example 4.61 的特殊情况的讨论。而且也不要试图去 interpret,因为你很难确定它讲的 “regardless of a”、”only because” 这些词是什么意思。请直接跳过去 Algorithm 4.62

参考 Dragon Book 的 Algorithm 4.62. Let:

  • be a symbol Σ.
  • I be an LR(0) item set
  • ker(I) be the set of kernel items of I
  • X be a symbol
  • GOTO(I,X)=J
  • SL:Set[Tuple(I,ϕi,J,ϕj,b)] is the result for “spontaneously generated lookaheads”, where:
    • ϕiker(I) is a kernel item of I
    • ϕjker(J) is a kernel item of J
    • b is a lookahead symbol
    • one such entry means: lookahead b is spontaneously generated by I (or more specially ϕi) for ϕj
  • PL:Set[Tuple(I,ϕi,J,ϕj)] is the result for “propagated lookaheads” where:
    • one such entry means: lookaheads propagate from ϕi to ϕj

Textbooks use # instead of (LaTeX \lozenge). I don’t like escaping it all the time in Markdown so I prefer . This symbol is often called dummy lookahead or universal lookahead. I also would like to call it placebo lookahead.

// Algorithm 4.62: determine lookahead channels of I on input Xprocedure LAChan(I,X):J=GOTO(1)(I,X)PL=Set()SL=Set()for each kernel item ϕi=[Aαβ]ker(I)patch ϕi to a LALR(1) item ϕi=[Aαβ,]let Iϕi=CLOSURE(1)({ϕi})if ψi=[BγXδ,]Iϕi// then certainly ϕj=[BγXδ,]Jlet ϕj=[BγXδ,]PL .add(Tuple(I,ϕi,J,ϕj))if ψi=[BγXδ,b]Iϕi and b// then certainly ϕj=[BγXδ,b]Jlet ϕj=[BγXδ,b]SL .add(Tuple(I,ϕi,J,ϕj,b))return PL,SL

You don’t need to keep X in the results since we know IXJ.

Why the dummy lookahead works? Can I replace it with some aΣ? 这里就涉及到了 “propagated” vs “spontaneous” 的核心问题:

如果我们是 ϕi=[Aαβ,a] 然后找到了 ϕj=[BγXδ,b]J,那么 ab“propagated” 还是 “spontaneous” 要看 “a 是否决定b 的值”。

举个例子:

  • 比如说按 CLOSURE(1),我们可能会有 bFIRST(βa),然后假设这个 b 一路传到了 ϕj unchanged
  • 此时我们能说 “a 决定b 的值” 吗?不能,因为还要继续拆:
    • 如果 β is not nullable,那么 bFIRST(β),与 a 无关,应该算 “spontaneous”
    • 如果 β is nullable,那么 bFIRST(β){a},与 a 有关;如果最终我们得到了 b=a (in ϕj),那么这应该算 “propagated”

这里就涉及了一个问题:在 β is not nullable 时,bFIRST(β) 也是有可能最终得到 b=a (in ϕj) 的,所以你从 b=?a 这个关系上是无法得出 “propagated or spontaneous?” 的结论的。用 dummy lookahead 来 test 就完美解决了这个问题,因为:

  • aΣ,β(ΣV), 不管你怎么操作都不会得出一个 Σ
    • 换言之 只可能是 “propagated”
    • 注意这里要考虑到 IXJ 实际包含了 3 步:
      1. CLOSURE(1) to get I
      2. shift X like [X][X]
      3. CLOSURE(1) to get J
  • on the other hand,如果你拿到一个 lookahead b,那它的值肯定不是被 决定的,所以一定是 “spontaneous”

Special Case: is “spontaneous” for [SS]

4.3 Construct Channel GraphPermalink

假设我们有:

---
config:
  layout: elk
  look: handDrawn
---
flowchart LR
    subgraph I["State $$\;I$$"]
        direction TB
        i0["$$\phi_i = [A \to \alpha \cdot \beta, \ell_i]$$"]
    end

    subgraph J["State $$\;J$$"]
        direction TB
        j0["$$\phi_j = [B \to \gamma X \cdot \delta, \ell_j = ?]$$"]
    end

    I -->|$$X$$| J

已知 i,求 j。我们可以初始化 j=:

  • (I,ϕi,J,ϕj)PL, 则 ji
  • (I,ϕi,J,ϕj,b)SL, 则 j{b}

所以我们所有的 kernel items 构成一个 graph:

  • vertex 形如 (ϕi,i)
  • edge 即是 PL
  • SL 可以理解成:
    • a visitor that generates lookahead b when it reaches vertex (ϕj,j)
    • a initializer of all values
      • 因为这个 graph 本质就是我们的 LALR(1) DFA,所有的 edges 都已知,所以我们一开始就可以把 “spontaneous” lookaheads 都 assign 给对应的 kernel items
  • ϕ0=[SS,] 构成 source vertex

所以剩下的工作就是根据 PL edges 把所有的 都填满就可以了,你用 DFS 或者 BFS 都行,但要注意这仍然是一个 fixed-point 问题,需要多次 DFS 或者 BFS 直到 没有更新为止。因为:

  • 你可能先处理了 (ϕi,i)(ϕj,j),有 ji
  • 但后续可能又有 (ϕk,k)(ϕi,i),有 ik,那这个新来的 k 的内容你也要更新给 j

Dragon Book 是用这么一个 table 来记录这个 fixed-point 的计算过程的:

其中 INIT 就是 SL,后续的 PASS 就根据 PL 来做。

4.4 The Relations Algorithm (Omitted)Permalink

Designed by DeRemer and Pennello (Efficient computation of LALR(1) lookahead sets).

See Section 9.7.1.3, Parsing Techniques.

4.5 LALR-by-SLR Technique (Omitted)Permalink

See Section 9.7.1.4, Parsing Techniques.

分类:

更新时间:

留下评论