LR Parsing #8: LALR(1) as Approximation of LR(1)

July 31, 2025 12 分钟阅读

Reference:

1. Intuition & ExamplePermalink

$L R (1)$ 有一个很明显的问题就是 item/state space 的爆炸式增长； $L A L R (1)$ 的想法是 “to merge $L R (1)$ states with the same core by combining their lookaheads, thus lowering the total count of states”.

这里需要正式定义一下 core:

Definition: The core of an $L R (1)$ state is its $L R (0)$ items (i.e. the item set with their lookaheads dropped). $◼$

注意与 kernel item 区分。

所以如果两个 $L R (1)$ states 的 core 相同， $L A L R (1)$ 认为它们就可以合并，合并的方式是把相同 item 的 lookaheads 取 union.

举个例子：

S' -> S
S -> XX
X -> aX
X -> b

“State#1” in red font represents the accept state. States with the same non-white backgraound color are mergeable by $L A L R (1)$ .

$L A L R (1)$ 认为 State#3 + State#6、State#4 + State#7、State#8 + State#9 可以合并，于是得到：

---
config:
  layout: elk
  look: handDrawn
---
flowchart LR
    subgraph I0["State#0"]
        direction TB
        i00["$$S' \to \cdot S, \Finv$$"] ~~~
        i01["$$S \to \cdot XX, \Finv$$"] ~~~
        i02["$$X \to \cdot aX, \lbrace a, b \rbrace$$"] ~~~
        i03["$$X \to \cdot b, \lbrace a, b \rbrace$$"]
    end

    subgraph I1["State#1"]
        direction TB
        i10["$$S' \to S \cdot, \Finv$$"]
    end

    subgraph I2["State#2"]
        direction TB
        i20["$$S \to X \cdot X, \Finv$$"] ~~~
        i21["$$X \to \cdot aX, \Finv$$"] ~~~
        i22["$$X \to \cdot b, \Finv$$"]
    end

    subgraph I36["State#36"]
        direction TB
        i30["$$X \to a \cdot X, \lbrace a, b, \Finv\rbrace$$"] ~~~
        i31["$$X \to \cdot aX, \lbrace a, b, \Finv\rbrace$$"] ~~~
        i32["$$X \to \cdot b, \lbrace a, b, \Finv\rbrace$$"]
    end

    subgraph I47["State#47"]
        direction TB
        i40["$$X \to b \cdot, \lbrace a, b, \Finv\rbrace$$"]
    end

    subgraph I5["State#5"]
        direction TB
        i50["$$S \to XX \cdot, \Finv$$"]
    end

    subgraph I89["State#89"]
        direction TB
        i80["$$X \to aX \cdot, \lbrace a, b, \Finv\rbrace$$"]
    end

    I0 -->|$$a$$| I36
    I0 -->|$$b$$| I47
    I0 -->|$$S$$| I1
    I0 -->|$$X$$| I2
    
    I2 -->|$$a$$| I36
    I2 -->|$$b$$| I47
    I2 -->|$$X$$| I5

    I36 -->|$$a$$| I36
    I36 -->|$$b$$| I47
    I36 -->|$$X$$| I89

    style I1 color:#f66

    style I36 fill:#E8DAEF

    style I47 fill:#D4E6F1

    style I89 fill:#D4EFDF

合并后是有可能直接变成 $S L R (1)$ 的。

2. Conflicts Emerged after Merging / Partially $L A L R (1)$ ?Permalink

合并 state 必然会增加 conflict 的 possibility:

conflicting 的两个 items 如果是分开的两个 states 中的，就没事
一合并就必然出事

Theorem: When a conflict arises after state merging, we say the grammar is not $L A L R (1)$ . $◼$

按照 LR Parsing #7: LR(0) vs SLR(1) vs LR(1) - 4. The Expressive Power Perspective 的思路，有：

Grammar	Possibility of Conflicts	so a conflict-free grammar must be … structually	Expressive Power
$L R (0)$	🎲🎲🎲🎲 High	🚫🚫🚫🚫 most restrictive	👑 Low
$S L R (1)$	🎲🎲🎲	🚫🚫🚫	👑👑
$L A L R (1)$	🎲🎲	🚫🚫	👑👑👑
$L R (1)$	🎲 Low	🚫 least restrictive	👑👑👑👑 High

为了避免出现 conflict，我有一个 rough 的 idea：只合并那些不会有 conflict 的 states；有 conflict 的 states 我们不合并就好了。这么一来就得到了一种 half- $L R (1)$ -half- $L A L R (1)$ 的形态。

学界的工作做得比我这个 rough idea 更细致，比如 $I E L R (1)$ ，这里就不展开了。

3. Parsing Table ConstructionPermalink

Same with $L R (1)$ ’s.

4. State-Merging AlgorithmsPermalink

4.0 老算法改造Permalink

之前的算法有这么几个问题：

${GOTO}^{(1)}$ 职责上稍微有点不清晰：
- 它要负责创建新的 closure
- 同时它又是 $T_{GOTO}$ 填表的依据
$repeat until C is not changed$ 没有一个具体的、高效的实现手段

我们可以这么改造一下：

\begin{aligned} // compute the canonical collection of item sets for grammar G^{'} \\ procedure {CC}^{(1)} (G^{'}) -> Set[Set[Item]]: \\ i = [S^{'} \to \cdot S, Ⅎ] // initial kernel item from the dummy production \\ I = {CLOSURE}^{(1)} ({i}) // initial item set \\ C = {I} // initial canonical collection \\ U Q = Queue(). append (I) // unvisited item sets \\ repeat until U Q is empty : \\ I = U Q . popleft() // visiting the first item set in the queue \\ for each symbol X : \\ J = {GOTO}^{(1)} (I, X) \\ if J = \emptyset : \\ continue \\ if J \notin C : \\ // J is not empty, and new to C \\ // should be marked as unvisited \\ C . add (J) \\ U Q . append (J) \\ T_{GOTO} [I, X] = J \\ return C \end{aligned}

$T_{ACTION}$ 填表放到 ${CC}^{(1)}$ 结束之后再做。

4.1 Brute-ForcePermalink

即在 ${CC}^{(1)}$ 结束之后再去 scan $C$ 寻找可以合并的 item sets，而且还要去改动 $T_{GOTO}$ ，程序写起来很麻烦。

但是如果是做练习题，这个方法还行，而且 $T_{GOTO}$ 有个规律可以用：

假设我们构建好了 $L R (1)$ 的 $C = {I_{0}, I_{1}, \dots, I_{n}}$
For each core $\in C$ , find all item sets having that core, and compute their union, say $J_{x} = ⋃ κ_{x}$ where $κ_{x} = {I_{i} \in C ∣ core (I_{i}) = the x-th core in C}$
假设 step 2 的结果是 $C^{'} = {J_{0}, J_{1}, \dots, J_{m}}$ :
- $T_{GOTO} ⟹$
  - 如果 $J_{p} = I_{i} \cup \dots \cup I_{k}$ ，那么 $\forall X$ , ${GOTO}^{(1)} (I_{i}, X), \dots {GOTO}^{(1)} (I_{k}, X)$ 也应该有相同的 core，所以你肯定能找到一个 $J_{q} = {GOTO}^{(1)} (I_{i}, X) \cup \dots \cup {GOTO}^{(1)} (I_{k}, X) \in C^{'}$ . 于是我们可以删除 $T_{GOTO} (I_{i}, X), \dots T_{GOTO} (I_{k}, X)$ ，然后添加 $T_{GOTO} (J_{p}, X) = J_{q}$
    - Section 1 就是个很好的例子
  - $\forall T_{GOTO} (?, ?) = I_{i} or \dots or I_{k}$ 都要改成 $T_{GOTO} (?, ?) = J_{p}$
- $T_{ACTION} ⟹$ 与 $L R (1)$ 的方法一致

4.2 A simple algorithm: step-by-step mergingPermalink

First described by Anderson et al. in 1973.

\begin{aligned} // compute the canonical collection of item sets for grammar G^{'} \\ procedure {CC_LALR}^{(1. 1)} (G^{'}) -> Set[Set[Item]]: \\ i = [S^{'} \to \cdot S, Ⅎ] // initial kernel item from the dummy production \\ I = {CLOSURE}^{(1)} ({i}) // initial item set \\ C = {I} // initial canonical collection \\ U Q = Queue(). append (I) // unvisited item sets \\ repeat until U Q is empty : \\ I = U Q . popleft() // visiting the first item set in the queue \\ for each symbol X : \\ J = {GOTO}^{(1)} (I, X) \\ if J = \emptyset : \\ continue \\ if \exists K \in C such that core (K) = core (J) : // 📌 able to merge \\ K^{'} = K . deep\_copy () \\ K^{'} = K^{'} . merge (J) \\ if K = K^{'} : // no change after merging \\ continue // discard J and K^{'} \\ C . remove (K) \\ C . add (K^{'}) \\ // since it has chnged, it may lead to new and different states \\ U Q . append (K^{'}) \\ T_{GOTO} [I, X] = K^{'} \\ return C \end{aligned}

Notes on 📌:

你只可能找出 only one such $K$ ，如果有多个的话，它们理应已经被 merge 了
这里你可能需要一个 map(core, item_set)，查找比较方便

Cons:

still generates almost all $L R (1)$ states (现实中 $L R (1)$ 大几千个 states 被合并成 $L A L R (1)$ 几百个 states 的情况是很常见的，> 10:1 ratio)
$K^{'}$ 需要 revisit 的频率很高，还是有很多重复的计算

4.3 The Channel Algorithm (used by yacc)Permalink

Described in YACC: Yet Another Compiler-Compiler by Stephen C. Johnson; detailed by Aho, Sethi and Ullman in the Dragon Book.

4.3.0 Intuition: A channel is a passage in the the LR(0) NFA that carries over lookaheads (among items)Permalink

With this example grammar:

// A non-LR(0) grammar for differences of numbers
S -> E
E -> E - T
E -> T
T -> n
T -> (E)

the book Parsing Techniques constructed a NFA for it:

and lookaheads are carried over by two types of channels within the NFA:

$◻$ is like a placeholder for lookaheads
dotted lines represent “propagated” channel
dashed lines represent “spontaneous” channel

But in reality you don’t need to run the Channel Algorithm on NFAs like this. Actually it’s easier to work on DFAs.

The skeleton of the Channel Algorithm is like:

Compute all $L R (0)$ kernel items
Compute the lookaheads (by channels) for those kernel items, making them $L A L R (1)$ kernel items
Expand those $L A L R (1)$ kernel items into $L A L R (1)$ item sets by ${CLOSURE}^{(1)}$

4.3.1 Compute $L R (0)$ Kernel ItemsPermalink

Definition:

Kernel items: the initial item $[S^{'} \to \cdot S]$ , plus all items whose dots are not at the left end
Non-kernel items: all other items with their dots at the left end, except for $[S^{'} \to \cdot S]$

$◼$

An item set may have $k > 1$ kernel items.

You can use the $CC$ procedure of LR Parsing #2: Structural Encoding of LR(0) Parsing DFA to compute all $L R (0)$ items and then remove the non-kernel ones. Otherwise you can modify the procedure so that every kernel item is marked whenever it’s created.

4.3.2 Lookahead Determining AlgorithmPermalink

首先我们要定义两种不同的 channel，或者说两种不同的 lookahead-attachment (to kernels) 的形式。

我强烈不建议参考 Dragon Book 的 Example 4.61 下的两个 bullet points，那根本就不是 formal definition，是针对 Example 4.61 的特殊情况的讨论。而且也不要试图去 interpret，因为你很难确定它讲的 “regardless of $a$ ”、”only because” 这些词是什么意思。请直接跳过去 Algorithm 4.62。

参考 Dragon Book 的 Algorithm 4.62. Let:

$◊$ be a symbol $\notin Σ$ .
$I$ be an $L R (0)$ item set
$\ker (I)$ be the set of kernel items of $I$
$X$ be a symbol
$GOTO (I, X) = J$
$S L : Set [Tuple (I, ϕ_{i}, J, ϕ_{j}, b)]$ is the result for “spontaneously generated lookaheads”, where:
- $ϕ_{i} \in \ker (I)$ is a kernel item of $I$
- $ϕ_{j} \in \ker (J)$ is a kernel item of $J$
- $b$ is a lookahead symbol
- one such entry means: lookahead $b$ is spontaneously generated by $I$ (or more specially $ϕ_{i}$ ) for $ϕ_{j}$
$P L : Set [Tuple (I, ϕ_{i}, J, ϕ_{j})]$ is the result for “propagated lookaheads” where:
- one such entry means: lookaheads propagate from $ϕ_{i}$ to $ϕ_{j}$

Textbooks use $#$ instead of $◊$ (LaTeX \lozenge). I don’t like escaping it all the time in Markdown so I prefer $◊$ . This symbol is often called dummy lookahead or universal lookahead. I also would like to call it placebo lookahead.

\begin{aligned} // Algorithm 4.62: determine lookahead channels of I on input X \\ procedure LAChan (I, X) : \\ J = {GOTO}^{(1)} (I, X) \\ P L = Set () \\ S L = Set () \\ for each kernel item ϕ_{i} = [A \to α \cdot β] \in \ker (I) : \\ patch ϕ_{i} to a LALR(1) item ϕ_{i}^{'} = [A \to α \cdot β, ◊] \\ let I_{ϕ_{i}} = {CLOSURE}^{(1)} ({ϕ_{i}^{'}}) \\ if \exists ψ_{i} = [B \to γ \cdot X δ, ◊] \in I_{ϕ_{i}} : \\ // then certainly \exists ϕ_{j} = [B \to γ X \cdot δ, ◊] \in J \\ let ϕ_{j} = [B \to γ X \cdot δ, ◊] \\ P L .add (Tuple (I, ϕ_{i}, J, ϕ_{j})) \\ if \exists ψ_{i} = [B \to γ \cdot X δ, b] \in I_{ϕ_{i}} and b \neq ◊ : \\ // then certainly \exists ϕ_{j} = [B \to γ X \cdot δ, b] \in J \\ let ϕ_{j} = [B \to γ X \cdot δ, b] \\ S L .add (Tuple (I, ϕ_{i}, J, ϕ_{j}, b)) \\ return P L, S L \end{aligned}

You don’t need to keep $X$ in the results since we know $I \overset{X}{\to} J$ .

Why the dummy lookahead $◊$ works? Can I replace it with some $a \in Σ$ ? 这里就涉及到了 “propagated” vs “spontaneous” 的核心问题：

如果我们是 $ϕ_{i} = [A \to α \cdot β, a]$ 然后找到了 $ϕ_{j} = [B \to γ X \cdot δ, b] \in J$ ，那么 $a \to b$ 是 “propagated” 还是 “spontaneous” 要看 “ $a$ 是否决定了 $b$ 的值”。

举个例子：

比如说按 ${CLOSURE}^{(1)}$ ，我们可能会有 $b \in FIRST (β a)$ ，然后假设这个 $b$ 一路传到了 $ϕ_{j}$ unchanged
此时我们能说 “ $a$ 决定了 $b$ 的值” 吗？不能，因为还要继续拆：
- 如果 $β$ is not nullable，那么 $b \in FIRST (β)$ ，与 $a$ 无关，应该算 “spontaneous”
- 如果 $β$ is nullable，那么 $b \in FIRST (β) \cup {a}$ ，与 $a$ 有关；如果最终我们得到了 $b = a$ (in $ϕ_{j}$ )，那么这应该算 “propagated”

这里就涉及了一个问题：在 $β$ is not nullable 时， $b \in FIRST (β)$ 也是有可能最终得到 $b = a$ (in $ϕ_{j}$ ) 的，所以你从 $b \overset{?}{=} a$ 这个关系上是无法得出 “propagated or spontaneous?” 的结论的。用 dummy lookahead $◊$ 来 test 就完美解决了这个问题，因为：

$\forall a \in Σ, \forall β \in (Σ \cup V)^{*}$ , 不管你怎么操作都不会得出一个 $◊ \notin Σ$
- 换言之 $◊$ 只可能是 “propagated”
- 注意这里要考虑到 $I \overset{X}{\to} J$ 实际包含了 3 步：
  1. ${CLOSURE}^{(1)}$ to get $I$
  2. shift $X$ like $[\cdot X] \to [X \cdot]$
  3. ${CLOSURE}^{(1)}$ to get $J$
on the other hand，如果你拿到一个 lookahead $b \neq ◊$ ，那它的值肯定不是被 $◊$ 决定的，所以一定是 “spontaneous”

Special Case: $Ⅎ$ is “spontaneous” for $[S^{'} \to \cdot S]$

4.3 Construct Channel GraphPermalink

假设我们有：

---
config:
  layout: elk
  look: handDrawn
---
flowchart LR
    subgraph I["State $$\;I$$"]
        direction TB
        i0["$$\phi_i = [A \to \alpha \cdot \beta, \ell_i]$$"]
    end

    subgraph J["State $$\;J$$"]
        direction TB
        j0["$$\phi_j = [B \to \gamma X \cdot \delta, \ell_j = ?]$$"]
    end

    I -->|$$X$$| J

已知 $ℓ_{i}$ ，求 $ℓ_{j}$ 。我们可以初始化 $ℓ_{j} = \emptyset$ :

若 $(I, ϕ_{i}, J, ϕ_{j}) \in P L$ , 则 $ℓ_{j} ⊎ ℓ_{i}$
若 $(I, ϕ_{i}, J, ϕ_{j}, b) \in S L$ , 则 $ℓ_{j} ⊎ {b}$

所以我们所有的 kernel items 构成一个 graph:

vertex 形如 $(ϕ_{i}, ℓ_{i})$
edge 即是 $P L$
$S L$ 可以理解成：
- a visitor that generates lookahead $b$ when it reaches vertex $(ϕ_{j}, ℓ_{j})$
- a initializer of all $ℓ$ values
  - 因为这个 graph 本质就是我们的 $L A L R (1)$ DFA，所有的 edges 都已知，所以我们一开始就可以把 “spontaneous” lookaheads 都 assign 给对应的 kernel items
$ϕ_{0} = [S \to \cdot S, Ⅎ]$ 构成 source vertex

所以剩下的工作就是根据 $P L$ edges 把所有的 $ℓ$ 都填满就可以了，你用 DFS 或者 BFS 都行，但要注意这仍然是一个 fixed-point 问题，需要多次 DFS 或者 BFS 直到 $ℓ$ 没有更新为止。因为：

你可能先处理了 $(ϕ_{i}, ℓ_{i}) \to (ϕ_{j}, ℓ_{j})$ ，有 $ℓ_{j} ⊎ ℓ_{i}$
但后续可能又有 $(ϕ_{k}, ℓ_{k}) \to (ϕ_{i}, ℓ_{i})$ ，有 $ℓ_{i} ⊎ ℓ_{k}$ ，那这个新来的 $ℓ_{k}$ 的内容你也要更新给 $ℓ_{j}$

Dragon Book 是用这么一个 table 来记录这个 fixed-point 的计算过程的：

其中 INIT 就是 $S L$ ，后续的 PASS 就根据 $P L$ 来做。

4.4 The Relations Algorithm (Omitted)Permalink

See Section 9.7.1.3, Parsing Techniques.

4.5 LALR-by-SLR Technique (Omitted)Permalink

See Section 9.7.1.4, Parsing Techniques.

X Facebook LinkedIn Bluesky

LR Parsing #8: LALR(1) as Approximation of LR(1)

1. Intuition & ExamplePermalink

2. Conflicts Emerged after Merging / Partially $L A L R (1)$ ?Permalink

3. Parsing Table ConstructionPermalink

4. State-Merging AlgorithmsPermalink

4.0 老算法改造Permalink

4.1 Brute-ForcePermalink

4.2 A simple algorithm: step-by-step mergingPermalink

4.3 The Channel Algorithm (used by yacc)Permalink

4.3.0 Intuition: A channel is a passage in the the LR(0) NFA that carries over lookaheads (among items)Permalink

4.3.1 Compute $L R (0)$ Kernel ItemsPermalink

4.3.2 Lookahead Determining AlgorithmPermalink

4.3 Construct Channel GraphPermalink

4.4 The Relations Algorithm (Omitted)Permalink

4.5 LALR-by-SLR Technique (Omitted)Permalink

分享

留下评论

猜您还喜欢

Zig: the Hash Map Types Maze

Zig: `defer` / `errdefer`

Bootstrapping in Compiler Design / Self-Hosting Compilers

Left recursion is not a problem for LR parsing

1. Intuition & ExamplePermalink

2. Conflicts Emerged after Merging / Partially LALR(1)?Permalink

3. Parsing Table ConstructionPermalink

4. State-Merging AlgorithmsPermalink

4.0 老算法改造Permalink

4.1 Brute-ForcePermalink

4.2 A simple algorithm: step-by-step mergingPermalink

4.3 The Channel Algorithm (used by yacc)Permalink

4.3.0 Intuition: A channel is a passage in the the LR(0) NFA that carries over lookaheads (among items)Permalink

4.3.1 Compute LR(0) Kernel ItemsPermalink

4.3.2 Lookahead Determining AlgorithmPermalink

4.3 Construct Channel GraphPermalink

4.4 The Relations Algorithm (Omitted)Permalink

4.5 LALR-by-SLR Technique (Omitted)Permalink

分享

留下评论

猜您还喜欢

Zig: the Hash Map Types Maze

Zig: defer / errdefer

Bootstrapping in Compiler Design / Self-Hosting Compilers

Left recursion is not a problem for LR parsing

2. Conflicts Emerged after Merging / Partially $L A L R (1)$ ?Permalink

4.3.1 Compute $L R (0)$ Kernel ItemsPermalink

Zig: `defer` / `errdefer`