Appetizer #2 Before Parsing: CFG Disambiguation (with intro to Parse Tree & Ambiguity)
Parse TreePermalink
DefinitionPermalink
Derivation Tree is a better name
Parse Tree 的另一个名字 – Derivation Tree – 明显是一个更好的名字,因为这样一来我们就很能自然地把 derivation 和 tree 联系起来。
我们在 parsing 的时候,parse tree 一般都是
Characteristics: For CFG
- Every interior/internal node is labeled by a non-terminal
- Every leaf node is labeled by
which is either a terminal, a terminal, or , i.e.- A leaf labeled by
must be the only child of its parent.
- A leaf labeled by
- If an internal node is labeled by
with children labeled by (from left to right) ,then must be a production .- Or it represents the application of production
in the corresponding derivation.
- Or it represents the application of production
Definition: Yield (产物) or frontier of a parse tree is the concatenated string of leaf labels (left–right).
Derivation Parse Tree Sentential Form YieldPermalink
Theorem: CFG
Proof: (
Base Case:
Induction Step: suppose the assumption holds for all
For
There must be a non-terminal
(
Base Case: If the parse tree has only
Induction Step: suppose the assumption holds for all parse tree with
Now if we have a parse tree with
- It’s certain that the parse tree for
has internal nodes - Similarly the parse tree for
has internal nodes
Therefore there exist derivation
这个定理再结合 “任何 derivation 都有等价的 leftmost derivation 和 rightmost derivation”,我们可以认为意下 5 个命题等价:
,或者说 是一个 sentential form (derived from ) parse tree 以 为 root 且 yield 为
AmbiguityPermalink
DefinitionPermalink
Definition: A grammar
“判定任意给定的 CFG
Definition: Given a language
Example:
Sentential Form 与 Derivation 与 Parse Tree 的数量关系Permalink
我们先区分一下 derivation:
- 我们把
这样省略了中间步骤的 derivation 称为 abstract derivation,用 表示 - 我们把
这样列出了中间每一步的 derivation 称为 concrete derivation,用 - abstract derivation 的本质是 a set of concrete derivations, i.e.
同时,考虑到:
- An abstract derivation cannot reveal the order of production application, just like a parse tree cannot reveal the order of its node expansion/creation.
- A concrete derivation shows a fixed order of production application, while a topological sort of a parse tree is also a fix order of its node expansion/creation.
我们也区分一下 “parse tree” 和 “parse tree 上的 topological sort”:
- 我们用
表示一棵具体的 parse tree - 我们用
表示 上的一个 topological sort
考虑 Sentential Form 与 Derivation 的关系时不需要考虑 grammar
考虑 Derivation 与 Parse Tree 的数量关系时需要考量 grammar
When
flowchart LR
W@{ shape: braces, label: $$w$$ }
W --- DA@{ shape: tag-rect, label: "$$𝔇(S,w)$$" }
DA --- DC1@{ shape: div-rect, label: "$$𝔡_1(S,w)$$" }
DA --- DC2@{ shape: div-rect, label: "$$𝔡_2(S,w)$$" }
DA --- DC3@{ shape: div-rect, label: "$$\\cdots$$" }
DA --- DC4@{ shape: div-rect, label: "$$𝔡_n(S,w)$$" }
DC1 --- T1[/"$$𝔱_1(S,w)$$"\]
DC2 --- T2[/"$$𝔱_2(S,w)$$"\]
DC3 --- T3[/"$$\\cdots$$"\]
DC4 --- T4[/"$$𝔱_n(S,w)$$"\]
T1 --- T@{ shape: tri, label: "$$𝔗(S,w)$$" }
T2 --- T
T3 --- T
T4 --- T
When
flowchart LR
W@{ shape: braces, label: $$w$$ }
W --- DA@{ shape: tag-rect, label: "$$𝔇(S,w)$$" }
DA --- DC1@{ shape: div-rect, label: "$$𝔡_1(S,w)$$" }
DA --- DC2@{ shape: div-rect, label: "$$\\cdots$$" }
DA --- DC3@{ shape: div-rect, label: "$$𝔡_k(S,w)$$" }
DA --- DC4@{ shape: div-rect, label: "$$𝔡_{k+1}(S,w)$$" }
DA --- DC5@{ shape: div-rect, label: "$$\\cdots$$" }
DA --- DC6@{ shape: div-rect, label: "$$𝔡_n(S,w)$$" }
DC1 --- T1[/"$$𝔱_1(S,w)$$"\]
DC2 --- T2[/"$$\\cdots$$"\]
DC3 --- T3[/"$$𝔱_k(S,w)$$"\]
DC4 --- T4[/"$$𝔱_{k+1}(S,w)$$"\]
DC5 --- T5[/"$$\\cdots$$"\]
DC6 --- T6[/"$$𝔱_n(S,w)$$"\]
T1 --- TT1@{ shape: tri, label: "$$𝔗_1(S,w)$$" }
T2 --- TT1
T3 --- TT1
T4 --- TT2@{ shape: tri, label: "$$𝔗_2(S,w)$$" }
T5 --- TT2
T6 --- TT2
Ambiguous Grammar Example #1: arithmetic operator precedencePermalink
假设我们有这么个语法:
这个语法对句型
%%{init: { 'fontFamily': 'monospace' } }%%
flowchart TD
subgraph #2
E'[$$E$$]
E' --- E1'[$$E$$]
E' --- Times'@{ shape: dbl-circ, label: $$\ast$$ }
E' --- E2'[$$E$$]
E1' --- E3'[$$E$$] --- I1'[$$I$$] --- a1'@{ shape: dbl-circ, label: $$a$$ }
E1' --- Plus'@{ shape: dbl-circ, label: $$+$$ }
E1' --- E4'[$$E$$] --- I2'[$$I$$] --- a2'@{ shape: dbl-circ, label: $$a$$ }
E2' --- I3'[$$I$$] --- a3'@{ shape: dbl-circ, label: $$a$$ }
end
subgraph #1
E[$$E$$]
E --- E1[$$E$$]
E --- Plus@{ shape: dbl-circ, label: $$+$$ }
E --- E2[$$E$$]
E1 --- I1[$$I$$] --- a1@{ shape: dbl-circ, label: $$a$$ }
E2 --- E3[$$E$$] --- I2[$$I$$] --- a2@{ shape: dbl-circ, label: $$a$$ }
E2 --- Times@{ shape: dbl-circ, label: $$\ast$$ }
E2 --- E4[$$E$$] --- I3[$$I$$] --- a3@{ shape: dbl-circ, label: $$a$$ }
end
Ambiguous Grammar Example #2: dangling ELSE problemPermalink
假设我们有这么个语法:
那么句子
用 python 更好理解一些:
# Block #1
if E1:
if E2:
S1
else:
S2
# Block #2
if E1:
if E2:
S1
else:
S2
换言之就是以下二式的区别:
你写成 python 有 indent 保证,但是输入给 parser 的时候往往是
DisambiguationPermalink
Method #1: Rewrite the GrammarPermalink
有些文法的歧义性,可以通过重新设计文法来消除。
比如 Example #1 的文法可以修改为:
flowchart TD
E[$$E$$]
E --- E1[$$E$$]
E --- Plus@{ shape: dbl-circ, label: $$+$$ }
E --- T[$$T$$]
E1 --- I1[$$I$$] --- a1@{ shape: dbl-circ, label: $$a$$ }
T --- T1[$$T$$] --- I2[$$I$$] --- a2@{ shape: dbl-circ, label: $$a$$ }
T --- Times@{ shape: dbl-circ, label: $$\ast$$ }
T --- T2[$$T$$] --- I3[$$I$$] --- a3@{ shape: dbl-circ, label: $$a$$ }
比如 Example #2 的文法可以修改为:
Method #2: Syntactic Predicates / Alternative PrecedencePermalink
指像 ANTLR 之类的有 special instructions 或者 rules 可以指定优先匹配哪个 alternative production.
比如 ANTLR 有 “First Written, Fist Match” 原则 (i.e. greedy choice),比如:
expr: expr '+' expr # Add
| expr '*' expr # Mul
| INT # Int
;
INT: [0-9]+;
当输入为 INT + INT * INT
时,因为 expr '+' expr
写在 expr '*' expr
前面,所以 expr '+' expr
的优先级更高,于是会优先匹配 INT + INT
.
留下评论