8 分钟阅读

Parse TreePermalink

DefinitionPermalink

Derivation Tree is a better name

Parse Tree 的另一个名字 – Derivation Tree – 明显是一个更好的名字,因为这样一来我们就很能自然地把 derivation 和 tree 联系起来。

我们在 parsing 的时候,parse tree 一般都是 Sw where w is a sentence 的情形;但更 general 的 Aw where w is a sentential form 的情况,我们构建的 tree 也算是 parse tree。换句话说,不要把 parse tree 限定成 Sw 这么一种形式。

Characteristics: For CFG G=(V,T,P,S), a parse tree (or derivation tree) of G is a tree satisfying the following conditions:

  1. Every interior/internal node is labeled by a non-terminal AV
  2. Every leaf node is labeled by x which is either a terminal, a terminal, or ε, i.e. xVT{ε}
    • A leaf labeled by ε must be the only child of its parent.
  3. If an internal node is labeled by A with children labeled by (from left to right) X1,X2,,Xn,then AX1X2Xn must be a production P.
    • Or it represents the application of production AX1X2Xn in the corresponding derivation.

Definition: Yield (产物) or frontier of a parse tree is the concatenated string of leaf labels (left–right).

Derivation Parse Tree Sentential Form YieldPermalink

Theorem: CFG G has a derivation Aα a parse tree with root labeled A and a yield of α.

Proof: () Proof by induction on the number of steps k in the derivation Akα.

Base Case: k=1. If Aα1α2αn, then a parse tree with height 1:

Induction Step: suppose the assumption holds for all k-step derivation Akβ.

For k+1-step derivation Akβα, there a parse tree with height k, root labeled A, and a yield of β.

There must be a non-terminal X in β and a production Xγ, who together derive βα . We can add Xγ to the above parse tree. The new tree is valid.

() Proof by induction on the number of internal nodes k of the parse tree.

Base Case: If the parse tree has only 1 internal node, then it must be the root, and the height of the parse tree must be 1. This means Aα1α2αn must be a production, and the derivation holds.

Induction Step: suppose the assumption holds for all parse tree with k internal nodes.

Now if we have a parse tree with k+1 internal nodes 的 parse tree like this:

  • It’s certain that the parse tree for AX1X2Xn has k internal nodes
  • Similarly the parse tree for Xiαi has k internal nodes

Therefore there exist derivation AX1X2Xn and i,Xiαi, and the combined derivation Aα1α2αn holds.

这个定理再结合 “任何 derivation 都有等价的 leftmost derivation 和 rightmost derivation”,我们可以认为意下 5 个命题等价:

  1. wL(A),或者说 w 是一个 sentential form (derived from A)
  2. parse tree 以 A 为 root 且 yield 为 w
  3. Aw
  4. Almw
  5. Armw

AmbiguityPermalink

DefinitionPermalink

Definition: A grammar G=(V,T,P,S) is said to be ambiguous (歧义的) if wL(G) for which derivation Sw two different parse trees.

“判定任意给定的 CFG G 是否歧义” 是一个不可判定问题。

Definition: Given a language L, if every CFG of L is ambiguous, then L is inherently ambiguous (固有歧义)

Example: L={aibjcki=j or j=k}

L 中任何形为 anbncn 的串,总会有两棵 parse tree,所以 L 是固有歧义的

Sentential Form 与 Derivation 与 Parse Tree 的数量关系Permalink

我们先区分一下 derivation:

  • 我们把 Sw 这样省略了中间步骤的 derivation 称为 abstract derivation,用 D(S,w) 表示
  • 我们把 Sw 这样列出了中间每一步的 derivation 称为 concrete derivation,用 di(S,w)
  • abstract derivation 的本质是 a set of concrete derivations, i.e. D(S,w)={d1(S,w),d2(S,w),,dn(S,w)}

同时,考虑到:

  • An abstract derivation cannot reveal the order of production application, just like a parse tree cannot reveal the order of its node expansion/creation.
  • A concrete derivation shows a fixed order of production application, while a topological sort of a parse tree is also a fix order of its node expansion/creation.

我们也区分一下 “parse tree” 和 “parse tree 上的 topological sort”:

  • 我们用 T(S,w) 表示一棵具体的 parse tree
  • 我们用 ti(S,w) 表示 T(S,w) 上的一个 topological sort

考虑 Sentential Form 与 Derivation 的关系时不需要考虑 grammar G 的 ambiguity,因为一个 sentential form w 一定只可能有一个 abstract derivation,且一定可以有多个 concrete derivation,无论 G 是否 ambiguous:

考虑 Derivation 与 Parse Tree 的数量关系时需要考量 grammar G 的 ambiguity.

When G is unambiguous:

flowchart LR
    W@{ shape: braces, label: $$w$$ }
    W --- DA@{ shape: tag-rect, label: "$$𝔇(S,w)$$" }
    DA --- DC1@{ shape: div-rect, label: "$$𝔡_1(S,w)$$" }
    DA --- DC2@{ shape: div-rect, label: "$$𝔡_2(S,w)$$" }
    DA --- DC3@{ shape: div-rect, label: "$$\\cdots$$" }
    DA --- DC4@{ shape: div-rect, label: "$$𝔡_n(S,w)$$" }

    DC1 --- T1[/"$$𝔱_1(S,w)$$"\]
    DC2 --- T2[/"$$𝔱_2(S,w)$$"\]
    DC3 --- T3[/"$$\\cdots$$"\]
    DC4 --- T4[/"$$𝔱_n(S,w)$$"\]

    T1 --- T@{ shape: tri, label: "$$𝔗(S,w)$$" }
    T2 --- T
    T3 --- T
    T4 --- T

When G is ambiguous (suppose there are 2 parse trees for the abstract derivation below):

flowchart LR
    W@{ shape: braces, label: $$w$$ }
    W --- DA@{ shape: tag-rect, label: "$$𝔇(S,w)$$" }
    DA --- DC1@{ shape: div-rect, label: "$$𝔡_1(S,w)$$" }
    DA --- DC2@{ shape: div-rect, label: "$$\\cdots$$" }
    DA --- DC3@{ shape: div-rect, label: "$$𝔡_k(S,w)$$" }
    DA --- DC4@{ shape: div-rect, label: "$$𝔡_{k+1}(S,w)$$" }
    DA --- DC5@{ shape: div-rect, label: "$$\\cdots$$" }
    DA --- DC6@{ shape: div-rect, label: "$$𝔡_n(S,w)$$" }

    DC1 --- T1[/"$$𝔱_1(S,w)$$"\]
    DC2 --- T2[/"$$\\cdots$$"\]
    DC3 --- T3[/"$$𝔱_k(S,w)$$"\]
    DC4 --- T4[/"$$𝔱_{k+1}(S,w)$$"\]
    DC5 --- T5[/"$$\\cdots$$"\]
    DC6 --- T6[/"$$𝔱_n(S,w)$$"\]

    T1 --- TT1@{ shape: tri, label: "$$𝔗_1(S,w)$$" }
    T2 --- TT1
    T3 --- TT1
    T4 --- TT2@{ shape: tri, label: "$$𝔗_2(S,w)$$" }
    T5 --- TT2
    T6 --- TT2

Ambiguous Grammar Example #1: arithmetic operator precedencePermalink

假设我们有这么个语法:

EE+EEEIIa

这个语法对句型 a+aa 就是歧义的(体现为运算优先级的不同)

%%{init: { 'fontFamily': 'monospace' } }%%
flowchart TD
    subgraph #2
        E'[$$E$$]
        E' --- E1'[$$E$$]
        E' --- Times'@{ shape: dbl-circ, label: $$\ast$$ }
        E' --- E2'[$$E$$]
    
        E1' --- E3'[$$E$$] --- I1'[$$I$$] --- a1'@{ shape: dbl-circ, label: $$a$$ }
        E1' --- Plus'@{ shape: dbl-circ, label: $$+$$ }
        E1' --- E4'[$$E$$] --- I2'[$$I$$] --- a2'@{ shape: dbl-circ, label: $$a$$ }
    
        E2' --- I3'[$$I$$] --- a3'@{ shape: dbl-circ, label: $$a$$ }
    end

    subgraph #1
        E[$$E$$]
        E --- E1[$$E$$]
        E --- Plus@{ shape: dbl-circ, label: $$+$$ }
        E --- E2[$$E$$]
    
        E1 --- I1[$$I$$] --- a1@{ shape: dbl-circ, label: $$a$$ }
    
        E2 --- E3[$$E$$] --- I2[$$I$$] --- a2@{ shape: dbl-circ, label: $$a$$ }
        E2 --- Times@{ shape: dbl-circ, label: $$\ast$$ }
        E2 --- E4[$$E$$] --- I3[$$I$$] --- a3@{ shape: dbl-circ, label: $$a$$ }
    end

Ambiguous Grammar Example #2: dangling ELSE problemPermalink

假设我们有这么个语法:

stmtif expr then stmtif expr then stmt else stmtother_stmt

那么句子 w=if E1 then if E2 then S1 else S2 就是歧义的。

用 python 更好理解一些:

# Block #1
if E1:
    if E2:
        S1
    else:
        S2

# Block #2
if E1:
    if E2:
        S1
else:
    S2

换言之就是以下二式的区别:

w=if E1 then if E2 then S1else S2stmtw=if E1 then if E2 then S1stmtelse S2stmt

你写成 python 有 indent 保证,但是输入给 parser 的时候往往是 if E1 then if E2 then  这种 flattened 的形式。

DisambiguationPermalink

Method #1: Rewrite the GrammarPermalink

有些文法的歧义性,可以通过重新设计文法来消除

比如 Example #1 的文法可以修改为:

EE+TTITTTIIa
flowchart TD
    E[$$E$$]
    E --- E1[$$E$$]
    E --- Plus@{ shape: dbl-circ, label: $$+$$ }
    E --- T[$$T$$]

    E1 --- I1[$$I$$] --- a1@{ shape: dbl-circ, label: $$a$$ }

    T --- T1[$$T$$] --- I2[$$I$$] --- a2@{ shape: dbl-circ, label: $$a$$ }
    T --- Times@{ shape: dbl-circ, label: $$\ast$$ }
    T --- T2[$$T$$] --- I3[$$I$$] --- a3@{ shape: dbl-circ, label: $$a$$ }

比如 Example #2 的文法可以修改为:

stmtmatched_stmtunmatched_stmtmatched_stmtif expr then matched_stmt else matched_stmtother_stmtunmatched_stmtif expr then stmtif expr then matched_stmt else unmatched_stmt

Method #2: Syntactic Predicates / Alternative PrecedencePermalink

指像 ANTLR 之类的有 special instructions 或者 rules 可以指定优先匹配哪个 alternative production.

比如 ANTLR 有 “First Written, Fist Match” 原则 (i.e. greedy choice),比如:

expr: expr '+' expr   # Add
    | expr '*' expr   # Mul
    | INT             # Int
    ;

INT: [0-9]+;

当输入为 INT + INT * INT 时,因为 expr '+' expr 写在 expr '*' expr 前面,所以 expr '+' expr 的优先级更高,于是会优先匹配 INT + INT.

分类:

更新时间:

留下评论