7 minute read

本文是 Parsing Expression Grammars: A Recognition-Based Syntactic Foundation by Bryan Ford@MIT 的读文笔记。这篇论文佶屈聱牙,我这里用人话总结一下。

另外涉及 TS/TDPL and gTS/GTDPL 的部分这里直接略过,我暂时用不上。这几个缩写的混乱程度一看就知不好惹:

  • TS: TMG Recognition Scheme
    • TMG: TransMoGrifier, a recursive descent compiler-compiler developed by Robert M. McClure
  • TPDL: TopDown Parsing Language
  • gTS: generalized TS
  • GTPDL: Generalized TDPL

1. The Chomsky Hierarchy is GenerativePermalink

论文教的这节历史课还是蛮不错的:

Chomsky’s generative system of grammars, from which the ubiquitous context-free grammars (CFGs) and regular expressions (REs) arise, was originally designed as a formal tool for modelling and analyzing natural (human) languages. Due to their elegance and expressive power, computer scientists adopted generative grammars for describing machine-oriented languages as well.

这个 generative 的 nature 挪用到 programming languages 时还是存在水土不服的:

The ability of a CFG to express ambiguous syntax is an important and powerful tool for natural languages. Unfortunately, this power gets in the way when we use CFGs for machine-oriented languages that are intended to be precise and unambiguous. Ambiguity in CFGs is difficult to avoid even when we want to, and it makes general CFG parsing an inherently super-linear-time problem.

虽然我们有 DCFG 这个武器,但在 programming language design 这个领域,ambiguity 还是很容易被引入的。论文给了几个例子 (但要注意它写于 2004,所以不确定现在的解决方案):

  1. vector<vector<float>> MyMatrix; 连续两个 > 可能会被识别成 input stream 的 >> operator (幽默 C++!)
  2. the “dangling ELSE” problem
  3. C++ 某些 token sequence 可能同时满足 <definition><statement>
  4. The syntax of lambda abstractions, let expressions, and conditionals in Haskell is unresolvably ambiguous in the CFG paradigm.

一般的解决方法有两种:

  1. rewrite your CFG and make it unambiguous
  2. 使用人为的 rules 去规定 parsing 的行为

使用第二种方法一般是出于迫不得已,比如:(1) rewrite 的工作量太大;(2) 理论上 rewrite 也无法解决某些 unambiguous 的情况。且它有一个后果就是:你这么搞了以后,你的 grammar 理论上就不再是一个合法的 CFG 了。

Many sensible syntactic constructs are inherently ambiguous when expressed in a CFG, commonly leading language designers to abandon syntactic formality and rely on informal metarules to solve these problems.

上面 4 个例子的解决方案是:

  1. workaround: 写成 vector<vector<float> > MyMatrix;,加一个空格;或者用 PEG 写法
  2. rewrite
  3. an informal meta-rule that such a sequence is always interpreted as a <definition> if possible
  4. an informal “longest match” meta-rule

2. PEGs are (somewhat) RecognitivePermalink

论文的用词是 recognition-based,我就直接用 recognitive 这个词了。

Simple languages can be expressed easily in either paradigm.

  • Generative: {waw=(aa)n}
  • Recognitive: {wa|w|mod2=0}

文中直接把 PEG 称为 recognition-based,有点不好理解,我觉得可以从两个角度考虑:

  1. 我们回忆一下 PEG 的全称:Parsing Expression Grammars。可见 PEG 就是为了 parsing 设计的,那 parsing 自然就是 recognition 的过程
  2. recognitive grammar 看上去需要有某种 “predicate” 能直接判断出是否有 wL。但 PEG 只有两个 syntactic predicates &e!e,其余的 production/rule 也是形如 Definition <- Identifier LEFTARROW Expression 这样 generative 的形式。从 grammar definition 的角度,我觉得称 PEG “有 recognitive 的成分” 是更准确的理解

3. Formal Definition of PEGsPermalink

3.1 GrammarsPermalink

Definition: A parsing expression grammar (PEG) is a 4-tuple G=(VN,VT,R,eS), where:

  • VN is a finite set of nonterminal symbols
  • VT is a finite set of terminal symbols
    • VNVT=
  • R is a finite set of rules
    • Each rule rR is like Ae where AVN and e is a parsing expression.
  • eS is a parsing expression termed “the start expression”.

3.2 Parsing ExpressionsPermalink

Definition: We define parsing expressions inductively as follows:

  1. empty string ε is a parsing expression
  2. any terminal aVT is a parsing expression
  3. any nonterminal AVN is a parsing expression
  4. sequence e1e2 is a parsing expression, if e1 and e2 both are
  5. prioritized choice e1/e2 is a parsing expression, if e1 and e2 both are
  6. zero-or-more repetition e is a parsing expression, if e is
  7. not-predicate !e is a parsing expression, if e is

基于上述定义,我们可以扩展得到:

  1. character class (我不明白它为啥不叫 terminal class) [a1a2an], where aiVT, after expanding the range, is equivalent to a1/a2//an
    • . is a character class containing all the terminals in VT
  2. option e? is equivalent to e/ε
  3. one-or-more repetition e+ is equivalent to ee
  4. and-predicate &e is equivalent to !(!e)

在 PEG 的 context 下,我们一般也把 parsing expression 简称为 expression

Definition: Expression set E(G) of PEG G is the set containing:

  1. the start expression eS,
  2. all expressions used in all rules, and
  3. all sub-expressions of those expressions

文中很神奇地没有定义什么叫做 “sub-expression”,只是举了个例子:如果 e1/e2 是 expression,那么 e1e2 就是它的 sub-expressions.

3.3 RulesPermalink

Axiom: AVN, there is exactly one e such that AeR.

Lemma: Therefore R:VNE(G) is a function from nonterminals to expressions, and we can write R(A)=e such that AeR.

3.4 Parsing / Reduction / RecognitionPermalink

文中没有 formally 定义 derivation,而是 formally 定义了 recognition,本质就是 parsing/reduction 的过程。

我觉得它的符号设计有点烂,我这里就不搬运了,可以直接看文章 3.3 Interpretation of a Grammar

这里我就简单说一下 PEG 几个特殊的 reduction 规则。

3.4.1 e1/e2 与 backtrackingPermalink

e1/e2//en 的 matching 规则是:

  1. 首先 match e1 (即看 wVT 能否 reduce 到 e1),若不能 match,再去 match e2,依此类推
  2. 如果某个 ei match 成功,那么直接 break,整个 matching 结束,后续的 ei+1//en 都不会被考虑

这里 “再去 match e2” 就会涉及到 backtracking 操作:

The choice expression e1/e2 first attempts pattern e1, then attempts e2 from the same starting point if e1 fails.

这个 “from the same starting point” 要求 parser 需要 backtrack,即回退到 parsing 起始的位置,再去 match e2

我们在 LL(1) parser 中是看不到这种 backtrack 操作的,所以这也解释了某些地方会把 “lookahead k” 和 “是否有 backtracking” 作为 parser 的两种特征,比如:

In addition, many practical top-down parsing libraries and toolkits, including the popular ANTLR and the PARSEC combinator library for Haskell, provide backtracking capabilities that conform to this model in practice, if perhaps unintentionally.

3.4.2 e1e2 与 backtrackingPermalink

sequence 的 matching 规则比较直接:如果 e1 match,那么就继续 match e2,依此类推

只考虑两个 sub-expressions,即 e1e2 时,match 失败的情况有两种:

  1. e1 match 失败
  2. e1 match 成功,但 e2 match 失败

这两种情况下,parser 都需要 backtrack 到 parsing e1 起始的位置

3.4.3 &e!e 与 backtrackingPermalink

这俩 predicates 的 backtracking 你可以有两种理解方式:

  1. 它们 match 成功后 (或者说 unconditionally,即无论 match 成功、失败) 都要 backtracks to the starting point
  2. 它们只 match 但不 consume token

举几个例子:

  • EndOfFile <- !. 表示 EOF 不能是任何字符
  • Primary <- Identifier !LEFTARROW 表示 Primary 可以是一个 Identifier,但这个 Identifier 后面不能接一个 <
    • 即排除 x <- y 这种类似 assignment 的形式
    • parser match 到 Identifier 之后,需要在确定后面没有 < 之后,才能做 reduction Identifier -> Primary
  • A <- "foo" &("bar") 表示 A 能 match "foobar" 中的 "foo"

3.4.4 Greedy e?, e, and e+Permalink

The ?, , and + operators behave as in common regular expression syntax, except that they are “greedy” rather than nondeterministic. The option expression e? unconditionally “consumes” the text matched by e if e succeeds, and the repetition expressions e and e+ always consume as many successive matches of e as possible. The expression aa for example can never match any string. Longest-match parsing is almost always the desired behavior where options or repetition occur in practical machine-oriented languages. Many forms of non-greedy behavior are still available in PEGs when desired, however, through the use of predicates.

这个特性就完美替代了 1. The Chomsky Hierarchy is Generative 中提到的 “longest match” 的 meta-rule

3.4.5 Quirk: Match 和 Fail 的中间态Permalink

这一节还是得用文章 3.3 Interpretation of a Grammar 的符号。

当我们写 (e,w)+w 时,我们的意思是:

  • e is a parsing expression (to start with in the parsing/recognition process)
  • wVT is the input text/string to be parsed/recognized
  • wVT is a substring of w that is successfully matched and consumed in the parsing/recognition process
  • 合起来的意思就是:我们要 parse w,选择的初始 expression 是 e,经过一步或多步 (可能还用到了其他的 expression),最终成功 match 了 w 中的 w 部分,w 后面的部分待续

当我们写 (e,w)+ 时,我们的意思是:

  • w such that (e,w)+w
  • wVN 从 expression e 开始的 matching 失败 (where is a special symbol that indicates failure)

Definition:

  • If wVT that (e,w)+w, we say that e matches w in G
  • If (e,w)+, we say that e fails on x in G
  • Match set of expression e in G is defined as MG(e)={we matches w in G}

Definition:

  • If expression e either matches or fails on a string wVT, we say e handles w
  • If grammar G’s start expression eS handles w, we say G handles w
  • G is complete if it handles wVT

这里就有一个疑问:从 “handle” 的定义来看,似乎存在 “neither matches nor fails on w” 的情况?的确是有的,但更容易理解的说法是 “matching will never terminate”,本质是因为存在 grammar 有 “无限递归” 的毛病。如 GRAMMAR::PEG(N) 0.1 “Grammar operations and usage” 所说:

A grammar is wellformed if it is not left-recursive. Such grammars are also complete, which means that they always succeed or fail on all input sentences. For an incomplete grammar on the other hand, input sentences exist for which an attempt to match them against the grammar will not terminate.

注意前面 3.4.4 Greedy e?, e, and e+ 中的例子:expression e=aa 去 match input like aaa:由于 greedy 特性,这个 matching 应该判定为 failure 而不是 never terminate

4. PELsPermalink

证明部分我略过了,需要时直接看原文。

Definition: The language of a PEG G=(VN,VT,R,eS) is defined as L(G)={xVTeS matches x}.

Note that:

eS only needs to succeed on input string x for x to be included in L(G); eS need not consume all of string x… This definition contrasts with TS and gTS, in which partially consumed input strings are excluded from the language and classified as partial-acceptance failures.

Definition: A language L over an alphabet VT is a parsing expression language (PEL) PEG G whose language is L.

Theorem: The class of PELs includes (some) non-context-free languages.

无法相信作者在文中没有标定是 “includes some” 还是 “include all”!根据 Computational Model for Parsing Expression Grammars:

From the 60’s it is known that PELs contain DCFLs as a subclass and some non-context-free languages like anbncn as well.

Theorem: Given an arbitrary PEG G, it’s generally undecidable whether its language L(G) is empty.

Categories:

Updated:

Comments