PEG (Parsing Expression Grammars)
本文是 Parsing Expression Grammars: A Recognition-Based Syntactic Foundation by Bryan Ford@MIT 的读文笔记。这篇论文佶屈聱牙,我这里用人话总结一下。
另外涉及 TS/TDPL and gTS/GTDPL 的部分这里直接略过,我暂时用不上。这几个缩写的混乱程度一看就知不好惹:
- TS: TMG Recognition Scheme
- TMG: TransMoGrifier, a recursive descent compiler-compiler developed by Robert M. McClure
- TPDL: TopDown Parsing Language
- gTS: generalized TS
- GTPDL: Generalized TDPL
1. The Chomsky Hierarchy is GenerativePermalink
论文教的这节历史课还是蛮不错的:
Chomsky’s generative system of grammars, from which the ubiquitous context-free grammars (CFGs) and regular expressions (REs) arise, was originally designed as a formal tool for modelling and analyzing natural (human) languages. Due to their elegance and expressive power, computer scientists adopted generative grammars for describing machine-oriented languages as well.
这个 generative 的 nature 挪用到 programming languages 时还是存在水土不服的:
The ability of a CFG to express ambiguous syntax is an important and powerful tool for natural languages. Unfortunately, this power gets in the way when we use CFGs for machine-oriented languages that are intended to be precise and unambiguous. Ambiguity in CFGs is difficult to avoid even when we want to, and it makes general CFG parsing an inherently super-linear-time problem.
虽然我们有 DCFG 这个武器,但在 programming language design 这个领域,ambiguity 还是很容易被引入的。论文给了几个例子 (但要注意它写于 2004,所以不确定现在的解决方案):
vector<vector<float>> MyMatrix;
连续两个>
可能会被识别成 input stream 的>>
operator (幽默 C++!)- the “dangling ELSE” problem
- C++ 某些 token sequence 可能同时满足
<definition>
和<statement>
- The syntax of lambda abstractions,
let
expressions, and conditionals in Haskell is unresolvably ambiguous in the CFG paradigm.
一般的解决方法有两种:
- rewrite your CFG and make it unambiguous
- 使用人为的 rules 去规定 parsing 的行为
使用第二种方法一般是出于迫不得已,比如:(1) rewrite 的工作量太大;(2) 理论上 rewrite 也无法解决某些 unambiguous 的情况。且它有一个后果就是:你这么搞了以后,你的 grammar 理论上就不再是一个合法的 CFG 了。
Many sensible syntactic constructs are inherently ambiguous when expressed in a CFG, commonly leading language designers to abandon syntactic formality and rely on informal metarules to solve these problems.
上面 4 个例子的解决方案是:
- workaround: 写成
vector<vector<float> > MyMatrix;
,加一个空格;或者用 PEG 写法 - rewrite
- an informal meta-rule that such a sequence is always interpreted as a
<definition>
if possible - an informal “longest match” meta-rule
2. PEGs are (somewhat) RecognitivePermalink
论文的用词是 recognition-based,我就直接用 recognitive 这个词了。
Simple languages can be expressed easily in either paradigm.
- Generative:
- Recognitive:
文中直接把 PEG 称为 recognition-based,有点不好理解,我觉得可以从两个角度考虑:
- 我们回忆一下 PEG 的全称:Parsing Expression Grammars。可见 PEG 就是为了 parsing 设计的,那 parsing 自然就是 recognition 的过程
- recognitive grammar 看上去需要有某种 “predicate” 能直接判断出是否有
。但 PEG 只有两个 syntactic predicates&e
、!e
,其余的 production/rule 也是形如Definition <- Identifier LEFTARROW Expression
这样 generative 的形式。从 grammar definition 的角度,我觉得称 PEG “有 recognitive 的成分” 是更准确的理解
3. Formal Definition of PEGsPermalink
3.1 GrammarsPermalink
Definition: A parsing expression grammar (PEG) is a 4-tuple
is a finite set of nonterminal symbols is a finite set of terminal symbols is a finite set of rules- Each rule
is like where and is a parsing expression.
- Each rule
is a parsing expression termed “the start expression”.
3.2 Parsing ExpressionsPermalink
Definition: We define parsing expressions inductively as follows:
- empty string
is a parsing expression - any terminal
is a parsing expression - any nonterminal
is a parsing expression - sequence
is a parsing expression, if and both are - prioritized choice
is a parsing expression, if and both are - zero-or-more repetition
is a parsing expression, if is -predicate is a parsing expression, if is
基于上述定义,我们可以扩展得到:
- character class (我不明白它为啥不叫 terminal class)
, where , after expanding the range, is equivalent to is a character class containing all the terminals in
- option
is equivalent to - one-or-more repetition
is equivalent to -predicate is equivalent to
在 PEG 的 context 下,我们一般也把 parsing expression 简称为 expression
Definition: Expression set
- the start expression
, - all expressions used in all rules, and
- all sub-expressions of those expressions
文中很神奇地没有定义什么叫做 “sub-expression”,只是举了个例子:如果
3.3 RulesPermalink
Axiom:
Lemma: Therefore
3.4 Parsing / Reduction / RecognitionPermalink
文中没有 formally 定义 derivation,而是 formally 定义了 recognition,本质就是 parsing/reduction 的过程。
我觉得它的符号设计有点烂,我这里就不搬运了,可以直接看文章 3.3 Interpretation of a Grammar。
这里我就简单说一下 PEG 几个特殊的 reduction 规则。
3.4.1 与 backtrackingPermalink
- 首先 match
(即看 能否 reduce 到 ),若不能 match,再去 match ,依此类推 - 如果某个
match 成功,那么直接break
,整个 matching 结束,后续的 都不会被考虑
这里 “再去 match
The choice expression
first attempts pattern , then attempts from the same starting point if fails.
这个 “from the same starting point” 要求 parser 需要 backtrack,即回退到 parsing 起始的位置,再去 match
我们在
In addition, many practical top-down parsing libraries and toolkits, including the popular ANTLR and the PARSEC combinator library for Haskell, provide backtracking capabilities that conform to this model in practice, if perhaps unintentionally.
3.4.2 与 backtrackingPermalink
sequence 的 matching 规则比较直接:如果
只考虑两个 sub-expressions,即
match 失败 match 成功,但 match 失败
这两种情况下,parser 都需要 backtrack 到 parsing
3.4.3 、 与 backtrackingPermalink
这俩 predicates 的 backtracking 你可以有两种理解方式:
- 它们 match 成功后 (或者说 unconditionally,即无论 match 成功、失败) 都要 backtracks to the starting point
- 它们只 match 但不 consume token
举几个例子:
EndOfFile <- !.
表示 EOF 不能是任何字符Primary <- Identifier !LEFTARROW
表示Primary
可以是一个Identifier
,但这个Identifier
后面不能接一个<
- 即排除
x <- y
这种类似 assignment 的形式 - parser match 到
Identifier
之后,需要在确定后面没有<
之后,才能做 reductionIdentifier -> Primary
- 即排除
A <- "foo" &("bar")
表示A
能 match"foobar"
中的"foo"
3.4.4 Greedy , , and Permalink
The
, , and operators behave as in common regular expression syntax, except that they are “greedy” rather than nondeterministic. The option expression unconditionally “consumes” the text matched by if succeeds, and the repetition expressions and always consume as many successive matches of as possible. The expression for example can never match any string. Longest-match parsing is almost always the desired behavior where options or repetition occur in practical machine-oriented languages. Many forms of non-greedy behavior are still available in PEGs when desired, however, through the use of predicates.
这个特性就完美替代了 1. The Chomsky Hierarchy is Generative 中提到的 “longest match” 的 meta-rule
3.4.5 Quirk: Match 和 Fail 的中间态Permalink
这一节还是得用文章 3.3 Interpretation of a Grammar 的符号。
当我们写
is a parsing expression (to start with in the parsing/recognition process) is the input text/string to be parsed/recognized is a substring of that is successfully matched and consumed in the parsing/recognition process- 合起来的意思就是:我们要 parse
,选择的初始 expression 是 ,经过一步或多步 (可能还用到了其他的 expression),最终成功 match 了 中的 部分, 后面的部分待续
当我们写
such that- 即
从 expression 开始的 matching 失败 (where is a special symbol that indicates failure)
Definition:
- If
that , we say that matches in - If
, we say that fails on in - Match set of expression
in is defined as
Definition:
- If expression
either matches or fails on a string , we say handles - If grammar
’s start expression handles , we say handles is complete if it handles
这里就有一个疑问:从 “handle” 的定义来看,似乎存在 “neither matches nor fails on
A grammar is wellformed if it is not left-recursive. Such grammars are also complete, which means that they always succeed or fail on all input sentences. For an incomplete grammar on the other hand, input sentences exist for which an attempt to match them against the grammar will not terminate.
注意前面 3.4.4 Greedy
4. PELsPermalink
证明部分我略过了,需要时直接看原文。
Definition: The language of a PEG
Note that:
…
only needs to succeed on input string for to be included in ; need not consume all of string … This definition contrasts with TS and gTS, in which partially consumed input strings are excluded from the language and classified as partial-acceptance failures.
Definition: A language
Theorem: The class of PELs includes (some) non-context-free languages.
无法相信作者在文中没有标定是 “includes some” 还是 “include all”!根据 Computational Model for Parsing Expression Grammars:
From the 60’s it is known that PELs contain DCFLs as a subclass and some non-context-free languages like
as well.
Theorem: Given an arbitrary PEG
Comments