Lark’s implementation of computing FIRST and FOLLOW sets
OverviewPermalink
Lark is a parsing toolkit for Python and it has a straightforward implementation of computing
Symbol
ClassesPermalink
Suppose we have a grammar
Lark uses the following 3 classes for symbols:
Symbol
Terminal(Symbol)
NonTerminal(Symbol)
class Symbol(Serialize):
name: str
is_term: ClassVar[bool] = NotImplemented
def __init__(self, name: str) -> None:
self.name = name
class Terminal(Symbol):
is_term: ClassVar[bool] = True
# filter_out: boolean flag of whether this terminal should be excluded from the parse tree
#
# When filter_out=True:
# - The terminal is matched during parsing but removed from the final parse tree
# - Useful for structural punctuation like parentheses, commas, semicolons, braces
# - Keeps parse trees clean by removing syntactic noise while preserving semantic content
# - Example: In "func(a, b)", the parentheses and comma would be filtered out
def __init__(self, name: str, filter_out: bool = False) -> None:
self.name = name
self.filter_out = filter_out
class NonTerminal(Symbol):
is_term: ClassVar[bool] = False
Rule
ClassPermalink
Lark uses class Rule
to represent productions of
class Rule:
origin: Symbol # Left-hand side (non-terminal), e.g. A
expansion: List[Symbol] # Right-hand side, e.g. [B, C, D]
alias: Optional[str] # Optional label for the rule in the parse tree
order: int # Priority for disambiguation
Note that alias
is for readability in parse trees.
Lark support the following grammar format:
(*
Question mark (?) at the beginning indicates that this rule will be inlined
if they have a single child, after filtering.
*)
?expr: term "+" term -> add
| term
where -> add
indicate the alias
for this production. Therefore we can construct the following Rule
object:
Rule(origin=NonTerminal('expr'),
expansion=[NonTerminal('term'), Terminal('+'), NonTerminal('term')],
alias='add')
And the application of this rule in a parse tree can be represented as:
___'add'___
/ | \
'term' '+' 'term'
where the name of the root of this sub tree is no longer 'expr'
ImplementationPermalink
def update_set(set1, set2):
if not set2 or set1 > set2:
return False
copy = set(set1)
set1 |= set2
return set1 != copy
def calculate_sets(rules):
"""Calculate FOLLOW sets.
Adapted from: http://lara.epfl.ch/w/cc09:algorithm_for_first_and_follow_sets"""
symbols = {sym for rule in rules for sym in rule.expansion} | {rule.origin for rule in rules}
# foreach grammar rule X ::= Y(1) ... Y(k)
# if k=0 or {Y(1),...,Y(k)} subset of NULLABLE then
# NULLABLE = NULLABLE union {X}
# for i = 1 to k
# if i=1 or {Y(1),...,Y(i-1)} subset of NULLABLE then
# FIRST(X) = FIRST(X) union FIRST(Y(i))
# for j = i+1 to k
# if i=k or {Y(i+1),...Y(k)} subset of NULLABLE then
# FOLLOW(Y(i)) = FOLLOW(Y(i)) union FOLLOW(X)
# if i+1=j or {Y(i+1),...,Y(j-1)} subset of NULLABLE then
# FOLLOW(Y(i)) = FOLLOW(Y(i)) union FIRST(Y(j))
# until none of NULLABLE,FIRST,FOLLOW changed in last iteration
NULLABLE = set()
FIRST = {}
FOLLOW = {}
for sym in symbols:
FIRST[sym]={sym} if sym.is_term else set() # 📌 Key Step 1
FOLLOW[sym]=set()
# Calculate NULLABLE and FIRST
changed = True
while changed:
changed = False
for rule in rules:
if set(rule.expansion) <= NULLABLE: # 📌 Key Step 2
if update_set(NULLABLE, {rule.origin}):
changed = True
for i, sym in enumerate(rule.expansion): # 📌 Key Step 3
if set(rule.expansion[:i]) <= NULLABLE:
if update_set(FIRST[rule.origin], FIRST[sym]):
changed = True
else:
break
# Calculate FOLLOW
changed = True
while changed:
changed = False
for rule in rules:
for i, sym in enumerate(rule.expansion):
if i==len(rule.expansion)-1 or set(rule.expansion[i+1:]) <= NULLABLE: # 📌 Key Step 4
if update_set(FOLLOW[sym], FOLLOW[rule.origin]):
changed = True
for j in range(i+1, len(rule.expansion)): # 📌 Key Step 5
if set(rule.expansion[i+1:j]) <= NULLABLE:
if update_set(FOLLOW[sym], FIRST[rule.expansion[j]]):
changed = True
return FIRST, FOLLOW, NULLABLE
update_set(set1, set2)
function updates set1
with set2
. If set1
does not increase after the update, it’s a False
update; otherwise it’s a True
update.
Unlike we have discussed in LL(1) Parsing, Lark excludes
Here we use the following notations for the simplicity of our discussion:
(union and assignment)
Also note that Python supports the following set operators:
s1 < s2
test ifs1 <= s2
test ifs1 == s2
test ifs1 > s2
test ifs1 >= s2
test ifs1 | s2
unions1 & s2
intersections1 - s2
differences1 ^ s2
symmetric difference
Explanation on 📌 key steps:
- Init
- Given a production like
, if is nullable, then is also nullable - Given the same production
:- when
i == 0
, condition always holds sincerule.expansion[:0]
is empty; therefore always perform - when
i == 1
, if is nullable, perform - when
i == 2
, if is nullable, perform - ……
- when
i == k-1
, if is nullable, perform
- when
- Given a production like
:- when
i == 0
, if is nullable, perform - when
i == 1
, if is nullable, perform - when
i == 2
, if is nullable, perform - ……
- when
i == k-1
, always perform
- when
- Given the same production
:- when
i == 0
,1 <= j <= k-1
- when
j == 1
, condition always holds sincerule.expansion[1:1]
is empty; therefore always perform - when
j == 2
, if is nullable, perform - when
j == 3
, if is nullable, perform - ……
- when
j == k-1
, if is nullable, perform
- when
- when
i == 1
,2 <= j <= k-1
- when
j == 2
, condition always holds sincerule.expansion[2:2]
is empty; therefore always perform - when
j == 3
, if is nullable, perform - when
j == 4
, if is nullable, perform - ……
- when
j == k-1
, if is nullable, perform
- when
- ……
- when
i == k-2
,k-1 <= j <= k-1
- when
j == k-1
, condition always holds sincerule.expansion[(k-1):(k-1)]
is empty; therefore always perform
- when
- when
i == k-1
,j
’s range is empty
- when
留下评论