Lark’s implementation of computing FIRST and FOLLOW sets

July 5, 2025 5 分钟阅读

OverviewPermalink

Lark is a parsing toolkit for Python and it has a straightforward implementation of computing $FIRST$ and $FOLLOW$ sets.

`Symbol` ClassesPermalink

Suppose we have a grammar $G = (V, Σ, S, P)$ .

Lark uses the following 3 classes for symbols:

Symbol $: V \cup Σ$
Terminal(Symbol) $: Σ$
NonTerminal(Symbol) $: V$

class Symbol(Serialize):
    name: str
    is_term: ClassVar[bool] = NotImplemented

    def __init__(self, name: str) -> None:
        self.name = name


class Terminal(Symbol):
    is_term: ClassVar[bool] = True

    # filter_out: boolean flag of whether this terminal should be excluded from the parse tree
    #
    # When filter_out=True:
    #   - The terminal is matched during parsing but removed from the final parse tree
    #   - Useful for structural punctuation like parentheses, commas, semicolons, braces
    #   - Keeps parse trees clean by removing syntactic noise while preserving semantic content
    #   - Example: In "func(a, b)", the parentheses and comma would be filtered out
    def __init__(self, name: str, filter_out: bool = False) -> None:
        self.name = name
        self.filter_out = filter_out


class NonTerminal(Symbol):
    is_term: ClassVar[bool] = False

`Rule` ClassPermalink

Lark uses class Rule to represent productions of $P$ . It’s main attributes are:

class Rule:
    origin: Symbol           # Left-hand side (non-terminal), e.g. A
    expansion: List[Symbol]  # Right-hand side, e.g. [B, C, D]
    alias: Optional[str]     # Optional label for the rule in the parse tree
    order: int               # Priority for disambiguation

Note that alias is for readability in parse trees.

Lark support the following grammar format:

(* 
    Question mark (?) at the beginning indicates that this rule will be inlined 
    if they have a single child, after filtering. 
*)
?expr: term "+" term -> add
     | term

where -> add indicate the alias for this production. Therefore we can construct the following Rule object:

Rule(origin=NonTerminal('expr'), 
     expansion=[NonTerminal('term'), Terminal('+'), NonTerminal('term')], 
     alias='add')

And the application of this rule in a parse tree can be represented as:

    ___'add'___
   /     |     \
'term'  '+'  'term'

where the name of the root of this sub tree is no longer 'expr'

ImplementationPermalink

def update_set(set1, set2):
    if not set2 or set1 > set2:
        return False

    copy = set(set1)
    set1 |= set2
    return set1 != copy

def calculate_sets(rules):
    """Calculate FOLLOW sets.

    Adapted from: http://lara.epfl.ch/w/cc09:algorithm_for_first_and_follow_sets"""
    symbols = {sym for rule in rules for sym in rule.expansion} | {rule.origin for rule in rules}

    # foreach grammar rule X ::= Y(1) ... Y(k)
    # if k=0 or {Y(1),...,Y(k)} subset of NULLABLE then
    #   NULLABLE = NULLABLE union {X}
    # for i = 1 to k
    #   if i=1 or {Y(1),...,Y(i-1)} subset of NULLABLE then
    #     FIRST(X) = FIRST(X) union FIRST(Y(i))
    #   for j = i+1 to k
    #     if i=k or {Y(i+1),...Y(k)} subset of NULLABLE then
    #       FOLLOW(Y(i)) = FOLLOW(Y(i)) union FOLLOW(X)
    #     if i+1=j or {Y(i+1),...,Y(j-1)} subset of NULLABLE then
    #       FOLLOW(Y(i)) = FOLLOW(Y(i)) union FIRST(Y(j))
    # until none of NULLABLE,FIRST,FOLLOW changed in last iteration

    NULLABLE = set()
    FIRST = {}
    FOLLOW = {}
    for sym in symbols:
        FIRST[sym]={sym} if sym.is_term else set()  # 📌 Key Step 1
        FOLLOW[sym]=set()

    # Calculate NULLABLE and FIRST
    changed = True
    while changed:
        changed = False

        for rule in rules:
            if set(rule.expansion) <= NULLABLE:  # 📌 Key Step 2
                if update_set(NULLABLE, {rule.origin}):
                    changed = True

            for i, sym in enumerate(rule.expansion):  # 📌 Key Step 3
                if set(rule.expansion[:i]) <= NULLABLE:
                    if update_set(FIRST[rule.origin], FIRST[sym]):
                        changed = True
                else:
                    break

    # Calculate FOLLOW
    changed = True
    while changed:
        changed = False

        for rule in rules:
            for i, sym in enumerate(rule.expansion):
                if i==len(rule.expansion)-1 or set(rule.expansion[i+1:]) <= NULLABLE:  # 📌 Key Step 4
                    if update_set(FOLLOW[sym], FOLLOW[rule.origin]):
                        changed = True

                for j in range(i+1, len(rule.expansion)):  # 📌 Key Step 5
                    if set(rule.expansion[i+1:j]) <= NULLABLE:
                        if update_set(FOLLOW[sym], FIRST[rule.expansion[j]]):
                            changed = True

    return FIRST, FOLLOW, NULLABLE

update_set(set1, set2) function updates set1 with set2. If set1 does not increase after the update, it’s a False update; otherwise it’s a True update.

Unlike we have discussed in LL(1) Parsing, Lark excludes $ε$ from $FIRST$ sets. Instead, Lark makes a new set $NULLABLE$ for nullable variables.

Here we use the following notations for the simplicity of our discussion:

$Φ \equiv FIRST$
$Λ \equiv FOLLOW$
$N \equiv NULLABLE$
$A ⊎ B \equiv (A \leftarrow A \cup B)$ (union and assignment)

Also note that Python supports the following set operators:

s1 < s2 $\Rightarrow$ test if $s_{1} \subset s_{2}$
s1 <= s2 $\Rightarrow$ test if $s_{1} \subseteq s_{2}$
s1 == s2 $\Rightarrow$ test if $s_{1} = s_{2}$
s1 > s2 $\Rightarrow$ test if $s_{1} \supset s_{2}$
s1 >= s2 $\Rightarrow$ test if $s_{1} \supseteq s_{2}$
s1 | s2 $\Rightarrow$ union $s_{1} \cup s_{2}$
s1 & s2 $\Rightarrow$ intersection $s_{1} \cap s_{2}$
s1 - s2 $\Rightarrow$ difference $s_{1} ∖ s_{2}$
s1 ^ s2 $\Rightarrow$ symmetric difference $s_{1} ⊖ s_{2}$

Explanation on 📌 key steps:

Init $Φ [a] = {a}, \forall a \in Σ$
Given a production like $X \to Y_{1} Y_{2} \dots Y_{k}$ , if $\forall Y_{i}$ is nullable, then $X$ is also nullable
Given the same production $X \to Y_{1} Y_{2} \dots Y_{k}$ :
- when i == 0, condition always holds since rule.expansion[:0] is empty; therefore always perform $Φ [X] ⊎ Φ [Y_{1}]$
- when i == 1, if $Y_{1}$ is nullable, perform $Φ [X] ⊎ Φ [Y_{2}]$
- when i == 2, if $Y_{1} Y_{2}$ is nullable, perform $Φ [X] ⊎ Φ [Y_{3}]$
- ……
- when i == k-1, if $Y_{1} Y_{2} \dots Y_{k - 1}$ is nullable, perform $Φ [X] ⊎ Φ [Y_{k}]$
Given a production like $Y \to X_{1} X_{2} \dots X_{k}$ :
- when i == 0, if $X_{2} \dots X_{k}$ is nullable, perform $Λ [X_{1}] ⊎ Λ [Y]$
- when i == 1, if $X_{3} \dots X_{k}$ is nullable, perform $Λ [X_{2}] ⊎ Λ [Y]$
- when i == 2, if $X_{4} \dots X_{k}$ is nullable, perform $Λ [X_{3}] ⊎ Λ [Y]$
- ……
- when i == k-1, always perform $Λ [X_{k}] ⊎ Λ [Y]$
Given the same production $Y \to X_{1} X_{2} \dots X_{k}$ :
- when i == 0, 1 <= j <= k-1
  - when j == 1, condition always holds since rule.expansion[1:1] is empty; therefore always perform $Λ [X_{1}] ⊎ Φ [X_{2}]$
  - when j == 2, if $X_{2}$ is nullable, perform $Λ [X_{1}] ⊎ Φ [X_{3}]$
  - when j == 3, if $X_{2} X_{3}$ is nullable, perform $Λ [X_{1}] ⊎ Φ [X_{4}]$
  - ……
  - when j == k-1, if $X_{2} \dots X_{k - 1}$ is nullable, perform $Λ [X_{1}] ⊎ Φ [X_{k}]$
- when i == 1, 2 <= j <= k-1
  - when j == 2, condition always holds since rule.expansion[2:2] is empty; therefore always perform $Λ [X_{2}] ⊎ Φ [X_{3}]$
  - when j == 3, if $X_{3}$ is nullable, perform $Λ [X_{2}] ⊎ Φ [X_{4}]$
  - when j == 4, if $X_{3} X_{4}$ is nullable, perform $Λ [X_{2}] ⊎ Φ [X_{5}]$
  - ……
  - when j == k-1, if $X_{3} \dots X_{k - 1}$ is nullable, perform $Λ [X_{2}] ⊎ Φ [X_{k}]$
- ……
- when i == k-2, k-1 <= j <= k-1
  - when j == k-1, condition always holds since rule.expansion[(k-1):(k-1)] is empty; therefore always perform $Λ [X_{k - 1}] ⊎ Φ [X_{k}]$
- when i == k-1, j’s range is empty

X Facebook LinkedIn Bluesky

Lark’s implementation of computing FIRST and FOLLOW sets

OverviewPermalink

`Symbol` ClassesPermalink

`Rule` ClassPermalink

ImplementationPermalink

分享

留下评论

猜您还喜欢

LR Parsing #5: Intuition Revisited

LR Parsing #4: Runtime Encoding of LR(0)/SLR(1) Parsing DFA (How to Construct the Parsing Tables)

LR Parsing #3: Simulation of the Parsing DFA (Configuration / Shift-Reduce / Structure of Parsing Table)

LR Parsing #2: Structural Encoding of LR(0) Parsing DFA

OverviewPermalink

Symbol ClassesPermalink

Rule ClassPermalink

ImplementationPermalink

分享

留下评论

猜您还喜欢

LR Parsing #5: Intuition Revisited

LR Parsing #4: Runtime Encoding of LR(0)/SLR(1) Parsing DFA (How to Construct the Parsing Tables)

LR Parsing #3: Simulation of the Parsing DFA (Configuration / Shift-Reduce / Structure of Parsing Table)

LR Parsing #2: Structural Encoding of LR(0) Parsing DFA

`Symbol` ClassesPermalink

`Rule` ClassPermalink