5 分钟阅读

OverviewPermalink

Lark is a parsing toolkit for Python and it has a straightforward implementation of computing FIRST and FOLLOW sets.

Symbol ClassesPermalink

Suppose we have a grammar G=(V,Σ,S,P).

Lark uses the following 3 classes for symbols:

  • Symbol :VΣ
  • Terminal(Symbol) :Σ
  • NonTerminal(Symbol) :V
class Symbol(Serialize):
    name: str
    is_term: ClassVar[bool] = NotImplemented

    def __init__(self, name: str) -> None:
        self.name = name


class Terminal(Symbol):
    is_term: ClassVar[bool] = True

    # filter_out: boolean flag of whether this terminal should be excluded from the parse tree
    #
    # When filter_out=True:
    #   - The terminal is matched during parsing but removed from the final parse tree
    #   - Useful for structural punctuation like parentheses, commas, semicolons, braces
    #   - Keeps parse trees clean by removing syntactic noise while preserving semantic content
    #   - Example: In "func(a, b)", the parentheses and comma would be filtered out
    def __init__(self, name: str, filter_out: bool = False) -> None:
        self.name = name
        self.filter_out = filter_out


class NonTerminal(Symbol):
    is_term: ClassVar[bool] = False

Rule ClassPermalink

Lark uses class Rule to represent productions of P. It’s main attributes are:

class Rule:
    origin: Symbol           # Left-hand side (non-terminal), e.g. A
    expansion: List[Symbol]  # Right-hand side, e.g. [B, C, D]
    alias: Optional[str]     # Optional label for the rule in the parse tree
    order: int               # Priority for disambiguation

Note that alias is for readability in parse trees.

Lark support the following grammar format:

(* 
    Question mark (?) at the beginning indicates that this rule will be inlined 
    if they have a single child, after filtering. 
*)
?expr: term "+" term -> add
     | term

where -> add indicate the alias for this production. Therefore we can construct the following Rule object:

Rule(origin=NonTerminal('expr'), 
     expansion=[NonTerminal('term'), Terminal('+'), NonTerminal('term')], 
     alias='add')

And the application of this rule in a parse tree can be represented as:

    ___'add'___
   /     |     \
'term'  '+'  'term'

where the name of the root of this sub tree is no longer 'expr'

ImplementationPermalink

def update_set(set1, set2):
    if not set2 or set1 > set2:
        return False

    copy = set(set1)
    set1 |= set2
    return set1 != copy

def calculate_sets(rules):
    """Calculate FOLLOW sets.

    Adapted from: http://lara.epfl.ch/w/cc09:algorithm_for_first_and_follow_sets"""
    symbols = {sym for rule in rules for sym in rule.expansion} | {rule.origin for rule in rules}

    # foreach grammar rule X ::= Y(1) ... Y(k)
    # if k=0 or {Y(1),...,Y(k)} subset of NULLABLE then
    #   NULLABLE = NULLABLE union {X}
    # for i = 1 to k
    #   if i=1 or {Y(1),...,Y(i-1)} subset of NULLABLE then
    #     FIRST(X) = FIRST(X) union FIRST(Y(i))
    #   for j = i+1 to k
    #     if i=k or {Y(i+1),...Y(k)} subset of NULLABLE then
    #       FOLLOW(Y(i)) = FOLLOW(Y(i)) union FOLLOW(X)
    #     if i+1=j or {Y(i+1),...,Y(j-1)} subset of NULLABLE then
    #       FOLLOW(Y(i)) = FOLLOW(Y(i)) union FIRST(Y(j))
    # until none of NULLABLE,FIRST,FOLLOW changed in last iteration

    NULLABLE = set()
    FIRST = {}
    FOLLOW = {}
    for sym in symbols:
        FIRST[sym]={sym} if sym.is_term else set()  # 📌 Key Step 1
        FOLLOW[sym]=set()

    # Calculate NULLABLE and FIRST
    changed = True
    while changed:
        changed = False

        for rule in rules:
            if set(rule.expansion) <= NULLABLE:  # 📌 Key Step 2
                if update_set(NULLABLE, {rule.origin}):
                    changed = True

            for i, sym in enumerate(rule.expansion):  # 📌 Key Step 3
                if set(rule.expansion[:i]) <= NULLABLE:
                    if update_set(FIRST[rule.origin], FIRST[sym]):
                        changed = True
                else:
                    break

    # Calculate FOLLOW
    changed = True
    while changed:
        changed = False

        for rule in rules:
            for i, sym in enumerate(rule.expansion):
                if i==len(rule.expansion)-1 or set(rule.expansion[i+1:]) <= NULLABLE:  # 📌 Key Step 4
                    if update_set(FOLLOW[sym], FOLLOW[rule.origin]):
                        changed = True

                for j in range(i+1, len(rule.expansion)):  # 📌 Key Step 5
                    if set(rule.expansion[i+1:j]) <= NULLABLE:
                        if update_set(FOLLOW[sym], FIRST[rule.expansion[j]]):
                            changed = True

    return FIRST, FOLLOW, NULLABLE

update_set(set1, set2) function updates set1 with set2. If set1 does not increase after the update, it’s a False update; otherwise it’s a True update.

Unlike we have discussed in LL(1) Parsing, Lark excludes ε from FIRST sets. Instead, Lark makes a new set NULLABLE for nullable variables.

Here we use the following notations for the simplicity of our discussion:

  • ΦFIRST
  • ΛFOLLOW
  • NNULLABLE
  • AB(AAB) (union and assignment)

Also note that Python supports the following set operators:

  • s1 < s2 test if s1s2
  • s1 <= s2 test if s1s2
  • s1 == s2 test if s1=s2
  • s1 > s2 test if s1s2
  • s1 >= s2 test if s1s2
  • s1 | s2 union s1s2
  • s1 & s2 intersection s1s2
  • s1 - s2 difference s1s2
  • s1 ^ s2 symmetric difference s1s2

Explanation on 📌 key steps:

  1. Init Φ[a]={a},aΣ
  2. Given a production like XY1Y2Yk, if Yi is nullable, then X is also nullable
  3. Given the same production XY1Y2Yk:
    • when i == 0, condition always holds since rule.expansion[:0] is empty; therefore always perform Φ[X]Φ[Y1]
    • when i == 1, if Y1 is nullable, perform Φ[X]Φ[Y2]
    • when i == 2, if Y1Y2 is nullable, perform Φ[X]Φ[Y3]
    • ……
    • when i == k-1, if Y1Y2Yk1 is nullable, perform Φ[X]Φ[Yk]
  4. Given a production like YX1X2Xk:
    • when i == 0, if X2Xk is nullable, perform Λ[X1]Λ[Y]
    • when i == 1, if X3Xk is nullable, perform Λ[X2]Λ[Y]
    • when i == 2, if X4Xk is nullable, perform Λ[X3]Λ[Y]
    • ……
    • when i == k-1, always perform Λ[Xk]Λ[Y]
  5. Given the same production YX1X2Xk:
    • when i == 0, 1 <= j <= k-1
      • when j == 1, condition always holds since rule.expansion[1:1] is empty; therefore always perform Λ[X1]Φ[X2]
      • when j == 2, if X2 is nullable, perform Λ[X1]Φ[X3]
      • when j == 3, if X2X3 is nullable, perform Λ[X1]Φ[X4]
      • ……
      • when j == k-1, if X2Xk1 is nullable, perform Λ[X1]Φ[Xk]
    • when i == 1, 2 <= j <= k-1
      • when j == 2, condition always holds since rule.expansion[2:2] is empty; therefore always perform Λ[X2]Φ[X3]
      • when j == 3, if X3 is nullable, perform Λ[X2]Φ[X4]
      • when j == 4, if X3X4 is nullable, perform Λ[X2]Φ[X5]
      • ……
      • when j == k-1, if X3Xk1 is nullable, perform Λ[X2]Φ[Xk]
    • ……
    • when i == k-2, k-1 <= j <= k-1
      • when j == k-1, condition always holds since rule.expansion[(k-1):(k-1)] is empty; therefore always perform Λ[Xk1]Φ[Xk]
    • when i == k-1, j’s range is empty

分类:

更新时间:

留下评论