Recursive Descent Parser | Barış Özmen

First heard in syllabus of Dabeaz (David Beazley) Write a Compiler Course Notebook Also see ((How to Write a (Lisp) Interpreter (in Python)) (by Peter Norvig)(archived))

A recursive descent parser is a top-down parsing technique where each non-terminal in a grammar corresponds to a function in the parser. The parser starts at the top-level rule (usually the start symbol) and recursively calls functions for each sub-rule, consuming tokens from the input when they match expected patterns. - chatgpt

Uses a set of mutually recursive functions to process the input.

Limitations: - left recursion - Direct left recursion (A → Aα | β) causes infinite recursion and must be eliminated via transformation.
- Indirect left recursion (A → Bα, B → Aβ) also needs resolution.

Dabeaz implementation from Python Cookbook

David Beazley (Dabeaz) code in Python Cookbook:

You should start by having a formal specification of the grammar in the form of BNF (Backus-Naur Form (BNF)) or EBNF (Extended Backus-Naur Form (EBNF)). The grammar of algebraic expressions: expr ::= expr + term | expr - term | term term ::= term * factor | term / factor | factor factor ::= ( expr ) | NUM

import re
import collections

# Token specification
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
MINUS = r'(?P<MINUS>-)'
TIMES = r'(?P<TIMES>\*)'
DIVIDE = r'(?P<DIVIDE>/)'
LPAREN = r'(?P<LPAREN>\()'
RPAREN = r'(?P<RPAREN>\))'
WS = r'(?P<WS>\s+)'

master_pat = re.compile('|'.join([NUM, PLUS, MINUS, TIMES,
DIVIDE, LPAREN, RPAREN, WS]))

# Tokenizer
Token = collections.namedtuple('Token', ['type','value'])

def generate_tokens(text):
    scanner = master_pat.scanner(text)
    for m in iter(scanner.match, None):
        tok = Token(m.lastgroup, m.group())
        if tok.type != 'WS':
            yield tok

# Parser
class ExpressionEvaluator:
    '''
    Implementation of a recursive descent parser. Each method
    implements a single grammar rule. Use the ._accept() method
    to test and accept the current lookahead token. Use the ._expect()
    method to exactly match and discard the next token on on the input
    (or raise a SyntaxError if it doesn't match).
    '''
    def parse(self,text):
        self.tokens = generate_tokens(text)
        self.tok = None   # Last symbol consumed
        self.nexttok = None  # Next symbol tokenized
        self._advance() # Load first lookahead token
        return self.expr()

    def _advance(self):
        'Advance one token ahead'
        self.tok, self.nexttok = self.nexttok, next(self.tokens, None)

    def _accept(self,toktype):
        'Test and consume the next token if it matches toktype'
        if self.nexttok and self.nexttok.type == toktype:
            self._advance()
            return True
        else:
            return False

    def _expect(self,toktype):
        'Consume next token if it matches toktype or raise SyntaxError'
        if not self._accept(toktype):
            raise SyntaxError('Expected ' + toktype)
        # Grammar rules follow

    def expr(self):
        "expression ::= term { ('+'|'-') term }*"
        exprval = self.term()
        while self._accept('PLUS') or self._accept('MINUS'):
            op = self.tok.type
            right = self.term()
            if op == 'PLUS':
                exprval += right
            elif op == 'MINUS':
                exprval -= right
        return exprval

    def term(self):
        "term ::= factor { ('*'|'/') factor }*"
        termval = self.factor()
        while self._accept('TIMES') or self._accept('DIVIDE'):
            op = self.tok.type
            right = self.factor()
            if op == 'TIMES':
                termval *= right
            elif op == 'DIVIDE':
                termval /= right
        return termval

    def factor(self):
        "factor ::= NUM | ( expr )"
        if self._accept('NUM'):
            return int(self.tok.value)
        elif self._accept('LPAREN'):
            exprval = self.expr()
            self._expect('RPAREN')
            return exprval
        else:
            raise SyntaxError('Expected NUMBER or LPAREN')

How it works on terminal: ```bash

e = ExpressionTreeBuilder() e.parse('2 + 3') ('+', 2, 3) e.parse('2 + 3 * 4') ('+', 2, ('', 3, 4)) e.parse('2 + (3 + 4) * 5') ('+', 2, ('', ('+', 3, 4), 5)) e.parse('2 + 3 + 4') ('+', ('+', 2, 3), 4)

Nevertheless, the overall idea of writing a recursive descent parser is generally simple. To start, you take every grammar rule and you turn it into a function or method. Thus, if your grammar looks like this:

expr ::= term { ('+'|'-') term }*
term ::= factor { ('*'|'/') factor }*
factor ::= '(' expr ')'
    | NUM

You start by turning it into a set of methods like this: python class ExpressionEvaluator: ... def expr(self): ... def term(self): ... def factor(self): ...

Key Characteristics

Top-Down Parsing
- It begins at the start symbol and expands productions recursively.
- Works best for LL(k) grammars (left-to-right parsing, leftmost derivation, k-token lookahead).
Backtracking (if naive)
- If a rule fails, the parser may revert to an earlier state and try another rule.
- Simple implementations suffer from exponential backtracking, making them impractical for ambiguous grammars.
Handling of Left Recursion
- Direct left recursion (A → Aα | β) causes infinite recursion and must be eliminated via transformation.
- Indirect left recursion (A → Bα, B → Aβ) also needs resolution.
Lookahead (LL(k))
- Typically uses one-token lookahead (LL(1)) but can be extended to LL(k) for more complex grammars.
- With predictive parsing, backtracking is avoided if the lookahead uniquely determines the next rule.
Mutual Recursion Between Functions
- Each grammar rule is implemented as a recursive function.
- Sub-rules are handled by calling appropriate functions.
Efficiency Considerations
- Left factoring is needed if multiple rules share common prefixes.
- Packrat parsing (memoization) can be applied to eliminate redundant computations.