Handcrafting Your Own Language Processor: The Art of Writing Recursive Descent Parsers

Tutorial @ PyCon USA 2025. May 15, 2025. Pittsburgh, PA, USA.

The source code repository for this tutorial is available on GitHub at:

https://github.com/ariel-ortiz/pycon2025

1. Introduction

In this tutorial, you will learn the fundamental principles of creating a language processor by hand, focusing specifically on recursive descent parsers. A recursive descent parser is a program used for analyzing the structure of some input language through a series of functions that represent its grammatical rules.

Knowing how to write a parser from scratch is incredibly useful for anyone interested in understanding how programming languages work under the hood. It also provides valuable insights into compiler design, interpreters, and any form of automatic text analysis. By mastering this skill, you can create your own mini-languages, customize existing ones, or simply satisfy your curiosity about the inner workings of language processors.

Throughout this tutorial, you will:

Gain a clear understanding of what a recursive descent parser is and how it operates.
Learn the step-by-step process of building in Python your own language processor from the ground up.
Improve your problem-solving skills by designing and implementing parsing algorithms.
Explore practical examples and hands-on exercises to solidify your understanding.

By the end of the tutorial, you will have the knowledge and confidence to craft your own language processor and appreciate the elegance and power of recursive descent parsing.

2. Core Concepts

2.1. Tokenizer

A tokenizer is a process that takes raw input (like text or code) and breaks it down into a stream of basic, meaningful units called tokens (e.g., keywords, identifiers, symbols). These tokens then serve as the input for a parser.

2.2. Parser

A parser is a computer program that takes a stream of tokens and analyzes its arrangement based on a defined set of rules known as a grammar. It checks if the input follows these rules and, if it does, often converts it into a more organized form, such as a tree, that illustrates the input’s structure and meaning. Essentially, a parser is a program that understands the rules of a language or data format and can analyze input to determine its structure and meaning.

2.3. Grammar

A grammar is a set of rules that defines how to combine basic symbols to create valid structures in a language. This could be how words form sentences in a natural language, or how keywords and symbols form code in a programming language.

2.4. Terminal Symbol

Terminal symbols are the fundamental, indivisible units of the language. These are what we previously referred to as tokens. They are the actual things you see or use, like individual words in a natural language (“the”, “cat”, “runs”) or specific characters and keywords in a programming language (if, while, x, 10). You can’t break them down further using the grammar’s rules.

2.5. Nonterminal Symbol

Nonterminal symbols are labels or names that represent larger structural concepts or categories within the grammar. In a natural language, these might be “sentence”, “noun-phrase”, or “verb-phrase”. In programming, they could be “statement”, “expression”, or “function-call”. Nonterminals are defined by the grammar’s rules in terms of other terminals and nonterminals.

2.6. Production Rule

A production rule is a single rule within a grammar. It specifies how a nonterminal symbol can be formed by other terminal or nonterminal symbols. For example, a rule in a simple natural language grammar might be “a noun-phrase can be an article followed by a noun”.

Unless explicitly specified otherwise, the initial production rule presented in a grammar functions as its entry point.

2.7. EBNF

EBNF stands for Extended Backus-Naur form. It’s a language used to define languages. More specifically, it’s a notation that allows a specific way of writing down grammar rules, providing a clear and organized method to describe how terminal and nonterminal symbols can be combined. It uses special symbols to indicate things like repetition or optional parts, making the grammar rules of a formal language easier to read and understand.

The production “a noun-phrase can be an article followed by a noun” could be expressed in EBNF as:

noun-phrase = article , noun ;

Similarly, the production “a noun can be one of the words man, ball, woman, or table” could be expressed as:

noun = "man" | "ball" | "woman" | "table" ;

In these examples, the nonterminals are noun-phrase, article, and noun, and the terminals are "man", "ball", "woman", and "table".

The specifics of EBNF will be discussed in the section on EBNF Syntax.

2.8. Exercise A ★

Consider the following EBNF grammar:

sentence    = noun-phrase , verb-phrase , ? end of input ? ;
noun-phrase = article , noun ;
verb-phrase = verb , noun-phrase ;
article     = "the" | "a" ;
noun        = "man" | "ball" | "woman" | "table" ;
verb        = "hit" | "took" | "saw" | "liked" ;

Answer the following questions:

What are all the terminal symbols?
What are all the nonterminal symbols?
How many productions are there in this grammar?

3. EBNF Syntax

3.1. Syntax Summary

The precise syntax for EBNF is formally specified by the international standard ISO/IEC 14977:1996. This tutorial utilizes a subset of this standard, summarized in the following table.

Table 1. EBNF Syntax Summary
EBNF Element	Syntax	Description	Notes for Mapping to Python Code
Terminal	"a"	An indivisible unit of the language, a.k.a token.	Call the `expect()` method to check if the current input token matches the terminal and, upon a match, consume it.
Nonterminal	a	A name representing a category defined by a production rule.	Call the parsing function that is specifically created to handle the production rule of the nonterminal.
Definition	=	Separates the left-hand side (the nonterminal symbol being defined) from the right-hand side (the definition of that nonterminal in terms of terminals and other nonterminals) of a production rule.	Define a parsing function whose body contains the logic to parse the sequence of terminals and nonterminals according to the rule’s structure.
Termination	;	Ends a production rule.	N/A
Concatenation	a , b	Element `a` followed by element `b`.	Sequentially call the `expect()` method for each terminal or the corresponding parsing function for each nonterminal, executed in the order they appear in the production rule.
Alternative	a \| b	Either element `a` or element `b`.	Provide conditional logic (using `if`/`elif`/`else` or `match`/`case` statements) to check the current input token and then execute the parsing logic corresponding to the chosen alternative.
Grouping	( a )	Groups elements to override precedence.	Ensure the code’s logic is executed according to the desired order.
Optional	[ a ]	Zero or one occurrence of `a`.	Use an `if` statement that checks if the current input token matches the beginning of the optional element; if it does, the parsing logic for that element is executed, otherwise, it’s skipped, and parsing continues.
Repetition	{ a }	Zero or more repetitions of `a`.	Use a `while` loop that continues to execute the parsing logic for the repeated element as long as the current input token matches the beginning of that element. The loop terminates when the element is no longer found.
Special Sequence	? a ?	User-defined construct that has an implementation-specific meaning. Some examples: `? integer ?` `? variable ?` `? end of input ?`	Handle these special tokens using the `expect()` method.
Comment	(* ... *)	Allows including human-readble remarks or notes to the grammar.	N/A

3.2. Operator Precedence

The EBNF operator precedence rules are as follows (strongest to weakest):

Optional ([…]), Repetition ({…})
Concatenation (,)
Alternative (|)

As mentioned before, you can use parentheses to override these precedence rules.

4. Steps to Build a Recursive Descent Parser

Recursive descent parsing is a top-down technique where each nonterminal in the grammar corresponds to a function designed to match the input according to that nonterminal’s production rules. The “recursive” aspect arises from these functions calling each other (or themselves) to traverse the grammar’s structure during the matching process.

To begin building our first recursive descent parser, we will follow the steps outlined in this section. We’ll start by creating an empty Python file named english_subset.py, which will serve as the container for all the necessary Python code for our parser implementation.

4.1. Defining the Initial Grammar

The most important step is to have a well-defined grammar for the language the parser will recognize.

Before we can effectively implement a recursive descent parser, it’s crucial to prepare the grammar by addressing two potential issues:

Eliminate left recursion: Left-recursive rules, where a nonterminal directly or indirectly calls itself at the beginning of a production, can lead to infinite loops in top-down parsers. These rules must be rewritten to allow the parser to make progress.

For example, consider the following left-recursive definition of a list of nouns:
```
list-of-nouns = list-of-nouns , "," , noun | noun ;
noun          = "cat" | "dog" | "bird" ;
```
To eliminate this left recursion, we can rewrite the rule as:
```
list-of-nouns = noun , { "," noun } ;
noun          = "cat" | "dog" | "bird" ;
```
As demonstrated here, the repetition construct ({…}) is a common technique used to transform left-recursive rules into a structure that avoids infinite loops.
Perform left factoring: When multiple grammar rules share a common initial sequence (a common prefix), factoring out this prefix can simplify parsing by delaying the choice between alternatives until after the shared part is recognized.

Here’s an example of a grammar rule with a common prefix:
```
sentence = "the" , "cat" , verb | "the" , "dog" , verb ;
verb     = "sleeps" | "eats" ;
```
By applying left factoring, this can be transformed into:
```
sentence = "the" , ("cat" | "dog") , verb ;
verb     = "sleeps" | "eats" ;
```

For all the grammars used in this tutorial, explicit left recursion elimination or left factoring will not be necessary.

For our main example, we will be using the grammar from Exercise A, which is reproduced below for your convenience.

sentence    = noun-phrase , verb-phrase , ? end of input ? ;
noun-phrase = article , noun ;
verb-phrase = verb , noun-phrase ;
article     = "the" | "a" ;
noun        = "man" | "ball" | "woman" | "table" ;
verb        = "hit" | "took" | "saw" | "liked" ;

For clarity and reference, it’s helpful to place a copy of this grammar as comments in the source file.

4.2. Create the `ParserError` class

This class will be used to signal errors during the parsing process, indicating that the input does not conform to the expected grammar or contains some sort of logical inconsistencies.

class ParserError(Exception): (1)

    def __init__(self, message): (2)
        super().__init__(f'PARSER ERROR DETECTED:\n{message}') (3)

1	Defines a custom exception class named `ParserError` that inherits from the base `Exception` class.
2	Defines the instance initializer method for the `ParserError` class, taking an error message as input.
3	Calls the initializer of the parent `Exception` class with a formatted error message.

4.3. Create the `Parser` class

This class will contain the methods for parsing each nonterminal in the grammar and will manage the parsing state (like the current position in the input).

class Parser:

    EOI = '? end of input ?' (1)

1	Defines a class constant `EOI` representing the end of input marker.

4.4. Implement basic tokenization functionality

An instance of the Parser class is in charge of tokenizing the input source into words, which it then uses to step through and verify the input against expected patterns, signaling errors when a mismatch occurs.

The tokenization process relies on the coordinated execution of the following three methods:

__init__: Initializes the parser with the input source and gets the first token.
```
    # NOTE: This is part of the Parser class.

    def __init__(self, source):
        self._tokens = iter(source.split() + [self.EOI]) (1)
        self.advance() (2)
```
1 Splits the source into a list of words, appends the EOI marker at the end, and creates an iterator^[1].

2 Initializes the parser by advancing to the first token.

advance: Advances the parser to the next token from the input.

    # NOTE: This is part of the Parser class.

    def advance(self):
        try: (1)
            self._current = next(self._tokens) (2)
        except StopIteration: (3)
            self._current = None (4)

1	Begins a `try` block to handle potential end of input.
2	Retrieves the next token from the iterator and stores it in `_current`.
3	Catches the exception raised when the iterator is exhausted.
4	Sets `_current` to `None` to indicate the end of input.

expect: Checks if the current token matches an expected value and advances.

    # NOTE: This is part of the Parser class.

    def expect(self, expected_tokens): (1)
        if self._current not in expected_tokens: (2)
            token_found_str = f'"{self._current}"' (3)
            expected_tokens_str = \
                ", ".join([f'"{token}"' for token in expected_tokens]) (4)
            message = (f'Found the token: {token_found_str}\n' (5)
                       f'But was expecting one of: {expected_tokens_str}')
            raise ParserError(message) (6)
        original_current = self._current (7)
        self.advance() (8)
        return original_current (9)

1	Takes a parameter which is a list containing the expected tokens.
2	Checks if the current token is not one of the expected tokens.
3	Creates a string representation of the found token, enclosed in double quotes.
4	Creates a single string containing the expected tokens separated by commas and enclosed in double quotes.
5	Constructs a multiline error message.
6	Raises a `ParserError` exception with the constructed error message.
7	Stores the current token before advancing.
8	Moves to the next token in the input stream.
9	Returns the matched (and consumed) token. While not used in the current example, this value can be useful in more advanced parsing tasks.

4.5. Create and implement parsing functions for each nonterminal

For each nonterminal in the grammar (sentence, noun-phrase, verb-phrase, article, noun, and verb), create a corresponding instance method with that same name in the Parser class. Implement the logic to match the sequence of tokens according to the production rules of the corresponding nonterminal. This involves calling other parsing functions for nonterminals in the rule, or using the expect() method for terminal values.

Example for the nonterminal sentence:

    # NOTE: This is part of the Parser class.

    # sentence = noun-phrase , verb-phrase , ? end of input ? ;
    def sentence(self):
        self.noun_phrase()
        self.verb_phrase()
        self.expect([self.EOI])

Always call the expect() method with a list as argument, even if it contains only one token, such as the end of input marker (EOI) here.

Example for the nonterminal article:

    # NOTE: This is part of the Parser class.

    # article = "the" | "a" ;
    def article(self):
        self.expect(["the", "a"])

4.6. Exercise B ★

Finish implementing the remaning parsing functions: noun-phrase, verb-phrase, noun, and verb.

4.7. Execute the Parser

Instantiate the Parser class with the input string and initiate parsing by calling the grammar’s root nonterminal function (sentence() in our example). Use a try/except block to catch ParserError exceptions, which are raised by the expect() method when the input doesn’t conform to the grammar.

All this can be done by adding the following code at the end of the english_subset.py file:

if __name__ == '__main__':
    parser_example = Parser('the man hit the ball')
    try:
        parser_example.sentence()
        print('Syntax OK!')
    except ParserError as e:
        print(e)

4.8. Exercise C ★

Verify the parser’s correct operation by testing it with another valid string.
To understand its response to errors, test the parser with an invalid input.

4.9. Exercise D ★★

The following grammar is an extension of the grammar from Exercise A:

sentence     = noun-phrase , verb-phrase , ? end of input ? ;
noun-phrase  = article , { adjective } , noun ;                (* UPDATED *)
verb-phrase  = [ adverb ] , verb , [ noun_phrase ] ;           (* UPDATED *)
article      = "the" | "a" ;
noun         = "man" | "ball" | "woman" | "table" ;
verb         = "hit" | "took" | "saw" | "liked" ;
adjective    = "digital" | "virtual" | "cyber" | "pixelated" ; (* NEW *)
adverb       = "algorithmically" | "securely" | "wirelessly" | "recursively" ; (* NEW *)

One of the main differences from the original grammar is that this new version incorporates the optional ([…]) and repetition ({…}) operators. Modify the parser built so far in order to incorporate all these new extensions. Check the notes on the EBNF Syntax Summary table to see how to map these operators into Python code.

Ensure the parser correctly recognizes the following valid strings

a woman saw
the virtual digital man algorithmically hit a pixelated ball

However, the parser should reject these invalid strings:

the cyber table wirelessly
every woman recursively saw a digital ball

5. Writing an Expression Evaluator

Our goal in this section is to implement a simple numerical expression evaluator capable of evaluating expressions like:

(14 + 8) + x * 7

Assuming that variable x equals 11, this expression evaluates to 99.

5.1. Operator Precedence

When evaluating mathematical and programming expressions, the concept of operator precedence is fundamental. Precedence defines the priority of different operators, determining which operations are performed before others. For instance, multiplication generally has higher precedence than addition. Without clear rules of precedence (or the use of parentheses to explicitly define order), the evaluation of expressions with multiple operators would be ambiguous and could lead to incorrect results. Understanding and correctly implementing operator precedence is therefore essential for building a functional expression evaluator.

5.2. The Initial Grammar

We will use the following grammar to define the expressions it can handle:

Expression Grammar, Version 1

start = expression , ? end of input ? ;
expression = additive ;
additive = multiplicative , { "+" , multiplicative } ;
multiplicative = primary , { "*" , primary } ;
primary = ? integer ? | ? variable ? | "(" , expression , ")"

The terminal symbols ? integer ? and ? variable ? in this grammar are defined as follows:

? integer ?: Represents a sequence of one or more digits (0 - 9). Examples include 0, 42, and 1729.
? variable ?: Represents a sequence of one or more letters from the English alphabet (a - z, A - Z). Examples include x, spam, and Ramanujan.

The above grammar establishes operator precedence through its hierarchical structure. Operations defined lower in the grammar have higher precedence because the parser will attempt to reduce those parts of the expression first. For example, primary is the lowest level and contains the most basic units (integers, variables, and parenthesized expressions). Next, multiplicative operates on primary using the * operator, giving multiplication the next level of precedence. Finally, additive operates on multiplicative using the + operator, resulting in the lowest precedence among the defined operators. Parenthesized expressions within primary can override this default precedence.

5.3. The Code

The following code, which largely incorporates what we’ve already written, now utilizes an environment — a dictionary holding defined variables and their values — to enable variable lookups during parsing. The changes from the original program are highlighted below.

File: expression.py

# start = expression , ? end of input ? ;
# expression = additive ;
# additive = multiplicative , { "+" , multiplicative } ;
# multiplicative = primary , { "*" , primary } ;
# primary = ? integer ? | ? variable ? | "(" , expression , ")"

import re (1)


class ParserError(Exception):

    def __init__(self, message):
        super().__init__(f'PARSER ERROR DETECTED:\n{message}')


class ExpressionParser:

    EOI = '? end of input ?'

    def __init__(self, source, environment): (2)
        self._environment = environment (3)
        self._tokens = iter(re.findall(r'[a-z]+|\d+|\S', source, re.I) (4)
                            + [self.EOI])
        self.advance()

    def advance(self):
        try:
            self._current = next(self._tokens)
        except StopIteration:
            self._current = None

    def expect(self, expected_tokens):
        if self._current not in expected_tokens:
            token_found_str = f'"{self._current}"'
            expected_tokens_str = \
                ", ".join([f'"{token}"' for token in expected_tokens])
            message = (f'Found the token: {token_found_str}\n'
                       f'But was expecting one of: {expected_tokens_str}')
            raise ParserError(message)
        original_current = self._current
        self.advance()
        return original_current

    # The code from this point forward is the implementation of the
    # grammar we are defining.

    def start(self): (5)
        result = self.expression()
        self.expect([self.EOI])
        return result

    def expression(self):
        return self.additive()

    def additive(self):
        result = self.multiplicative()
        while self._current == '+':
            self.advance()
            result += self.multiplicative()
        return result

    def multiplicative(self):
        result = self.primary()
        while self._current == '*':
            self.advance()
            result *= self.primary()
        return result

    def primary(self):
        current = self._current
        if current.isdigit(): (6)
            result = int(self._current) (7)
            self.advance()
            return result (8)
        if current.isalpha(): (9)
            if current not in self._environment: (10)
                raise ParserError( (11)
                    f'Variable "{current}" not found in '
                    f'environment {self._environment}')
            self.advance()
            return self._environment[current] (12)
        if current == '(': (13)
            self.advance()
            result = self.expression()
            self.expect([')'])
            return result
        token_found_str = f'"{current}"' (14)
        message = (f'Found the token: {token_found_str}\n'
                   f'But was expecting an integer, variable or "("')
        raise ParserError(message)


if __name__ == '__main__': (15)
    expression = '(14 + 8) + x * 7'
    environment = {'x': 11}
    evaluator = ExpressionParser(expression, environment)
    try:
        result = evaluator.start()
        print(f'{expression} = {result}')
    except ParserError as e:
        print(e)

1 Imports the re module, which provides support for regular expressions^[2].

2 Defines the initializer for the ExpressionParser class taking the input source string and a dictionary containing variable names and their values.

3 Initializes the parser’s environment with the provided dictionary of variables.

Creates an iterator over the tokens found in the source string using a regular expression.

The re.findall() function in Python’s re module is used to find all non-overlapping matches of a given regular expression pattern in a string. It returns these matches as a list of strings. The re.I flag makes the regular expression matching case-insensitive, so it will match both uppercase and lowercase letters.

Understanding How The Regular Expression Works

The regular expression used here, [a-z]+|\d+|\S, has three parts:

[a-z]+: This part is designed to match sequences of one or more letters.
The character set [a-z] defines a range that includes any single letter from 'a' through 'z'. The + symbol that follows this character set is a quantifier, meaning it requires the preceding element (in this case, a letter) to occur one or more times consecutively. Therefore, [a-z]+ will successfully match words composed entirely of letters, such as x, spam, or Ramanujan.
\d+: This element targets sequences of one or more digits.
The special character class \d is a shorthand for matching any single digit from 0 to 9. Similar to the previous part, the + quantifier ensures that the pattern matches one or more consecutive digits. Consequently, \d+ will match whole numbers represented as strings, such as 0, 42, and 1729.
\S: This final part is designed to match any single character that is not a whitespace character. The uppercase \S is the negated version of the lowercase \s (which matches whitespace characters like spaces, tabs, newlines, etc.). Therefore, \S will match any individual character that is visible and not considered a space, tab, or newline. This is useful for capturing individual symbols, punctuation marks, or any other non-whitespace characters that might appear in the input string.

The pipe symbol (|) in the regular expression acts as an OR operator, allowing the regular expression to match any substring that satisfies at least one of the patterns separated by the pipes.

5 Defines the grammar’s starting rule.

6 Checks if the current token is a sequence of digits, indicating an integer.

7 Converts the digit token to an integer.

8 Returns the resulting integer value.

9 Identifies if the current token is alphabetic, treating it as a variable name.

10 Determines if this variable name is not found in the parser’s environment, implying it’s undefined.

11 If the variable is undefined, raises a ParserError exception.

12 Returns the value associated with the current variable name from the environment.

13 If the current token is an opening parenthesis, the parser advances, recursively evaluates the inner expression, expects a closing parenthesis, and returns the result of the inner expression.

14 If the current token doesn’t match an integer, a variable, or an opening parenthesis, the code proceeds to format an error message and then raises a ParserError to indicate a syntax error.

15 Sets up an expression with a simple environment, creates an ExpressionParser, attempts to evaluate the expression and print the result, and catches any ParserError exceptions to print the error message.

For the evaluator to work correctly, each production function must return an integer value representing the result of the parsed expression.

5.4. Exercise E ★★

Modify the ExpressionParser class to implement the following grammar, which includes support for subtraction (-), division (/), and remainder (%) operations.

Expression Grammar, Version 2

start = expression , ? end of input ? ;
expression = additive ;
additive = multiplicative , { ("+" | "-") , multiplicative } ;  (* UPDATED *)
multiplicative = primary , { ("*" | "/" | "%" ) , primary } ;   (* UPDATED *)
primary = ? integer ? | ? variable ? | "(" , expression , ")"

In you Python code, implement division using floor division (//) operator in order to get integer results.

Verify the parser’s expected behavior by testing it with several valid and invalid expressions. For example, try it with this expression:

11 * 2 / (15 - 10) + 20 % 3

The result should be 6.

5.5. Exercise F ★

The following mathematical problem has been at the center of significant online controversy for several years:

Figure 1. Viral math problem. Source: https://x.com/

Address the following questions:

What’s the correct solution for this problem?
How does our parser behave when processing it?

5.6. Exercise G ★★

Extend the ExpressionParser class to support the following grammar, which incorporates the exponentiation (^) operator.

Expression Grammar, Version 3

start = expression , ? end of input ? ;
expression = additive ;
additive = multiplicative , { ("+" | "-") , multiplicative } ;
multiplicative = power , { ("*" | "/" | "%" ) , power } ;   (* UPDATED *)
power = primary , [ "^" , power ] ;                         (* NEW *)
primary = ? integer ? | ? variable ? | "(" , expression , ")"

Perform exponentiation in your Python code using the power operator (**) to obtain the desired results.

Ensure the parser behaves as expected by testing it with a variety of valid and invalid expressions. For example, try it with this expression:

11 * t + 2 ^ t ^ 2 - t + 1

If t is 3, the result is 543.

5.7. Exercise H ★★

The following expression grammar has been updated to include the unary negation operator. All operators defined before this were binary operators, meaning they operate on two values. In contrast, the new unary negation operator works with just one value.

Expression Grammar, version 4

start = expression , ? end of input ? ;
expression = additive ;
additive = multiplicative , { ("+" | "-") , multiplicative } ;
multiplicative = negation , { ("*" | "/" | "%" ) , negation } ; (* UPDATED *)
negation = "-" , negation | power ;                             (* NEW *)
power = primary , [ "^" , power ] ;
primary = ? integer ? | ? variable ? | "(" , expression , ")" ;

Perform the following:

Update the ExpressionParser class to implement this new grammar.
Determine the precedence of the unary negation operator in relation to the other defined operators.
Validate the parser’s correct behavior by testing it with various expressions. For example, if you try:
```
--(-p + -p) + - 2 ^ 4
```
where p is 2, the result is -20.

5.8. Operator Associativity

As previously discussed, operator precedence governs the order of evaluation, with multiplicative operators taking precedence over additive operators. When operators share the same precedence level, associativity defines the direction of evaluation:

Left-associative means operators are evaluated from left to right.
For instance, a - b - c is interpreted as (a - b) - c. The +, -, *, /, and % operators in our grammar are left-associative.
Right-associative means operators are evaluated from right to left.
For example, a ^ b ^ c is interpreted as a ^ (b ^ c). In our grammar, the exponentiation and unary negation are right-associative operators.

Parentheses provide a mechanism to explicitly override both precedence and associativity, ensuring the desired order of operations.

In the expression grammars we’ve seen so far, left associativity for operators like +, -, *, /, and % is achieved through the left-recursive-like structure created by the { … } (repetition) in their respective rules. For example, the additive production first matches a multiplicative, and then iteratively matches subsequent + or - operators followed by another multiplicative, naturally leading to left-to-right evaluation.

On the other hand, right associativity for the exponentiation (^) and unary negation (-) operators is achieved by the recursive structure on the right side in their corresponding rules. This structure forces the parser to group these operations from right to left.

5.9. Exercise I ★★★

This is a challenging exercise that will test your understanding of parsing techniques, operator precedence and associativity, logical expressions, truth table generation, output formatting, and function implementation in Python.

If you solve it correctly, I hereby grant you bragging rights during the whole PyCon US 2025 conference 😁

5.9.1. Title

Truth Table Generator Function for a Logical Expression Language

5.9.2. Objective

Write a Python function that evaluates logical expressions involving boolean variables and prints their truth tables.

5.9.3. Function Specification

You need to implement a Python function that adheres to the following specification:

def generate_truth_table(expression):
    """
    Evaluates a logical expression string and prints its truth table.

    Parameter:
        expression: A string containing the logical expression to
        evaluate. For example: "x AND (y OR NOT z) XOR x"
    """
    # Your implementation here
    pass

The function should support a logical expression language with the following features:

Boolean Variables: Variables represented by single lowercase letters (e.g., x, y, a, b). Each variable can take one of two boolean values: 1 (true) or 0 (false).
Logical Operators: The following logical operators must be supported:
- AND: (Logical AND) Is true if and only if both of its operands are true.
- OR: (Logical OR) Is true if and only if at least one of its operands is true.
- NOT: (Logical NOT) Is true if its single operand is false, and false if its single operand is true.
- XOR: (Logical Exclusive OR) Is true if and only if exactly one of its operands is true.
- EQV: (Logical Equivalence) Is true if and only if both of its operands have the same boolean value.
- IMP: (Logical Implication) Is false only when the first operand is true and the second operand is false; otherwise, it is true.
- NAND: (Logical NOT AND) Is true if and only if at least one of its operands is false.
- NOR: (Logical NOT OR) Is true if and only if both of its operands are false.
Operator Precedence: The NOT operator possesses a higher precedence than all other operators. All remaining operators share the same precedence level.
Operator Associativity: All operators are left-associative, except for NOT, which is a unary prefix operator that evaluates from right to left.
Parentheses: Parentheses ( and ) can be used to override the default precedence and associativity.

5.9.4. Function Requirements

The generate_truth_table function should:

Input: Accept a logical expression as a string argument (expression).
Variable Identification: Identify all the unique boolean variables present in the input expression.
Truth Table Generation and Printing:
- Generate all possible boolean assignments for the identified variables.
- Evaluate the input logical expression for each of these truth assignments.
- Print a truth table to the standard output in a clear and readable format. The table should include:
  - A header row listing all the variables in alphabetical order, followed by the original expression, followed by the string “Result”.
  - Subsequent rows representing each truth assignment, with the corresponding boolean values for each variable, the original expression with variable values substituted, and the resulting boolean value of the expression.

5.9.5. Example

For the input expression:

x AND (y OR NOT z) XOR x

Calling generate_truth_table('x AND (y OR NOT z) XOR x') should print the following to the standard output:

x y z | x AND (y OR NOT z) XOR x | Result
------|--------------------------|-------
0 0 0 | 0 AND (0 OR NOT 0) XOR 0 | 0
0 0 1 | 0 AND (0 OR NOT 1) XOR 0 | 0
0 1 0 | 0 AND (1 OR NOT 0) XOR 0 | 0
0 1 1 | 0 AND (1 OR NOT 1) XOR 0 | 0
1 0 0 | 1 AND (0 OR NOT 0) XOR 1 | 0
1 0 1 | 1 AND (0 OR NOT 1) XOR 1 | 1
1 1 0 | 1 AND (1 OR NOT 0) XOR 1 | 0
1 1 1 | 1 AND (1 OR NOT 1) XOR 1 | 0

5.9.6. Implementation Notes

You will need to implement a parser to understand the structure of the input expression, taking into account operator precedence and associativity. Begin by defining the grammar for this logical expression language.
Upon encountering a parser error, such as a grammar mismatch or variables with more than one character, the function should print a descriptive error message to the standard output indicating the issue.
You will need a mechanism to generate all possible boolean assignments for the identified variables.

The itertools.product() function can come in handy for this purpose.
You will need to implement the logic for each of the boolean operators.

Good luck!

— Always look on the bright side of life.

6. Additional Resources

6.1. Books

Writing a Simple Recursive Descent Parser
By David Beazley
O’Reilly Media, Inc., 2024
Writing a C Compiler
By Nora Sandler
No Starch Press, 2024
ISBN: 978-1718500426
Relevant Chapter: 1
Engineering a Compiler, 2nd Edition
By Keith D. Cooper, Linda Torczon
Morgan Kaufmann, 2011
ISBN: 978-9380931876
Relevant Chapter: 3
Domain Specific Languages
By Martin Fowler
Addison-Wesley Professional, 2010
ISBN: 978-0321712943
Relevant Chapter: 21
Language Implementation Patterns
By Terence Parr
Pragmatic Bookshelf, 2009
ISBN: 978-1934356456
Relevant Chapter: 2

6.2. Online Resources

Building Recursive Descent Parsers: The Definitive Guide
By Supriyo Biswas
Code samples in Python
Recursive Descent Parser
By GeeksforGeeks
Code samples in C
Writing a Simple Recursive Descent Parser
By Jamis Buck
Code samples in Ruby

7. License and Credits

This work is licensed under a CC BY-NC-SA 4.0.
Free use of the source code presented here is granted under the terms of the GPL version 3 License.
The author gratefully acknowledges the helpful contributions of Gemini in the writing and review process of these notes.
This document was prepared using the Asciidoctor text processor.
Icons by www.flaticon.com.

1. An iterator is an object that enables sequential access to elements of a collection, one at a time, using the next() function.

2. A regular expression pattern is a sequence of characters that defines a search pattern, used to match, locate, and manipulate text.

1	Splits the `source` into a list of words, appends the `EOI` marker at the end, and creates an iterator^[1].
2	Initializes the parser by advancing to the first token.

Handcrafting Your Own Language Processor: The Art of Writing Recursive Descent Parsers

1. Introduction

2. Core Concepts

2.1. Tokenizer

2.2. Parser

2.3. Grammar

2.4. Terminal Symbol

2.5. Nonterminal Symbol

2.6. Production Rule

2.7. EBNF

2.8. Exercise A ★

3. EBNF Syntax

3.1. Syntax Summary

3.2. Operator Precedence

4. Steps to Build a Recursive Descent Parser

4.1. Defining the Initial Grammar

4.2. Create the ParserError class

4.3. Create the Parser class

4.4. Implement basic tokenization functionality

4.5. Create and implement parsing functions for each nonterminal

4.6. Exercise B ★

4.7. Execute the Parser

4.8. Exercise C ★

4.9. Exercise D ★★

5. Writing an Expression Evaluator

5.1. Operator Precedence

5.2. The Initial Grammar

5.3. The Code

5.4. Exercise E ★★

5.5. Exercise F ★

5.6. Exercise G ★★

5.7. Exercise H ★★

5.8. Operator Associativity

5.9. Exercise I ★★★

5.9.1. Title

5.9.2. Objective

5.9.3. Function Specification

5.9.4. Function Requirements

5.9.5. Example

5.9.6. Implementation Notes

6. Additional Resources

6.1. Books

6.2. Online Resources

7. License and Credits

4.2. Create the `ParserError` class

4.3. Create the `Parser` class