A meta approach to implementing programming languages

Published on 2018-12-02.

28 May 2019: Added link to next article in Putting it together.

How does the computer know what to do with the following expression?

1+2*3

How does it know how to recognize the sequence of characters as an arithmetic expression? How does it know that 2*3 should be computed first? How does it translate it to instructions that execute on the CPU?

In this article I present a metalanguage that I've developed that can help answer those questions.

Metalanguages

Metalanguages are used to reason about languages. In metalanguages you can make statements about statements in a different language.

The metalanguage I've developed is called RLMeta. It is inspired by a metalanguage from the sixties called META II. I wanted to develop my own version of META II to understand it deeply. RLMeta is also inspired by OMeta (another META II derivative).

RLMeta is a programming language in which you write grammars. Grammars have rules that specify how to match objects from an input stream and specify what should happen when objects are matched. The RLMeta compiler translates grammars into programs that recognize the objects specified in the grammar and evaluates the semantic actions when the objects are matched.

Overview of RLMeta compiler.

Interpreting expressions

How can RLMeta be used to give meaning to arithmetic expressions of the kind presented in the introductory example? Here is a grammar:

  1. calculator
  2. calculator.rlmeta
Calculator {
  expression =
    | additive
  additive =
    | multitive:x '+' additive:y -> add(x y)
    | multitive
  multitive =
    | digit:x '*' multitive:y    -> mul(x y)
    | digit
  digit =
    | '0'-'9':x                  -> int(x)
}

This grammar is called Calculator. It has four rules. The first rule says that an expression is an additive. The second rule says that an additive is either a multitive followed by the character '+' followed by another additive or just a multitive. The second case is only tried if the first does not match. The third rule says that a multitive is either a digit followed by the character '*' followed by another multitive or just a digit. The fourth rule says that a digit is a character in the range 0-9. The : followed by a name binds the result of a match to a variable. The expressions to the right of -> are semantic actions. They specify what should happen on a match. They can refer to variables. In this grammar they say that whenever an additive is matched, call the host language function add with the left and right side, and whenever a multitive is matched, call the host language function mul with the left and right side, and whenever a digit is matched, call the host language function int with the digit character. The host language function int converts a digit character to an integer and the add and mul functions perform addition and multiplication.

This grammar describes how to recognize an arithmetic expression in a sequence of characters. Precedence is encoded by the order of the rules. An additive is x1 + x2 + x3 + .. where the xes are multitives. Therefore multiplication is performed before addition. It gives meaning to the expression by calling host language functions when parts are matched.

When the calculator grammar is fed to the RLMeta compiler, a program is output that is an interpreter for arithmetic expressions.

Overview of calculator compilation.

More specifically, this program is a Python class that implements interpretation of arithmetic expressions. The class depends on a support library and also on the host language functions that were called from the grammar (add, mul, and int). Host language refers to the language that grammars are compiled to. In this case Python. The pieces must be assembled to form an executable program. Here is a template for the final Python file that implements a read-eval-print loop (REPL) for arithmetic expressions:

  1. calculator
  2. compile.sh
  3. python file template
from operator import add, mul

$support_py

$calculator_py

if __name__ == "__main__":
    calculator = Calculator()
    while True:
        line = raw_input("> ")
        result = calculator.run("expression", line)
        print(result)

First the host language functions are imported (int is always available). Then the support library and the compiled calculator grammar snippets are inserted. Finally the main method which is the REPL is implemented. Compiled grammars are used by instantiating them and calling their run method with the name of the rule and the input object. This template is rendered with a Bash script:

  1. calculator
  2. compile.sh
#!/bin/bash

set -e

cd "$(dirname "$0")"

support_py=$(python ../rlmeta/rlmeta.py --support)
calculator_py=$(python ../rlmeta/rlmeta.py < calculator.rlmeta)

cat <<EOF
<<python file template>>
EOF

First the set -e directive ensures that the script exits as soon as there is an error. Then the directory is changed to the one where the compile script lives. Then the output of two calls to the RLMeta compiler (rlmeta.py) are captured in two variables. The RLMeta compiler reads a grammar from stdin and writes a Python class to stdout. If the --support flag is given, it writes the support library to stdout instead. The command < file syntax redirects the contents of the file to the command's stdin. Finally the Python file template is rendered and written to stdout using a here document which has access to the previously captured variables.

Example usage on the command line:

$ python <(./calculator/compile.sh)
> 1+2*3
7

The compile script writes a Python file to stdout. The <(command) syntax is process substitution and turns the output of the command into a temporary file which can then be run with Python.

In this example, the input stream to the calculator becomes a list of characters:

['1', '+', '2', '*', '3']

When the calculator matches the expression, the following host language functions will be called:

add(int('1'), mul(int('2'), int('3')))

Compiling expressions

The previous example relied on host language functions to perform addition and multiplication. The meaning of an expression was defined in terms of the meaning of Python functions. To understand what an expression means, you need to understand how Python implements those functions. The next example compiles an expression down to a kind of assembly language that eliminates the need for Python.

For this compilation, two grammars are written: a parser and a code generator. The parser looks similar to the calculator grammar but instead of evaluating the expression, it creates an abstract syntax tree (AST) describing the expression:

  1. expression
  2. parser.rlmeta
Parser {
  expression =
    | additive
  additive =
    | multitive:x '+' additive:y -> ["add" x y]
    | multitive
  multitive =
    | digit:x '*' multitive:y    -> ["mul" x y]
    | digit
  digit =
    | '0'-'9':x                  -> ["digit" x]
}

The bracket notation in the semantic actions creates lists. Nodes in the AST are represented as lists where the first item is a string denoting the type of node.

The code generator takes as input an AST from the parser and generates assembly language code for an imaginary stack machine:

  1. expression
  2. codegenerator.rlmeta
CodeGenerator {
  ast =
    | ["add" ast:x ast:y] -> { x y "add"     "\n" }
    | ["mul" ast:x ast:y] -> { x y "mul"     "\n" }
    | ["digit" .:x]       -> {     "push " x "\n" }
}

This grammar has only one rule: ast. It says that an AST is either a list that starts with the string 'add', or a list that starts with the string 'mul', or a list that starts with the string 'digit'. The add and mul cases recursively match AST nodes as their left and right side whereas the digit matches anything (.) which is the digit stored in the AST node. The semantic actions in this grammar generate string output which is denoted by the curly braces. When a digit AST node is matched, the string 'push [digit]\n' is generated. It instructs the stack machine to push the given digit to the stack. For add and mul, instructions for the operands are first output followed by an 'add\n' or 'mul\n' instruction. They instruct the stack machine to pop two numbers off the stack, add or multiply them, and push the result.

This grammar describes how to recognize and AST in a sequence of objects. It gives meaning to the AST by generating assembly language code when AST nodes are matched.

When the expression grammars are fed to the RLMeta compiler, programs are output whose combination is a compiler for arithmetic expressions.

Overview of expression compilation.

Here is a template for the final Python file that implements a REPL for arithmetic expression compilation:

  1. expression
  2. compile.sh
  3. python file template
import sys

$support_py

$parser_py

$codegenerator_py

if __name__ == "__main__":
    parser = Parser()
    codegenerator = CodeGenerator()
    while True:
        line = raw_input("> ")
        ast = parser.run("expression", line)
        assembly = codegenerator.run("ast", ast)
        sys.stdout.write(assembly)

First the sys module is imported because the main method needs it. Then the support library, compiled parser grammar, and compiled code generator grammar snippets are inserted. Finally the main method which is the REPL is implemented. The output of the parser is fed to the code generator and its output is finally written to stdout. This template is rendered with a Bash script:

  1. expression
  2. compile.sh
#!/bin/bash

set -e

cd "$(dirname "$0")"

support_py=$(python ../rlmeta/rlmeta.py --support)
parser_py=$(python ../rlmeta/rlmeta.py < parser.rlmeta)
codegenerator_py=$(python ../rlmeta/rlmeta.py < codegenerator.rlmeta)

cat <<EOF
<<python file template>>
EOF

First the set -e directive ensures that the script exits as soon as there is an error. Then the directory is changed to the one where the compile script lives. Then the output of three calls to the RLMeta compiler are captured in three variables. Finally the Python file template is rendered and written to stdout.

Example usage on the command line:

$ python <(./expression/compile.sh)
> 1+2*3
push 1
push 2
push 3
mul
add
> 1*2+3
push 1
push 2
mul
push 3
add

In the first example, the input stream to the parser becomes a list of characters (same as for the calculator):

['1', '+', '2', '*', '3']

The input stream to the code generator becomes a list with a single object which is a list (the root AST node):

[
    [
        'add',
        ['digit', '1'],
        [
            'mul',
            ['digit', '2'],
            ['digit', '3']
        ]
    ]
]

The generated assembly language code is much closer to CPU instructions than the Python-based interpreter. A grammar could be written to convert these assembly instructions to assembly instructions of a real CPU. But I will not do that here. The point is that the meaning of an expression can be described by a series of transformations that eventually output machine instructions.

RLMeta implementation

So far I've just given informal descriptions of how RLMeta works. To fully understand how arithmetic expressions are evaluated and compiled, you need to understand how RLMeta is implemented.

The RLMeta compiler translates a grammar to a Python class. This translation is implemented in RLMeta itself. A grammar is, like an expression, translated in two steps: the parser translates grammar syntax to an AST and the code generator translates the AST to a Python class. The generated Python class depends on a support library.

RLMeta compiler internals illustrated.

The RLMeta compiler thus comprises three pieces: the parser, the code generator, and the support library.

Parser

This section defines the parser:

  1. rlmeta
  2. parser.rlmeta
Parser {
  <<rules>>
}

Grammar

The top level syntactic element is a grammar. A grammar has a name followed by rules enclosed in curly braces. When this is matched, a Grammar AST node is created containing the name of the grammar and the rule AST nodes:

  1. rlmeta
  2. parser.rlmeta
  3. rules
grammar =
  | name:x space '{' rule*:ys space '}' -> ["Grammar" x ~ys]

Throughout the parser, space is ignored. As a rule of thumb, it is inserted before matching a character sequence (and not before matching other rules).

The * operator after rule means match the preceding expression zero or more times. The result is a list.

The ~ operator in the semantic action means splice the list in-line into the enclosing list. The Grammar AST node thus has the name as element one, the first rule AST node as element two, the second rule AST node as element three, and so on.

Rule

A rule has a name followed by an equal sign followed by a choice. When this is matched, a Rule AST node is created containing the name and the choice AST node:

  1. rlmeta
  2. parser.rlmeta
  3. rules
rule =
  | name:x space '=' choice:y -> ["Rule" x y]

Choice

A choice has sequences separated by vertical bars. Optionally the first sequence can start with a vertical bar to allow all sequence lines to look the same. When this is matched, an Or AST node is created containing the sequence AST nodes:

  1. rlmeta
  2. parser.rlmeta
  3. rules
choice =
  | (space '|')?
    sequence:x (space '|' sequence)*:xs -> ["Or" x ~xs]

The ? operator means match the preceding expression zero or one time.

The result of xs is a list of sequences since the expression inside parenthesis returns the last match (which is a sequence).

Sequence

A sequence has one or more expressions. When this is matched, a Scope AST node is created containing an And AST node containing the expression AST nodes:

  1. rlmeta
  2. parser.rlmeta
  3. rules
sequence =
  | expr:x expr*:xs -> ["Scope" ["And" x ~xs]]

Expression

An expression is one of the following sequences:

  1. rlmeta
  2. parser.rlmeta
  3. rules
expr =
  | expr1:x space ':' name:y -> ["Bind" y x]
  | expr1

The first sequence is an expression of level 1 followed by a colon followed by a name. When this is matched, a Bind AST node is created containing the name and the expression AST node.

The second sequence is an expression of level 1.

Expression of level 1

An expression of level 1 is one of the following sequences:

  1. rlmeta
  2. parser.rlmeta
  3. rules
expr1 =
  | expr2:x space '*' -> ["Star" x]
  | expr2:x space '?' -> ["Or" x ["And"]]
  | space '!' expr2:x -> ["Not" x]
  | expr2

The first sequence is an expression of level 2 followed by an asterisk. When this is matched, a Star AST node is created containing the expression AST node.

The second sequence is an expression of level 2 followed by a question mark. When this is matched, an Or AST node is created containing the expression AST node and an empty And AST node. There is no dedicated AST node for the ? operator, but it is equivalent to matching the expression or zero expressions and'ed.

The third sequence is an exclamation mark followed by an expression of level 2. When this is matched, a Not AST node is created containing the expression AST node.

The fourth sequence is an expression of level 2.

Expression of level 2

An expression of level 2 is one of the following sequences:

  1. rlmeta
  2. parser.rlmeta
  3. rules
expr2 =
  | space '->' hostExpr:x        -> ["SemanticAction" x]
  | name:x !(space '=')          -> ["MatchRule" x]
  | space char:x '-' char:y      -> ["MatchRange" x y]
  | space string:x               -> ["MatchString" x]
  | space charseq:x              -> ["MatchCharseq" x]
  | space '.'                    -> ["MatchAny"]
  | space '(' choice:x space ')' -> x
  | space '[' expr*:xs space ']' -> ["MatchList" ["And" ~xs]]

The first sequence is the characters '->' followed by a host expression. When this is matched, a SemanticAction AST node is created containing the expression AST node.

The second sequence is a name that is not followed by an equal sign (otherwise it would also match the start of a rule). When this is matched, a MatchRule AST node is created containing the name.

The third sequence is a character followed by a dash followed by another character. When this is matched, a MatchRange AST node is created containing the two characters.

The fourth sequence is a string. When this is matched, a MatchString AST node is created containing the string.

The fifth sequence is a character sequence. When this is matched, a MatchCharseq AST node is created containing the character sequence.

The sixth sequence is a dot. When this is matched, a MatchAny AST node is created.

The seventh sequence is an open parenthesis followed by a choice followed by a closing parenthesis. When this is matched, the choice AST node is returned.

The eighth sequence is an open bracket followed by expressions followed by a closing bracket. When this is matched, a MatchList AST node is created containing an And AST node containing the expressions.

Host expression

A host expression is one of the following sequences:

  1. rlmeta
  2. parser.rlmeta
  3. rules
hostExpr =
  | space string:x                           -> ["String" x]
  | space '[' hostExprListItem*:xs space ']' -> ["List" ~xs]
  | space '{' buildExpr*:xs space '}'        -> ["Builder" ~xs]
  | name:x space '(' hostExpr*:ys space ')'  -> ["FnCall" x ~ys]
  | name:x                                   -> ["VarLookup" x]

The first sequence is a string. When this is matched, a String AST node is created containing the string.

The second sequence is an open bracket followed by host expression list items followed by a closing bracket. When this is matched, a List AST node is created containing the list item AST nodes.

A list item is either a host expression preceded by the ~ operator, in which case a ListItemSplice AST node is created containing the expression, or a host expression:

  1. rlmeta
  2. parser.rlmeta
  3. rules
hostExprListItem =
  | space '~' hostExpr:x -> ["ListItemSplice" x]
  | hostExpr

The third sequence is an open curly brace followed by build expressions followed by a closing curly brace. When this is matched, a Builder AST node is created containing the expression AST nodes.

A build expression is either a greater than character, in which case an IndentBuilder AST node is created, or a less than character, in which case a DedentBuilder AST node is created, or a host expression.

  1. rlmeta
  2. parser.rlmeta
  3. rules
buildExpr =
  | space '>' -> ["IndentBuilder"]
  | space '<' -> ["DedentBuilder"]
  | hostExpr

The fourth sequence is a name followed by an open parenthesis followed by host expressions followed by a closing parenthesis. When this is matched, a FnCall AST node is created containing the name and expression AST nodes.

The fifth sequence is a name. When this is matched, a VarLookup AST node is created containing the name.

Character related

Character related rules capture strings, character sequences, and single characters. Inside all of them a few escape codes are possible. When this is matched, the characters inside the delimiters are joined together to create the string:

  1. rlmeta
  2. parser.rlmeta
  3. rules
string    = '"'  (!'"'  innerChar)*:xs '"'  -> join(xs)
charseq   = '\'' (!'\'' innerChar)*:xs '\'' -> join(xs)
char      = '\''  !'\'' innerChar  :x  '\'' -> x
innerChar = '\\' escape | .
escape    = '\\' -> "\\" | '\'' -> "'"
          | '"'  -> "\"" | 'n'  -> "\n"

Name

A name has at least one alphabetic character followed by any number of alphanumeric characters. When this is matched, the individual characters are joined together to create a string:

  1. rlmeta
  2. parser.rlmeta
  3. rules
name      = space nameStart:x nameChar*:xs -> join([x ~xs])
nameStart = 'a'-'z' | 'A'-'Z'
nameChar  = 'a'-'z' | 'A'-'Z' | '0'-'9'

Space

A space is any number of space characters or newlines:

  1. rlmeta
  2. parser.rlmeta
  3. rules
space = (' ' | '\n')*

Code generator

This section defines the code generator and the support library:

  1. rlmeta
  2. codegenerator.rlmeta
CodeGenerator {
  <<rules>>
}
  1. rlmeta
  2. support.py
<<classes>>

Note on target language

The choice of Python as the target language for code generation is an implementation detail. A different target language could easily be used. Say for example that you would like to use RLMeta in a web browser. In that case JavaScript must be used as the target language. RLMeta could be ported to JavaScript by modifying the code generator and the support library.

Parsing algorithm

The parsing algorithm that RLMeta implements is based on parsing expression grammars (PEG), but is extended to match arbitrary objects, not just characters. Another way to describe the parsing algorithm is that it is a recursive descent parser with backtracking an memoization. Details of the algorithm is shown in the remainder of this section.

Structure of code generator

The code generator has two main rules: ast and astFnBody:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
ast =
  <<ast>>
  | astFnBody:x -> { "(lambda:\n" > x < "\n)" }
astFnBody =
  <<astFnBody>>

Sometimes generated code for an AST node should be wrapped in a lambda. Those AST nodes are added to the astFnBody rule. The astFnBody rule is not strictly needed, but without it, many rules would have to wrap its output in a lambda.

The greater than and less than characters in the string output expression cause an indent and a dedent like this:

(lambda:
    x
)

Grammar

When a Grammar AST node is matched, a Python class inheriting _Grammar is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["Grammar" .:x ast*:ys] -> { "class " x "(_Grammar):\n" > ys < }

The name of the class is the same as the name of the grammar and the child AST nodes are assumed to generate methods in the class.

The _Grammar class is defined in the support library:

  1. rlmeta
  2. support.py
  3. classes
class _Grammar(object):

    <<_Grammar>>

Names of support classes start with underscore to not collide with generated grammar names (which can not contain underscores).

Rule

When a Rule AST node is matched, a Python method with a name prefixed with _rule_ is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["Rule" .:x ast:y] -> { "\ndef _rule_" x "(self):\n" > "return " y "()\n" < }

The method name ends with the same name as the rule. The child AST node is assumed to generate a matcher. A matcher is a function that tries to match objects from the input stream and return a semantic action if it succeeds or raise an exception if it fails. That function is called from the generated method and the semantic action is returned.

Or

When an Or AST node is matched, a matcher that calls the built-in _or method is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["Or" astItems:x] -> { "self._or([" x "])" }

Rules to generate a list of items:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
astItems = astItem*:xs -> { "\n" > xs < }
astItem  = ast:x       -> { x ",\n"     }

The generated string is wrapped in a lambda because the code is added to the astFnBody rule and will thus look like this:

(lambda:
    self._or([
        matcher1,
        matcher2,
        ...
    ])
)

The _or method expects a list of matchers which are tried in sequence:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _or(self, matchers):
    original_stream = self._stream
    for matcher in matchers:
        try:
            return matcher()
        except _MatchError:
            self._stream = original_stream
    original_stream.fail("no choice matched")

The result of the first succeeding matcher is returned. The input stream is stored in _stream. Streams are immutable, so resetting the stream upon failure is just a matter of saving and restoring _stream. _MatchError is the name of the exception that is raised when a match fails. Streams have a fail method that generates that exception and adds context to it that is useful for error reporting.

Scope

When a Scope AST node is matched, a matcher that creates a new scope is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["Scope" ast:x] -> { "(lambda _vars:\n" > x < "()\n)(_Vars())" }

A scope is a set of variables that do not interfere with variables in other scopes. The name _vars is used to refer to variables in the current scope. The child AST node is assumed to generate a matcher which is called to return a semantic action. The generated string will thus look like this:

(lambda:
    (lambda _vars:
        matcher()
    )(_Vars())
)

The _Vars class is a subclass of a Python dictionary:

  1. rlmeta
  2. support.py
  3. classes
class _Vars(dict):

    <<_Vars>>

And

When an And AST node is matched, a matcher that calls the built-in _and method is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["And" astItems:x] -> { "self._and([" x "])" }

The _and method expects a list of matchers which are called in sequence:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _and(self, matchers):
    result = None
    for matcher in matchers:
        result = matcher()
    return result

The result of the last matcher is returned.

Bind

When a Bind AST node is matched, a matcher that binds the name to a value in the current scope is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["Bind" .:x ast:y] -> { "_vars.bind(" repr(x) ", " y "())" }

The child AST node is assumed to generate a matcher which is called to make the value a semantic action.

The bind method stores and returns a value with the given name:

  1. rlmeta
  2. support.py
  3. classes
  4. _Vars
def bind(self, name, value):
    self[name] = value
    return value

Star

When a Star AST node is matched, a matcher that calls the built-in _star method is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["Star" ast:x] -> { "self._star(" x ")" }

The _star method expects a matcher and calls it for as long as it succeeds:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _star(self, matcher):
    result = []
    while True:
        original_stream = self._stream
        try:
            result.append(matcher())
        except _MatchError:
            self._stream = original_stream
            return _SemanticAction(lambda: [x.eval() for x in result])

The return value is a semantic action. When evaluated, it returns a list of all match results evaluated.

A semantic action is a wrapper for a function:

  1. rlmeta
  2. support.py
  3. classes
class _SemanticAction(object):

    def __init__(self, fn):
        self.fn = fn

    def eval(self):
        return self.fn()

Its eval method calls the function. The reason for wrapping semantic actions in functions is to prevent them from being evaluated before a parse is complete. If a parse fails, no semantic actions are evaluated.

Not

When a Not AST node is matched, a matcher that calls the built-in _not method is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["Not" ast:x] -> { "self._not(" x ")" }

The _not method expects a matcher and succeeds if that matcher fails:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _not(self, matcher):
    original_stream = self._stream
    try:
        matcher()
    except _MatchError:
        return _SemanticAction(lambda: None)
    else:
        original_stream.fail("match found")
    finally:
        self._stream = original_stream

It never consumes any input. The original stream is always reset.

Semantic action

When a SemanticAction AST node is matched, a matcher that creates a _SemanticAction is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["SemanticAction" ast:x] -> { "_SemanticAction(lambda: " x ")" }

The child AST node is assumed to generate a Python expression that will be returned when the semantic action is evaluated.

Match rule

When a MatchRule AST node is matched, a matcher that calls the built-in _match_rule is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["MatchRule" .:x] -> { "self._match_rule(" repr(x) ")"}

The _match_rule method expects the name of the rule to call:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _match_rule(self, rule_name):
    key = (rule_name, self._stream.position())
    if key in self._memo:
        result, _, self._stream = self._memo[key]
    else:
        start = self._stream
        result = getattr(self, "_rule_{}".format(rule_name))()
        end = self._stream
        self._memo[key] = (result, start, end)
    return result

If the given rule has been matched at the current position before, the memoized result is returned and the input stream is changed to where the previous match ended. If there has been no previous match, the rule is matched by calling the method. The result of the match is stored in the memoization table for later retrieval.

Match range

When a MatchRange AST node is matched, a matcher that calls the built-in _match_range is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["MatchRange" .:x .:y] -> { "self._match_range(" repr(x) ", " repr(y) ")" }

The _match_range method expects two objects defining a range to match:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _match_range(self, start, end):
    original_stream = self._stream
    next_objext, self._stream = self._stream.next()
    if next_objext >= start and next_objext <= end:
        return _SemanticAction(lambda: next_objext)
    else:
        original_stream.fail(
            "expected range {!r}-{!r} but found {!r}".format(start, end, next_objext)
        )

If the next object from the input stream is in that range, it succeeds, otherwise it fails.

Match string

When a MatchString AST node is matched, a matcher that calls the built-in _match_string is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["MatchString" .:x] -> { "self._match_string(" repr(x) ")" }

The _match_string method expects the string to match:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _match_string(self, string):
    original_stream = self._stream
    next_object, self._stream = self._stream.next()
    if next_object == string:
        return _SemanticAction(lambda: string)
    else:
        original_stream.fail(
            "expected {!r} but found {!r}".format(string, next_object)
        )

If the next object from the input stream is that string, it succeeds, otherwise it fails.

Match character sequence

When a MatchCharseq AST node is matched, a matcher that calls the built-in _match_charseq is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["MatchCharseq" .:x] -> { "self._match_charseq(" repr(x) ")" }

The _match_charseq method expects a string with characters to match:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _match_charseq(self, charseq):
    for char in charseq:
        original_stream = self._stream
        next_object, self._stream = self._stream.next()
        if next_object != char:
            original_stream.fail(
                "expected {!r} but found {!r}".format(char, next_object)
            )
    return _SemanticAction(lambda: charseq)

If the next objects from the input stream are those characters, it succeeds, otherwise it fails.

Match any

When a MatchAny AST node is matched, the built-in _match_any is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["MatchAny"] -> { "self._match_any" }

The _match_any method expects no arguments and always matches the next object from the input stream:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _match_any(self):
    next_object, self._stream = self._stream.next()
    return _SemanticAction(lambda: next_object)

It only fails if there are no more objects.

Match list

When a MatchList AST node is matched, a matcher that calls the built-in _match_list is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. astFnBody
| ["MatchList" ast:x] -> { "self._match_list(" x ")" }

The _match_list method expects a matcher that should match the contents of the list:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def _match_list(self, matcher):
    original_stream = self._stream
    next_object, next_stream = self._stream.next()
    if isinstance(next_object, list):
        self._stream = self._stream.nested(next_object)
        matcher()
        if self._stream.is_at_end():
            self._stream = next_stream
            return _SemanticAction(lambda: next_object)
    original_stream.fail("list match failed")

If the next object is a list, a new stream is created with the nested call that contains all child objects. It is set to be the input stream, and the matcher is then called. The matcher must match all child objects, or the match fails.

String

When a String AST node is matched, a Python string is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["String" .:x] -> { repr(x) }

List

When a List AST node is matched, a Python list is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["List" astList:x] -> { x }
  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
astList = astListItem*:xs    -> { "(" xs "[])" }
astListItem =
  | ["ListItemSplice" ast:x] -> {     x  "+"   }
  | ast:x                    -> { "[" x "]+"   }

The Python list is generated by concatenating sub-lists. If an item should be spliced, it is assumed to be a list already and is not wrapped in brackets.

Builder

When a Builder AST node is matched, a _Builder is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["Builder" astItems:x] -> { "_Builder.create([" x "])" }

The child AST nodes are assumed to generate Python expressions.

When an IndentBuilder AST node is matched, an _IndentBuilder is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["IndentBuilder"] -> { "_IndentBuilder()" }

When a DedentBuilder AST node is matched, a _DedentBuilder is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["DedentBuilder"] -> { "_DedentBuilder()" }

All builders inherit from _Builder:

  1. rlmeta
  2. support.py
  3. classes
class _Builder(object):

    <<_Builder>>

A builder can build a string with the build_string method:

  1. rlmeta
  2. support.py
  3. classes
  4. _Builder
def build_string(self):
    output = _Output()
    self.write(output)
    return output.value

All builders must implement the write method that is passed an instance of _Output. The _Output class has functionality to build a string with appropriate indentation:

  1. rlmeta
  2. support.py
  3. classes
class _Output(object):

    def __init__(self):
        self.value = ""
        self.indentation = 0

    def write(self, value):
        for ch in value:
            if self.value and ch != "\n" and self.value[-1] == "\n":
                self.value += "    "*self.indentation
            self.value += ch

The create method of a builder creates an instance of a builder depending on the type of object passed in:

  1. rlmeta
  2. support.py
  3. classes
  4. _Builder
@classmethod
def create(self, item):
    if isinstance(item, _Builder):
        return item
    elif isinstance(item, list):
        return _ListBuilder([_Builder.create(x) for x in item])
    else:
        return _AtomBuilder(item)

A _ListBuilder calls the write method on all child builders:

  1. rlmeta
  2. support.py
  3. classes
class _ListBuilder(_Builder):

    def __init__(self, builders):
        self.builders = builders

    def write(self, output):
        for builder in self.builders:
            builder.write(output)

An _AtomBuilder converts its object to a string and writes it to the output:

  1. rlmeta
  2. support.py
  3. classes
class _AtomBuilder(_Builder):

    def __init__(self, atom):
        self.atom = atom

    def write(self, output):
        output.write(str(self.atom))

An _IndentBuilder changes the indentation of the output:

  1. rlmeta
  2. support.py
  3. classes
class _IndentBuilder(_Builder):

    def write(self, output):
        output.indentation += 1

A _DedentBuilder changes the indentation of the output:

  1. rlmeta
  2. support.py
  3. classes
class _DedentBuilder(_Builder):

    def write(self, output):
        output.indentation -= 1

Function call

When a FnCall AST node is matched, a Python function call is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["FnCall" .:x astItems:y] -> { x "(" y ")" }

The child AST nodes are assumed to generate Python expressions.

Variable lookup

When a VarLookup AST node is matched, a Python expression that looks up the variable and evaluates it is generated:

  1. rlmeta
  2. codegenerator.rlmeta
  3. rules
  4. ast
| ["VarLookup" .:x] -> { "_vars.lookup(" repr(x) ").eval()" }

The lookup method returns the stored value:

  1. rlmeta
  2. support.py
  3. classes
  4. _Vars
def lookup(self, name):
    return self[name]

Every time a variable is referenced, the eval method is called. Consider this grammar:

AGrammar {
  foo = bar:x -> { x x }
  bar = .     -> a_side_effect()
}

If evaluating x has a side effect, it will be executed twice. The eval method of _SemanticAction could be modified so that the function is only called once if needed.

Final support methods

Grammars in Python have a single entry point, run, which expects the name of the rule to match and the input object:

  1. rlmeta
  2. support.py
  3. classes
  4. _Grammar
def run(self, rule_name, input_object):
    self._memo = _Memo()
    self._stream = _Stream.from_object(self._memo, input_object)
    result = self._match_rule(rule_name).eval()
    if isinstance(result, _Builder):
        return result.build_string()
    else:
        return result

It initializes the memoization table and the input stream, matches the rule, and returns the evaluated result. If the result is a builder, the string that the builder builds is returned.

The memoization table is a subclass of a Python dictionary:

  1. rlmeta
  2. support.py
  3. classes
class _Memo(dict):

    def __init__(self):
        dict.__init__(self)
        self._latest_stream = _ObjectStream(self, [], position=-1)
        self._latest_message = ""

    def describe(self):
        items = []
        for (rule_name, _), (_, start, end) in self.items():
            if end > start:
                items.append((rule_name, start, end))
        items.sort(key=lambda item: (item[2].position(), item[1].position()))
        message = []
        for item in items:
            message.append("matched {: <20} {} -> {}\n".format(*item))
        message.append("\n")
        message.append("ERROR: {}: {}\n".format(
            self._latest_stream,
            self._latest_message
        ))
        return "".join(message)

    def fail(self, stream, message):
        if stream.position() >= self._latest_stream.position():
            self._latest_stream = stream
            self._latest_message = message
        raise _MatchError(self)

Its describe method returns a string that describes what has been matched so far and what the latest error message was. It is used for error reporting. It keeps track of match failures via the fail method which is always called to raise a _MatchError.

A _MatchError is a Python exception that also has a reference to a memoization table so that the describe method can be exposed:

  1. rlmeta
  2. support.py
  3. classes
class _MatchError(Exception):

    def __init__(self, memo):
        Exception.__init__(self)
        self._memo = memo

    def describe(self):
        return self._memo.describe()

A _Stream is an immutable object that has a list of objects:

  1. rlmeta
  2. support.py
  3. classes
class _Stream(object):

    @classmethod
    def from_object(cls, memo, input_object):
        if isinstance(input_object, basestring):
            return _CharStream(memo, list(input_object))
        else:
            return _ObjectStream(memo, [input_object])

    def __init__(self, memo, objects):
        self._memo = memo
        self._objects = objects

    def fail(self, message):
        self._memo.fail(self, message)

    def next(self):
        if self.is_at_end():
            self.fail("not eof")
        next_object = self._objects[0]
        return (
            next_object,
            self._advance(next_object, self._objects[1:]),
        )

    def is_at_end(self):
        return len(self._objects) == 0

Its next method returns a tuple with the next object and the next stream. There are two kinds of streams: character streams and object streams. They are subclasses of _Stream and implement the position and _advance methods. The appropriate stream is chosen in the from_object method.

A character steam stores the position as a line + column:

  1. rlmeta
  2. support.py
  3. classes
class _CharStream(_Stream):

    def __init__(self, memo, objects, line=1, column=1):
        _Stream.__init__(self, memo, objects)
        self._line = line
        self._column = column

    def position(self):
        return (self._line, self._column)

    def _advance(self, next_object, objects):
        if next_object == "\n":
            return _CharStream(self._memo, objects, self._line+1, 1)
        else:
            return _CharStream(self._memo, objects, self._line, self._column+1)

    def __str__(self):
        return "L{:03d}:C{:03d}".format(self._line, self._column)

An object stream stores the position as a tuple of indices:

  1. rlmeta
  2. support.py
  3. classes
class _ObjectStream(_Stream):

    def __init__(self, memo, objects, parent=(), position=0):
        _Stream.__init__(self, memo, objects)
        self._parent = parent
        self._position = position

    def position(self):
        return self._parent + (self._position,)

    def nested(self, input_object):
        return _ObjectStream(self._memo, input_object, self._parent+(self._position,))

    def _advance(self, next_object, objects):
        return _ObjectStream(self._memo, objects, self._parent, self._position+1)

    def __str__(self):
        return "[{}]".format(", ".join(str(x) for x in self.position()))

The nested method will only be called on object streams since character streams do not nest.

Putting it together

Here is a template for the final Python file that implements the RLMeta compiler:

  1. rlmeta
  2. compile.sh
  3. python file template
import sys

SUPPORT = $support_py_string

$support_py

$parser_py

$codegenerator_py

join = "".join

def compile_grammar(grammar):
    parser = Parser()
    code_generator = CodeGenerator()
    return code_generator.run("ast", parser.run("grammar", grammar))

if __name__ == "__main__":
    if "--support" in sys.argv:
        sys.stdout.write(SUPPORT)
    else:
        try:
            sys.stdout.write(compile_grammar(sys.stdin.read()))
        except _MatchError as e:
            sys.stderr.write(e.describe())
            sys.exit(1)

First the sys module is imported because the main method needs it. Then the support library snippet is stored in a variable so that it can be output when the --support flag is given. Then the support library, compiled parser grammar, and compiled code generator grammar snippets are inserted. Then the host language function join is defined. Then a function to compile a grammar is defined. Finally the main method that reads a grammar from stdin and writes a Python class to stdout is implemented. If an error occurs, it is written to stderr using the exception's describe method. This template is rendered with a Bash script:

  1. rlmeta
  2. compile.sh
#!/bin/bash

set -e

rlmeta_compiler="$(pwd)/$1"

cd "$(dirname "$0")"

to_python_string() {
    python -c 'import sys; sys.stdout.write(repr(sys.stdin.read()))'
}

support_py=$(cat support.py)
support_py_string=$(to_python_string < support.py)
parser_py=$(python "$rlmeta_compiler" < parser.rlmeta)
codegenerator_py=$(python "$rlmeta_compiler" < codegenerator.rlmeta)

cat <<EOF
<<python file template>>
EOF

First the set -e directive ensures that the script exits as soon as there is an error. Then a variable containing the path to the RLMeta compiler is defined. The path must be given to the compile script as the first argument. Then the directory is changed to the one where the compile script lives. Then a function is defined that turns its stdin into a Python string with correct quoting. Then the support library is captured in a variable. Then the output of passing the support library to to_python_string is captured in a variable. Then the output of two calls to the RLMeta compiler are captured in two variables. Finally the Python file template is rendered and written to stdout.

If the compile script is run with the current RLMeta compiler (rlmeta/rlmeta.py), a new Python file is output that is exactly the same as the rlmeta/rlmeta.py file. It can be seen by diffing the two files:

$ diff <(./rlmeta/compile.sh rlmeta/rlmeta.py) rlmeta/rlmeta.py && echo EQUAL
EQUAL

The rlmeta.py file never needs to be changed manually. The next version can always be produced using the previous version. Sometimes the next version must be produced in steps and intermediate compilers must be created. That is why the path to the compiler must be given to the compile script. (More on modifying RLMeta in the next article.) If you want to see what the rlmeta.py file looks like, it is here.

Bootstrapping

In the previous section, a version of the RLMeta compiler was needed to compile the RLMeta compiler:

parser_py=$(python "$rlmeta_compiler" < parser.rlmeta)
codegenerator_py=$(python "$rlmeta_compiler" < codegenerator.rlmeta)

But how can the RLMeta compiler be run before it exists? How was the first version of rlmeta.py created? This is a bootstrapping problem. In this case I solved it by translating the parser and the code generator to Python code manually according the rules specified in the grammars. I did manually what the compiler would do.

I started translating the parser that looks like this:

Parser {
  ...
}

This is matched by the grammar rule in the parser and turned into a Grammar AST node:

grammar =
  | name:x space '{' rule*:ys space '}' -> ["Grammar" x ~ys]

The Grammar AST node is then turned into a Python class definition by the code generator:

| ["Grammar" .:x ast*:ys] -> { "class " x "(_Grammar):\n" > ys < }

The parser is thus turned into the following Python code:

class Parser(_Grammar):
    ...

I then went on to translate the rules in the parser. The first rule is grammar:

grammar =
  | name:x space '{' rule*:ys space '}' -> ["Grammar" x ~ys]

It is matched by the rule rule in the parser and turned into a Rule AST node:

rule =
  | name:x space '=' choice:y -> ["Rule" x y]

The Rule AST node is then turned into a Python method definition by the code generator:

| ["Rule" .:x ast:y] -> { "\ndef _rule_" x "(self):\n" > "return " y "()\n" < }

The grammar rule is thus turned into the following Python code:


def _rule_grammar(self):
    return ...()

The body of the grammar rule is matched by the choice rule in the parser and turned into an Or AST node:

choice =
  | (space '|')?
    sequence:x (space '|' sequence)*:xs -> ["Or" x ~xs]

The Or AST node is then turned into a Python lambda expression by the code generator:

| ["Or" astItems:x] -> { "self._or([" x "])" }

The body of the grammar rule is thus turned into the following Python code:

(lambda:
    self._or([...])
)

I continued this process until all dots had been expanded. Then I did the same for all remaining rules. Finally I repeated the process for the code generator. Once I had the manually translated versions of parser.py and codegenerator.py, I could temporarily replace

parser_py=$(python "$rlmeta_compiler" < parser.rlmeta)
codegenerator_py=$(python "$rlmeta_compiler" < codegenerator.rlmeta)

with

parser_py=$(cat parser.py)
codegenerator_py=$(cat codegenerator.py)

to create the initial version of rlmeta.py.

With the initial version of rlmeta.py I could run the compile script to generate the second version of rlmeta.py. The second version did not match the first version exactly. The complete diff can be seen in the commit Get rid of bootstrapped compiler. Mostly I had used a different character for strings and forgotten some commas that were not strictly necessary. Once the second version replaced the first, the compile script reproduced rlmeta.py exactly. At this point the manually translated versions could be discarded. The RLMeta compiler was bootstrapped. This was a tremendously rewarding experience.

Did the first version work on the first run? No. In the translation process I noticed incorrect behavior in the grammars that I had to fix. And some manual translations were done incorrectly. But all fixes were relatively minor. Before the translation I had also carefully debugged the grammars in my head to decrease the risk of them having bugs.

To avoid making mistakes in the manual translation process, I created snippets for the Vim text editor for each AST node. So when I encountered a Rule AST node, I could type "rule", hit tab, and the following snippet would be inserted placing the cursor where the rule name should be inserted:

def _rule_${1:name}(self):
    return ${2:matcher}()

I could then type the name ("grammar" for example), hit tab, write the name of the next AST node, and hit tab. If the next AST node was Or for example, the following snipped would be inserted, placing the cursor inside the list:

(lambda:
    self._or([
        ${1:matchers}
    ])
)

The Vim snippets saved me a lot of typing and made manual translation feasible. Getting all commas and parenthesis right would have been difficult otherwise.

Implementing programming languages

When I read Structure and Interpretation of Computer Programs I remember thinking that it described what programming was all about. The following quote from chapter 4 I think summarizes it:

However, as we confront increasingly complex problems, we will find that Lisp, or indeed any fixed programming language, is not sufficient for our needs. We must constantly turn to new languages in order to express our ideas more effectively. Establishing new languages is a powerful strategy for controlling complexity in engineering design; we can often enhance our ability to deal with a complex problem by adopting a new language that enables us to describe (and hence to think about) the problem in a different way, using primitives, means of combination, and means of abstraction that are particularly well suited to the problem at hand.

In this article I set out to explain how a metalanguage could be used to give meaning to arithmetic expressions. But RLMeta is more powerful than that. It can be used to implement itself and also many other programming languages because matching and transforming is at the heart of programming language implementation. Its small implementation, just over 400 lines of code, also makes it feasible to understand and modify:

   51 parser.rlmeta
   33 codegenerator.rlmeta
  278 support.py
   45 compile.sh
  407 total

I hope this article inspires you to experiment with implementing programming languages so that you can solve complex problems elegantly.

Resources

I was helped by the following resources when implementing RLMeta:

Code listings for RLMeta

parser.rlmeta

Parser {
  grammar =
    | name:x space '{' rule*:ys space '}'      -> ["Grammar" x ~ys]
  rule =
    | name:x space '=' choice:y                -> ["Rule" x y]
  choice =
    | (space '|')?
      sequence:x (space '|' sequence)*:xs      -> ["Or" x ~xs]
  sequence =
    | expr:x expr*:xs                          -> ["Scope" ["And" x ~xs]]
  expr =
    | expr1:x space ':' name:y                 -> ["Bind" y x]
    | expr1
  expr1 =
    | expr2:x space '*'                        -> ["Star" x]
    | expr2:x space '?'                        -> ["Or" x ["And"]]
    | space '!' expr2:x                        -> ["Not" x]
    | expr2
  expr2 =
    | space '->' hostExpr:x                    -> ["SemanticAction" x]
    | name:x !(space '=')                      -> ["MatchRule" x]
    | space char:x '-' char:y                  -> ["MatchRange" x y]
    | space string:x                           -> ["MatchString" x]
    | space charseq:x                          -> ["MatchCharseq" x]
    | space '.'                                -> ["MatchAny"]
    | space '(' choice:x space ')'             -> x
    | space '[' expr*:xs space ']'             -> ["MatchList" ["And" ~xs]]
  hostExpr =
    | space string:x                           -> ["String" x]
    | space '[' hostExprListItem*:xs space ']' -> ["List" ~xs]
    | space '{' buildExpr*:xs space '}'        -> ["Builder" ~xs]
    | name:x space '(' hostExpr*:ys space ')'  -> ["FnCall" x ~ys]
    | name:x                                   -> ["VarLookup" x]
  hostExprListItem =
    | space '~' hostExpr:x                     -> ["ListItemSplice" x]
    | hostExpr
  buildExpr =
    | space '>'                                -> ["IndentBuilder"]
    | space '<'                                -> ["DedentBuilder"]
    | hostExpr
  string    = '"'  (!'"'  innerChar)*:xs '"'   -> join(xs)
  charseq   = '\'' (!'\'' innerChar)*:xs '\''  -> join(xs)
  char      = '\''  !'\'' innerChar  :x  '\''  -> x
  innerChar = '\\' escape | .
  escape    = '\\' -> "\\" | '\'' -> "'"
            | '"'  -> "\"" | 'n'  -> "\n"
  name      = space nameStart:x nameChar*:xs   -> join([x ~xs])
  nameStart = 'a'-'z' | 'A'-'Z'
  nameChar  = 'a'-'z' | 'A'-'Z' | '0'-'9'
  space     = (' ' | '\n')*
}

codegenerator.rlmeta

CodeGenerator {
  ast =
    | ["Grammar" .:x ast*:ys]   -> { "class " x "(_Grammar):\n" > ys <                   }
    | ["Rule" .:x ast:y]        -> { "\ndef _rule_" x "(self):\n" > "return " y "()\n" < }
    | ["MatchAny"]              -> { "self._match_any"                                   }
    | ["String" .:x]            -> { repr(x)                                             }
    | ["List" astList:x]        -> { x                                                   }
    | ["Builder" astItems:x]    -> { "_Builder.create([" x "])"                          }
    | ["IndentBuilder"]         -> { "_IndentBuilder()"                                  }
    | ["DedentBuilder"]         -> { "_DedentBuilder()"                                  }
    | ["FnCall" .:x astItems:y] -> { x "(" y ")"                                         }
    | ["VarLookup" .:x]         -> { "_vars.lookup(" repr(x) ").eval()"                  }
    | astFnBody:x               -> { "(lambda:\n" > x < "\n)" }
  astFnBody =
    | ["Or" astItems:x]         -> { "self._or([" x "])"                                 }
    | ["Scope" ast:x]           -> { "(lambda _vars:\n" > x < "()\n)(_Vars())"           }
    | ["And" astItems:x]        -> { "self._and([" x "])"                                }
    | ["Bind" .:x ast:y]        -> { "_vars.bind(" repr(x) ", " y "())"                  }
    | ["Star" ast:x]            -> { "self._star(" x ")"                                 }
    | ["Not" ast:x]             -> { "self._not(" x ")"                                  }
    | ["SemanticAction" ast:x]  -> { "_SemanticAction(lambda: " x ")"                    }
    | ["MatchRule" .:x]         -> { "self._match_rule(" repr(x) ")"                     }
    | ["MatchRange" .:x .:y]    -> { "self._match_range(" repr(x) ", " repr(y) ")"       }
    | ["MatchString" .:x]       -> { "self._match_string(" repr(x) ")"                   }
    | ["MatchCharseq" .:x]      -> { "self._match_charseq(" repr(x) ")"                  }
    | ["MatchList" ast:x]       -> { "self._match_list(" x ")"                           }
  astItems = astItem*:xs        -> { "\n" > xs <                                         }
  astItem  = ast:x              -> { x ",\n"                                             }
  astList  = astListItem*:xs    -> { "(" xs "[])"                                        }
  astListItem =
    | ["ListItemSplice" ast:x]  -> {     x  "+"                                          }
    | ast:x                     -> { "[" x "]+"                                          }
}

support.py

class _Grammar(object):

    def _or(self, matchers):
        original_stream = self._stream
        for matcher in matchers:
            try:
                return matcher()
            except _MatchError:
                self._stream = original_stream
        original_stream.fail("no choice matched")

    def _and(self, matchers):
        result = None
        for matcher in matchers:
            result = matcher()
        return result

    def _star(self, matcher):
        result = []
        while True:
            original_stream = self._stream
            try:
                result.append(matcher())
            except _MatchError:
                self._stream = original_stream
                return _SemanticAction(lambda: [x.eval() for x in result])

    def _not(self, matcher):
        original_stream = self._stream
        try:
            matcher()
        except _MatchError:
            return _SemanticAction(lambda: None)
        else:
            original_stream.fail("match found")
        finally:
            self._stream = original_stream

    def _match_rule(self, rule_name):
        key = (rule_name, self._stream.position())
        if key in self._memo:
            result, _, self._stream = self._memo[key]
        else:
            start = self._stream
            result = getattr(self, "_rule_{}".format(rule_name))()
            end = self._stream
            self._memo[key] = (result, start, end)
        return result

    def _match_range(self, start, end):
        original_stream = self._stream
        next_objext, self._stream = self._stream.next()
        if next_objext >= start and next_objext <= end:
            return _SemanticAction(lambda: next_objext)
        else:
            original_stream.fail(
                "expected range {!r}-{!r} but found {!r}".format(start, end, next_objext)
            )

    def _match_string(self, string):
        original_stream = self._stream
        next_object, self._stream = self._stream.next()
        if next_object == string:
            return _SemanticAction(lambda: string)
        else:
            original_stream.fail(
                "expected {!r} but found {!r}".format(string, next_object)
            )

    def _match_charseq(self, charseq):
        for char in charseq:
            original_stream = self._stream
            next_object, self._stream = self._stream.next()
            if next_object != char:
                original_stream.fail(
                    "expected {!r} but found {!r}".format(char, next_object)
                )
        return _SemanticAction(lambda: charseq)

    def _match_any(self):
        next_object, self._stream = self._stream.next()
        return _SemanticAction(lambda: next_object)

    def _match_list(self, matcher):
        original_stream = self._stream
        next_object, next_stream = self._stream.next()
        if isinstance(next_object, list):
            self._stream = self._stream.nested(next_object)
            matcher()
            if self._stream.is_at_end():
                self._stream = next_stream
                return _SemanticAction(lambda: next_object)
        original_stream.fail("list match failed")

    def run(self, rule_name, input_object):
        self._memo = _Memo()
        self._stream = _Stream.from_object(self._memo, input_object)
        result = self._match_rule(rule_name).eval()
        if isinstance(result, _Builder):
            return result.build_string()
        else:
            return result

class _Vars(dict):

    def bind(self, name, value):
        self[name] = value
        return value

    def lookup(self, name):
        return self[name]

class _SemanticAction(object):

    def __init__(self, fn):
        self.fn = fn

    def eval(self):
        return self.fn()

class _Builder(object):

    def build_string(self):
        output = _Output()
        self.write(output)
        return output.value

    @classmethod
    def create(self, item):
        if isinstance(item, _Builder):
            return item
        elif isinstance(item, list):
            return _ListBuilder([_Builder.create(x) for x in item])
        else:
            return _AtomBuilder(item)

class _Output(object):

    def __init__(self):
        self.value = ""
        self.indentation = 0

    def write(self, value):
        for ch in value:
            if self.value and ch != "\n" and self.value[-1] == "\n":
                self.value += "    "*self.indentation
            self.value += ch

class _ListBuilder(_Builder):

    def __init__(self, builders):
        self.builders = builders

    def write(self, output):
        for builder in self.builders:
            builder.write(output)

class _AtomBuilder(_Builder):

    def __init__(self, atom):
        self.atom = atom

    def write(self, output):
        output.write(str(self.atom))

class _IndentBuilder(_Builder):

    def write(self, output):
        output.indentation += 1

class _DedentBuilder(_Builder):

    def write(self, output):
        output.indentation -= 1

class _Memo(dict):

    def __init__(self):
        dict.__init__(self)
        self._latest_stream = _ObjectStream(self, [], position=-1)
        self._latest_message = ""

    def describe(self):
        items = []
        for (rule_name, _), (_, start, end) in self.items():
            if end > start:
                items.append((rule_name, start, end))
        items.sort(key=lambda item: (item[2].position(), item[1].position()))
        message = []
        for item in items:
            message.append("matched {: <20} {} -> {}\n".format(*item))
        message.append("\n")
        message.append("ERROR: {}: {}\n".format(
            self._latest_stream,
            self._latest_message
        ))
        return "".join(message)

    def fail(self, stream, message):
        if stream.position() >= self._latest_stream.position():
            self._latest_stream = stream
            self._latest_message = message
        raise _MatchError(self)

class _MatchError(Exception):

    def __init__(self, memo):
        Exception.__init__(self)
        self._memo = memo

    def describe(self):
        return self._memo.describe()

class _Stream(object):

    @classmethod
    def from_object(cls, memo, input_object):
        if isinstance(input_object, basestring):
            return _CharStream(memo, list(input_object))
        else:
            return _ObjectStream(memo, [input_object])

    def __init__(self, memo, objects):
        self._memo = memo
        self._objects = objects

    def fail(self, message):
        self._memo.fail(self, message)

    def next(self):
        if self.is_at_end():
            self.fail("not eof")
        next_object = self._objects[0]
        return (
            next_object,
            self._advance(next_object, self._objects[1:]),
        )

    def is_at_end(self):
        return len(self._objects) == 0

class _CharStream(_Stream):

    def __init__(self, memo, objects, line=1, column=1):
        _Stream.__init__(self, memo, objects)
        self._line = line
        self._column = column

    def position(self):
        return (self._line, self._column)

    def _advance(self, next_object, objects):
        if next_object == "\n":
            return _CharStream(self._memo, objects, self._line+1, 1)
        else:
            return _CharStream(self._memo, objects, self._line, self._column+1)

    def __str__(self):
        return "L{:03d}:C{:03d}".format(self._line, self._column)

class _ObjectStream(_Stream):

    def __init__(self, memo, objects, parent=(), position=0):
        _Stream.__init__(self, memo, objects)
        self._parent = parent
        self._position = position

    def position(self):
        return self._parent + (self._position,)

    def nested(self, input_object):
        return _ObjectStream(self._memo, input_object, self._parent+(self._position,))

    def _advance(self, next_object, objects):
        return _ObjectStream(self._memo, objects, self._parent, self._position+1)

    def __str__(self):
        return "[{}]".format(", ".join(str(x) for x in self.position()))

compile.sh

#!/bin/bash

set -e

rlmeta_compiler="$(pwd)/$1"

cd "$(dirname "$0")"

to_python_string() {
    python -c 'import sys; sys.stdout.write(repr(sys.stdin.read()))'
}

support_py=$(cat support.py)
support_py_string=$(to_python_string < support.py)
parser_py=$(python "$rlmeta_compiler" < parser.rlmeta)
codegenerator_py=$(python "$rlmeta_compiler" < codegenerator.rlmeta)

cat <<EOF
import sys

SUPPORT = $support_py_string

$support_py

$parser_py

$codegenerator_py

join = "".join

def compile_grammar(grammar):
    parser = Parser()
    code_generator = CodeGenerator()
    return code_generator.run("ast", parser.run("grammar", grammar))

if __name__ == "__main__":
    if "--support" in sys.argv:
        sys.stdout.write(SUPPORT)
    else:
        try:
            sys.stdout.write(compile_grammar(sys.stdin.read()))
        except _MatchError as e:
            sys.stderr.write(e.describe())
            sys.exit(1)
EOF

What is Rickard working on and thinking about right now?

Every month I write a newsletter about just that. You will get updates about my current projects and thoughts about programming, and also get a chance to hit reply and interact with me. Subscribe to it below.