Published on 2019-09-07.
The first article about RLMeta has this example grammar implementing a simple calculator:
Calculator { expression = | additive additive = | multitive:x '+' additive:y -> add(x y) | multitive multitive = | digit:x '*' multitive:y -> mul(x y) | digit digit = | '0'-'9':x -> int(x) }
It works, but if we were to add a second case to support subtraction, it would break because subtraction is a left associate operator but the calculator creates a right associative parse. In this article I show how it would brake and how it can be solved.
Operator associativity has to do with the order that operators are evaluated. The following expression can be evaluated in two ways:
1-2-3
Either as this (if the operator is left associative):
(1-2)-3
Or as this (if the operator is right associative):
1-(2-3)
Let's see how the calculator evaluates expressions.
The following grammar works like the calculator but only supports subtraction:
Calculator { expr = digit:x '-' expr:y -> sub(x y) | digit digit = '0'-'9':x -> int(x) }
The sub
function is implemented in two ways. The first creates an AST node representing a subtraction:
def sub(left, right): return ["-", left, right]
The second evaluates the subtraction expression:
def sub(left, right): return left - right
Let's see how the calculator handles the following two expressions:
1-2 1-2-3
Below are the ASTs and the evaluated results:
['-', 1, 2] => -1 ['-', 1, ['-', 2, 3]] => 2
For an expression with only two numbers, the calculator works as expected. But with more numbers, the parse is incorrect. The calculator parses 1-2-3
as 1-(2-3)
which is incorrect because subtraction is left associative and should thus be parsed as (1-2)-3
.
The original calculator does not have this problem because it only supports operators that evaluate the same no matter if parsed as left associative or right associative.
If we strip the semantic actions from the expr
rule it is easier to see what creates this right associative parse:
expr = digit '-' expr | digit
The topmost expression will be parsed as a digit followed by an arbitrary complex expression: (digit - (...))
.
Let's see if we can get a left associative parse instead.
In order to get a left associate parse (that we need for subtraction) we would like to write the expr
rule like this instead:
expr = expr '-' digit | digit
The whole grammar would then look like this:
Calculator { expr = expr:x '-' digit:y -> sub(x y) | digit digit = '0'-'9':x -> int(x) }
Unfortunately, this will not work with the parsing algorithm that RLMeta uses. In order to parse an expr
it first has to parse an expr
and so on, and it will get stuck in an infinite loop. The parsing algorithm is based on recursive descent parsing and the first choice in the expr
rule will thus be translated into something like this:
def expr(): x = expr() atom("-") y = digit() return sub(x, y)
We need to handle left associative operators differently.
The right associative parse that we saw earlier can be rewritten using a repetition:
expr = digit '-' expr | digit
It then becomes this:
expr = digit ('-' digit)*
It parses the same expressions, but it does not create a right associative parse. What does it create? Something that consist of a digit and a list of something. We can turn this into a parse tree using a function that we call leftAssoc
. Here is the complete grammar:
Calculator { expr = digit:x (op:y digit:z -> [y z])*:xs -> leftAssoc(x xs) digit = '0'-'9':x -> int(x) op = '-' -> makeSub() }
In the semantic action for expr
, x
is bound to a digit, and xs
is bound to a list of pairs consisting of an operator and a digit: [[op, digit], [op, digit], ...]
. The operator is a function that takes two arguments: the left hand side and the right hand side. (makeSub
returns the sub
function. It is used because currently global variables can not be referenced from semantic actions. Only global functions.) The function leftAssoc
turns this into a left associative tree:
def leftAssoc(expr, items): while items: op, rhs = items.pop(0) expr = op(expr, rhs) return expr
Here is how the expression 1-2-3
is parsed:
leftAssoc
is called with expr=1
and items=[[sub, 2], [sub, 3]]
.op=sub
and rhs=2
expr=sub(1, 2)
op=sub
and rhs=3
expr=sub(sub(1, 2), 3)
sub(sub(1, 2), 3)
is returnedLet's verify that the calculator handles the following expressions as intended:
1-2 1-2-3
Below are the ASTs and the evaluated results:
['-', 1, 2] => -1 ['-', ['-', 1, 2], 3] => -4
And indeed it does. We have now successfully parsed a left associate operator using RLMeta.
We can handle a right associative parse in RLMeta with a recursive rule as we have seen before:
expr = digit '-' expr | digit
However, we can also obtain a right associative parse by parsing operators as a list and then converting them into an AST with a rightAssoc
function. Here is a grammar that supports only exponentiation which is right associative:
Calculator { expr = digit:x (op:y digit:z -> [y z])*:xs -> rightAssoc(x xs) digit = '0'-'9':x -> int(x) op = '^' -> makePow() }
It has the exact same structure as the previous calculator, only now a different function is called to create the parse tree. The rightAssoc
function is also similar in structure to the leftAssoc
function, but the loop is replaced with an if statement followed by a recursive call:
def rightAssoc(expr, items): if items: op, rhs = items.pop(0) expr = op(expr, rightAssoc(rhs, items)) return expr
Here is how the expression 3^2^2
is parsed:
rightAssoc
is called with expr=3
and items=[[pow, 2], [pow, 2]]
op=pow
and rhs=2
expr=pow(3, rightAssoc(2, [[pow, 2]])
rightAssoc
is called with expr=2
and items=[[pow, 2]]
op=pow
and rhs=2
expr=pow(2, rightAssoc(2, [])
rightAssoc
is called with expr=2
and items=[]
2
is returnedpow(2, 2)
pow(3, pow(2, 2))
The two pow
functions are defined like this:
def pow(left, right): return ["^", left, right]
def pow(left, right): return left ** right
Let's verify that the calculator handles the following expressions as intended:
3^2 3^2^2
Below are the ASTs and the evaluated results:
['^', 3, 2] => 9 ['^', 3, ['^', 2, 2]] => 81
And indeed it does.
Now that we know how to handle both left and right associative operators, we can extend the calculator to handle more operators:
Calculator { expr = expr1 expr1 = expr2:x (op1:y expr2:z -> [y z])*:xs -> leftAssoc(x xs) expr2 = expr3:x (op2:y expr3:z -> [y z])*:xs -> leftAssoc(x xs) expr3 = digit:x (op3:y digit:z -> [y z])*:xs -> rightAssoc(x xs) digit = '0'-'9':x -> int(x) op1 = | '+' -> makeAdd() | '-' -> makeSub() op2 = | '*' -> makeMul() | '/' -> makeDiv() op3 = | '^' -> makePow() }
Different levels of expressions are used to handle operators having different precedence (multiplication is done before subtraction for example).
The additional arithmetic functions are defined like this:
def add(left, right): return ["+", left, right] def mul(left, right): return ["*", left, right] def div(left, right): return ["/", left, right]
def add(left, right): return left + right def mul(left, right): return left * right def div(left, right): return left / right
Let's see how the calculator handles the following expression:
1+2-3^2^2-5
Below is the AST and the evaluated result:
['-', ['-', ['+', 1, 2], ['^', 3, ['^', 2, 2]]], 5] => -83
If we enter the same expression in the Python prompt, but replace ^
with **
as Python uses, we get the same result:
>>> 1+2-3**2**2-5 -83
We can now parse complicated expressions correctly. However, if there are many precedence levels in a grammar, it might become difficult to read. Can we do better?
Here is the calculator grammar from the previous section rewritten in a cleaner way:
Calculator { expr = digit:x (op:y digit:z -> [y z])*:xs -> parseOps(x xs) digit = '0'-'9':x -> int(x) op = | '+' -> Op(makeAdd() "1" "left") | '-' -> Op(makeSub() "1" "left") | '*' -> Op(makeMul() "2" "left") | '/' -> Op(makeDiv() "2" "left") | '^' -> Op(makePow() "3" "right") }
In this version all operators are parsed as a list and then handed over to the parseOps
function. The operators also return an Op
object instead of just a function to evaluate the operator. The Op
object has information about the operator's precedence (given as a string only because RLMeta can not express integers in semantic actions) and associativity:
class Op(object): def __init__(self, fn, prec, assoc): self.fn = fn self.prec = int(prec) self.assoc = assoc
The parseOps
function uses that information to create a parse tree:
def parseOps(expr, items, min_level=0): while items and items[0][0].prec >= min_level: op, rhs = items.pop(0) if op.assoc == "left": next_level = op.prec + 1 else: next_level = op.prec expr = op.fn(expr, parseOps(rhs, items, next_level)) return expr
It combines the functionality of leftAssoc
and rightAssoc
and additionally also handles precedence. The increment of level if the operator is left associative ensures that only operators of higher precedence are parsed in the recursive call. If the next operator is of the same precedence, the recursive call will return immediately, and parseOps
will behave like leftAssoc
.
Here is how the expression 1+2+3*4 is parsed:
parseOps
is called with expr=1
, items=[[OpAdd, 2], [OpAdd, 3], [OpMul, 4]]
and min_level=0
prec >= 0
, the loop is enteredop=OpAdd
and rhs=2
next_level=2
expr=add(1, parseOps(2, [[OpAdd, 3], [OpMul, 4]], 2))
parseOps
is called with expr=2
, items=[[OpAdd, 3], [OpMul, 4]]
and min_level=2
prec < 2
, the loop is not entered, and 2
is returnedexpr=add(1, 2)
op=OpAdd
and rhs=3
next_level=2
expr=add(add(1, 2), parseOps(3, [[OpMul, 4]], 2))
parseOps
is called with expr=3
, items=[[OpMul, 4]]
and min_level=2
prec >= 2
, the loop is enteredop=OpMul
and rhs=4
next_level=3
expr=mul(3, parseOps(4, [], 3))
parseOps
is called with expr=4
, items=[]
and min_level=3
4
is returnedmul(3, 4)
add(add(1, 2), mul(3, 4))
Let's see how the calculator handles the following expressions:
1+2-3 1+2-3^2^2-5
Below are the ASTs and the evaluated results:
['-', ['+', 1, 2], 3] => 0 ['-', ['-', ['+', 1, 2], ['^', 3, ['^', 2, 2]]], 5] => -83
This looks correct. We now have a calculator whose grammar is more cleanly written and only uses one support function instead of two.
All three operator parser functions have a similar structure to them. Here they are for comparison:
def leftAssoc(expr, items): while items: op, rhs = items.pop(0) expr = op(expr, rhs) return expr
def rightAssoc(expr, items): if items: op, rhs = items.pop(0) expr = op(expr, rightAssoc(rhs, items)) return expr
def parseOps(expr, items, min_level=0): while items and items[0][0].prec >= min_level: op, rhs = items.pop(0) if op.assoc == "left": next_level = op.prec + 1 else: next_level = op.prec expr = op.fn(expr, parseOps(rhs, items, next_level)) return expr
The following articles helped me understand how to handle left associative operators in recursive descent parsers:
I used the following Bash script to run the examples:
compile() { echo "import sys" echo "import pprint" rlmeta --support rlmeta < "$1" cat "support.py" cat "$2" echo "makeAdd = lambda: add" echo "makeSub = lambda: sub" echo "makeMul = lambda: mul" echo "makeDiv = lambda: div" echo "makePow = lambda: pow" echo "try:" echo " for expr in sys.stdin.read().splitlines():" echo " pprint.pprint(Calculator().run('expr', expr), width=20)" echo "except _MatchError as e:" echo " sys.stderr.write(e.describe())" } ( while read -e expr; do echo "$expr" | python <(compile "$1" "ast.py") echo "=>" echo "$expr" | python <(compile "$1" "eval.py") echo "" done ) | head -n-1
It reads expressions from stdin one per line. Then it evaluates them both using the AST function and the eval function and prints the result. The make*
functions are also defined here to just return the respective function.
What is Rickard working on and thinking about right now?
Every month I write a newsletter about just that. You will get updates about my current projects and thoughts about programming, and also get a chance to hit reply and interact with me. Subscribe to it below.