Published on 2020-05-20.
This is a work in progress that will change. Like to see it finished? Let me know by sending me an email.In Parsing left associative operators using RLMeta we looked at how to parse a subset of mathematical expressions. In this article we will take it further and compile such expressions to x86 machine code. Thus creating a real compiler for expressions.
The compiler is implemented by transforming an expression in different stages.
Parser -> Abstract codegen -> X86 codegen -> GAS -> Assembler
The parser turns an expression into an AST, which the abstract code generator turns it into instructions for an imaginary stack machine, which the X86 code generator turns into assembly instructions for the x86 architecture. The final stage either turns those instructions into textual assembly instructions suitable for compilation with the GNU assembler (GAS) or assembles them straight into machine code.
The style of this article is inspired by Chains of meaning in the STEPS system (mirror) where a compiler is explained by explaining all stages in it. Examples are shown throughout what the input and output of the different stages are.
The parser is the same as in the previous article. It turns an expression into an AST:
Parser { expr = digit:x (op:y digit:z -> [y z])*:xs -> parseOps(x xs) digit = '0'-'9':x -> int(x) op = | '+' -> Op(makeAstNode("ADD") 1 "left") | '-' -> Op(makeAstNode("SUB") 1 "left") | '*' -> Op(makeAstNode("MUL") 2 "left") | '/' -> Op(makeAstNode("DIV") 2 "left") | '^' -> Op(makeAstNode("POW") 3 "right") }
The support functions are also defined similarly:
class Op(object): def __init__(self, fn, prec, assoc): self.fn = fn self.prec = prec self.assoc = assoc
def makeAstNode(name): def op(left, right): return [name, left, right] return op
def parseOps(expr, items, min_level=0): while items and items[0][0].prec >= min_level: op, rhs = items.pop(0) if op.assoc == "left": next_level = op.prec + 1 else: next_level = op.prec expr = op.fn(expr, parseOps(rhs, items, next_level)) return expr
Parsing the simple expression
3
looks like this:
3 =V========== Parser ============ 3
Parsing the slightly more complicated expression
1+2*3
looks like this:
1+2*3 =V========== Parser ============ ['ADD', 1, ['MUL', 2, 3]]
The abstract code generator turns an AST into instructions for an imaginary stack machine:
AbstractCodeGen { expr = | [.:name expr:left expr:right] -> [~right ~left name] | .:leaf -> ["PUSH" leaf] }
The right hand side of an expression is generated before the left hand side. This order does not matter, but it turns out that it's a bit easier to generate x86 instructions if the right hand side comes first.
Generating stack machine instructions for the simple expression
3
looks like this:
3 =V========== Parser ============ 3 =V===== AbstractCodeGen ======== ['PUSH', 3]
Generating stack machine instructions for the slightly more complicated expression
1+2*3
looks like this:
1+2*3 =V========== Parser ============ ['ADD', 1, ['MUL', 2, 3]] =V===== AbstractCodeGen ======== ['PUSH', 3, 'PUSH', 2, 'MUL', 'PUSH', 1, 'ADD']
The X86 code generator turns instructions for the imaginary stack machine into assembly instructions for the x86 architecture.
An assembly instruction is represented as a list with its mnemonic and operands:
["mnemonic" op1 op2 ..]
For an instruction performing a binary operation, op1
is the destination and op2
is the source. The binary operation performed is therefore the following:
op1 = op1 `operation` op2
An operand can be either a constant, a register, or a memory location.
rdi
is the stack pointer where the stack grows downwards (64-bit integer)eax
is a TOS (top of stack) register (32-bit signed integer)TODO: explain what registers and so on that we use
X86CodeGen { expr = [%*:xs] -> [~~xs ["ret"]] <<instructions>> }
ADD = -> [ ["add" ["reg" "eax"] ["addr" "rdi"]] ["sub" ["reg" "rdi"] ["const" 4 ]] ]
SUB = -> [ ["sub" ["reg" "eax"] ["addr" "rdi"]] ["sub" ["reg" "rdi"] ["const" 4 ]] ]
MUL = -> [ ["imul" ["reg" "eax"] ["addr" "rdi"]] ["sub" ["reg" "rdi"] ["const" 4 ]] ]
DIV = -> [ ["cdq" ] ["idiv" "long" ["addr" "rdi"] ] ["sub" ["reg" "rdi"] ["const" 4]] ]
POW = -> [ ]
PUSH = .:x -> [ ["add" ["reg" "rdi"] ["const" 4 ]] ["mov" ["addr" "rdi"] ["reg" "eax"]] ["mov" ["reg" "eax"] ["const" x ]] ]
Generating assembly instructions for the simple expression
3
looks like this:
3 =V========== Parser ============ 3 =V===== AbstractCodeGen ======== ['PUSH', 3] =V======== X86CodeGen ========== [['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 3]], ['ret']]
Generating assembly instructions for the slightly more complicated expression
1+2*3
looks like this:
1+2*3 =V========== Parser ============ ['ADD', 1, ['MUL', 2, 3]] =V===== AbstractCodeGen ======== ['PUSH', 3, 'PUSH', 2, 'MUL', 'PUSH', 1, 'ADD'] =V======== X86CodeGen ========== [['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 3]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 2]], ['imul', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 1]], ['add', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['ret']]
The GAS generator turns assembly instructions into textual assembly instructions suitable for compilation with the GNU assembler (GAS): gcc -nostdlib file.s
:
GasGen { expr = [instr*:xs] -> { ".global expr\n" "expr:\n" xs } instr = | [mnemonic:x op:target op:source] -> { x " " source "," target "\n" } | [mnemonic:x op:arg] -> { x " " arg "\n" } | [mnemonic:x] -> { x "\n" } op = | ["reg" .:name] -> { "%" name } | ["addr" .:name] -> { "(%" name ")" } | ["const" .:value] -> { "$" value } mnemonic = .:x size:y -> pad(x y) size = "long" -> "l" | -> "" }
def pad(text, suffix): return (text+suffix).ljust(7)
#include <stdlib.h> #include <stdio.h> int expr(void* mem); int main() { void* mem = (void*)malloc(32*sizeof(int)); int result = expr(mem); printf("%d\n", result); }
Generating GAS instructions (and compiling and running them) for the simple expression
3
looks like this:
3 =V========== Parser ============ 3 =V===== AbstractCodeGen ======== ['PUSH', 3] =V======== X86CodeGen ========== [['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 3]], ['ret']] =V========== GasGen ============ .global expr expr: add $4,%rdi mov %eax,(%rdi) mov $3,%eax ret =V===== GccCompileAndRun ======= 3
Generating GAS instructions (and compiling and running them) for the slightly more complicated expression
1+2*3
looks like this:
1+2*3 =V========== Parser ============ ['ADD', 1, ['MUL', 2, 3]] =V===== AbstractCodeGen ======== ['PUSH', 3, 'PUSH', 2, 'MUL', 'PUSH', 1, 'ADD'] =V======== X86CodeGen ========== [['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 3]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 2]], ['imul', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 1]], ['add', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['ret']] =V========== GasGen ============ .global expr expr: add $4,%rdi mov %eax,(%rdi) mov $3,%eax add $4,%rdi mov %eax,(%rdi) mov $2,%eax imul (%rdi),%eax sub $4,%rdi add $4,%rdi mov %eax,(%rdi) mov $1,%eax add (%rdi),%eax sub $4,%rdi ret =V===== GccCompileAndRun ======= 7
Generating GAS instructions (and compiling and running them) for the expression that uses all operators
2^4/2*3-4+5
looks like this:
2^4/2*3-4+5 =V========== Parser ============ ['ADD', ['SUB', ['MUL', ['DIV', ['POW', 2, 4], 2], 3], 4], 5] =V===== AbstractCodeGen ======== ['PUSH', 5, 'PUSH', 4, 'PUSH', 3, 'PUSH', 2, 'PUSH', 4, 'PUSH', 2, 'POW', 'DIV', 'MUL', 'SUB', 'ADD'] =V======== X86CodeGen ========== [['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 5]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 4]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 3]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 2]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 4]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 2]], ['cdq'], ['idiv', 'long', ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['imul', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['sub', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['add', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['ret']] =V========== GasGen ============ .global expr expr: add $4,%rdi mov %eax,(%rdi) mov $5,%eax add $4,%rdi mov %eax,(%rdi) mov $4,%eax add $4,%rdi mov %eax,(%rdi) mov $3,%eax add $4,%rdi mov %eax,(%rdi) mov $2,%eax add $4,%rdi mov %eax,(%rdi) mov $4,%eax add $4,%rdi mov %eax,(%rdi) mov $2,%eax cdq idivl (%rdi) sub $4,%rdi imul (%rdi),%eax sub $4,%rdi sub (%rdi),%eax sub $4,%rdi add (%rdi),%eax sub $4,%rdi ret =V===== GccCompileAndRun ======= 1
The assembler turns assembly instructions into machine code:
Assembler { expr = [instr*:xs] -> [~~xs] instr = [%:x] -> x add = | reg64:r const:i -> [0x48 0x83 modRmDirect(r) ~littleEndian(i 1)] | reg32:r addr:m -> [] sub = | reg64:r const:i -> [] | reg32:r addr:m -> [] imul = | reg32:r addr:m -> [] cdq = -> [] idiv = | "long" addr:m -> [] mov = | addr:m reg32:r -> [0x89 modRmAddr(m r)] | reg32:r const:i -> [add(0xb8 r) ~littleEndian(i 4)] ret = -> [0xc3] addr = ["addr" reg64n:n] -> n reg64 = ["reg" reg64n:n] -> n reg64n = | "rdi" -> 7 reg32 = ["reg" reg32n:n] -> n reg32n = | "eax" -> 0 const = ["const" .:i] -> i }
def modRmDirect(register): return 0xc0 | register
def modRmAddr(addrRegister, desinationRegister): return 0x00 | (desinationRegister << 3) | addrRegister
def add(x, y): return x + y
def littleEndian(number, numBytes): if number < 0 or number >= 2**(8*numBytes): raise ValueError("{} is not in range [0, max]".format(number)) values = [] for i in range(numBytes): values.append(number & 0xff) number >>= 8 return values
import ctypes import mmap
libc = ctypes.cdll.LoadLibrary(None)
fn_mmap = libc.mmap fn_mmap.restype = ctypes.c_void_p fn_mmap.argtypes = ( ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_size_t, )
code = "".join([chr(x) for x in machine_code]) code_address = fn_mmap( None, len(code), mmap.PROT_READ|mmap.PROT_WRITE|mmap.PROT_EXEC, mmap.MAP_PRIVATE|mmap.MAP_ANONYMOUS, -1, 0 ) ctypes.memmove(code_address, code, len(code))
fn_malloc = libc.malloc fn_malloc.restype = ctypes.c_void_p fn_malloc.argtypes = (ctypes.c_size_t,)
expr_fn_type = ctypes.CFUNCTYPE(ctypes.c_int, ctypes.c_void_p) expr_fn = ctypes.cast(code_address, expr_fn_type)
result = expr_fn(fn_malloc(1024))
Generating machine code (and running it) for the simple expression
3
looks like this:
3 =V========== Parser ============ 3 =V===== AbstractCodeGen ======== ['PUSH', 3] =V======== X86CodeGen ========== [['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 3]], ['ret']] =V======== Assembler =========== [72, 131, 199, 4, 137, 7, 184, 3, 0, 0, 0, 195] =V======== X86Runner =========== 3
Generating machine code (and running it) for the slightly more complicated expression
1+2*3
looks like this:
1+2*3 =V========== Parser ============ ['ADD', 1, ['MUL', 2, 3]] =V===== AbstractCodeGen ======== ['PUSH', 3, 'PUSH', 2, 'MUL', 'PUSH', 1, 'ADD'] =V======== X86CodeGen ========== [['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 3]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 2]], ['imul', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 1]], ['add', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['ret']] =V======== Assembler =========== [72, 131, 199, 4, 137, 7, 184, 3, 0, 0, 0, 72, 131, 199, 4, 137, 7, 184, 2, 0, 0, 0, 72, 131, 199, 4, 137, 7, 184, 1, 0, 0, 0, 195] =V======== X86Runner =========== 1
Generating machine code (and running it) for the expression that uses all operators
2^4/2*3-4+5
looks like this:
2^4/2*3-4+5 =V========== Parser ============ ['ADD', ['SUB', ['MUL', ['DIV', ['POW', 2, 4], 2], 3], 4], 5] =V===== AbstractCodeGen ======== ['PUSH', 5, 'PUSH', 4, 'PUSH', 3, 'PUSH', 2, 'PUSH', 4, 'PUSH', 2, 'POW', 'DIV', 'MUL', 'SUB', 'ADD'] =V======== X86CodeGen ========== [['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 5]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 4]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 3]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 2]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 4]], ['add', ['reg', 'rdi'], ['const', 4]], ['mov', ['addr', 'rdi'], ['reg', 'eax']], ['mov', ['reg', 'eax'], ['const', 2]], ['cdq'], ['idiv', 'long', ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['imul', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['sub', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['add', ['reg', 'eax'], ['addr', 'rdi']], ['sub', ['reg', 'rdi'], ['const', 4]], ['ret']] =V======== Assembler =========== [72, 131, 199, 4, 137, 7, 184, 5, 0, 0, 0, 72, 131, 199, 4, 137, 7, 184, 4, 0, 0, 0, 72, 131, 199, 4, 137, 7, 184, 3, 0, 0, 0, 72, 131, 199, 4, 137, 7, 184, 2, 0, 0, 0, 72, 131, 199, 4, 137, 7, 184, 4, 0, 0, 0, 72, 131, 199, 4, 137, 7, 184, 2, 0, 0, 0, 195] =V======== X86Runner =========== 2
Here is all source code needed to turn expressions into x86 machine code. Code to run machine code is dependent on situation, so it is not included here.
Parser:
Parser { expr = digit:x (op:y digit:z -> [y z])*:xs -> parseOps(x xs) digit = '0'-'9':x -> int(x) op = | '+' -> Op(makeAstNode("ADD") 1 "left") | '-' -> Op(makeAstNode("SUB") 1 "left") | '*' -> Op(makeAstNode("MUL") 2 "left") | '/' -> Op(makeAstNode("DIV") 2 "left") | '^' -> Op(makeAstNode("POW") 3 "right") }
Abstract code generator:
AbstractCodeGen { expr = | [.:name expr:left expr:right] -> [~right ~left name] | .:leaf -> ["PUSH" leaf] }
X86 code generator:
X86CodeGen { expr = [%*:xs] -> [~~xs ["ret"]] ADD = -> [ ["add" ["reg" "eax"] ["addr" "rdi"]] ["sub" ["reg" "rdi"] ["const" 4 ]] ] SUB = -> [ ["sub" ["reg" "eax"] ["addr" "rdi"]] ["sub" ["reg" "rdi"] ["const" 4 ]] ] MUL = -> [ ["imul" ["reg" "eax"] ["addr" "rdi"]] ["sub" ["reg" "rdi"] ["const" 4 ]] ] DIV = -> [ ["cdq" ] ["idiv" "long" ["addr" "rdi"] ] ["sub" ["reg" "rdi"] ["const" 4]] ] POW = -> [ ] PUSH = .:x -> [ ["add" ["reg" "rdi"] ["const" 4 ]] ["mov" ["addr" "rdi"] ["reg" "eax"]] ["mov" ["reg" "eax"] ["const" x ]] ] }
Assembler:
Assembler { expr = [instr*:xs] -> [~~xs] instr = [%:x] -> x add = | reg64:r const:i -> [0x48 0x83 modRmDirect(r) ~littleEndian(i 1)] | reg32:r addr:m -> [] sub = | reg64:r const:i -> [] | reg32:r addr:m -> [] imul = | reg32:r addr:m -> [] cdq = -> [] idiv = | "long" addr:m -> [] mov = | addr:m reg32:r -> [0x89 modRmAddr(m r)] | reg32:r const:i -> [add(0xb8 r) ~littleEndian(i 4)] ret = -> [0xc3] addr = ["addr" reg64n:n] -> n reg64 = ["reg" reg64n:n] -> n reg64n = | "rdi" -> 7 reg32 = ["reg" reg32n:n] -> n reg32n = | "eax" -> 0 const = ["const" .:i] -> i }
Support functions:
class Op(object): def __init__(self, fn, prec, assoc): self.fn = fn self.prec = prec self.assoc = assoc def makeAstNode(name): def op(left, right): return [name, left, right] return op def parseOps(expr, items, min_level=0): while items and items[0][0].prec >= min_level: op, rhs = items.pop(0) if op.assoc == "left": next_level = op.prec + 1 else: next_level = op.prec expr = op.fn(expr, parseOps(rhs, items, next_level)) return expr def modRmDirect(register): return 0xc0 | register def modRmAddr(addrRegister, desinationRegister): return 0x00 | (desinationRegister << 3) | addrRegister def add(x, y): return x + y def littleEndian(number, numBytes): if number < 0 or number >= 2**(8*numBytes): raise ValueError("{} is not in range [0, max]".format(number)) values = [] for i in range(numBytes): values.append(number & 0xff) number >>= 8 return values
In this article I use a version of RLMeta that builds upon the VM based version from memoizing failures, but has the following changes:
~
operators.%
operator.I will not explain how I made those changes here. The full source code is available on GitHub.
I used the following script to run the examples:
set -e compile() { echo "import sys" echo "import pprint" python rlmeta/rlmeta.py --support cat "support.py" python rlmeta/rlmeta.py < parser.rlmeta python rlmeta/rlmeta.py < abstractcodegen.rlmeta python rlmeta/rlmeta.py < x86codegen.rlmeta python rlmeta/rlmeta.py < gas.rlmeta python rlmeta/rlmeta.py < assembler.rlmeta echo "main()" } python <(compile) "$@"
def main(): grammars = { "parser": Parser(), "acodegen": AbstractCodeGen(), "xcodegen": X86CodeGen(), "gas": GasGen(), "assembler": Assembler(), "xrunner": X86Runner(), } try: expr = sys.stdin.read().strip() first = True for grammar_name in sys.argv[1:]: if grammar_name.startswith("@"): with open(grammar_name[1:], "w") as f: f.write(str(expr)) continue grammar = grammars[grammar_name] print_expr(expr) print_box(grammar.__class__.__name__) expr = grammar.run("expr", expr) print_expr(expr) except _MatchError as e: sys.stderr.write(e.describe())
def print_box(name): print("") print("=V{}==".format(" {} ".format(name).center(28, "="))) print("")
def print_expr(expr): if isinstance(expr, str): print(expr.strip()) else: pprint.pprint(expr, width=20)
set -e cat | bash compile.sh parser acodegen xcodegen gas @expr.s gcc driver.c expr.s -o expr echo "" echo "=V===== GccCompileAndRun =======" echo "" ./expr
set -e bash compile.sh parser acodegen xcodegen assembler xrunner
class X86Runner(object): def run(self, name, machine_code): <<run machine code>> return result
<<compiler>>
What is Rickard working on and thinking about right now?
Every month I write a newsletter about just that. You will get updates about my current projects and thoughts about programming, and also get a chance to hit reply and interact with me. Subscribe to it below.