Published on 12 February 2022.
A while ago I created a poster to showcase RLMeta. The version of RLMeta on the poster is based on the version from the memoizing failures article, but I made it smaller and more beautiful to better fit the poster. To be able to finish the poster, I had to stop making changes and put the source code on the poster. That was difficult because I felt the need for it to be perfect. Eventually I did stop polishing, and left a few items unresolved.
Almost immediately after I finished the poster, I started working on a second version. Initially, my plan was to make a second version of the poster. I started to fix the unresolved items and I was making progress. But somehow imperfections kept creeping in. It felt like a never ending game of chasing perfection. That’s when I decided that a second poster was probably not going to be worth it. But I still liked the new version of RLMeta.
Instead, I decided to attempt to present the new version in the style of a code walk through. In other words, another way to showcase RLMeta that is also a bit more practical. Compared to the poster version, this version could also be more easily improved because rendering the blog post is automatic whereas creating the layout of a poster requires manual work every time the source code changes. I also wanted to experiment with the walk through format because I thought it could be something worth putting into the README of a project.
The rest of this blog post consists of the walk through of the new version of RLMeta and a section on the most important changes from the poster version and motivations for them.
In order to follow along on this walk through, you need to download the version of RLMeta from here: rlmeta-poster-2.zip.
The zip file consists of the source code for the RLMeta compiler, a make script, and the compiler itself (rlmeta.py
):
$ tree --dirsfirst
.
├── src
│ ├── assembler.rlmeta
│ ├── codegenerator.rlmeta
│ ├── main.py
│ ├── parser.rlmeta
│ └── support.py
├── make.py
└── rlmeta.py
1 directory, 7 files
The size of the source code is quite small:
$ wc -l src/*
39 src/assembler.rlmeta
57 src/codegenerator.rlmeta
26 src/main.py
60 src/parser.rlmeta
240 src/support.py
422 total
The compiler can be created from this source code only. We will see how later in this walk through.
Before we dive into how the RLMeta compiler is created, let’s explore RLMeta by writing a small, but real, program in it.
What types of programs can we write in RLMeta?
In RLMeta, we write grammars. Grammars have rules that specify how to match objects from an input stream and specify what should happen when objects are matched.
Let’s write a grammar that counts the number of objects in an input stream and produces a report:
$ cat object_counter.rlmeta
ObjectCounter {
count = .*:xs -> { "number of objects = " len(xs) }
}
The main function of the RLMeta compiler is to transform grammars into Python code. If invoked without arguments, the compiler reads a grammar from stdin and writes Python code to stdout:
$ cat object_counter.rlmeta | python rlmeta.py
class ObjectCounter(Grammar):
rules = {
'count': 0
}
code = [
PUSH_SCOPE,
LIST_START,
BACKTRACK,
10,
MATCH,
'any',
lambda x: True,
LIST_APPEND,
COMMIT,
2,
LIST_END,
BIND,
'xs',
ACTION,
lambda self: join(['number of objects = ', self.lookup('len')(self.lookup('xs'))]),
POP_SCOPE,
RETURN
]
This is equivalent to using the --compile
command with a value of -
which stands for stdin:
$ cat object_counter.rlmeta | python rlmeta.py --compile - | head -n3
class ObjectCounter(Grammar):
rules = {
'count': 0
And, the file can also be specified directly like this:
$ python rlmeta.py --compile object_counter.rlmeta | head -n3
class ObjectCounter(Grammar):
rules = {
'count': 0
Don’t worry about understanding the generated code. We will explore it more later. Just note that the generated class inherits from a class called Grammar
and that it uses some constants like PUSH_SCOPE
and LIST_START
. These things are defined in a support library which can be generated by the RLMeta compiler with the --support
command:
$ python rlmeta.py --support | grep '^\(class\|def\)'
class VM:
def PUSH_SCOPE(vm):
def POP_SCOPE(vm):
def BACKTRACK(vm):
def COMMIT(vm):
def CALL(vm):
def CALL_(vm, pc):
def RETURN(vm):
def MATCH(vm):
def MATCH_(vm, fn, message):
def MATCH_CALL_RULE(vm):
def LIST_START(vm):
def LIST_APPEND(vm):
def LIST_END(vm):
def BIND(vm):
def ACTION(vm):
def PUSH_STREAM(vm):
def POP_STREAM(vm):
def FAIL(vm):
def FAIL_(vm, fail_message):
class SemanticAction(object):
class MatchError(Exception):
class Grammar(object):
class Runtime(dict):
class Counter(object):
def splice(depth, item):
def concat(lists):
def join(items, delimiter=""):
def indent(text, prefix=" "):
def compile_chain(grammars, source):
To create a complete program, we also have to write a main function that instantiates the ObjectCounter
grammar and invokes its count
rule.
Here is an example that passes stdin as the input stream to the count
rule and prints the result to stdout:
$ cat object_counter_main.py
if __name__ == "__main__":
import sys
sys.stdout.write(ObjectCounter().run("count", sys.stdin.read()))
The --copy
command of the RLMeta compiler can be used to copy this main file, as is, to the output.
Combining these pieces into a single compile command, we get this:
$ python rlmeta.py --support --compile object_counter.rlmeta --copy object_counter_main.py > object_counter.py
It will perform all commands in the given order and write all generated code concatenated into a single file.
Note that the support library comes before the grammar so that Grammar
is defined by the time ObjectCounter
is evaluated.
The object counter source code has now been compiled into a standalone Python program that can be run like this:
$ echo 'hello' | python object_counter.py
number of objects = 6
$ echo 'this is longer' | python object_counter.py
number of objects = 15
So programs in RLMeta are written mainly in grammar files with some support functions written in Python. The RLMeta compiler can process all these files to produce a single Python file which is the compiled program.
Now that we have an understanding of RLMeta, let’s look at the command that compiles the RLMeta compiler itself from the source code:
$ python rlmeta.py --embed SUPPORT src/support.py --support --compile src/parser.rlmeta --compile src/codegenerator.rlmeta --compile src/assembler.rlmeta --copy src/main.py > rlmeta-raw.py
The first command, --embed SUPPORT src/support.py
, tells the compiler to generate a Python variable named SUPPORT
containing the contents of the file src/support.py
. The --embed
command is the last command of the compiler that we have not yet seen. (The RLMeta compiler needs the support library in a variable so that it can generate it later with the --support
command.)
Next, the --support
command tells the compiler to generate the support library that is embedded in it.
The --compile ...
commands tell the compiler to compile the given grammar files.
The last command, --copy src/main.py
, tells the compiler to copy the main file verbatim. Similar to what we did to the main file in the object counter.
The make script can be called with the --compile
command to perform this exact function:
$ ./make.py --compile > rlmeta-compile.py
Compiling rlmeta using rlmeta.py
O-----------------O
| RLMeta compiled |
~~| itself! |
O-----------------O
And all these files are exactly the same:
$ md5sum rlmeta.py rlmeta-compile.py rlmeta-raw.py
92396155e85e24fb45cb3e58e160e89e rlmeta.py
92396155e85e24fb45cb3e58e160e89e rlmeta-compile.py
92396155e85e24fb45cb3e58e160e89e rlmeta-raw.py
Thus, the RLMeta compiler reproduced itself exactly from the source code.
Let’s now look at how all commands of the RLMeta compiler are implemented. Here is the main function:
if __name__ == "__main__":
import sys
def read(path):
if path == "-":
return sys.stdin.read()
with open(path) as f:
return f.read()
args = sys.argv[1:] or ["--compile", "-"]
while args:
command = args.pop(0)
if command == "--support":
sys.stdout.write(SUPPORT)
elif command == "--copy":
sys.stdout.write(read(args.pop(0)))
elif command == "--embed":
sys.stdout.write("{} = {}\n".format(
args.pop(0),
repr(read(args.pop(0)))
))
elif command == "--compile":
sys.stdout.write(compile_chain(
[(Parser, "file"), (CodeGenerator, "asts"), (Assembler, "asts")],
read(args.pop(0))
))
else:
sys.exit("ERROR: Unknown command '{}'".format(command))
It contains command line parsing and handles processing of all commands.
The --compile
command is the most complex of them all. It calls the compile_chain
function which runs the given grammars/rules in order (in this case the input will first be parsed, then passed to the code generator, and finally passed to the assembler) and prints a pretty error message to stderr upon failure:
def compile_chain(grammars, source):
import os
import sys
import pprint
for grammar, rule in grammars:
try:
source = grammar().run(rule, source)
except MatchError as e:
marker = "<ERROR POSITION>"
if os.isatty(sys.stderr.fileno()):
marker = f"\033[0;31m{marker}\033[0m"
if isinstance(e.stream, str):
stream_string = e.stream[:e.pos] + marker + e.stream[e.pos:]
else:
stream_string = pprint.pformat(e.stream)
sys.exit("ERROR: {}\nPOSITION: {}\nSTREAM:\n{}".format(
e.message,
e.pos,
indent(stream_string)
))
return source
This function might be useful for other RLMeta programs as well. That is why it’s included in the support library and not only in the main file.
Let’s now follow a compilation of an example grammar to learn more about how a grammar file is turned into Python code. Here it is:
$ cat example.rlmeta
Example {
main = .
}
And this is what it compiles to:
$ python rlmeta.py --compile example.rlmeta
class Example(Grammar):
rules = {
'main': 0
}
code = [
PUSH_SCOPE,
MATCH,
'any',
lambda x: True,
POP_SCOPE,
RETURN
]
The transformations that the grammar goes through are defined in the main function:
[(Parser, "file"), (CodeGenerator, "asts"), (Assembler, "asts")],
So first the grammar file is passed to the file
rule of the parser:
file =
| (space grammar)*:xs space !. -> xs
It in turn calls the grammar
rule to parse all grammars in the file:
grammar =
| name:x space '{' rule*:ys space '}' -> ["Grammar" x ~ys]
This rule matches the name, the open curly brace, a set of rules, and the closing curly brace. It will then return an AST that looks like this:
["Grammar",
"Example",
... ]
All grammar AST nodes are handed off to the asts
rule in the code generator:
asts = ast*:xs !. -> xs
It it turn calls the ast
rule to process each AST node:
ast = [%:x] -> x
The ast
rule treats the first argument in the AST as a rule name, and calls that rule. In this case Grammar
:
Grammar = .:x ast*:ys -> ["Grammar" x ~~ys]
The code generator creates a new AST node representing a grammar. But this AST node is slightly different and meant to be processed by the assembler. The result is this:
["Grammar",
"Example",
for consumption by assembler ...
... ast nodes ]
This AST node and all the others that the code generator produces are passed to the asts
rule in the assembler:
asts = ast*:xs !. -> { xs }
It in turn calls the ast
rule:
ast = [%:x] -> x
Which does the same trick again, now invoking the Grammar
rule (in the assembler) which looks like this:
Grammar = .:x ast*:ys -> list():rules
-> list():code
-> dict():labels
-> list():patches
-> ys
-> run("asts" patches)
-> { "class " x "(Grammar):\n" >
"rules = {\n" > join(rules ",\n") < "\n}\n"
"code = [\n" > join(code ",\n") < "\n]\n"
< }
This rule can be read as follows:
rules
which is a listcode
which is a listlabels
which is a dictionarypatches
which is a listpatches
variable as a list of AST nodes and process them with the asts
rule of this grammarThe generated code from our example looks like this:
class Example(Grammar):
rules = {
...
}
code = [
...
]
To understand how the rule
and code
sections are generated, we just have to follow a few more transformations.
Let’s look at one more and see how the rule in our example grammar is transformed.
First, the rule is parsed by the rule
rule in the parser:
rule =
| name:x space '=' choice:y -> ["Rule" x y]
First the name is matched, then the equals sign, and then an expression representing the body of the rule.
It our case, this rule produces this AST node:
["Rule",
"main",
... ]
That node is going to be processed by the Rule
rule in the code generator:
Rule = .:x ast:y -> [["Rule" x]
~y
["OpCode" "RETURN"]]
Generating an AST node that looks like this:
["Rule", "main"],
[
...,"OpCode", "RETURN"]
[ ]
Here we can see that the AST from the code generator looks a bit more like assembly code than a representation of the syntax in the grammar.
The first child in this AST node is going to be handled the Rule
rule in the assembler:
Rule = .:x -> add(rules { repr(x) ": " len(code) })
-> set(labels x len(code))
It does two things:
rules
listlabels
dictionary to map a label to an index in the code
listAt this point, the variables have the following values.
= [
rules "'main': 0",
]= {
labels 'main': 0,
}
The second child in the AST node is going to be handled by the OpCode
rule in the assembler:
OpCode = .:x -> add(code x)
It adds the given op code to the code
list, giving it this value:
= [
code
...,"RETURN",
]
When the rules
and code
variables are expanded, the resulting class looks like this:
class Example(Grammar):
rules = {
'main': 0
}
code = [
...,
RETURN
]
Hopefully you should now be comfortable to follow transformations yourself to understand how a compilation is done.
When the make script is called without arguments, it performs a meta compilation and runs a few tests:
$ ./make.py
Compiling rlmeta using rlmeta.py
Writing rlmeta1.py
Test: Has its own support library
Test: Disallow semantic action in the middle
ERROR: expected }
POSITION: 22
STREAM:
Grammar { x = . -> [] <ERROR POSITION>. }
Test: Call unknown rule foo
Moving rlmeta1.py -> rlmeta.py
O-----------------O
| RLMeta compiled |
~~| itself! |
O-----------------O
The meaning of a meta compilation is to create a new version of RLMeta that can still reproduce itself from the source code.
In the output above, we can see that it compiled RLMeta and wrote the result to rlmeta1.py
. In this case, since it is exactly the same as rlmeta.py
, the compilation stopped there and a few more tests were run using this compiler. But if we make changes to the source code, rlmeta1.py
will most likely not be exactly the same as rlmeta.py
, and a few more compilations might be needed. I’ve written about the details of meta compilation in a previous blog post.
So the purpose of the make script is to ease meta compilations and also run a test suit on the newly generated metacompiler before accepting it.
The make script can also be used to perform a single compilation of RLMeta with the --compile
argument as we saw earlier.
This section explains the most important changes in this version of RLMeta compared to the poster version.
First of all, I wanted to work on the unresolved items which were the following:
In the poster article, I also had a few notes about future versions:
The smaller it is, the easier it is to understand and therefore extend. The more flexible it is to extend the better. If I make another poster version it would therefore focus on being smaller and more flexible. Since all successive version of RLMeta have been faster than the ones before, performance is also important. But small size, clarity, and flexibility come first.
I used these guidelines to decide if certain changes should go into the new version or not.
One interesting thing to note is that the guidelines are sometimes contradicting. Writing clear code might mean more lines of code which makes the code base larger. Perhaps that’s also why I got stuck chasing perfection. I thought I made something easier to read, but it ended up costing 10 extra lines of code. Should I include it?
One thing that I left in the poster version that still annoyed me was that labels were generated at match time, not at semantic action evaluation time. It would not produce incorrect results. At worst, some labels end up not being used because the counter value captured was in a rule that later failed. But dealing with labels at match time does not make sense. It should really happen at semantic action evaluation time.
Here is what the Not
rule looks like in the poster version:
Not = ast:x #:a #:b -> { "I('BACKTRACK', " b ")\n"
x
"I('COMMIT', " a ")\n"
"LABEL(" a ")\n"
"I('FAIL', 'no match expected')\n"
"LABEL(" b ")\n" }
Here is what the Not
rule looks like after the change:
Not = ast:x -> label():a -> label():b
-> { "I('BACKTRACK', " b ")\n"
x
"I('COMMIT', " a ")\n"
"LABEL(" a ")\n"
"I('FAIL', 'no match expected')\n"
"LABEL(" b ")\n" }
This change puts label generation where it belongs, in semantic actions, and thus makes the implementation more clear. The VM is no longer concerned with labels. It is only concerned with matching. This change required a bit of rework how semantic actions work. Previously only one expression was allowed:
<match expression> -> <semantic action expression>
Now multiple expressions are allowed:
<match expression> -> <semantic action expression>:x
-> <semantic action expression>
-> <semantic action expression>
The result of expressions can also be bound to names which subsequent expressions can refer to. label
is such a variable that is set internally to a function that generates increasing integers starting at 0.
The implementation of this change also increases the flexibility of RLMeta. For example, it is now possible to write a semantic action that generates code in different sections like this:
ExampleBuffers {
program = ast:x -> []:header
-> { "# HEADER\n"
header
"# BODY\n"
x }
ast = [%:x] -> x
Program = ast*
Function = .:name -> add(header { "def " name "\n" })
-> { name "()\n" }
}
The expression []:header
creates a list and assigns it to the variable header
. When x
is evaluated in the next step, the semantic action for the Function
rule will be evaluated which can then access the header
variable defined earlier. These variables are not lexically scoped, but dynamically scoped. If at runtime, a variable is defined, it will be accessible. It also means that the Function
rule can not be run without program
being run first, or the header
variable will not be defined.
Here is an example AST representing a program:
AST = [
['Program',
['Function', 'foo'],
['Function', 'bar']
]
]
When the program
rule is run on the example input, the following is output:
$ python example_buffers.py
# HEADER
def foo
def bar
# BODY
foo()
bar()
In summary, this change is as follows:
#
) in parser is removedlabel
function to generate labels(The complete initial diff for this change can be found on GitHub.)
The increased clarity and flexibility come with a price. The size increases and the performance drops.
The parser and the code generator are mostly the same. The greatest addition is in the support library. Which is expected when semantic action evaluation becomes more complex. The drop in performance is likely due to more function calls when evaluating semantic actions. Even though size and performance got worse, I believe the clarity and flexibility gain is worth it.
To compile the poster version of RLMeta, you ran the following command:
./compile.sh rlmeta.py
In one way, the compiler could not compile itself, but relied on a Bash script for gluing things together. It would call the rlmeta.py
compiler for certain tasks and use Bash and Python for other tasks.
As we have already seen, the new version of RLMeta compiles itself like this:
python rlmeta.py \
--embed SUPPORT src/support.py \
--support \
--compile src/parser.rlmeta \
--compile src/codegenerator.rlmeta \
--compile src/assembler.rlmeta \
--copy src/main.py \
> rlmeta.py
The rlmeta.py
compiler now has support (via --embed
and --copy
) for doing what the Bash script previously did.
This makes the compiler slightly larger, but it feels so much cleaner.
In addition, the extra features are useful when writing programs in RLMeta. Those programs can now also be compiled with a single command, and there is no need to concatenate different pieces together.
(The complete diff for this change can be found on GitHub.)
The third thing that annoyed me with in the poster version was the readability of the code generator. For example, the Not
rule looked like this:
Not = ast:x -> label():a -> label():b
-> { "I('BACKTRACK', " b ")\n"
x
"I('COMMIT', " a ")\n"
"LABEL(" a ")\n"
"I('FAIL', 'no match expected')\n"
"LABEL(" b ")\n" }
It generates a string which contains Python code that calls functions to create “assembly” code. So part of the compilation is actually happening at runtime here. It is mixed and messy.
The new Not
rule looks like this:
Not = ast:x -> label():a -> label():b
-> [["OpCode" "BACKTRACK"]
["Target" b]
~x
["OpCode" "COMMIT"]
["Target" a]
["Label" a]
["OpCode" "FAIL"]
["Value" "no match"]
["Label" b]]
Instead of outputting Python code directly, it now generates abstract assembly code. Then a new third pass, the assembler, turns those instructions into Python code as well as resolves label positions. So no more compilation at runtime.
This reads better because the purpose of the code generator is now a bit narrower. It can focus on one thing and leave the rest to the assembler.
Adding another pass also opens up the possibility to do peep-hole optimizations on the abstract assembly code before the assembler turns the instructions into Python code.
In the poster version, the virtual machine was written as a single function with one loop like this:
def vm(instructions, labels, start_rule, stream):
...
while True:
name, arg1, arg2 = instructions[pc]
if name == "PUSH_SCOPE":
scope_stack.append(scope)
scope = {}
pc += 1
continue
elif name == "BACKTRACK":
...
...
It was written like that to be as fast as possible. It avoided function calls. It avoided class variables lookup by avoiding classes. All variables used were defined locally in the vm
function. Because function calls could not be used, some code was also duplicated.
I decided that I would not consider performance at all, and instead try to write the VM as clear as I could. I ended up with a VM
class to hold some state and instruction functions that operate on an instance of a VM:
class VM:
def __init__(self, code, rules):
...
...
def PUSH_SCOPE(vm):
vm.scope_rest = (vm.scope, vm.scope_rest)
vm.scope = {}
def BACKTRACK(vm):
...
...
As I noted earlier, I’m not sure I am happy with this result. I’m not convinced that it reads better. The biggest upside is that since function calls are now allowed, part of the VM can be expressed more clearly without repetition.
Before I ended up with this VM, I experimented with a language for writing virtual machines that compiled to Python code. You could define instructions and the arguments they took and define macros for code re-use. It was basically a small macro language on top of Python. It looked something like this:
def vm(code, rules, start_rule, stream):
action = SemanticAction(None)
pc = rules[start_rule]
call_backtrack_stack = []
stream, stream_rest = (stream, None)
pos, pos_rest = (0, tuple())
scope, scope_rest = (None, None)
fail_message = None
latest_fail_message, latest_fail_pos = (None, tuple())
memo = {}
definstruction PUSH_SCOPE():
scope_rest = (scope, scope_rest)
scope = {}
And here is how macros were used:
definstruction FAIL(arg_message):
fail_message = (arg_message,)
#FAIL
defmacro FAIL:
...
When this was compiled, something similar to the vm
function above was generated. A single function that was intended to run as fast as possible. But you could write the VM quite clearly anyway.
I liked the result of that, but it introduced yet another language and made compilation and metacompilation more complicated. For that reason, I decided against it.
Perhaps another approach would be to consider the VM a separate piece, not to be included in compilations and metacompilations. But they are also strongly connected. Say for example that an optimizer decides to output a new VM instruction, then the VM has to change.
I am not entirely clear about the interface here between the VM and the rest of the compiler.
Another feature that was added in this version was the ability to run a grammar rule recursively from a semantic action.
This was initially needed to to implement recursive macros in the VM language mentioned in the previous section, but it made its way into RLMeta to support the patching of assembly instructions.
When a Target
instruction is encountered, the patches
list is populated with a command:
Target = .:x -> add(patches ["Patch" len(code) x])
-> add(code "placeholder")
These commands are then evaluated by running the asts
rule on the patches
list. This starts another parse on the given stream.
-> run("asts" patches)
The new parse has access to all the runtime variables that the semantic action that invokes it has. So that is why a Patch
instruction can modify the code
array and insert the correct index there instead of the placeholder:
Patch = .:x .:y -> set(code x get(labels y))
Many more small changes were made. Here are a few notes about them.
Various renames to make intention more clear and reformats to improve readability.
Various clean ups in the parser:
Only allow semantic actions at the very end of a rule.
Make sure the whole file is parsed so that junk after a grammar results in an error.
Adapt to Python 3.
On the one hand, I’m quite happy with the improvements to RLMeta that I was able to make. The code feels more clear and flexible. Definitely a better version of RLMeta.
On the other hand, this article turned out to have the same problem as the poster. It just kept growing and growing, and at some point I had to stop working on in, leave some issues unresolved, and call the article finished. For example, I am not happy with how the new VM looks. A mix between classes and functions and helpers.
I decided to set up a repo on GitHub for RLMeta where it can continue to be improved.
I plan for it to contain the base version of RLMeta which is the minimal version that is able to compile itself and maintain properties such as flexible, easy to extend, and easy to understand. Then I want to include examples as well to show how RLMeta can be used and how you can extend it in various ways.
Here is all the source code and also the make script for this version of RLMeta.
Parser {
file =
| (space grammar)*:xs space !. -> xs
grammar =
| name:x space '{' rule*:ys space '}' -> ["Grammar" x ~ys]
rule =
| name:x space '=' choice:y -> ["Rule" x y]
choice =
| (space '|')?
sequence:x (space '|' sequence)*:xs -> ["Or" x ~xs]
sequence =
| expr*:xs maybeAction:ys -> ["Scope" ["And" ~xs ~ys]]
expr =
| expr1:x space ':' name:y -> ["Bind" y x]
| expr1
expr1 =
| expr2:x space '*' -> ["Star" x]
| expr2:x space '?' -> ["Or" x ["And"]]
| space '!' expr2:x -> ["Not" x]
| space '%' -> ["MatchCallRule"]
| expr2
expr2 =
| name:x !(space '=') -> ["MatchRule" x]
| space char:x '-' char:y -> ["MatchObject" ["Range" x y]]
| space '\'' (!'\'' matchChar)*:xs '\'' -> ["And" ~xs]
| space '.' -> ["MatchObject" ["Any"]]
| space '(' choice:x space ')' -> x
| space '[' expr*:xs space ']' -> ["MatchList" ["And" ~xs]]
matchChar =
| innerChar:x -> ["MatchObject" ["Eq" x]]
maybeAction =
| actionExpr:x -> [["Action" x]]
| -> []
actionExpr =
| space '->' hostExpr:x
(space ':' name | -> ""):y actionExpr:z -> ["Set" y x z]
| space '->' hostExpr:x -> x
hostExpr =
| space string:x -> ["String" x]
| space '[' hostListItem*:xs space ']' -> ["List" ~xs]
| space '{' formatExpr*:xs space '}' -> ["Format" ~xs]
| var:x space '(' hostExpr*:ys space ')' -> ["Call" x ~ys]
| var:x
hostListItem =
| space '~'*:ys hostExpr:x -> ["ListItem" len(ys) x]
formatExpr =
| space '>' formatExpr*:xs space '<' -> ["Indent" ["Format" ~xs]]
| hostExpr
var =
| name:x !(space '=') -> ["Lookup" x]
string = '"' (!'"' innerChar)*:xs '"' -> { xs }
char = '\'' !'\'' innerChar :x '\'' -> x
innerChar = '\\' escape | .
escape = '\\' -> "\\" | '\'' -> "'"
| '"' -> "\"" | 'n' -> "\n"
name = space nameStart:x nameChar*:xs -> { x xs }
nameStart = 'a'-'z' | 'A'-'Z'
nameChar = 'a'-'z' | 'A'-'Z' | '0'-'9'
space = (' ' | '\n')*
}
CodeGenerator {
Grammar = .:x ast*:ys -> ["Grammar" x ~~ys]
Rule = .:x ast:y -> [["Rule" x]
~y
["OpCode" "RETURN"]]
Or =
| ast:x Or:y -> label():a -> label():b
-> [["OpCode" "BACKTRACK"]
["Target" a]
~x
["OpCode" "COMMIT"]
["Target" b]
["Label" a]
~y
["Label" b]]
| ast
Scope = ast:x -> [["OpCode" "PUSH_SCOPE"]
~x
["OpCode" "POP_SCOPE"]]
And = ast*:xs -> [~~xs]
Bind = .:x ast:y -> [~y
["OpCode" "BIND"]
["Value" x]]
Star = ast:x -> label():a -> label():b
-> [["OpCode" "LIST_START"]
["Label" a]
["OpCode" "BACKTRACK"]
["Target" b]
~x
["OpCode" "LIST_APPEND"]
["OpCode" "COMMIT"]
["Target" a]
["Label" b]
["OpCode" "LIST_END"]]
Not = ast:x -> label():a -> label():b
-> [["OpCode" "BACKTRACK"]
["Target" b]
~x
["OpCode" "COMMIT"]
["Target" a]
["Label" a]
["OpCode" "FAIL"]
["Value" "no match"]
["Label" b]]
MatchCallRule = -> [["OpCode" "MATCH_CALL_RULE"]]
MatchRule = .:x -> [["OpCode" "CALL"]
["Target" x]]
MatchObject = .:x -> [["OpCode" "MATCH"]
x]
MatchList = ast:x -> [["OpCode" "PUSH_STREAM"]
~x
["OpCode" "POP_STREAM"]]
Action = .:x -> [["OpCode" "ACTION"]
["Action" x]]
asts = ast*:xs !. -> xs
ast = [%:x] -> x
}
Assembler {
Grammar = .:x ast*:ys -> list():rules
-> list():code
-> dict():labels
-> list():patches
-> ys
-> run("asts" patches)
-> { "class " x "(Grammar):\n" >
"rules = {\n" > join(rules ",\n") < "\n}\n"
"code = [\n" > join(code ",\n") < "\n]\n"
< }
Rule = .:x -> add(rules { repr(x) ": " len(code) })
-> set(labels x len(code))
Label = .:x -> set(labels x len(code))
Target = .:x -> add(patches ["Patch" len(code) x])
-> add(code "placeholder")
Patch = .:x .:y -> set(code x get(labels y))
OpCode = .:x -> add(code x)
Value = .:x -> add(code repr(x))
Eq = .:x -> add(code repr(x))
-> add(code { "lambda x: x == " repr(x) })
Range = .:x .:y -> add(code repr({"range " repr(x) "-" repr(y)}))
-> add(code { "lambda x: " repr(x) " <= x <= " repr(y) })
Any = -> add(code repr("any"))
-> add(code "lambda x: True")
Action = ast:x -> add(code {"lambda self: " x})
Set = .:x ast:y ast:z -> { "self.bind(" repr(x) ", " y ", lambda: " z ")" }
String = .:x -> repr(x)
List = astList:x -> { "concat([" x "])" }
ListItem = .:x ast:y -> { "splice(" repr(x) ", " y ")" }
Format = astList:x -> { "join([" x "])" }
Indent = ast:x -> { "indent(" x ", "
"self.lookup('indentprefix'))" }
Call = ast:x astList:y -> { x "(" y ")" }
Lookup = .:x -> { "self.lookup(" repr(x) ")" }
asts = ast*:xs !. -> { xs }
astList = ast*:xs -> join(xs ", ")
ast = [%:x] -> x
}
class VM:
def __init__(self, code, rules):
self.code = code
self.rules = rules
def run(self, start_rule, stream):
self.action = SemanticAction(None)
self.pc = self.rules[start_rule]
self.call_backtrack_stack = []
self.stream, self.stream_rest = (stream, None)
self.pos, self.pos_rest = (0, tuple())
self.scope, self.scope_rest = (None, None)
self.latest_fail_message, self.latest_fail_pos = (None, tuple())
self.memo = {}
while True:
result = self.pop_arg()(self)
if result:
return result
def pop_arg(self):
code = self.code[self.pc]
self.pc += 1
return code
def PUSH_SCOPE(vm):
vm.scope_rest = (vm.scope, vm.scope_rest)
vm.scope = {}
def POP_SCOPE(vm):
vm.scope, vm.scope_rest = vm.scope_rest
def BACKTRACK(vm):
vm.call_backtrack_stack.append((
vm.pop_arg(), vm.stream, vm.stream_rest, vm.pos, vm.pos_rest, vm.scope, vm.scope_rest
))
def COMMIT(vm):
vm.call_backtrack_stack.pop()
vm.pc = vm.pop_arg()
def CALL(vm):
CALL_(vm, vm.pop_arg())
def CALL_(vm, pc):
key = (pc, vm.pos_rest+(vm.pos,))
if key in vm.memo:
if vm.memo[key][0] is None:
FAIL_(vm, vm.memo[key][1])
else:
vm.action, vm.stream, vm.stream_rest, vm.pos, vm.pos_rest = vm.memo[key]
else:
vm.call_backtrack_stack.append((vm.pc, key))
vm.pc = pc
def RETURN(vm):
if not vm.call_backtrack_stack:
return vm.action
vm.pc, key = vm.call_backtrack_stack.pop()
vm.memo[key] = (vm.action, vm.stream, vm.stream_rest, vm.pos, vm.pos_rest)
def MATCH(vm):
object_description = vm.pop_arg()
fn = vm.pop_arg()
MATCH_(vm, fn, ("expected {}", object_description))
def MATCH_(vm, fn, message):
if vm.pos >= len(vm.stream) or not fn(vm.stream[vm.pos]):
FAIL_(vm, message)
else:
vm.action = SemanticAction(vm.stream[vm.pos])
vm.pos += 1
return True
def MATCH_CALL_RULE(vm):
if MATCH_(vm, lambda x: x in vm.rules, ("expected rule name",)):
CALL_(vm, vm.rules[vm.action.value])
def LIST_START(vm):
vm.scope_rest = (vm.scope, vm.scope_rest)
vm.scope = []
def LIST_APPEND(vm):
vm.scope.append(vm.action)
def LIST_END(vm):
vm.action = SemanticAction(vm.scope, lambda self: [x.eval(self.runtime) for x in self.value])
vm.scope, vm.scope_rest = vm.scope_rest
def BIND(vm):
vm.scope[vm.pop_arg()] = vm.action
def ACTION(vm):
vm.action = SemanticAction(vm.scope, vm.pop_arg())
def PUSH_STREAM(vm):
if vm.pos >= len(vm.stream) or not isinstance(vm.stream[vm.pos], list):
FAIL_(vm, ("expected list",))
else:
vm.stream_rest = (vm.stream, vm.stream_rest)
vm.pos_rest = vm.pos_rest + (vm.pos,)
vm.stream = vm.stream[vm.pos]
vm.pos = 0
def POP_STREAM(vm):
if vm.pos < len(vm.stream):
FAIL_(vm, ("expected end of list",))
else:
vm.stream, vm.stream_rest = vm.stream_rest
vm.pos, vm.pos_rest = vm.pos_rest[-1], vm.pos_rest[:-1]
vm.pos += 1
def FAIL(vm):
FAIL_(vm, (vm.pop_arg(),))
def FAIL_(vm, fail_message):
fail_pos = vm.pos_rest+(vm.pos,)
if fail_pos >= vm.latest_fail_pos:
vm.latest_fail_message = fail_message
vm.latest_fail_pos = fail_pos
call_backtrack_entry = tuple()
while vm.call_backtrack_stack:
call_backtrack_entry = vm.call_backtrack_stack.pop()
if len(call_backtrack_entry) == 7:
break
else:
vm.memo[call_backtrack_entry[1]] = (None, fail_message)
if len(call_backtrack_entry) != 7:
raise MatchError(
vm.latest_fail_message[0].format(*vm.latest_fail_message[1:]),
vm.latest_fail_pos[-1],
vm.stream
)
(vm.pc, vm.stream, vm.stream_rest, vm.pos, vm.pos_rest, vm.scope, vm.scope_rest) = call_backtrack_entry
class SemanticAction(object):
def __init__(self, value, fn=lambda self: self.value):
self.value = value
self.fn = fn
def eval(self, runtime):
self.runtime = runtime
return self.fn(self)
def bind(self, name, value, continuation):
self.runtime = self.runtime.set(name, value)
return continuation()
def lookup(self, name):
if name in self.value:
return self.value[name].eval(self.runtime)
else:
return self.runtime[name]
class MatchError(Exception):
def __init__(self, message, pos, stream):
Exception.__init__(self)
self.message = message
self.pos = pos
self.stream = stream
class Grammar(object):
def run(self, rule, stream, runtime={}):
return Runtime(self, dict(runtime, **{
"label": Counter(),
"indentprefix": " ",
"list": list,
"dict": dict,
"add": lambda x, y: x.append(y),
"get": lambda x, y: x[y],
"set": lambda x, y, z: x.__setitem__(y, z),
"len": len,
"repr": repr,
"join": join,
})).run(rule, stream)
class Runtime(dict):
def __init__(self, grammar, values):
dict.__init__(self, dict(values, run=self.run))
self.grammar = grammar
def set(self, key, value):
return Runtime(self.grammar, dict(self, **{key: value}))
def run(self, rule, stream):
return VM(self.grammar.code, self.grammar.rules).run(rule, stream).eval(self)
class Counter(object):
def __init__(self):
self.value = 0
def __call__(self):
result = self.value
self.value += 1
return result
def splice(depth, item):
if depth == 0:
return [item]
else:
return concat([splice(depth-1, subitem) for subitem in item])
def concat(lists):
return [x for xs in lists for x in xs]
def join(items, delimiter=""):
return delimiter.join(
join(item, delimiter) if isinstance(item, list) else str(item)
for item in items
)
def indent(text, prefix=" "):
return "".join(prefix+line for line in text.splitlines(True))
def compile_chain(grammars, source):
import os
import sys
import pprint
for grammar, rule in grammars:
try:
source = grammar().run(rule, source)
except MatchError as e:
marker = "<ERROR POSITION>"
if os.isatty(sys.stderr.fileno()):
marker = f"\033[0;31m{marker}\033[0m"
if isinstance(e.stream, str):
stream_string = e.stream[:e.pos] + marker + e.stream[e.pos:]
else:
stream_string = pprint.pformat(e.stream)
sys.exit("ERROR: {}\nPOSITION: {}\nSTREAM:\n{}".format(
e.message,
e.pos,
indent(stream_string)
))
return source
if __name__ == "__main__":
import sys
def read(path):
if path == "-":
return sys.stdin.read()
with open(path) as f:
return f.read()
args = sys.argv[1:] or ["--compile", "-"]
while args:
command = args.pop(0)
if command == "--support":
sys.stdout.write(SUPPORT)
elif command == "--copy":
sys.stdout.write(read(args.pop(0)))
elif command == "--embed":
sys.stdout.write("{} = {}\n".format(
args.pop(0),
repr(read(args.pop(0)))
))
elif command == "--compile":
sys.stdout.write(compile_chain(
[(Parser, "file"), (CodeGenerator, "asts"), (Assembler, "asts")],
read(args.pop(0))
))
else:
sys.exit("ERROR: Unknown command '{}'".format(command))
#!/usr/bin/env python
import os
import subprocess
import sys
def make_next_version():
final_compiler = meta_compile_rlmeta()
test(final_compiler)
mv(final_compiler, "rlmeta.py")
def meta_compile_rlmeta():
compiler = "rlmeta.py"
content = read(compiler)
for i in range(4):
next_compiler = "rlmeta{}.py".format(i+1)
next_content = compile_rlmeta(compiler)
log("Writing {}".format(next_compiler))
write(next_compiler, next_content)
if next_content == content:
return next_compiler
compiler = next_compiler
content = next_content
fail("Unable to produce metacompiler.")
def compile_rlmeta(rlmeta):
log("Compiling rlmeta using {}".format(rlmeta))
return run_rlmeta(rlmeta, [
"--embed", "SUPPORT", "src/support.py",
"--support",
"--compile", "src/parser.rlmeta",
"--compile", "src/codegenerator.rlmeta",
"--compile", "src/assembler.rlmeta",
"--copy", "src/main.py",
])
def test(rlmeta):
log("Test: Has its own support library")
assert run_rlmeta(rlmeta, ["--support"]) == read("src/support.py")
log("Test: Disallow semantic action in the middle")
run_rlmeta(rlmeta, [], b"Grammar { x = . -> [] . }", expect_failure=True)
log("Test: Call unknown rule foo")
assert test_grammar(
rlmeta,
b"Grammar { x = % | . }",
b"print(compile_chain([(Grammar, 'x')], ['foo']))"
) == b"foo\n"
def test_grammar(rlmeta, grammar, main_code):
compiled = run_rlmeta(rlmeta, ["--support", "--compile", "-"], grammar)
total = compiled + main_code
process = subprocess.Popen(
["python"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE
)
stdout, _ = process.communicate(total)
return stdout
def run_rlmeta(rlmeta, args, stdin=b"", expect_failure=False):
process = subprocess.Popen(
["python", rlmeta]+args,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE
)
stdout, _ = process.communicate(stdin)
if expect_failure:
if process.returncode == 0:
fail("Expected failure")
else:
if process.returncode != 0:
fail("Expected success")
return stdout
def mv(src, dest):
log("Moving {} -> {}".format(src, dest))
os.remove(dest)
os.rename(src, dest)
def cleanup():
for path in [
"rlmeta1.py",
"rlmeta2.py",
"rlmeta3.py",
"rlmeta4.py",
]:
if os.path.exists(path):
log("Deleting {}".format(path))
os.remove(path)
def read(path):
with open(path, "rb") as f:
return f.read()
def write(path, content):
with open(path, "wb") as f:
return f.write(content)
def log(message):
sys.stderr.write(color(f"{message}\n", "33"))
def success(message):
sys.stderr.write(color(f"{message}\n", "32"))
def fail(message):
sys.exit(color(f"ERROR: {message}", "31"))
def color(message, color):
if os.isatty(sys.stderr.fileno()):
return f"\033[0;{color}m{message}\033[0m"
else:
return message
if __name__ == "__main__":
cleanup()
if sys.argv[1:] == ["--compile"]:
sys.stdout.buffer.write(compile_rlmeta("rlmeta.py"))
else:
make_next_version()
cleanup()
success(" O-----------------O")
success(" | RLMeta compiled |")
success("~~| itself! |")
success(" O-----------------O")
What is Rickard working on and thinking about right now?
Every month I write a newsletter about just that. You will get updates about my current projects and thoughts about programming, and also get a chance to hit reply and interact with me. Subscribe to it below.