Building an Interpreter with the RPython Toolchain¶
This repository contains introductory material on building a simple interpreter with RPython.
The original slides are also available.
Contents:
PyPy¶
Besides being the most prominent example of an interpreter written in RPython, having been the interpreter for which RPython was developed, we will also use PyPy in order to actually translate our own interpreter, which we will discuss later.
Below are brief instructions for installing and running PyPy.
Installing PyPy¶
PyPy can be installed by following the instructions at:
or:
OS X¶
$ brew install pypy
Windows¶
Windows binaries are available at the above link.
Linux¶
PyPy should be available in your Linux distro’s package manager.
Alternatively, tarballs are available at:
https://github.com/squeaky-pl/portable-pypy
which should work across distributions.
Running PyPy¶
As with CPython, the REPL is accessible via pypy
at a terminal.
Feel free to play around a bit if you’ve never done so already, whereby you’ll (hopefully!) notice nothing amiss.
CyCy¶
CyCy is a small C interpreter built during a 48 hour hackathon at Magnetic.
It certainly does not interpret an appreciable amount of C, nor does it run particularly quickly, but it’s a small step beyond the example interpreter (which we’ll also see soon).
It is highly recommended to resist temptation on directly copying its layout (or the layout of the example interpreter) line for line. Reading it briefly and using it as a reference when stuck will allow you to proceed without (much) frustration while still taking time to play with RPython enough to experience it.
Installing CyCy¶
Clone the CyCy repo which can be found at https://github.com/Magnetic/CyCy:
$ git clone https://github.com/Magnetic/cycy
$ cd cycy
Optional: Create a Virtualenv¶
Create a virtualenv which you will install CyCy into. If you have no other preference on location, create one within the CyCy repo checkout:
$ python -m pip install -U virtualenv
$ virtualenv -p pypy venv
which will create a virtualenv (a directory) named venv
in the checkout,
and will use the PyPy interpreter you’ve installed inside it.
Install CyCy as you would a normal Python package:
$ venv/bin/pip install -e .
which should fetch all of the necessary dependencies.
Run venv/bin/rpython --help
to confirm that the toolchain was successfully
installed, and you’re on your way.
Alternate: Without a Virtualenv¶
If you do not wish to use a virtualenv, you can install CyCy and its dependencies globally:
$ pip install --user -e .
Follow along with the rest of the document by removing venv/bin/
from the
beginning of commands, since your binaries will be available globally.
Note
If after running the above command you cannot run rpython --help
,
you likely do not have ~/.local/bin
on your shell’s $PATH
.
Follow the instructions of your shell on adding it (typically either
to .bash_profile
or .zshenv
if you’re using bash or zsh
respectively).
Running & Translating CyCy¶
Take a quick look at the contents of the directories in front of you. We will cover the types of components in them as we develop our own interpreter, but it will be useful (and perhaps somewhat irresistible) to peek at what reasonably small amount of code is there.
At a high level, the translation process will take the code you have and translate it into an executable you can run standalone.
Before we do so, let’s prove that the interpreter can simply be run as a normal program on top of Python. Running:
$ venv/bin/pypy -m cycy
should present you with a CyCy REPL. Here you certainly will notice things amiss (bonus points if you simply crash the interpreter).
But! If you run something simple, like:
CC-> int main (void) { return 2 + 23; }
you should see:
25
outputted below. There are various slightly more complicated things for which CyCy manages to implement support for so far.
Note
For both CyCy and the interpreter we will attempt to build, you may
find it helpful to install the rlwrap
*nix utility and to use
it as a wrapper around the REPL:
$ rlwrap venv/bin/pypy -m cycy
will give you a REPL with readline support provided via rlwrap.
You can find it in homebrew or via your Linux package manager.
Now that we can run CyCy on top of Python, let’s use the RPython toolchain to create an executable that is independent of the Python runtime.
From the same directory (the root of the checkout)::
$ venv/bin/rpython --output=cycy-nojit cycy/target.py
will produce a long stream of output that you might find interesting to stare out.
After around 2 minutes of churning, out should pop an executable in the current directory which you can run via:
$ ./cycy-nojit
which should produce a REPL with (roughly) the same behavior as before.
This executable depends neither on Python nor the RPython toolchain:
$ file ./cycy-nojit
$ otool -L ./cycy-nojit
which should show you some basic information on the executable (and
specifically that it does not in fact try to link against Python). On
Linux, use ldd
in place of otool -L
.
Some Example Interpreters¶
As we move forward with our own interpreter, it will be helpful to have other interpreters that we can reference if and when we are stuck, either with design or implementation.
To start with, besides CyCy which we have already cloned, let’s retrieve two more interpreters written with RPython.
First, let’s clone the “official” RPython example interpreter. It lives on BitBucket.
Note
To clone it, you’ll need the mercurial VCS. If you do not already have mercurial, you can install it as any other Python package via
$ pip install –user mercurial
In this instance personal recommendation is to install mercurial globally, but feel free to install it into a virtualenv as well if you wish.
Execute:
$ hg clone https://bitbucket.org/pypy/example-interpreter
to grab a copy of it. Take the same quick flip through the source code as you did through CyCy. You’ll hopefully notice some surface-level similarities because CyCy was written via the same strategy of occasional glancing at the example interpreter for inspiration.
As we progress through writing our own interpreter, you now have at least a pair of interpreters to reference. In order of complexity, we have:
- the example interpreter implementing a toy language similar to our own
- CyCy implementing a small subset of C
Let’s also clone the Topaz and PyPy interpreters as well. These interpreters are full-blown (real-world) projects with all of the considerations that brings, so they come with huge additional levels of complexity. Try not to be flustered by it, we clone them now for two reasons:
- familiarity with Python might in some way provide additional guidance if you can manage to find the associated implementation in PyPy
- both repos are large – we might want to take quick looks at PyPy later in the day, so we’ll clone them up front. Feel free to leave the clone running in the background while we proceed.
You can find Topaz at https://github.com/topazproject/topaz
Other Resources¶
There are a number of other resources that will be useful as we work:
- the RPython documentation which you’ll likely want to keep open for reference throughout the tutorial
- the #pypy IRC channel on Freenode which you might like to join, especially if you have an issue that we can’t figure out in person. See http://webchat.freenode.net/ to connect if you do not already have an IRC client.
RPython¶
There is an introduction to RPython in the RPython language documentation, which explains what subset of Python constitutes valid RPython.
It is not critical that you read that page from top to bottom (yet, probably not even at all at least for today).
The most important things to remember from the start are:
- You cannot mix objects of “un-unifyable” types in the same collection.
Lists must contain all strs, or all instances of a class. It does not
have to be the same exact class, the classes just have to be unifyable,
which means that there is some common base class between them (at the
RPython level this cannot be
object
, so you can’t mix ints with instances of your RPython class). Relatedly, you can mixNone
with some types, but not others. See the documentation page above for details. - Most of the (Python) standard library (and similarly external modules) are not valid RPython. There is a separate, smaller RPython standard library. With a few exceptions, it comprises the external RPython code you have available.
- Write simple code and it is likely to be easily converted to RPython even if it is invalid. If your code is complex, it will likely need untangling once you try and translate it.
Unfortunately the most important rule though is the one at the top of the documentation, that if it translates, it’s valid, and if it ain’t, it isn’t.
Two Specific Notes¶
Besides the above, there are two specific things which you might find surprising initially that are worth learning upfront.
Firstly, the RPython toolchain will expect an entry point similar to
the one expected when writing C code. By default it will expect a
function called main
in the file you pass to the rpython
binary
(explained shortly). The function should return an int
as main
would in C. This is where the translation process which we’ll discuss
(as well as your interpreter) will begin.
Secondly, the assert
statement in RPython has special semantics. Whereas in
Python it is generally used to describe invariants that should never be false
in a piece of code, in RPython it additionally makes an assertion of that
invariant to the toolchain itself so that the toolchain can make use of that
information.
As a quick example, you’ll likely soon notice once we begin writing our parser that the toolchain will complain during translation if for instance you have:
class Parent(object):
...
class ChildA(Parent):
attr_only_on_this_child = 12
class ChildB(Parent):
...
and you try and access attr_only_on_this_child
even if you only pass in
instances of ChildA
. You will need to tell the toolchain that you are
guaranteeing that only instances of ChildA
will be there, and not
ChildB
, and the way to do so is by saying assert isinstance(myinstance,
ChildA)
, which signals that exact fact.
A Start¶
The “language” we’ll implement is exceedingly simple. We’ll aim for a small initial goal:
- variable assignment
- string and integer literals
- simple output to stdout
We’ll call our interpreter (and language) Snap. Don’t bother Googling it.
The simple syntax for our language will lead us to aim directly for the following small program:
foo = 2 + 7
bar = "cat" + "dog"
print(foo)
print(bar)
where we will want:
$ snap example.snap
containing the above program to put the appropriate output on stdout.
We’ll tackle our interpreter in 3 phases, a small parser, a small corresponding bytecode compiler, and then a small corresponding bytecode interpreter.
After that we’ll embellish!
Initial Setup¶
Fork the repo containing these instructions, which you can use to house your interpreter (feel free to just use the local clone if you prefer). You can use the CyCy repo as a guide (or any of the other interpreters) for all of the basic steps below. We will assume a similar layout for the remainder of this session.
It lives at https://github.com/Julian/BuildingAnInterpreter.
We’ll be creating a fairly typical Python package, despite its purpose,
so create the package layout you are comfortable with, along with a
virtualenv for your project (be sure to use -p pypy
if you do).
Install the rpython
toolchain which we’ll shortly need from PyPI.
We’ll be writing tests! Install your favorite test runner as well, and follow along with the next step, where we finally start writing our parser.
A Snap Parser¶
Our first goal is to be able to successfully parse our goal into an
AST. An AST is a
programmatic representation of the structure of our program (which you
may have encountered via the ast
module in Python).
Create a test file named test_example.py
in your Snap package’s test
folder, and create a test case within it. The test case will guide us along the
way towards at least being able to parse our example program. We’ll handle
interpreting it soon.
We wish to parse the example program we had earlier:
foo = 2 + 7
bar = "cat" + "dog"
print(foo)
print(bar)
into an AST. There are numerous ways to represent the above source code as an AST. Feel free to be a bit creative, but as a sampling, we will represent the first sample of this piece of code as the following (RPython) AST:
Assign("foo", BinaryOperation("+", Integer(2), Integer(7)))
See if you can write a (failing) test case that asserts that the result of parsing the above source should produce that AST. You’ll need to create a few classes corresponding to the various node types in your AST.
Note
Hooray! Tests don’t need to be valid RPython, they won’t run within your interpreter, so you have freedom to write them however you’d like.
We’ll follow roughly typical TDD (if you’re unfamiliar, don’t worry too much about it), so create the function that will parse your AST and make your example test pass by simply giving it a fake implementation that simply returns the (overall) AST we want for our example program.
We’ll now build out our parser in small chunks until we successfully can parse the above program.
RPly and an Actual Start to Our Parser¶
We need a way to actually parse our source code into the AST we’ve
just designed. There are a number of paths forward. The most obvious
is to implement a parser ourselves “by hand”, but there are a number
of tools that exist to make this easier. The RPython standard library
has the rpython.rlib.parsing
module, but we’ll use the RPly library, which has
a slightly nicer API.
There are only two pages of documentation, so have a quick read through them to see how you will likely want to proceed. You’ll likely find consulting the CyCy parser helpful minutes as well but try and wait until you’re off the ground.
You may want to consult additional resources on EBNF notation if this is your first exposure to it. For our purposes superficial knowledge should suffice until you want to extend the parser (later) as well, so it can also wait.
Break down the things you will need to parse into smaller chunks, write unit tests for the (small AST) that will result from parsing them, and then implement enough of a parser to make your tests pass, extending your implementation at each step. For example, you might want to take the following path:
- Parse integer literals
- Parse sums of integers
- Parse assignment of an integer
- Parse the assignment of a sum
- ...
slowly building up your parser via your unit tests.
Translation¶
Let’s not forget about translation! Generally speaking, you should not necessarily attempt to translate your code not after change or commit (it’s tedious and long as you’ll soon notice). Rely on your tests to check that your code works, then simply periodically attempt to translate, and fix any invalid RPython you’ve introduced since the last time.
Let’s set up enough to be able to translate our project and check if it’s valid RPython. Recall that translation only operates on code paths that are actually reachable from your entry point, which we’re about to define, and only on objects after import-time, where you’re able to fully utilize Python.
Note
If you forgot about the type unification requirement of RPython, you’ll likely see your first translation error with your AST nodes!
These errors are cryptic, but if you stare at them and trust them for long enough, they often are clear enough to point you in the right direction.
See if you can solve your error.
More About non-RPython¶
With the above note again in mind about translation operating only on code paths your interpreter will traverse (and not ones used only during tests or debugging), it’s useful to set yourself up for easy debugging of your parser.
Give your AST nodes a helpful __repr__
and see what you can do in
regular Python to help yourself as you go along.
You’ve likely already implemented __eq__
and __ne__
methods on
your nodes to get your tests to pass. These too do not need to be valid
RPython since they likely will only be used in tests.
A Snap Compiler¶
We have the foundation of a parser, which means we now can convert source code into programmatic representations of an AST.
Our objective now is to be able to compile an AST into bytecode. Our bytecode, roughly speaking, is similar to bytecode you likely will have encountered at least in passing with Python.
It is a sequence of bytes which program (what will become) our virtual machine, in an analogous way to how machine code is a sequence of bytes that programs our CPU.
Our compiler will take an AST and walk it to produce a single stream of bytecode that encapsulates the instructions we need to execute.
Designing bytecode languages is an art unto itself, but our fundamental goal during compilation is to create a simple language that manipulates a stack which we will maintain as our VM.
For example, here is a human-readable rendering of what adding two numbers might look like for our VM:
LOAD_CONST 0
LOAD_CONST 1
BINARY_ADD
The interpretation of these three instructions are:
- load a particular constant, whose ID is 0 (we’ll maintain constants in an array, so this will be an index specifying the constant within the array), then push its value onto our stack
- do the same for another constant with ID 1
- pop the top two entries off the stack, perform addition, and push the result back onto the stack
The non-human readable version should use simple bytes to represent the actual instructions ultimately, which the VM will then dispatch on.
First Steps¶
Your compiler should operate on an AST, and ultimately will produce bytecode: a simple sequence of bytes which our upcoming VM will execute.
In fairly typical OO style, generally it will simply delegate to each of your AST nodes for compilation, in which case each node will know how to compile itself by producing a sequence of our bytecode that it corresponds to.
We’ll need to store state during our compilation process. We need at least a place to store variables and one to store constants that the bytecode we generate will reference.
For a concrete example, a simple constant node like the one we’ll need for integers or strings should compile itself by simply registering itself as a constant on an object that is keeping context for the compilation. It should then emit a bytecode that when executed will retrieve the corresponding constant from the context object.
In the bytecode we will emit, loading this constant for use on the stack will consist of accessing an appropriate element within a registered constants list by index.
Compiler Tests¶
Write your first compiler test! There are (at least) two ways to test your compiler. One is in integration with your parser, by asserting that some inputted source code parses and then compiles into an expected piece of bytecode. The other is to test direct compilation of an AST.
Try out both methods and see which you find easier to read. You may also find it convenient to have a way to produce human-readable bytecode dumping for your bytecode, which should produce output similar to our first example above, or to what you’d get out of the dis module.
An Interpreter¶
Finally! It’s time to actually execute some code.
We have bytecode, which essentially is a tape-like sequence of instructions that we will interpret. We’ve casted our (potentially) complicated language into a sequence of simple stack manipulation operations.
Some of you will find interpretation to be the most “fun” part of the process. We need to implement the appropriate stack manipulation for each bytecode we wish to interpret.
The Main Loop¶
Interpreting bytecode will take place within a main loop similar to the CPython VM’s main loop or PyPy’s main loop. A giant loop that simply performs whatever bytecode instruction is at the current position we’re processing.
A program counter should keep track of that position within our bytecode. Our interpreter will read sequentially through the bytecode sequence, possibly moving the program counter to some other position (if you implement a jump or conditional expression in a later exercise).
Along with a stack (an RPython list), we’ll implement our plan.
For each bytecode instruction that your compiler produces, implement the appropriate stack manipulation.
Note
Depending on the particular bytecode instruction, you may find it difficult at this stage to write tests without simply making tests that assert about the internal state of your stack.
Try this out.
You might find it more reasonable as you progress to write tests that use the print bytecode mentioned below instead once you have a working entry point.
Print¶
Until you implement full support for function calls, it will likely be useful to special-case the print() instruction by giving it its own bytecode.
Note that you cannot use the sys
module in RPython for the most part, nor
do you have access to open
. You may write to stdout directly via
os.write
by passing in an fd of 1
for stdout.
You may also want to check out the streamio
module from the RPython
standard library which can provide some provisional file-like support for
RPython.
Once you have the ability to print values, you can begin print-debugging your own interpreter!
Wrapper Objects¶
Much like Python, our toy Snap language allows you to print objects that aren’t necessarily strings.
It becomes useful to start applying OOP techniques to objects at your language level and not just at the RPython level for your interpreter. For example, you may have an integer object which represents a Snap integer within your runtime.
The convention is to call objects like these W_Integer
, for example, where
the W_
prefix indicates that this object is a wrapper object.
Once you have wrapper objects, you can begin to encapsulate Snap
features on each wrapper object. A W_Integer
for example may have
a .to_string
method in RPython, which returns a W_String
Snap
string. Your PRINT
bytecode might then be implemented by simply
delegating to this method in order to produce a string, which you can
implement polymorphically on each Snap type you might have.
If you begin to implement method calls in Snap, your to_string
method might
further become a Snap-accessible method that you can call on your Snap objects
directly if so chosen.
A REPL¶
As a somewhat fun additional exercise, let’s implement an interactive interpreter which parses, compiles and executes your code in the same way the Python interpreter does.
Most of the work here is straightforward.
You do not have access to the code
(Python) stdlib module, so it will be up
to you to manually implement a simple execution loop.
Simply repeatedly read a line, and run it through the appropriate steps within your interpreter.
Here specifically you might want to take a look at the streamio
module if
you did not before.
Final Thoughts¶
To come full circle, verify that you can run the example that we set out to interpret.
It should produce our expected output.
Going Further¶
At this point you likely have a number of further ideas for extension.
Implement them!
For some further inspiration:
- Loops! Function calls! Methods! More operators. Extend your parser, compiler and interpreter to support these, turning Snap into an actual programming language.
- Add some more types!
- We haven’t even mentioned how to add a JIT to your interpreter. The RPython docs cover the basics. Your task will be to inform the RPython toolchain about a number of the fundamentals within your interpreter, such as where your main loop starts, and what the lifecycles of some of your variables and objects are within your interpreter.