The LLVM compiler toolkit makes it easy to implement a compiler, and the LLVM Tutorial (called "Kaleidoscope") is a good place to start.
This example compiler provides an alternative introduction, with a bit more focus on design details like the use of a visitor pattern for syntax traversal. It consists of the following components:
-
A regexp-based lexer, employing re2c to generate an efficient state machine.
-
A recursive-descent parser with an operator-precedence strategy for parsing expressions with infix operators.
-
Well-engineered abstract syntax classes that provide a good foundation for extending the source language.
-
A typechecker that supports simple overloading (without implicit type conversions). A key aspect of the typechecker is that it resolves lexical scoping, linking variable references and function calls to the corresponding definitions. This allows subsequent passes to operate without any knowledge of scoping rules.
-
A simple code generator that contructs LLVM IR.
-
Optimization and JIT code generation using off-the-shelf LLVM components.
The source language is a subset of C with the following grammar.
Prog -> FuncDef+
FuncDef -> Type FuncId ( VarDecl* ) Seq
Type -> bool | int
FuncId -> Id | operator BinaryOp
VarDecl -> Type Id
Seq -> { Stmt* }
Stmt -> Id = Exp ;
| Id ( Args ) ;
| VarDecl ;
| Seq
| return Exp ;
| if ( Exp ) Stmt
| if ( Exp ) Stmt else Stmt
| while ( Exp ) Stmt
Args -> Exp
| Exp , Args
Exp -> true | false
| Num
| Id
| Id ( Args )
| ( Exp )
| UnaryOp Exp
| Exp BinaryOp Exp
UnaryOp -> - | !
BinaryOp -> * | / | %
| + | -
| < | <= | > | >=
| == | !=
Notation:
Prog -> FuncDef+
indicates that a program consists of one or more function definitions.Exp -> true | false | ...
indicates that an expression can be atrue
orfalse
keyword, etc.VarDecl*
indicates zero or more variable declarations- Other punctuation characters are program literals (e.g. parentheses, braces, semicolon, comma).
Example:
int main(int x)
{
int sum = 0;
int i = 1;
while (i <= x)
{
sum = sum + i;
i = i + 1;
}
}
Here is an overview of the source files:
main.cpp
: calls the lexer, parser, typechecker, and code generatorToken.h
: lexical tokens, e.g. constants, identifiers, and keywords.Lexer.re
: regular expressions for lexical tokens (compiled by re2c)TokenStream.h
: adapter that calls Lexer to produce a stream of tokens.Parser.cpp
: recursive descent parser, which reads token stream and produces a syntax tree.Exp.h Stmt.h VarDecl FuncDef.h Program.h
: syntax trees for expressions, statements, functions, etc.Visitor.h
: visitor pattern for syntax traversalPrinter.h
: print syntax tree using VisitorTypechecker.h
: a typechecker that supports overloading.Scope.h
: scoped symbol table used by the typechecker.Builtins.h
: declarations of built-in operatorsCodegen.cpp
: generates LLVM IR from syntax treeSimpleJit.h
: encapsulates LLVM ORC JIT engine
Prerequisites:
-
CMake 3.1+: https://cmake.org/download/
-
LLVM 7.0: http://releases.llvm.org/download.html
-
re2c 1.2: https://github.com/skvadrik/re2c/releases
- (Note that building re2c under Windows is easiest via cygwin or mingw.)
The Weekend Compiler project is built with CMake. From the command line, the locations of LLVM and re2c can be specified as follows:
cmake -D LLVM_ROOT_DIR=C:/LLVM-7.1 -D RE2C_EXE=C:/re2c-1.2.1/bin/re2c.exe
Alternatively these locations can be specified via cmake-gui.