Cómo escribir un compilador en Go: una guía rápida

¡Los compiladores son increíbles! ? ? ? Combinan teoría y aplicación y tocan muchos temas relacionados con el software, como el análisis sintáctico y la construcción del lenguaje. En esencia, los compiladores son un programa que hace que un programa sea legible por la computadora.

La inspiración para esto vino de un curso de compiladores que tomé el otoño pasado y mi amor por Go.

Esta es la guía que desearía tener al comenzar mi viaje hacia los compiladores. Hay muchos libros, videos y tutoriales sobre cómo crear compiladores. El objetivo de esta publicación es lograr un equilibrio entre proporcionar un ejemplo no trivial de algunas de las cosas que un compilador puede hacer mientras evita quedarse atascado en la maleza. ?

El resultado será un compilador que puede ejecutar un pequeño lenguaje inventado. Para verificar y ejecutar el proyecto final, consulte las instrucciones a continuación. ?

Nota: Recuerde que Go es estricto con respecto a las rutas absolutas al ejecutar este

cd $GOPATH/src/github.com/Lebonescogit clone //github.com/Lebonesco/go-compiler.gitcd go-compilergo test -vgo run main.go ./examples/math.bx

Esquema del compilador

  • Lexer / Analizador
  • Generador AST
  • Verificador de tipo
  • Codigo de GENERACION

El idioma

El objetivo de esta publicación es que se familiarice con los compiladores lo más rápido posible, por lo que mantendremos el lenguaje simple. Por Tipos vamos a trabajar con strings, integersy bools. Tendrá declaraciones que incluyen func, if, else, let, y return. Esto debería ser suficiente para divertirse trabajando con algunas de las complejidades de un compilador.

El primer compilador que construí, lo completé en el transcurso de dos meses y tomé miles de líneas de código. Tomé algunos atajos en esta publicación para mostrarte los fundamentos clave.

Dos componentes comunes que le faltan a nuestro lenguaje son classesy arrays. Estos añaden complicaciones adicionales para las que no tenemos tiempo ahora. Si resulta que la gente realmente quiere saber cómo manejar estos elementos, escribiré un seguimiento.

Algún código de ejemplo:

func add(a Int, b int) Int { return a + b;}
func hello(name String) String { return "hello:" + " " + name;}
let num = add(1, 2);let phrase = string hello("Jeff");let i = int 0;let result = "";
if (i == 2) { result = hello("cat");} else { result = hello("dog");}
PRINT(result);

Configuración rápida

El único paquete exterior que necesitamos es gocc, lo que ayudará a construir el analizador léxico y analizador.

Para que funcione:

go get github.com/goccmack/gocc

Asegúrese de que la carpeta bin donde se encuentra gocc esté en su PATH:

export PATH=$GOPATH/bin:$PATH

Nota: Si tiene problemas en esta etapa, intente ejecutar go envpara asegurarse de que su $GOROOTy $GOPATHesté correctamente asignado.

Genial, profundicemos en el código.

Construyendo el Lexer

El trabajo del lexer es leer el programa y generar un flujo de tokens que consume el analizador. Cada uno Tokencontiene lo typeque representa el token en el idioma y la cadena Literalde ese token.

Para identificar las piezas del programa usaremos expresiones regulares. gocc luego convertirá estas expresiones regulares en un DFA ( Autómatas finitos deterministas ) que teóricamente puede ejecutarse en tiempo lineal.

La notación que usaremos es BNF (forma Backus – Naur ). No confunda esto con EBNF (forma extendida Backus-Naur ) o ABNF (forma aumentada Backus-Naur ) que tienen algunas características adicionales. Tenga esto en cuenta cuando mire otros ejemplos en línea que podrían estar usando otras formas que brinden más azúcar sintáctico.

Comencemos con lo básico y definamos cómo se verá stringsy integersse verá.

Tenga en cuenta que:

  • Todos los nombres de tokens deben estar en minúsculas
  • Cualquier clave precedida por '!' será ignorado
  • Cualquier clave precedida por '_' no se convertirá en un token
  • Cualquier expresión delimitada por '{' expression'}' puede repetirse 0 o más veces

En el siguiente ejemplo ignoramos todos los espacios en blanco y hemos definido una inty string_literaltoken.

Cada idioma tiene incorporadas keywordspalabras reservadas que brindan una funcionalidad especial. Pero, ¿cómo sabrá el lexer si a stringes keywordun usuario creado identifier? Maneja esto dando preferencia al orden en que se definen los tokens. Por lo tanto, definamos a keywordscontinuación.

Terminaremos esto agregando la puntuación necesaria para las expresiones.

¡Frio! Veamos si esto realmente está haciendo lo que queremos con algunas pruebas unitarias . Siéntase libre de pegar esta parte en su IDE. ?

Nota: En general, es una buena práctica en Ir colocar archivos de prueba en el subdirectorio correspondiente, pero para simplificar, coloco todas las pruebas en la raíz.

Para probar nuestro escáner, ejecute:

$ gocc grammer.bnf$ go test -v=== RUN TestToken--- PASS: TestToken (0.00s)PASSok github.com/Lebonesco/compiler 0.364s

Genial, ahora tenemos un trabajo Lexer?

Construyendo el analizador

Construir el Parseres similar a cómo construimos el Lexer. Construiremos un conjunto de secuencias de elementos que cuando coinciden correctamente con un flujo de tokens producen una gramática. Esto también se ejecutará en tiempo lineal al convertir internamente nuestra gramática NFA ( Autómata no determinista ) a DFA ( Autómata finito determinista ).

Mantengamos las cosas simples. ¿Cuál es realmente nuestro programa? Bueno, es un grupo de Statementsen el que cada uno Statementpuede contener un conjunto de Statementsy / o Expressions. Este parece un buen lugar para comenzar nuestra gramática.

Below is the beginning recursive hierarchy of the program. Statements is a sequence of zero or more Statements and Functions is a list of functions. Our languages requires functions to be defined before other Statement types. This will reduce some headache during the type checking phase. empty is a keyword in BNF that represents an empty space.

But wait! What is a Statement? It’s a unit of code that doesn’t return a value. This includes: if, let, and return statements. This is opposed to Expressions which do return values. We will get to those next.

Below is our grammar for Statement and Function. A StatementBlock is a larger abstraction that encapsulates a list of Statements with braces {}.

Next lets build out our Expression which handles all infix operations such as +, -, *, <, >, ==, and, and or.

Almost done with a fully working grammar! Let’s wrap things up by defining our parameter insertion. Each function can accept any amount of values. When defining a function we need to label the argument types in the signature while a called function can accept any amount of Expressions.

Now that our parser is completed let’s add some code to our driver, main.go.

As we progress through the compiler we will add more functionality to this driver.

One of the things challenging about building a parser is that there’re many different ways to define the grammar. ?

We won’t really know how well we did until we get into the next section which uses the output we just generated. The difficulty of building the static type checker will be strongly influenced by our grammar design. Keep your fingers crossed ?.

Test Parser

We’ll keep this simple because at this point our parser can still produce false positives. Once we start working on the AST we can check its accuracy.

go test -v=== RUN TestParser--- PASS: TestParser (0.00s)=== RUN TestToken--- PASS: TestToken (0.00s)PASSok github.com/Lebonesco/go-compiler 7.295s

Congrats ? ? ?! You now have a fully working Lexer and Parser. Time to move onto the AST (Abstract Syntax Tree).

Abstract Syntax Tree

The best way to understand an abstract syntax tree is in relation to a parse tree which is what we generated in the last post. A parse tree represents each part of the program that is matched in our grammar.

Por el contrario, un AST solo contiene la información relacionada con la verificación de tipos y la generación de código, y omite cualquier otro contenido adicional que se utilice al analizar el texto.

No se preocupe si esa definición no tiene sentido en este momento. Siempre aprendo mejor codificando, ¡así que saltemos a eso!

Cree un directorio nuevo y dos archivos nuevos. ast.gocontendrá nuestras funciones generadoras de AST y types.gotendrá los tipos de nodo de árbol .

mkdir astcd asttouch ast.gotouch types.go

Like with the parse tree, we will define our structure from top to bottom. Every node will either be a Statement or Expression. Go isn’t object oriented so we’ll use a composition pattern utilizing interface and struct to represent our node categories. Our AST will return a Program node that contains the rest of the program. From now on, any struct we created with a TokenLiteral() method is a node. If that node has a statementNode() method it will fit the Statement interface and if it has a expressionNode() method it’s an Expression.

In addition, we’ll be adding json tags to each struct field to make it easier when we convert our AST into json for testing purposes.

Now let’s start building our Statement structs based off of the Statement and Node interfaces.

Note:json:"-" means that the field will be omitted from our json.

Next we need some hooks to connect our nodes and parser. The code below allows our Statement nodes to fit the Node and Statement interfaces.

We then need the hooks that will be called by the parser.

As you can see, most of our code is checking and casting our input type.

These hooks will then be called by the parser in grammar.bnf. To access these functions we’ll need to import "github.com/Lebonesco/go-compiler/ast.

Now when a piece of grammar is identified, it calls an AST hook and passes in the necessary data to construct a node.

As you could imagine, there is a lot of flexibility in how you want to generate your AST. There are design strategies that reduce the amount of unique nodes in the AST . However, having more node types will make your life easier when we get to the typechecker and code generation. ?

This part has a lot of code. However, it’s not very difficult and mostly all the same. The error handling in Go can feel a bit tedious, but in the long run it’ll be worth it when we make a silly mistake. Safety First ?

We’re not going to do anything too crazy with our error handling because I want to save on lines of code. However, if you feel so inclined you can add more descriptive and useful errors.

Let’s move on to Expressions!

Fit them into the Node and Expression interfaces.

And create the Expression hooks.

Finally, insert the hooks into the parser.

Test AST Generator

The test cases for the AST generator are the most tedious to write. But trust me, this is not a part you want to mess up on. If you have problems here, those bugs will rollover into your type checker and code generator. ?

In my opinion, if code doesn’t have tests it’s broken

In our AST test we will construct what our final result should look like. To avoid comparing elements such as tokens, that could create false negatives, we convert our result and expected output into json before comparing with a deep comparison function, reflect.DeepEqual().

Run Test:

go test -v=== RUN TestAST--- PASS: TestAST (0.00s)=== RUN TestParser--- PASS: TestParser (0.00s)=== RUN TestToken--- PASS: TestToken (0.00s)PASSok github.com/Lebonesco/go-compiler 9.020s

Half way to a working compiler! ? While you give this post some ? ? ? don’t forget to give yourself a hand for making it this far.

Let’s add some more code to the driver.

Now onto my favorite part, the Type Checker.

Type Checker

Our type checker will make sure that users don’t write code that conflicts with our statically typed language.

The core backbone of our type checker will be a combination of data structures that track identifier types, initialization, and valid type operations. This can get vastly more complicated once we start dealing with different levels of scope and classes. However, we’re keeping it as simple as possible. ?

To start:

touch checker_test.gomkdir checkercd checkertouch checker.gotouch environment.go

environment.go will contain all of our helper functions that will be used by our checker and code generator to access and manipulate our set of variables and corresponding types. Our environment is simple so this will be straight forward.

We’ll begin by setting all of our constant values including operation types, variable types, and mapping of each type to its valid methods.

Note: If we had classes in our language our checker would handle this third part rather than us doing it by hand.

The rest of environment.go are basic getters and setters that handle identifiers and functions.

The foundation of our type checker will be a single dispatch function, checker(), that takes in a Node and fires the corresponding checker function

I felt like saving lines of code so we’ll be using a global environment to store our variable types.

Note: This wouldn’t be possible if we allowed Block Statements and Functions to have their own scope or if we were abiding by best practices.

Now eval Statements. Block Statements are the only statement in which we return a type in order to handle the case when it is inside a function. If there is a Return Statement inside the Block Statement its type is returned. Otherwise, the Nothing_Type is returned.

Onto evaluating Expressions. The most complicated part will be evalFunctionCall() because it could either be a built in function such as PRINT() or user defined.

Note: Currently, there is only one builtin function. However, more could be easily added given the framework that we’ve built.

Awesome! Let’s add a small snippet to our driver.

Test Type Checker

go test -v=== RUN TestAST--- PASS: TestAST (0.00s)=== RUN TestOperations--- PASS: TestOperations (0.00s)=== RUN TestIdents--- PASS: TestIdents (0.00s)=== RUN TestFunctions--- PASS: TestFunctions (0.00s)=== RUN TestParser--- PASS: TestParser (0.00s)=== RUN TestToken--- PASS: TestToken (0.00s)PASSok github.com/Lebonesco/go-compiler 9.020s

I made some deliberate choices to leave things out of this language such as classes, for loops, and function scope. Most of the complexities that arise from these occur in the checker and code generator. If I added those elements this post would be a lot lot longer. ?

Code Generation

We have finally made it to the last stage! ? ? ?

In order to handle the most cases with the least amount of hassle every expression will be assigned to a temporary variable. It might make our generated code look bloated, but it will solve for any nested expressions.

Bloated code won’t have any impact on final program speed because the optimizer will remove any redundancy when we compile our final generated C++ code.

Our generator will look similar to the type checker. The main difference is that we’ll be passing down a buffer to store the generated code.

While I chose to compile into C++, you can substitute in any language . The main purpose of this Go Compiler Guide was to enable you to be able to understand the pieces well enough to go out and create your own.

To start:

touch gen_test.gomkdir gencd gentouch gen.go

We’ll begin by importing all of the necessary packages and defining three utility functions, write() to write generated code to a buffer, check()to do error handling, and freshTemp()to generate unique variable names for temporary variables we create on the fly.

Note: It’s generally bad practice to use panic()for normal error handling in Go, but I’m tired of writing if statements.

Similar to the checker, our generator has a core dispatch function that accepts a Node and calls the corresponding gen function.

Let’s generate some Statements. genProgram() generates necessary headers and main() function.

Generating Expressions will look very similar to the code above. The main difference is that a temp variable is returned representing that expression. This helps us handle more complex Expression nesting.

The final piece of code will be our C++ Builtin types. Without this nothing will work.

Test Code Generator

Bringing It All Together

We’re now going to combine our lexer, parser, AST generator, type checker, and code generator into a final runnable program, main.go.

Note: I’m running this on a Windows so my C++ compiles into main.exe. If this doesn’t work for you remove the .exe extension.

To find some test programs to run go to github.com/Lebonesco/go-compiler/examples.

go run main.go ./example/function.bxhello Jeff3

And there you have it! We have completed a fully working compiler in Go!

Thank you for taking the time to read this article.

If you found it helpful or interesting please let me know ???.