Understanding Compilers – featuring Swift!
Have you ever hit “compile” and wondered what was actually happening to produce those 0’s and 1’s from lines and lines of source code? As a Swift developer and a compiler design enthusiast, I used to wonder what was inside the mysterious “black box” of a compiler. If you’re curious, too— or think you can benefit from a little extra technical knowledge— this article will give a very high-level overview of the structure of compilers and the different stages involved in converting source program to machine code.
And if you’re a little intimidated by the more specific language, don’t be. Though there will be certain details that are relevant to Swift, most of this blog is for any computer science enthusiast.
For the purpose of this article, I have created my own language with a BNF grammar and will be using examples of code relevant to this language.
Before getting into the details of a compiler I think it’s essential to know the difference between a compiler and an interpreter. Both translate the source code to machine code, but in different ways.
A compiler and interpreter are both different forms of a translator. A translator is nothing but a program that translates a source language to a target language while maintaining the logical structure of the code.
Compilers and interpreters are similar in structure, the only difference between the two is that the interpreter directly executes the instructions in the source program, whereas the compiler translates the instructions into efficient machine code.
An interpreter directly executes instructions written in a programming or scripting language without previously converting them to an object/machine code. Examples of interpreted languages are Perl and Python.
A compiler takes the source program and converts it into assembly/object code, which is typically stored in a file .s or .asm file. Object code (also referred to as binary code) can be directly executed by the machine after linking. Examples of compiled programming languages include C++ and Swift. So, for Swift developers, understanding how compilers work helps us gain a deeper understanding of our coding environment.
What a compiler really is
A compiler is essentially a program that takes in human-readable source code and converts it into machine code. Machine code or machine language is generally referred to as binary code, which is nothing but 0’s and 1’s. The source program is just a stream of single characters that are read one at a time. A compiler has many moving parts, which include a lexical analyzer, a parser/tokenizer, and a code generator. Some compilers have additional code optimizers which can be a different step or be a part of code generation.
A Swift compiler in specific can be divided into 2 main parts the
– Frontend which contains the lexical analyzer, parser, and semantic analysis
– Backend which is mainly optimizations and code generation.
This documentation by Apple gives a brief description of some of the terms mentioned above.
What does a compiler do?
A compiler has several steps to be completed before the source code can be executed:
– The source code is read character by character.
– The characters are grouped into numbers, words, and operations, which are known as tokens.
– The sorted characters are read and compared to some grammar which is specific to the language of the source program, in order to generate an AST/parse tree.
– The tree generated previously is used to convert the source code to binary code.
The Lexical Analyzer
The process of splitting the source code into individual characters is called lexical analysis or tokenization. The main focus of the lexical analyzer is to convert source code into tokens which is nothing but the grouping of characters together to form variables, identifiers, and symbols.
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.
For the purpose of this article let us assume that token numbers for our language are assigned based on these rules.
The lexical analyzer uses the above rules to construct a DFA and generates tokens based on this DFA.
DFA or Deterministic Finite Automaton is a state machine where, for a particular input character, which is nothing but a token in our stream of tokens, the machine goes to one state only.
Let’s say our source code contains the below lines:
BEGIN INTEGER: X; INTEGER: Y; IF X < Y THEN X := X+1; END.
The lexical analyzer creates new entries in the symbol table and stores the tokens in the table using the DFA. The lexer reads each character and uses the DFA to generate a token. The above code is tokenized as below:
The Swift language also has a lexical structure it follows. The lexical structure of Swift describes which sequence of characters forms tokens, something very similar to our made-up language but a little more complex. These tokens form the building blocks of Swift. The lexical structure used by Swift can be found here.
What is a symbol table?
A symbol table is a data structure that is created and maintained by the compiler in order to store various information regarding the variables, functions, objects, classes, etc.
Some of the functionalities performed by the symbol table may include:
– Maintaining the names and types of all entities in an organized and structured manner
– Verifying if a variable has already been declared
– Determining if an entity is in scope
After the lexical analysis process is completed for the above code the symbol table looks as below:
Symbol tables can be implemented in multiple ways. In this example, there is a single table and the scope is maintained with numbers, but another way of implementing this would be to have multiple symbol tables, one for each scope.
Why is the symbol table important?
During runtime, the references in the symbol table are identified and resolved. Also, if a variable is encountered that is not part of the current scope, then an error is displayed.
The parser is the heart of the compiler and responsible for validating the syntactic correctness of the source code. It takes in the string of tokens and verifies if it adheres to certain patterns. Some things the parser is responsible for validating include balanced parenthesis, the format of if-else and for statements, and the presence of a semi-colon at the end of each statement. When things aren’t in the right order the parser detects the anomaly and complains.
How does this work? The parser uses a grammar that is specified for the language. This grammar is usually a Backus-Naur Form (BNF) or Extended Backus-Naur Form (EBNF) grammar.
Backus-Naur notation (BNF) is a formal mathematical way to describe the syntax of a programming language.
For the purpose of this article, let us assume it’s a BNF grammar with the following rules:
The parser generates an abstract syntax tree which indicates the correct order the source code is parsed.
In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language.
Below is a video that generates an AST for the IF-THEN statement using the above grammar.
Let’s use the above-generated AST and iterate over the tokens generated in the lexical analysis phase. Consider the statement with if-then: the tokens generated by the lexer are as follows 9, 1, 6, 1, 10, 1, 6, 1, 4, 2, 16. According to the grammar of the if-then statement the “if token” is generated first, which is token number 9. We can see that the lexer also generated 9. The next token expected by the grammar is an “id token” to represent the variable which is nothing but token number 1. This way the AST is used to verify the semantics of the source code.
Now, let’s consider a scenario where there is an error in the syntax. Consider the statement “INTEGER X;” The AST for a declaration statement is as follows:
Now let us assume that the source code had the semi-colon missing and the tokens generated by the lexer is as follows: 3, 1. Now, according to the grammar token, number 3 is expected to represent the type of the variable and token number 1 is expected to represent the variable name, which the lexer has generated for us. Now the parser expects a token 16 to represent the semi-colon as specified in the grammar. Since the source code does not contain this, the parser detects this anomaly and complains. This is how syntax errors are detected by compilers!
The Swift compiler follows a similar path using grammar to construct an AST. Let us consider a basic example in Swift for type grammar. The grammar specified for Swift for type is as follows:
GRAMMAR OF A TYPE type → array-type type → dictionary-type type → function-type type → type-identifier type → tuple-type type → optional-type type → implicitly-unwrapped-optional-type type → protocol-composition-type type → metatype-type type → Any type → Self type → ( type )
I personally find the best way to understand a language is to take an example line of code and deconstruct it using the language grammar. The entire Swift grammar can be found in this documentation.
The Code Generator
Code generation is the final stage of compilation. The code generator takes in the parse tree and generates the machine code equivalent to the source code. Since ASTs contain branches that are nothing but a description of the source program, it is easy to step through the branches and sequentially generate machine code.
Though most current compilers use optimization processes to enhance the code, the machine code generated is some object code of lower level programming language. Some compilers follow a slightly different compilation path where intermediate code is generated before an optimization process is applied, and then the target code is generated.
Swift compilers follow something similar to the above process. The Swift compiler generates an intermediate language suitable for further analysis and optimization of Swift code. This intermediate code is called Swift Intermediate Language (SIL) and the process is called SIL generation. Swift compiler also performs optimization on this intermediate code and this process is called SIL Optimizations. This mainly involves Swift-specific optimizations to the program, examples being Automatic Reference Counting optimizations, devirtualization, and generic specialization. The last step is target code generation which, for Swift, is LLVM IR Generation where the SIL is translated to LLVM.
So what happens once the source code is converted to target code?
After the machine language has been produced, it is written to a new assembly file which has a .s or .asm file format. The file is then converted to binary by the assembler. The binary code is then written to a new file called an object file with file extension .o.
By understanding the compiler, we change the way we perceive computers and programming languages. This knowledge also enables us to code more efficiently. For example:
– An essential but time-consuming part of software development is debugging. Whether it is a compilation error or runtime error, understanding how compilers work helps to narrow down erring code much more easily.
– Understanding compilers also helps to write code that is more efficient. Consider a switch statement vs if-else. We may know that a switch statement is the more efficient between the two, but we better understand why if we have an idea of how the code is converted and executed in binary. An if-else is more time-consuming as it includes a condition to be computed, whereas a switch statement is nothing but a jump statement.
You don’t need to know the precise inner-workings of a compiler to be a good programmer, but this knowledge can be extremely illuminating. Especially as Swift developers, if we want to form a deeper understanding of our craft, knowing how compilers function is a key factor in our progress.