Programming languages under the hood

How does a programming language work under the hood

Héla Ben Khalfallah
Hat that says, “Leave me alone.”
Photo by Angel Monsanto III on Unsplash

Have you ever wondered how it works under the hood of the programming language? How does the compiler get to properly understand our code? How do machines (computers) manage to execute our code correctly and exactly?

“Properly, correctly, and exactly” implied without errors, without losses, and without modifications.

To answer these questions, we’ll try to make a small programming language called H#and go through all the steps that a compiler ordinarily will do.

But before, let’s start from the end of the story: what kind of code can a machine (computer) understand?

Machine implies hardware and hardware electrical circuits. Electrical circuits inside the computer can have just two states: on and off.

So how can we link software to hardware?
It’s easy thanks to two magical numbers (bits) ‘1’ and ‘0,’ which are synonymous to on and off situations of circuits in a computer (Booleans Algebra).

Machine code consists of a sequence of simple computer instructions, with each instruction expressed as a string of binary digits or bits (ie, 1’s and 0’s).

It is important to note that the machine can be your computer or a virtual machine that runs on your computer like a Java Virtual Machine, since the virtual machine use the resources of the host and execute inside it.

However, it’s rare for anyone to program in numerical machine code these days due to the difficulty in working with it.

Thus, to make the developer’s job easier, above this lowest level of representation, programming languages ​​offer different layers of high-level abstraction.

As the first level of abstraction, we can cite the Assembly code:

Programming languages ​​abstraction levels (Image by the author)

At this level, we understood what the machine (computer) understands and I think it’s clear what a compiler will do, don’t you think?

The role of the compiler is to make the human world and the machine world compatible. To let the machine understand the instructions (the wishes) of the human. Perfect!

Before going inside the compiler, let’s start by defining and presenting the programming language that we are going to develop!

We are going to develop a small programming language called H#:

  • Like a human language, a programming language is formed by a dictionary and a grammar.
  • The dictionary defines the set of words (tokens) that make up a language.
  • The grammar checks whether the assembly of words is correct or not.

Currently, our current language will only allow:

  • defining typed variables
  • performing simple arithmetic operations
  • defining and using functions
  • displaying a message in the console

Other specifications:

  • word delimiter will be space
  • statements delimiter will be the semicolon ;
  • Each statement should start in a new line
  • case sensitive
  • comments start with //
  • indentation isn’t important

Here is an example of a program written in H#:

int value1 = 5;
int value2 = 2;
// calculate the sum of two integers
int function sum(x: int, y: int) {
return x + y;
}
int result = sum(value1, value2);
print('The result is : ', result);

H# will be designed on top of the C language, the final output will be an Assembler code :

H# black box (Image by the author)

Dictionary definition (tokens)

Tokens are our dictionary of accepted words.

A practical representation can be a map or a dictionary:

Grammar definition

Rule 1: variable declaration

Variable declaration syntax (Image by the author)

Rule 2: function declaration

Function declaration syntax (Image by the author)

Rule 3: function call

Function call syntax (Image by the author)

Error definition

Lexical errors:

Error 1: unknown identifier.

Error 2: reserved keywords.

Syntax errors:

Error 1: declaration mismatch.

Error 2: missing of an opening/closing parenthesis.

We’re ready to understand what’s going on inside a compiler!

A compiler is a tool that translates software written in a human-oriented language into a machine-oriented language.

Compiler black box (Image by the author)

Typical “source” language might be C, C++, Fortran, Java, or ML. The “target” language is usually the instruction set of some processor.

Fundamental characteristics of a compiler:

  • it must keep the meaning of the program being compiled.
  • it must improve the input (exp: remove dead code).
  • It must do its job with an acceptable threshold of performance: speed, energy, compress and output size.

During its work, the compiler goes through these steps:

Compiler steps (Image by the author)
  • Program analysis (frontend): produces an intermediate representation (IR).
  • Optimization: transforms the IR into an equivalent IR that runs more
    efficacy.
  • Machine targeting (backend): transforms the IR into native code (machine code).

Let’s dive deeper!

Before starting the technical part, I wanted to ask you to translate this sentence to your French friend: “The compiler is a magic and an amazing tool.” How are you going to proceed?

Thus, some will suggest these steps:

  • divide the sentence into words.
  • the word breaker can be any one of the punctuation marks.
  • find the equivalent of each word in a dictionary.
  • put the words together to form the sentence.
  • check that the syntax and meaning are correct.
  • review the sentence and rephrase if necessary.
  • read the phrase.

The compiler will do the same steps to understand the input language:

Compiler frontend steps (Image by the author)

The steps look like how we translate the sentence above, don’t they?

Let’s dive deeper into each step together!

Scanner

The scanner will analyze the source program’s character stream and turn it into a token stream. It’s like finding a word in a dictionary character by character. It also removes white space and comments.

Errors that might occur during scanning are called lexical errors:

  • The scanner will detect all unknown words and throw an unknown identifier error.
  • If a keyword is misused, the scanner will a reserved keywords error .

A simplest recognizing algorithm of a scanner can be a character-by-character analyzer.

For example, suppose we have this input source code:

int value1 = 5;
int value2 = 2;

// calculate the sum of two integers
int function sum(x: int, y: int) {
return x + y;
}

After the scanning phase, the output will be:

Source code tokens that split up input (Image by the author)

These cases will generate lexical errors:

integer value1 = 5; // unknown token integer 
int
function = 5; // function is a reserved keyword
INT
value1 = 5; // unknown token INT, H# is case sensitive
println() // unknown token println

Perfect. At this level the compiler knows all the words we used in our source code and we are sure that all the words used exist in the token dictionary!

💡 Tips: Lex and Flex are tools for generating scanners.

However, this is not enough, we need to ensure that the assembly of tokens is correct. It’s like making sure that the assembly of a sentence is grammatically correct (exp: subject + verb). This will be the role of the parser. Let’s take a look!

Parser

The scanner checks the correctness of the vocabulary, the parser will check the correctness of the grammar.

That is, the scanner combines symbols into tokens, and the parser combines tokens to form sentences.

The parser has the main responsibility for recognizing the syntax, determining whether the program being compiled is a valid sentence in the syntactic model of the programming language.

Well, but how the parser can do this? To understand this, I’m going to ask you, if you were in school and I asked you to calculate this operation (17 + 19) * (8 + 96)how would you do it?

A simpler way to calculate is to do this:

Calculation tree (Image by the author)

Indirectly you have parsed the expression like a compiler: you have identified the tokens (operands and operators) and their relationships!

The parser will do the same thing: it converts our tokens into a tree that represents the actual structure of the code.

AST representation in GCC http://icps.u-strasbg.fr/~pop/gcc-ast.html

Practically, each node in the AST is stored as an object with named fields, many of whose values ​​are themselves nodes in the tree.

Errors that might occur during parsing are called syntax errors.

These cases will generate lexical errors:

int j = 4 * (6 − x; // missing a closing parenthesis
int i = /5; // missing the first value
int 42 = x * 3 // missing a semicolon, 42 can't be a variable name
int value1 = 1, value2 = 1; // the language definition state that we must have a declaration per ligne

🔵 Good to know: some Javascript tools like Babel and ESLint can manipulate the AST. You can also visualize the AST of any Javascript code using the AST Explorer.

💡 Tips: Yacc and Bison are tools for generating parsers.

Semantic analyzer

During semantic parsing, we need to check for legality rules, and in doing so, we link the pieces of the syntax tree (by resolving identifier references, inserting cast operations for implicit coercions, etc.) to form a semantic graph.

Rules that we can verify in this phase:

  • Multiple declarations of a variable within a scope.
  • Referencing a variable before its declaration.
  • Referencing an identifier that has no declaration.
  • Too many arguments in a method call.
  • Not enough arguments in a method call.
  • Type mismatches.

The type checker checks the static semantics of each AST node:

  • it verifies that the construct is legal and meaningful (that all identifiers involved are declared, that types are correct, and so on).
  • If the construct is semantically correct, the type checker “decorates” the AST node, adding type or symbol table information to it.
  • if a semantic error is discovered, a suitable error message is issued.
  • type checking is purely dependent on the semantic rules of the source language. It is independent of the compiler’s target machine.

Errors occurring during this phase are called static semantic errors.

Let’s sum!

Scanner:

  • input: source code.
  • output: tokens.
  • purpose: vocabulary verification.
  • lexical errors: unknown token, reserved keywords, …

Parser:

  • input: tokens.
  • output: AST.
  • purpose : syntax verification (sentence formulation).
  • syntax errors: missing a closing parenthesis, missing a semicolon, incorrect variable name, …

Semantic analyzer:

  • input: AST.
  • output: Annotated AST.
  • purpose: semantic verification (meaning).
  • semantic errors: types errors, declaration errors, arguments errors, reference errors, etc

Intermediate Representations

The intermediate representation is a machine- and language-independent version of the original source code.

The use of an intermediate representation provides advantages in increased abstraction, cleaner separation between the front and backends, and adds possibilities for retargeting/cross-compilation.

IR: machine and language independent (Image by the author)

Intermediate representations also lend themselves to supporting advanced compiler optimizations and most optimizations are performed on this form of code.

Compiler backend steps (Image by the author)

The purpose of the backend phases is to generate the machine code:

  • Instruction selection is a compiler phase that translates the compiler’s intermediate representation of programs to a sequence of target-dependent machine instructions optimizing for various compiler objectives (eg, speed and space).
  • Instruction scheduling: to produce code that runs quickly, the code generator may need to rearrange operations to reflect the target machine and its specific performance constraints.
  • Register allocation: when selecting instructions, the compiler ignores the fact that the target machine has a limited set of registers. Compiling may create more “virtual” register demands than the “real” hardware support. The register allocator must map the “virtual” registers to the “real” registers of the target machine. It determines at each point in the program, which values ​​will reside in the registers and which register will hold each of those values. If the allocator cannot retain a value in a register for its entire lifetime, the value must be stored in memory for part or all of its lifetime.

In this article, we took an overview of how programming languages ​​work from language definition to runtime.

A programming language is governed by: vocabulary, grammar, and semantics.

The role of the compiler is to ensure that the code that the machine will execute respects the definition of the language and the hardware characteristics.

The code inside the compiler goes through several stages: lexical analysis, the syntactic analysis then semantic analysis. Correctness is a characteristic and behavior of a mandatory compiler.

In addition to the correctness, the compiler must guarantee performance and perform optimization to reduce and improve runtime code execution.

Leave a Comment