Language Processors

A compiler is a program that read a program in source language and translate it into an equivalent program in target language. One of its important role is to report any erros during the compilation
Compared to interpreters, compilers must compile the program into a executable file then the user can use it, which means it compiles then accept inputs, the interpreters, on the other hand, accept the inputs while reading source program, and evaluate the program statement by statement simultaneously, then produce the output, the compilers often generate better target codes that will run significantly faster than interpreters, but interpreters are usually have better error-correction capabilities, because they executes the program statement by statement.
There are also hybrid compilers that unite the interpreter and the traditional compiler, they use translator to translate the program into a low-level IL then evaluate the IL in an interpreted way

From source to binary

  1. A program may be divided into multiple files, it's preprocessor's duty to combine them, it also expands macros.
  2. The modified program will be sent into the compiler to generate assembly language, then the ASM will be produced by assmebler that will produces the relocatable machine code.
  3. Large program will be compiled in multiple parts, the linker will resolve external memory address because a code in one file may refer to a symbol in another file. The loader then put these executable object files together into memory for execution

Compiler structure

Lexical Analysis(词法分析)

Lexical analysis is the first phase of a compiler, it reads the stream of characters, then group these characters into meaningful sequences, called lexeme, for every lexeme the lexer will produce a token like \langle\text{token-name},\text{attr-value}\rangle where token-name refers to a symbol that will be used during the syntax analysis(next phase), the second component points to a entry in to symbol table which is needed for semantic analysis and codegen

Syntax Analysis(语法分析)

This is the second phase of a compiler, it's also named parsing, the token's first component that generated by the previous phase will be used to build a tree-like intermediate representation that shows the structure of the token stream. It is often a syntax tree

The nodes of the syntax tree represent the operation and the children of a specific node represents the arguments of the operation which the node represents

Semantic Analysis(语义分析)

Semantic analyzer uses syntax tree and the symbol table to check the semantic consistency, gather type information and saves it in the syntax tree or symbol table for intermediate codegen.

An important part of the semantic analysis is type checking, the compiler checks each operator has matching operands, e.g. strings can not be added to an array of integers

Coercion is a implicit type conversion, like the add operator can be applied to either a pair of integers or a pair of floats, if one side is integer and the other side is float, the compiler may coerce the integer into a float, the actual underlying operation may behaves like add an explicit type conversion to the integer-side operand.

Intermediate Codegen(中间表示生成)

During the compilation there might be one or more intermediate representations of the source program, syntax tree is one of them with relatively high level, after the syntax and semantic analysis, the compiler often generates an low-level IR that can be considered as an abstract machine(e.g. a stack machine or a register machine). the two important properties of IL are:

  1. Easy to produce
  2. Easy to translate into target machine

Code Optimization(机器无关优化)

The code optimization here is machine independent, it works upon the IR trying to improve its quality to generate better target code, the machine independent optimization often involve measures that are highly depended on the control flow and the dataflow like constant folding(常量折叠) and copy propagation(复写传播)


The codegen phase takes IR as input and produces target language as output, if the target language is machine code, then the register and memory location will be selected for each variables use by the program, then the IR will be translated into equivalent real machine code, and the machine code will work together with symbol tables to bridge a path between registers and memory addresses

Symbol-Table Management(符号表管理)

The symbol table records the name of variables and functions, and collect information for each of them, it may collects the allocated storage, type, scope for a variable, and the arguments' number, type, calling convention(by ref of by val) and return type for a function
The symbol table will be an associative container that uses the name of the variables as key and the attributes above as value, it should be designed to find the record for each name quickly

The front-end of a compiler contains Lexical Analysis, Syntax Analysis, Semantic Analysis and Intermediate Codegen

The back-end of a compiler contains Code Optimization(optional) and Codegen

A well-design frontend and backend should be decoupled, the frondend can be applied to different backends, and a backend can be applied to different frontends

The classification of programming languages

The programming languages that have been invented so far can be classified into various ways

  1. By generations
    1. First-generation: machine languages, which talks to the machine by using binary encodes directly
    2. Second-generation: assembly language
    3. Third-generation: high-level langauges like C/C++/C#/Java/Fortran
    4. Fourth-generation: Domain specific languages like SQL
    5. Fifth-generation: logic- and constraint-based languages like Prolog
  2. By the way that languages represent operations
    1. Imperitive: the languages that show how a computation is to be done like C/C++/Java/C#, these languages always involving statements and notions for changing states
    2. Declarative: the languages that show "what computations are to be done" like ML/Haskell/Prolog
  3. von Neumann langauge: the languages whose computational model is based on the von Neumann computer architechture
  4. Object-oriented languages: the languages that support object-oriented programming
  5. Scripting language: the interpreted languages that're designed for "glue code"

The requirements of compiler optimizations


  1. The optimization must be correct, which means preserve the semantic of the original program, this is the most important thing
  2. The optimization must improve the performance of many programs, effective and meaningful, if the optimization only show its brilliance on a few circumstances, then it is mostly useless and should not be made. Besides performance, error-reporting and debugging is also important
  3. The compilation time must kept reasonable
  4. The engineering effort that the optimization required must be manageable

There are two major optimizations specified for computer architecture: Parallelism and Memory Hierarchy, the parallelism have several levels: processor-level and instruction-level, the machine can issues the instructions in parallel if possible, operate on a vector of data at the same time(called vectorization), and run multiple operations in parallel. Some techniques have been found to automatically turn a sequential procedure into multiprocesstor code.
Memory Hierarchy consists of several storage levels like a pyramid, where the topmost has fastest speed but smallest storage, the speed will get slow down but storage gets bigger all the way down to the bottom of the pyramid, there will be some techniques to be involved in compiler design such as register allocation, allocate variables into registers in a reasonable way is the most important problem in optimizing programs. The memory and physical devices is managed by system and hardware yet sometimes they may not efficient enough, it is possible to improve performance on this by rearrange the layout of data and the order of instructions accesing the data

Env and states

An environment is a mapping from names to variables(a location in the store)
A state is a mapping from variables to their values

The binding from names to locations are mostly dynamic, the same names may bind to the different variables under different scopes

The binding from variables to values is also dynamic because we cannot know the value of a particular location until we run the game, constants are exceptions, the binding of the constant variables and their values is known at compile-time

Name, Identifiers, and Variables

Name mostly refer to the compile-time name, and variable mostly refer to the runtime locations denoted by names

An identifier is a string of characters refers to an entity, like an object, a procedure, a class, or a type.

All identifiers are names, but not all names are identifiers, x.y is a name but not a identifier, x is both a identifier and a name, so is y, but x.y is a qualified name, not identifier

A variable refers to a particular location of the store(like in the memory), variables may be redeclared more than once, such declarations would result in different store location, the same identifiers may not refer to the same variables, for example, in a recursive procedure.

Procedures, Functions, and Methods

All of the callable subprograms are procedures,procedures have no return value, a function is a procedure with return value, specifically, we can treat a void-return function as a procedure. The methods are heavily used in object-oriented languages, they refer to function or procedures that are associated with a class

Declaration and Definition

Declaration tells us the type and the existence of things, while definition tells us about their value and behaviours

Dynamic scope and Lexical scope

We say a scope resolution is dynamic, if it's based on the factors that can be known only during the runtime, a dynamic scope of name x refers to the variable that are defined in the most recently called, not-yet-terminated procedure, dynamic dispatch and macro expansion are the widely-known usages of dynamic policy

A scope resolution is lexical, if it's based on the lexical structure of the source program, almost all languages' variable scope resolution are lexical

Call-by-val and Call-by-ref



If two variables point to the same location so that the changes made to one of them will affect the other one, then each one of them can be the other one's alias