Three-address code

The compilers do not treat the part of the program as set of characters, instead, these "set of characters" are treated as a unit, called token, in count+1, there will be three tokens, specifically, the count will be treated as a token of type id instead of a set of characters.
The intermediate codegen will based on the AST:

The feature of the syntax trees make them particularly suitable in intermediate code generation, consider the two leaves of a node as operands and the node itself as operator, it is easily to construct a three-address code like x=y\ op\ z where y and z are the operands, op is the operator, and x is the location where the computation result stores. A three-address instruction carries out at most one operation, such as comparison, computation or branch.

Context-free grammar(上下文无关文法)

See this article(上下文无关文法)
The analogy here is that the name of the token will be the terminals in the corresponding CFG of the programming language. The terminology token and terminal is synonymous to each other under this context. Another concern is that the empty strings are denoted by \epsilon not \Lambda

If an operands with the same(precedence, actually) operators on its both sides belongs to the left one, the operator is said to be left associative, for example, plus sign is left associative because in an arithmetic expression(incomplete) +x+, x will be resolved by the left + unless a pair of parenthesis is involved. The opposite of left associative is right associative, the assignment operator = and exponential operator are often right associative, they have different parse trees, the former one's parse tree growns towards bottom-left, and the later one is bottom-right

If an operator a has the privilege to takes operands before another operator b, we say that a has higher precedence than b, for example, \ast has higher precedence than +

If we want to create CFG for a language with built-in precedence

  1. Build the atomic part of the expression, for example in arithmetic expressions, a single digit and a expression that is surrounded by a pair of parenthesis are atomic, they can not be torn apart in anyway, which means they have the highest precedence in the system, for exmaple, a digit itself is definitely atomic, it has the highest precedence, to itself. If an expression, no matter what, is surrounded by parenthese, then it can never be "seized" by any operator outside of the parentheses, so, both digits and parenthesized expression can be considered as the atomic expressions, they have the highest precedence, so we add a production for them: factor\rightarrow\ digit\ |\ (?), the question mark "?" is because for now we don't know what can be put in to the parentheses yet.
  2. If we want to build a grammar with n-levels of precedence, we need to make an observation:
    • For every expression with lower precedence, they cannot "seize" the expression that are bounded by operators of higher precedence, for example, in 3+4\ast5^6 the 5^6 is treated as an atomic unit to the multiplication sign, it cannot "seizes" 5 from the exponential, 4\ast5^6 is treated as an atomic unit to the plus sign, it cannot "seizes" the 4 from multiplication, now let a=5^6, b=4\ast a, the original expression can be shown as 3+ba where a has the highest procedence, b takes the second place, as for the plus sign, since the addition and subtraction have the lowest precedence in the arithmetic system, so the cannot be considered as a whole, because any left-associative operator can "seize" their operands, for example if we put another + at the left of the whole expression +3+ab, then the 3 immediately becomes the operand of the left +

    so the basic idea is, for every operator at the same level of precedence, we write a similar production like the one we've made for the factor, where the right side of the production should become \text{current-level-of-precedence } op \text{ higher-level-of-precedence}, of course the order of the two operands depends on the associativity.

  3. From the last step we know we need to build a production set for operators that have one-level-lower precedence than factor, which, in arithmetic expressions, are multiplication and division, so we add:
    term\rightarrow\ term\ast factor\ |\ term/factor\ |\ factor

    the last production means that an expression with higher precedence can be derived from an expression with lower precedence, but not vice versa because of the "unit" we've mentioned, an expression with lower precedence can derives higher precedence because that they could be consisted of the expression of higher precedence, but the higher precedence are treated as units like we've talked in step 2, so a higher precedence can never be consisted of lower precedence, just imagine that it's impossible

  4. Now we get to the lowest precedence in the hierarchy, the addition and subtraction, we do like what we've said in step 2:
    expr\rightarrow\ expr+term\ |\ expr-term\ |\ term
  5. Finally, we need to consider what need to be filled into the question mark we leaved in the first step. since any valid arithmetic expression can be placed into a pair of parentheses, so the content of the question mark must be able to derive everything in this langauge, which is, apperantly, expr, because it has the lowest precedence, and only lowest precedence can generates higher or the same precedence, so expr can derive all other productions and itself, but the other productions cannot derive expr, so we need to put an expr to replace the question mark:
    factor\rightarrow\ digit\ |\ (expr)

So far from now we've finished inventing a CFG for a language with built-in precedence, the whole structure of this CFG is similar to an Ouroboros(because the expr is clearly a self-embedded nonterminal), it is easy to find that this algorithm is iterative, so with any level of precedence we just need to repeat the steps of this algorithm

The keyword allows us to recognize statements because most statements start with a keyword or special character(not all of them, of course)

Syntax-Directed Translation(语法制导翻译)

"Attribute" under this sense means any quantity associated with a programming construct(i.e. a node in parse tree), includes but not limited to:

  1. Data type of the expressions
  2. Location of the first instruction generated by this programming construct
  3. Number of instructions in the generated code

Syntax directed translation scheme is a notation for attaching program fragments to the productions of the grammar so that they will be executed in the order of how the program text is parsed and produce the corresponding result that will be used as the translation of the original program

A syntax directed definition, includes

  1. For each grammar symbol a set of attributes
  2. For each production a set of rules called semantic rules to compute the value of the attributes associated with that symbol

We use x.y denotes the attribute y of nonterminal x, a parse tree with attribute value on each node is called annotated parse tree

An attribute is said to be synthesized, if its value on a node N in the parse tree depends on the attributes values of N and the children of N, one of the systhesized attributes' property is that they can be evaluated during a bottom-up traversal of the tree

Semantic Rule(语义规则)

In the figure above, the t is a attribute with string value that bounded to the nonterminals, represent the postfix form of the subtree start from the corresponding nonterminal, we can used another table-based syntax directed definition to represents such structures:

where || stands for string concatenation, for every production there is a semantic rule tells how the attribute t is evaluated

Such a SDD(abbreviation of syntax directed definition) that the expressions in the right side of the semantic rule has the same order as its production with some additional string is called simple SDD

The evaluation order of the subexpressions in the semantic rules of SDD does not matter, as long as you ensure that the root expression is being evaluated only after all of its subexpressions' evaluation

Semantic Action(语义动作)

The semantic rules of the SDD is not limited to only produce strings, you can embed some program fragments that will be executed when the node is encountered, the semantic rules in such case will be called as semantic actionssemantic action will be treated as an extra children of that nonterminal according to its place in the production rule and executed when it is traversed

Parser(语法分析器)

For every CFG there exist a parsing algorithm that can parses it in O(n^3) time(the CYK Algorithm), mostly it won't be employed since it's too slow

There're generally two classes of parsers: bottom-up and top-down, they're named by the order that the nodes in the parse tree are generated. The current terminal that is being scanned by the parser is called lookahead, in the parsing procedure, if we find that the terminal corresponds to the input, we advance both the tree node and the input from left to right, if we find that some production is unsuitable, we need to go back and try another production, this is called backtracking.

Recursive-descent parsing and Predictive parsing(递归下降分析和预测分析)

Recursive-descent parsing is a top-down approach to process the input, there is a simple form of RDP called predictive parsing, it is a procedure that "predicts" what productions are about to be used and list all the nonterminal and terminals in that production in advance, which implicitly generates a parse tree, the predictive parsing does not require backtracking, because it already listed all the expecting tokens before the input string contacts, we don't need to try the productions one by one; for production stmt\rightarrow\boldsymbol{for(}\ optexpr\ ;\ optexpr\ ;\ optexpr\ \boldsymbol{)}\ stmt, we can use the predictive parser

It should be noticed that match procedure are only for terminals

Let \alpha be the deriving string, we used \text{FIRST}(\alpha) to denote the set of terminals that at the first place of the one or more substition results from \alpha, if \alpha is nullable, then \epsilon\in\text{FIRST}(\alpha), in the example above, the \text{FIRST}(\alpha)=\lbrace \boldsymbol{for}\rbrace, the \text{FIRST} must be considered is there is more than one live productions from a nonterminal A, for instance, A\rightarrow \alpha and A\rightarrow \beta, the \text{FIRST}(\alpha) and \text{FIRST}(\beta) is required to be disjoint, otherwise the predictive parser cannot deduce which production is to be used when encountering their common terminals

The application of \epsilon-production is a bit tricky, suppose for nonterminal optexpr there exist production optexpr\rightarrow \epsilon\ |\ \boldsymbol{expr}, once we hit such a situation that there is no terminals for optexpr, for example, in the input string \boldsymbol{for(;expr;expr)}\ other, the should be an optexpr after the first (, but the truth is nothing is presented, the input string skips it and proceed to the ;, in such circumstance we should check if the lookahead symbol matches any of the non-\epsilon-productions of nonterminal optexpr, in this example we check if the lookahead=\boldsymbol{expr}, apparently they are not equal since the lookahead=\boldsymbol{;}, so we just return from optexpr() and do nothing, we skipped the optexpr nonterminal, but the lookahead stays intact; if the lookahead indeed match any production, we then use that production and advance the lookahead, we do nothing means that an \epsilon-production is applied

The things the a predictive parser should undertake for a particular nonterminal A can be listed in a list:

  1. Examine which production of A should be used by checking the \text{FIRST}(\alpha) is the same as lookahead where \alpha is the body of the current examining production, if the \text{FIRST} set is intersected in multiple non-\epsilon-productions, then we cannot used predictive parser for this grammar, the \epsilon-production is used if the lookahead doesn't match all the others

  2. Call the procedure that corresponding to the \text{FIRST}(\alpha) we've just detected, in that procedure all the terminal and nonterminals will be listed from left to right to simulates a derivation, if any of the symbols(terminal or nonterminal) is mismatched with lookahead, the procedure should crash or a syntax error should be reported

Since the SDT(abbr for syntax-directed translators)s are build upon SDDs, we can form a SDT from a predictive parser, first we build a predictive parser for those production and omit those semantic action, then, we add those semantic actions to the right place in the procedures of the predictive parser(where "right place" is self-explainable, since the procedures is almost the same as the productions).

Left Recursion and Right recursion(左递归和右递归)

A production is said to be left-recursive, if the right side starts with the same nonterminal as the left side, such productions will causing an infinitely high parse tree that will grow downward left, and lookahead will never move during the recursion(because lookahead only advanced in match and match is only for terminals); The right recursion is opposite at left recursion, a production is said to be right-recursive if the right side ends with the same nonterminal as the left side. a right-recursive production will causing an downward right parse tree with infinitely height

AST(抽象语法树)

Abstract syntax tree is similar to parse tree, but the nodes in parse tree are nonterminals, and in AST are program structures, the node of AST represent an operator and its children represent operands, the operands and the operator is not necessarily be some widely-known binary operatos, they just need to be "treated" like operators with operands

it is clear that the syntax tree does not have something like expr or expr.t on it, they're replaced by real structures

When we convert a grammar in to a form that facilitates parsing, the semantic actions inside will be treated as terminals

Lexical Analysis

A token is consists of name and attributes, it is not the same as lexeme, since the later one is only a string, and token is an object, however, there do have a relationship between them, a particular lexeme can be recognized by the lexer and then create a specific token for this lexeme, the lexeme, of course, also can be an attribute of the token

  • The lexer often skip blanks
  • The lexer is just for lexical analyzing, no matter the lexeme is semantically right or wrong, it just produce tokens
  • A token object typically has a tag field to identify the kind of this token, like number or identifier or boolean constants.
  • A string table will be hold for identifiers and reserved words during the lexical analysis, if we meet a lexeme we put it into the table with the token it corresponds unless there is already a token with the same lexeme in the table, in such case we return that token directly
  • The lexeme of integers is computed to real integers during the lexical analysis

Symbol Table

Symbol table entries are used by lexical, syntax and semantical analysis. Compared to the string table in the lexical analysis, it is more often that only parser can decide whether to use old entry or create a new one in symbol table, the lexer can only return tokens with lexeme

Symbol tables can be use a form of a stack, where the topmost element is the symbol table of current block, the second is the outer block, etc. It can also be implemented by a tree(because every nesting level could be more than one block, so there may be two symbol tables at the same nesting level, which will form a tree), or, anything that can holds the order of the symbol tables from the nested block to the outmost block

A new symbol can be added to a symbol table by using semantic actions on the declaration nodes, when a node that declares a variable is analyzed, the semantic action will puts its information to the symbol table, and it will be queried out from the symbol table when encountered a production such as factor\rightarrow \boldsymbol{id} which declares a use of that variable

Intermediate Codegen

It is not necessary for compiler to yields IR(AST is a kind of IR with higher level, by the way) while building a solid syntax tree, it could generates IR while pretending to be doing so, which means some sort of simulation, the nodes and their attrs will be computed along with the data structure while parsing, but disappear after the parsing, the algorithm is to find a way to generates the syntax tree first, and then modify this method to let it generates the IR.

The compiler's front end will finishes the static checking including syntactic and semantic analysis before the intermediate codegen

要点整理

  1. 三地址代码是一种相较于高级语言更低级的中间代码,包含四个部分:临时变量t,操作符op,操作数o_1和操作数o_2,一般有着t=o_1\ op\ o_2的形式,但是op并不一定是中缀运算符,只要是语义上符合"带有两个操作数的运算符"的都符合该标准,因此while, for, if也可以转换为这种形式
  2. 上下文无关文法,一种短语结构文法,详见之前的一篇文章
  3. 文法内建的运算符优先级和结合性是由生成的语法树决定的,在遍历的过程中更低的节点要比高的节点更早求值,可以利用这一点让高优先级运算符在语法树中总是比低优先级的预算服更低来做到,算法是针对每一个优先级创建一个非终结符和对应的产生式,然后对不可分割的原子表达式创建一个非终结符factor和对应的产生规则,因此对于n个级别的优先级需要n+1个非终结符
  4. 语法制导翻译指通过语法来导引翻译的顺序,也就是说翻译将会随着对于语法树的遍历和解析而完成,通常会通过在语法树中内嵌语义动作来在遍历的同时生成代码
  5. 自顶向下分析和自底向上分析的区别在于从前者从语法树的根开始尝试推导出整棵树,后者从语法树的叶子开始尝试演绎的推导出根
  6. 递归下降分析,指的是在分析过程中针对同一种结构递归的调用某个解析过程的分析方法
  7. 预测分析器,指的是一种特殊的分析方法,这种方法并不会针对某个节点靠回溯尝试所有的产生式,而是预先预测出这个节点的产生式,并且尝试去对产生式中的每一个终结符或非终结符进行匹配,使用这种方法需要在产生式体的开头部分有一个能够让分析器识别应当使用那一个产生式的模式串,一个推导中的字符串\alpha中可能出现的这样的"开头部分"的集合用\text{FIRST}(\alpha)表示,\text{FIRST}(\alpha)\text{FIRST}(\beta)应该是不相交的,否则分析器将不知道选择哪一个产生式
  8. 如果在分析过程中我们什么都不做,则视为应用了\epsilon产生式
  9. 左递归指的是右侧开头是一个和左侧相同的非终结符的产生式,这种产生式无法使用预测分析器分析,因为预测分析器只会在遇到终结符的时候更新lookahead变量,因此左递归会导致分析器陷入无限递归最终爆栈。针对这种情况需要使用某些算法将左递归转换为等价的右递归,叫做左递归消除
  10. 抽象语法树是一种和语法树近似的树形数据结构,不同在于语法树的节点是非终结符,想要知道每个非终结符表示了什么只能通过附加属性做到,而抽象语法树的每一个节点分别代表了运算符和运算数
  11. 符号表会在词法,语法和语义分析阶段被用到,一般用一个树形结构表示符号表,每一个嵌套作用域层级都会在树的末端添加一个属于该作用域的符号表,在解析x的时候会从最下层的符号表开始找,找到最近的包含x定义的符号表,并使用这个符号表中的x,当退出一个作用域的时候应当删除掉该作用域挂载到树上的符号表,符号表的添加和删除可以依靠在进入和退出作用域的产生式上附加语义动作实现
  12. 中间代码生成依赖于语法树,但是并不一定真的会生成一个语法树,中间代码生成同样可以靠语义动作所构建的抽象语法树做到,对抽象语法树的节点归约即可得到三地址形式的中间代码
  13. 编译器的静态检查也会在中间代码生成的阶段发生,它包含两点,第一是语法检查,第二是类型检查

行成于思,毁于随