2021-06-05Download source code
here
The grammar provided for the experiment is as
follows:
\begin{array}{lcl}
program & \to & block\\
block & \to & \{\;decls\;stmts\;\}\\
decls & \to & decls\;decl\;|\;\varepsilon\\
decl & \to & type\;\textbf{id}\;;\\
type & \to &
type\;[\;\textbf{num}\;]\;|\;\textbf{basic}\\
stmts & \to & stmts\;stmt\;|\;\varepsilon\\
stmt & \to & \textbf{id}=expr\;;\\
& | & \textbf{if}\;(\;bool\;)\;stmt\\
& | &
\textbf{if}\;(\;bool\;)\;stmt\;\textbf{else}\;stmt\\
& | & \textbf{while}\;(\;bool\;)\;stmt\\
& | &
\textbf{do}\;stmt\;\textbf{while}\;(\;bool\;)\;;\\
& | & \textbf{break}\;;\\
bool & \to & bool\;||\;join\;|\;join\\
join & \to &
join\;\&\&\;equality\;|\;equality\\
equality & \to &
equality==rel\;|\;equality\;\text{!=}\;rel\;|\;rel\\
rel & \to & expr<expr\\
& | & expr<=expr\\
& | & expr>expr\\
& | & expr>=expr\\
& | & expr\\
expr & \to & expr+term\\
& | & expr - term\\
& | & term\\
term & \to & term\,*\,unary\\
& | & term\;/\;unary\\
& | & unary\\
unary & \to & !\;unary\;|\;-unary\;|\;factor\\
factor & \to &
(\;expr\;)\;|\;\textbf{id}\;|\;\textbf{num}
\end{array}
where \textbf{basic}
includes four types: int, float, char, and bool.
1 Grammar Rewriting
1.1 Error Correction
Since array types are defined in type but not reflected in
assignment expressions, array types are not implemented
here, and decl is
modified to:
decl\to \textbf{basic}\;\textbf{id}\;;
Since expr only
involves four arithmetic operations, negative numbers,
and negation, without evaluation of boolean expressions
and logical operations, ||, \&\&, >, >=, <, and <= are also included in
the operations of expr.
Since the final factor can only derive \textbf{num} (i.e.,
integers), while \textbf{basic} also supports
floating-point, character, and boolean types, a new
terminal symbol \textbf{const} needs to be
defined to represent all constants.
1.2 Conversion
to Ambiguous Grammar
Since ambiguous grammar is more concise and
consistent with most syntax-directed definitions for
intermediate code generation given in textbooks, the
original grammar can be converted into an equivalent
ambiguous grammar. First, modify bool as follows:
\begin{array}{lcl}
bool & \to & bool\;||\;bool\\
& | & bool\;\&\&\;bool\\
& | & !\;bool\\
& | & (\;bool\;)\\
& | & expr<expr\\
& | & expr<=expr\\
& | & expr>expr\\
& | & expr>=expr\\
& | & expr\\
\end{array}
Then modify expr as
follows:
\begin{array}{lcl}
expr & \to & expr\;||\;expr\\
& | & expr\;\&\&\;expr\\
& | & expr<expr\\
& | & expr<=expr\\
& | & expr>expr\\
& | & expr>=expr\\
& | & expr+expr\\
& | & expr-expr\\
& | & expr\,*\,expr\\
& | & expr\;/\;expr\\
& | & !\;expr\\
& | & -\;expr\\
& | & (\;expr\;)\\
& | & \textbf{id}\\
& | & \textbf{const}
\end{array}
The shift-reduce and reduce-reduce conflicts of the
ambiguous grammar are resolved by specifying the
precedence and associativity of operators. The
precedence and associativity are specified in
grammar.y as follows:
%left OR
%left AND
%left EQ NE
%left LT LE GT GE
%left '+' '-'
%left '*' '/'
%right INV NOT
where INV stands for negation (negative numbers), and
NOT stands for logical negation.
1.3 Error Recovery
To prevent yacc from exiting immediately when a
syntax error is detected, but to continue translation
and find remaining possible errors, an error production
is added to stmt:
stmt\to\textbf{error}\;;
When encountering an error, Yacc will keep looking
ahead until it encounters a semicolon (i.e., the end of
the erroneous statement), then reduce the preceding
error to \textbf{error}, and continue
translating the subsequent content to achieve error
recovery.
1.4 Synthesized
Attributes
Since yacc adopts a bottom-up LR translation scheme,
each non-terminal and terminal symbol is set to have the
following synthesized attributes, defined as the
yystack structure in
types.h:
typedef struct yystack
{
char name[MAXLEN];
constant_val val;
type type;
int addr;
goto_list *t_list;
goto_list *f_list;
goto_list *n_list;
} yystack;
#define YYSTYPE yystack
where:
name | Name of the terminal symbol \textbf{id} |
val | Constant value of the terminal symbol \textbf{const} |
type | Type of terminal and non-terminal symbols (int,
float, char, bool) |
addr | Address of the temporary variable corresponding to
the non-terminal symbol |
t_list | List of jump instructions to be backfilled with
target addresses when the condition is true
(corresponding to truelist in textbooks) |
f_list | List of jump instructions to be backfilled with
target addresses when the condition is false
(corresponding to falselist in textbooks) |
n_list | List of jump instructions to be backfilled with
target addresses at the end of the statement
(corresponding to nextlist in textbooks) |
2 Lexical Analysis
In lexer.l, for the \textbf{basic} terminal
symbol, set the corresponding type attribute and return
it:
int { yylval.type = INT; return BASIC; }
float { yylval.type = FLOAT; return BASIC; }
char { yylval.type = CHAR; return BASIC; }
bool { yylval.type = BOOL; return BASIC; }
For the constant \textbf{const}, also set the
corresponding type and parse the constant value using
different methods according to different types. Integers
and floating-point numbers are implemented through
atoi and atof, character types
take the second character of the string (the first is a
single quote), and boolean types are set directly:
true { yylval.type = BOOL; yylval.val.boolean = true; return CONST; }
false { yylval.type = BOOL; yylval.val.boolean = false; return CONST; }
{num} { yylval.type = INT; yylval.val.num = atoi(yytext); return CONST; }
{real} { yylval.type = FLOAT; yylval.val.real = atof(yytext); return CONST; }
{char} { yylval.type = CHAR; yylval.val.ch = yytext[1]; return CONST; }
For \textbf{id},
directly copy the matching result to the name attribute,
and its type can only be determined during syntax
analysis.
{id} { strcpy(yylval.name, yytext); return ID; }
The regular expressions for matching identifiers,
integers, floating-point numbers, and characters are
defined as follows, supporting the E notation for
floating-point numbers.
id [A-Za-z][0-9A-Za-z]*
num [0-9]+
real [0-9]+(\.[0-9]+)?(E[+-]?[0-9]+)?
char '.'
3 Identifiers and
Constants
3.1 Symbol Table
Identifiers, symbol tables, and related functions are
defined in types.h as follows:
typedef struct symbol
{
char name[MAXLEN];
type type;
int addr;
int width;
struct symbol *next;
} symbol;
typedef struct symbol_list
{
symbol *head;
struct symbol_list *prev;
int begin;
int end;
} symbol_list;
symbol *new_symbol(symbol_list *list, char *name, type type);
symbol *find_symbol(symbol_list *list, char *name);
symbol *find_local_symbol(symbol_list *list, char *name);
symbol_list *new_symbol_list(symbol_list *prev, int addr);
void print_symbol_list(symbol_list *list);
void delete_symbol_list(symbol_list *list);
Among them, the identifier structure records the
name, type, address, and memory width occupied by the
identifier, and records the pointer to the next symbol
to form a linked list. Since identifier definitions can
appear in different blocks, and the innermost
block can reference all
defined identifiers in the current and upper blocks, it is necessary to
save the pointer prev to the upper-level
symbol table of the symbol table.
Dynamic creation and deletion of symbol tables at
different levels are realized by modifying the
production of block:
\begin{array}{ll}
block\to \{\;begin\;decls\;stmts\;end\;\}\\
begin\to\varepsilon & \{\;slist =
new\_symbol\_list();\;\}\\
end\to\varepsilon &
\{\;delete\_symbol\_list(slist);\;\}
\end{array}
where slist is a global variable
representing the symbol table of the outermost block currently.
The specific implementation in grammar.y
is as follows: when creating a symbol table for the
first time, it is necessary to specify the starting
address of the symbol table. Here,
SYMBOL_BEGIN is used to represent it, which
is 1000. Subsequent symbol tables start from the end
address of the upper-level symbol table. Before the
symbol table is released, print_symbol_list
is called to print the current symbol table.
block : '{' begin decls stmts end '}'
begin :
{
if (slist == NULL)
slist = new_symbol_list(NULL, SYMBOL_BEGIN);
else
slist = new_symbol_list(slist, slist->end);
}
end :
{
symbol_list *slist2 = slist;
if (error == 0)
print_symbol_list(slist2);
slist = slist2->prev;
delete_symbol_list(slist2);
}
3.2 Constant Table
The constant table is similar to the symbol table,
but only a global constant table clist is
needed, starting from address number 3000
(CONSTANT_BEGIN), and there is no need to
save hierarchical relationships. The relevant function
definitions are as follows:
constant *new_constant(constant_list *list, constant_val val, type type);
constant *find_constant(constant_list *list, constant_val val, type type);
constant_list *new_constant_list(int addr);
void print_constant_list(constant_list *list);
void delete_constant_list(constant_list *list);
3.3
Identifier Declaration and Reference
Identifiers are added to the symbol table when the
declaration is parsed during syntax analysis, because
only then can the type of the identifier be known.
decl\to\textbf{basic}\;\textbf{id}\;;\qquad\{\;new\_symbol(slist,\,\textbf{id}.name,\,\textbf{basic}.type);\;\}
The new_symbol function creates a new
identifier, specifies the address of the new identifier
through the current symbol table information, and
determines the width occupied by the new identifier
through the type. When inserting into the symbol table,
it is also necessary to consider whether there are
duplicate defined identifiers. Since identifiers with
the same name in the inner layer will override those in
the outer layer, only the current layer’s symbol table
needs to be searched, which is realized by the
find_local_symbol function. The specific
implementation of the semantic rules in
grammar.y is as follows:
decl : BASIC ID ';'
{
if (find_local_symbol(slist, $2.name) == NULL)
new_symbol(slist, $2.name, $1.type);
else
{
char msg[MAXLEN];
sprintf(msg, "duplicate declearation of symbol \"%s\"", $2.name);
yyerror(msg);
}
}
When referencing an identifier, assign the attributes
of the identifier to the current non-terminal
symbol:
expr\to\textbf{id}\qquad\begin{array}{l}\{\;expr.type=\mathbf{id}.type;\\\;\;expr.addr=\textbf{id}.addr;\;\}\end{array}
Of course, it is also necessary to check whether the
symbol has been defined; otherwise, an error is
reported, but a new symbol can be inserted immediately
for error recovery processing. The specific
implementation is as follows:
expr : ID
{
symbol *id = find_symbol(slist, $1.name);
if (id == NULL)
{
char msg[MAXLEN];
sprintf(msg, "undeclared symbol \"%s\"", $1.name);
yyerror(msg);
id = new_symbol(slist, $1.name, INT);
}
$$.type = id->type;
$$.addr = id->addr;
}
3.4 Constant Reference
For constants, they need to be added to the global
constant table clist when they appear, and
their attributes are assigned to non-terminal
symbols:
expr\to\textbf{const}\qquad\begin{array}{l}\{\;expr.type=\mathbf{const}.type;\\\;\;expr.addr=new\_constant(clist,\,\textbf{const}.val,\,\textbf{const}.type);\;\}\end{array}
To address the problem of repeated references to the
same constant, a check for duplicate constants is added
inside the new_constant function. When both
the constant value and type are the same, the original
constant is returned directly. The specific
implementation is similar to the reference of
identifiers.
4 Quadruples
and Three-Address Code
4.1 Quadruple Table
A quadruple consists of an operator op, parameters
arg1, arg2, and result result (arg3). The structures of
quadruples and quadruple tables are defined in
types.h as follows:
typedef struct quadruple
{
int op;
int arg1;
int arg2;
int arg3;
char code[4 * MAXLEN];
} quadruple;
typedef struct quadtable
{
quadruple *buf;
int size;
int bufsize;
} quadtable;
quadtable *new_quadtable();
void print_quadtable(quadtable *table);
void delete_quadtable(quadtable *table);
Among them, the code field of the
quadruple is used to store the three-address code. The
buf of the quadruple table is used to store
quadruples. The code for inserting quadruples into the
quadruple table and generating the intermediate code
corresponding to the quadruple is as follows.
4.2 Incremental
Translation
The gen() function
in the textbook is implemented using incremental
translation. It is defined as follows, where arg1, arg2,
and arg3 are all memory addresses of parameters.
void gen(quadtable *table, symbol_list *slist, constant_list *clist, int op, int arg1, int arg2, int arg3)
The specific implementation steps of this function
are as follows. First, the size of the quadruple table
uses dynamic memory allocation. If the size is
insufficient, realloc is performed during
insertion:
if (table->size == table->bufsize)
{
table->bufsize += MAXLEN;
table->buf = realloc(table->buf, table->bufsize);
}
Then, assign values in sequence and insert the
current quadruple. If the values of arg1, arg2, and arg3
are -1, it means that the parameter is not used or the
parameter is to be backfilled.
quadruple *t = table->buf + table->size;
t->op = op;
t->arg1 = arg1;
t->arg2 = arg2;
t->arg3 = arg3;
After insertion, it is necessary to generate the
intermediate code corresponding to the quadruple. The
address of the intermediate code is the size of the
current quadruple table plus the starting address of the
code CODE_BEGIN, which is set to 100 here.
First, use the arg_name function to search
the symbol table and constant table, converting arg1,
arg2, and arg3 from memory addresses to corresponding
symbol names or string representations of constants. If
the value of arg is -1, it corresponds to an empty
string for subsequent backfilling.
int addr = table->size + CODE_BEGIN;
char code1[MAXLEN];
char code2[MAXLEN];
char code3[MAXLEN];
arg_name(code1, slist, clist, t->arg1);
arg_name(code2, slist, clist, t->arg2);
arg_name(code3, slist, clist, t->arg3);
Then, generate different forms of three-address
intermediate code according to different operators.
Since the backpatching method is used, the intermediate
code uses one label per line instead of forms like L0
and L1.
switch (t->op)
{
case movi:
case movf:
case movc:
case movb:
sprintf(t->code, "%d:\t%s = %s\n", addr, code3, code1);
break;
case addi:
case addf:
sprintf(t->code, "%d:\t%s = %s + %s\n", addr, code3, code1, code2);
break;
...
case and:
sprintf(t->code, "%d:\t%s = %s && %s\n", addr, code3, code1, code2);
break;
...
case eq:
sprintf(t->code, "%d:\t%s = %s == %s\n", addr, code3, code1, code2);
break;
...
case invi:
case invf:
sprintf(t->code, "%d:\t%s = -%s\n", addr, code3, code1);
break;
case not:
sprintf(t->code, "%d:\t%s = !%s\n", addr, code3, code1);
break;
case ftoi:
case ctoi:
case btoi:
sprintf(t->code, "%d:\t%s = (int)%s\n", addr, code3, code1);
break;
...
case jmp:
sprintf(t->code, "%d:\tgoto %s\n", addr, code3);
break;
...
case jle:
sprintf(t->code, "%d:\tif %s <= %s goto %s\n", addr, code1, code2, code3);
break;
}
Here, an enumeration type of operators is defined.
The mov series is for assignment operations, the add
series for addition operations, the inv series for
negation operations, the to series for type conversion
operations, and the j series for jump and branch
instructions. The suffixes i, f, c, and b correspond to
int, float, char, and bool types respectively. If a
branch instruction needs to be backfilled, since the
string corresponding to address -1 is an empty string,
the newline character can be overwritten after the
original string and content can be appended.
5 Expression
Translation
5.1 Temporary
Variables
Used to implement \textbf{new}\;Temp() in the
textbook. First, maintain a global variable
temp_count to represent the subscript of
the current temporary variable. Then generate a
temporary variable name, add it to the current symbol
table in the same way as new_symbol, and
finally return the address of the generated temporary
variable:
int new_temp(symbol_list *list, type type)
{
symbol *id = malloc(sizeof(symbol));
sprintf(id->name, "_t%d", temp_count++);
id->type = type;
id->width = type_width[type];
id->addr = list->end;
list->end += id->width;
id->next = list->head;
list->head = id;
return id->addr;
}
5.2 Type Conversion
Used to implement the widen() function and max() function in the
textbook.
The following enumeration type and conversion matrix
conv are defined in types.h,
corresponding to the op of type conversion. For example,
conv[INT][FLOAT] = conv[0][1] = itof, which
means the instruction for converting an integer to a
floating-point number. -1 means no conversion is
needed.
typedef enum type
{
INT,
FLOAT,
CHAR,
BOOL
} type;
int conv[4][4] = {
{-1, itof, itoc, itob},
{ftoi, -1, ftoc, ftob},
{ctoi, ctof, -1, ctob},
{btoi, btof, btoc, -1}};
Then, write the widen function. This
function first determines whether conversion is needed
according to the types before and after conversion. If
no conversion is needed, it directly returns the
original parameter address; otherwise, it adds a new
intermediate variable through new_temp and
generates a type conversion intermediate code through
gen, returning the address of the
intermediate variable.
int widen(quadtable *table, symbol_list *slist, constant_list *clist, int addr, type type1, type type2)
{
int op = conv[type1][type2];
if (op != -1)
{
int arg3 = new_temp(slist, type2);
gen(table, slist, clist, op, addr, -1, arg3);
return arg3;
}
return addr;
}
Next, the max array is defined to
indicate the type of the result when numerical values of
two types perform four arithmetic operations, which is
used to provide a basis for the above type
conversion.
int max[4][4] = {
{INT, FLOAT, INT, INT},
{FLOAT, FLOAT, FLOAT, FLOAT},
{INT, FLOAT, CHAR, INT},
{INT, FLOAT, INT, BOOL}
};
For example,
max[CHAR][FLOAT] = MAX[2][1] = FLOAT, which
means the result of the operation between a character
type and a floating-point type should be a
floating-point type.
5.3 Four Arithmetic
Operations
Take addition as an example:
expr\to
expr_1+expr_2\qquad\begin{array}{l}\{\;expr.type=max(expr_1.type,\,expr_2.type);\\
\;\;expr_1.addr=widen(expr_1.addr,\,expr_1.type,\,expr.type);\\
\;\;expr_2.addr=widen(expr_2.addr,\,expr_2.type,\,expr.type);\\
\;\;expr.addr=\textbf{new}\;Temp();\\
\;\;gen(expr.addr\;'\texttt{=}'\;expr_1.addr\;'\texttt{+}'\;expr_2.addr);\;\}\end{array}
First, determine the type of the result through the
max array. Then, perform type conversion on
the source operands through the widen
function (if needed). Create a new intermediate variable
to save the result of the expression, and assign the
address of this intermediate variable to the address
attribute of the non-terminal expr. Finally, incrementally
translate the statement through the gen
function to generate intermediate code. The specific
implementation is as follows:
expr : expr '+' expr
{
$$.type = max[$1.type][$3.type];
int arg1 = widen(table, slist, clist, $1.addr, $1.type, $$.type);
int arg2 = widen(table, slist, clist, $3.addr, $3.type, $$.type);
int arg3 = new_temp(slist, $$.type);
gen(table, slist, clist, _add[$$.type], arg1, arg2, arg3);
$$.addr = arg3;
}
Among them, the _add array is used to
determine which addition instruction to use (i.e., addi
or addf) according to the type. Subtraction,
multiplication, and division are similar, except for the
different operators in gen.
5.4 Boolean Operations
Since boolean operations here appear in the
production of expr
instead of bool, only
the value of boolean operations needs to be considered.
Similar to four arithmetic operations, but expr.type is specified as
BOOL in advance instead of being determined
using the max array.
5.5 Unary
Operators and Parentheses
For logical negation operations, the type is also set
to BOOL in advance, while for negation
(negative number) operations, the type is set to the
maximum of the operand type and INT,
because for floating-point numbers, negation remains a
floating-point number, and for integers, characters, and
boolean types, negation becomes a positive number. At
the same time, since there is only one operand,
arg2 needs to be set to -1 when using
gen for incremental translation.
For parentheses, it is simpler to directly assign the
attributes to the non-terminal symbol on the left side
of the production as they are:
expr\to(\;expr_1\;)\qquad\begin{array}{l}\{\;expr.type=expr_1.type;\\\;\;expr.addr=expr_1.addr;\;\}\end{array}
The specific implementation is as follows:
expr : '!' expr %prec NOT
{
$$.type = BOOL;
int arg1 = widen(table, slist, clist, $2.addr, $2.type, $$.type);
int arg3 = new_temp(slist, $$.type);
gen(table, slist, clist, not, arg1, -1, arg3);
$$.addr = arg3;
}
| '-' expr %prec INV
{
$$.type = max[$2.type][INT];
int arg1 = widen(table, slist, clist, $2.addr, $2.type, $$.type);
int arg3 = new_temp(slist, $$.type);
gen(table, slist, clist, _inv[$$.type], arg1, -1, arg3);
$$.addr = arg3;
}
| '(' expr ')'
{
$$.type = $2.type;
$$.addr = $2.addr;
}
5.6 Assignment
Statements
After the expression calculation is completed, it may
be necessary to assign it to a defined identifier.
However, type conversion is still involved here, and the
type of the expression needs to be converted to the type
of the identifier itself (if needed) through
widen, and the intermediate code for
assignment is generated:
stmt\to\textbf{id}=expr\qquad\{\;gen(\textbf{id}\;'\texttt{=}'\;widen(expr.addr,\,expr.type.\,\textbf{id}.type);\;\}
In addition, it is necessary to consider whether the
identifier has been defined. If it is not found in all
symbol tables, an error is reported, but a new
identifier can be created immediately for error
recovery. The specific implementation is as follows:
stmt : ID '=' expr ';'
{
$$.n_list = NULL;
symbol *id = find_symbol(slist, $1.name);
if (id == NULL)
{
char msg[MAXLEN];
sprintf(msg, "undeclared symbol \"%s\"", $1.name);
yyerror(msg);
id = new_symbol(slist, $1.name, INT);
}
int arg1 = widen(table, slist, clist, $3.addr, $3.type, id->type);
int arg3 = id->addr;
gen(table, slist, clist, _mov[id->type], arg1, -1, arg3);
}
Among them, _mov is used to determine
which move (assignment) instruction to use according to
the type, including movi,
movf, movc, and
movb.
6 Control Flow
Translation
6.1 Backpatching
Technology
The goto_list type is defined in
types.h for the backpatching table
structure, i.e., t_list,
f_list, n_list (corresponding
to truelist, falselist, nextlist in textbooks). It
stores the address list of jump instructions to be
backfilled with target addresses. Related functions are
defined, where the first two new_goto_list
and merge_goto_list correspond to the makelist() and merge() functions in the
textbook respectively.
typedef struct goto_list
{
int addr;
struct goto_list *next;
} goto_list;
goto_list *new_goto_list(int addr);
goto_list *merge_goto_list(goto_list *list1, goto_list *list2);
void delete_goto_list(goto_list *list);
At the same time, the backpatching function backpatch() in the textbook
is implemented as follows: for a given backpatching
table and backpatching address, access the quadruple in
the quadruple table, modify the destination address
(third parameter arg3) of its jump
instruction, and modify the generated intermediate code
by appending the string representation of the
backpatching address:
void backpatch(quadtable *table, goto_list *list, int addr)
{
while (list != NULL)
{
quadruple *t = table->buf + list->addr;
t->arg3 = addr;
sprintf(t->code + strlen(t->code) - 1, "%d\n", addr);
list = list->next;
}
}
6.2 Boolean
Expressions
For boolean expressions, it is necessary to maintain
truelist and falselist for the
non-terminal symbol bool. Boolean expressions are
translated using the “short-circuit” method of boolean
operators, so the production needs to be further
appropriately rewritten:
\begin{array}{ll}
bool\to bool_1\;||\;M\;bool_2 &
\begin{array}{l}\{\;backpatch(bool_1.falselist,\,M.addr);\\\;\;bool.truelist=merge(bool_1.truelist,\,bool_2.truelist);\\\;\;bool.falselist=bool_2.falselist;\;\}\end{array}\\\\
bool\to bool_1\;\&\&\;M\;bool_2 &
\begin{array}{l}\{\;backpatch(bool_1.truelist,\,M.addr);\\\;\;bool.truelist=bool_2.truelist;\\\;\;bool.falstlist=merge(bool_1.falselist,\,bool_2.falstlist);\;\}\end{array}\\\\
bool\to\;!\;bool_1 &
\begin{array}{l}\{\;bool.truelist=bool_1.falselist;\\\;\;bool.falselist=bool_1.truelist;\;\}\end{array}\\\\
bool\to(\;bool_1\;) &
\begin{array}{l}\{\;bool.truelist=bool_1.truelist;\\\;\;bool.falselist=bool_1.falselist;\;\}\end{array}\\\\
M\to\varepsilon & \{\;M.addr=nextinstr;\;\}
\end{array}
where nextinstr is
the address of the next instruction, i.e., the length of
the current quadruple table plus the starting address of
the code segment
table->size + CODE_BEGIN. The marker
symbol M is introduced
to record the starting address of the code corresponding
to the second boolean expression of the AND and OR
operations.
Taking the OR operation as an example to explain the
“short-circuit” method. First, backfill the jump address
where the condition of bool_1 is false to the
starting address of bool_2, because if the first
part of the OR operation is false, the second part may
still be true, and the result cannot be obtained
directly. However, the jump address where the condition
of bool_1 is true is
merged with that of bool_2, both being the jump
address where the condition of bool is true. In this way,
when bool_1 is indeed
true, the judgment of bool_2 can be skipped and
jump directly. The specific implementation is as
follows:
M :
{ $$.addr = table->size + CODE_BEGIN; }
bool : bool OR M bool
{
backpatch(table, $1.f_list, $3.addr);
$$.t_list = merge_goto_list($1.t_list, $4.t_list);
$$.f_list = $4.f_list;
}
| bool AND M bool
{
backpatch(table, $1.t_list, $3.addr);
$$.t_list = $4.t_list;
$$.f_list = merge_goto_list($1.f_list, $4.f_list);
}
| '!' bool %prec NOT
{
$$.t_list = $2.f_list;
$$.f_list = $2.t_list;
}
| '(' bool ')'
{
$$.t_list = $2.t_list;
$$.f_list = $2.f_list;
}
6.3 Conditional
Expressions
Boolean operations can also derive conditional
expressions. Conditional expressions are divided into
two types: one uses relational operators \textbf{rel}, i.e., ==, !=, >, >=, <, <=; the other directly
evaluates the expression as a boolean type. The
translation is as follows:
\begin{array}{ll}
bool\to expr_1\;\textbf{rel}\;expr_2 &
\begin{array}{l}\{\;type=max(expr_1.type,\,expr_2.type);\\\;\;expr_1.addr=widen(expr_1.addr,\,expr_1.type,\,type);\\\;\;expr_2.addr=widen(expr_2.addr,\,expr_2.type,\,type);\\\;\;bool.truelist=makelist(nextinstr);\\\;\;bool.falselist=makelist(nextinstr+1);\\\;\;gen('\texttt{if}'\;expr_1.addr\;\textbf{rel}\;expr_2.addr\;'\texttt{goto
\_}');\\\;\;gen('\texttt{goto
\_}');\;\}\end{array}\\\\
bool\to expr &
\begin{array}{l}\{\;type=max[expr_1.type][\textbf{bool}];\\\;\;expr_1.addr=widen(expr_1.addr,\,expr_1.type,\,type);\\\;\;bool.truelist=makelist(nextinstr);\\\;\;bool.falselist=makelist(nextinstr+1);\\\;\;gen('\texttt{if}'\;expr_1.addr\;'\texttt{==}'\;\textbf{true}\;'\texttt{goto
\_}');\\\;\;gen('\texttt{goto
\_}');\;\}\end{array}
\end{array}
For the first case, it is also necessary to convert
the parameters on both sides of the relational operator
to the same type through max and
widen before comparison. Then add the
goto statement where the condition is true
to the truelist
backpatching table, and add the next (condition not met)
goto statement to the falselist backpatching
table.
For the second case, first convert the expression to
the bool type and compare it with the constant
true for equality; the rest is similar to
the first case. The specific implementations of the two
cases are as follows, with the first case taking greater
than or equal to as an example:
bool : expr GE expr
{
int arg1 = widen(table, slist, clist, $1.addr, $1.type, max[$1.type][$3.type]);
int arg2 = widen(table, slist, clist, $3.addr, $3.type, max[$1.type][$3.type]);
$$.t_list = new_goto_list(table->size + CODE_BEGIN);
$$.f_list = new_goto_list(table->size + CODE_BEGIN + 1);
gen(table, slist, clist, jge, arg1, arg2, -1);
gen(table, slist, clist ,jmp, -1, -1, -1);
}
| expr
{
int arg1 = widen(table, slist, clist, $1.addr, $1.type, BOOL);
int arg2 = new_constant(clist, (constant_val)true, BOOL)->addr;
$$.t_list = new_goto_list(table->size + CODE_BEGIN);
$$.f_list = new_goto_list(table->size + CODE_BEGIN + 1);
gen(table, slist, clist, jeq, arg1, arg2, -1);
gen(table, slist, clist ,jmp, -1, -1, -1);
}
6.4 Branch Statements
In addition to boolean expressions, it is also
necessary to maintain nextlist for general
expression statements, i.e., the list of jump
instructions to be backfilled with target addresses at
the end of the statement. To this end, the
above-mentioned marker M needs to be introduced, and
a new marker N is
added:
\begin{array}{ll}
stmt\to\textbf{if}\;(\;bool\;)\;M\;stmt_1 &
\begin{array}{l}\{\;backpatch(bool.truelist,\,M.addr);\\\;\;stmt.nextlist=merge(bool.falstlist,\,stmt_1.nextlist);\;\}\end{array}\\\\
stmt\to\textbf{if}\;(\;bool\;)\;M_1\;stmt_1\;N\;\textbf{else}\;M_2\;stmt_2
&
\begin{array}{l}\{\;backpatch(bool.truelist,\,M_1.addr);\\\;\;backpatch(bool.falselist,\,M_2.addr);\\\;\;stmt.nextlist=merge(stmt_1.nextlist,\,merge(N.nextlist,\,stmt_2.nextlist));\;\}\end{array}\\\\
M\to\varepsilon & \{\;M.addr=nextinstr;\;\}\\\\
N\to\varepsilon &
\begin{array}{l}\{\;N.nextlist=makelist(nextinstr);\\\;\;gen('\texttt{goto
\_}');\;\}\end{array}
\end{array}
Taking if-else as an example. Backfill the address
where the condition of the boolean expression is true to
the starting address of stmt_1, and backfill the
address where the condition of the boolean expression is
false to the starting address of stmt_2. At the same time,
generate a jump instruction through the marker N, so that after stmt_1 is executed, it can
jump to the end of the if-else statement instead of
entering the next else statement. Finally, merge the
backpatching tables of the addresses of the next
instructions of stmt_1,
N, and stmt_2 to form the
backpatching table of the address of the next
instruction of stmt.
The specific implementation is as follows:
N :
{
$$.n_list = new_goto_list(table->size + CODE_BEGIN);
gen(table, slist, clist, jmp, -1, -1, -1);
}
stmt : IF '(' bool ')' M stmt
{
backpatch(table, $3.t_list, $5.addr);
$$.n_list = merge_goto_list($3.f_list, $6.n_list);
}
| IF '(' bool ')' M stmt N ELSE M stmt
{
backpatch(table, $3.t_list, $5.addr);
backpatch(table, $3.f_list, $9.addr);
$$.n_list = merge_goto_list(merge_goto_list($6.n_list, $7.n_list), $10.n_list);
}
It should be noted here that the production of N needs to be placed before
stmt. Due to the
original ambiguity problem of the if-else statement
(shift-reduce conflict between shifting else and
reducing a single if statement), yacc will default to
shifting, thus correctly resolving the ambiguity of
else. However, after introducing N, it becomes a reduce-reduce
conflict between reducing N and reducing a single if
statement. When handling such conflicts, yacc determines
the priority according to the order of productions, so
the production of N
needs to be placed first to ensure that N is reduced first and the
next else statement is shifted.
6.5 Loop Statements
In addition to conditional judgments that are
basically similar to branch statements, to implement
break in loops, it is necessary to additionally define a
global stack break_lists to store the
addresses of break statements in loops at all levels at
the beginning of the loop. Therefore, a marker Q needs to be introduced
before each loop, and the role of this marker is to push
a new breaklist onto
the stack. Once a break is encountered, the top of the
stack is merged with the address of the newly generated
goto instruction. At the end of the loop, the top of the
stack is popped and added to the nextlist of stmt.
\begin{array}{ll}
stmt\to\textbf{while}\;Q\;M_1\;(\;bool\;)\;M_2\;stmt_1
&
\begin{array}{l}\{\;backpatch(stmt_1.nextlist,\,M_1.addr);\\\;\;backpatch(bool.truelist,\,M_2.addr);\\\;\;stmt.nextlist=merge(bool.falselist,\,breaklists.pop());\\\;\;gen('\texttt{goto}'\;M_1.addr);\;\}\end{array}\\\\
stmt\to\textbf{do}\;Q\;M_1\;stmt_1\;M_2\;\textbf{while}\;(\;bool\;)\;;
&
\begin{array}{l}\{\;backpatch(bool.truelist,\,M_1.addr);\\\;\;backpatch(stmt_1.nextlist,\,M_2.addr);\\\;\;stmt.nextlist=merge(bool.falstlist,\,breaklists.pop());\;\}\end{array}\\\\
stmt\to\textbf{break}; &
\begin{array}{l}\{\;stmt.nextlist=\textbf{null};\\\;\;breaklists.top=merge(breaklists.top,\,makelist(nextinstr));\\\;\;gen('\texttt{goto
\_}');\;\}\end{array}\\\\
Q\to\varepsilon &
\{\;breaklists.push(\textbf{null});\;\}
\end{array}
At the same time, for while loops, an additional goto
instruction needs to be generated at the end of the loop
to jump to the beginning of the loop. However, do-while
does not need this; it is directly determined according
to the truelist of the
boolean expression. The specific implementation is as
follows:
Q :
{ break_lists[tos++] = NULL; }
stmt : WHILE Q M '(' bool ')' M stmt
{
backpatch(table, $8.n_list, $3.addr);
backpatch(table, $5.t_list, $7.addr);
$$.n_list = merge_goto_list($5.f_list, break_lists[--tos]);
gen(table, slist, clist, jmp, -1, -1, $3.addr);
}
| DO Q M stmt WHILE M '(' bool ')' ';'
{
backpatch(table, $8.t_list, $3.addr);
backpatch(table, $4.n_list, $6.addr);
$$.n_list = merge_goto_list($8.f_list, break_lists[--tos]);
}
| BREAK ';'
{
$$.n_list = NULL;
if (tos == 0)
yyerror("\"break\" statement doesn't match any loop");
else
{
break_lists[tos - 1] = merge_goto_list(break_lists[tos - 1], new_goto_list(table->size + CODE_BEGIN));
gen(table, slist, clist, jmp, -1, -1, -1);
}
}
6.6 Backpatching
nextlist
The nextlist
generated by the above control flow statements
temporarily does not determine the address of the next
statement of the statement, so it is recorded and left
for backfilling later. Therefore, at the end of each
statement, the starting address of the next statement is
recorded through M, and
the backpatching operation is performed:
stmts\to
stmts\;stmt_1\;M\qquad\{\;backpatch(stmt_1.nextlist,\,M.addr);\;\}
The specific implementation is as follows:
stmts : stmts stmt M
{ backpatch(table, $2.n_list, $3.addr); }
At the same time, for other stmt productions, such as
stmt\to\textbf{id} and
stmt\to block, their
nexlist should be set
to null to prevent errors caused by uninitialized
pointers during merge_goto_list.
7 Running Results
7.1 Compilation
Compile in the terminal using the following commands,
where yacc uses the -v command
to output state transition graph information to
y.output:
lex lexer.l
yacc -v -d grammar.y
cc types.c y.tab.c -ly -ll
During yacc compilation, it prompts 5
shift-reduce conflicts and 22 reduce-reduce conflicts,
and the specific conditions of the conflicts can be
viewed in y.output. The 5 shift-reduce
conflicts are conflicts between reducing bool\to expr and shifting
AND, OR, and shifting should be prioritized, so the
default action is taken. The 21 reduce-reduce conflicts
are caused by some of the same productions between bool and expr. Since the production of
bool is placed before
that of expr, bool is prioritized for
reduction, i.e., jump code is generated first instead of
evaluating boolean expressions. There is another
reduce-reduce conflict caused by the introduction of
marker N due to the
ambiguity of if-else mentioned above, and the solution
is to place the production N in front to prioritize
reducing N and shifting
else.
If the following test sample is saved as
test.c:
{
int i;
int sum;
i = 2;
while (i <= 100)
{
sum = sum + i;
i = i + 2;
}
}
Run the following command to get the output
result:
./a.out < test.c

The program prints the symbol table, global constant
table, quadruple table in hierarchical order, and
finally prints the intermediate code corresponding to
the quadruples.
7.2 Symbol
Table and Constant Table Test
For the following program:
{
int a; float b; char c;
a = 137; b = 3.14159;
if (a == 137)
{
bool d; int e;
d = true; e = 137;
}
else
{
char d; float e;
d = '*'; e = 3.14159;
}
c = '*';
}
The program outputs the following symbol table and
constant table. It can be seen that the three blocks are
divided into 3 symbol tables, and duplicate symbol names
in different blocks have no impact. The constant table
also does not insert the same constant repeatedly:

7.3 Expression
Translation Test
Use the following test code involving operator
precedence, boolean expression evaluation, and type
conversion:
{
int a;
float b;
bool result;
a = 1 * 1;
b = a / 2 + 1;
result = b != 1.5 && a == 1;
}
The translation result is as follows:

It can be seen that for integer division such as
a / 2, the translation is correct, and the
result is still an integer until it is assigned to b,
which is then converted to a floating-point number.
7.4 Control Flow
Translation Test
Use the following test code involving translation of
if, if-else, while, do-while, break, and reflection of
short-circuit code:
{
int i; int j; float sum;
i = 0; j = 0;
while(true)
{
i = i + 1;
do
{
sum = sum + 1.0 / i;
if (sum > 3) break;
else j = j + 1;
} while (j < 100);
if (i >= 100 || sum > 3) break;
}
}
The result is as follows:

It can be seen that these goto
statements are all correctly backfilled.
7.5 Error Recovery
Test
Test with the following error-prone program: first,
duplicate definition of a in the third
line, missing semicolon in the fourth line, reference to
undefined variable x in the sixth line, and
break not inside any loop statement:
{
int a; int b;
float a;
a = 1
b = 0;
if (x < 1)
{
a = a - 1;
break;
}
}
The program output is as follows, showing all
errors:

Download source code
here