CS 348 Lab #7: Compiling Simple Expressions

Lab #7: Compiling Simple Expressions
Assigned: Thu 18 Nov 2002
Due: ?? May 2002

Contents

The problem description
Informal semantics
Some sample expressions
Code generation

The problem description
You should write a compiler that reads in an infix expression in the simple language we have been working on for the last couple of labs. Your compiler should produces PC-231 assembly code (or any other assembly or machine code of your choice) that will evaluate the expression when run. If you use PC-231 assembly language, you may use my assembler (asm231) to convert the assembly code into machine language and my simulator (sim231) to evaluate the resulting machine code (these are command-line calls on Gemini and Hudson).
If you use any other assembly or machine language, you should have a simulator (or chip!) available from the Collins 411 lab on which to run demos.
You should already have a parser and static analyzer (type and variable-initialization checker) written from the last two assignments. These should produce a type-decorated tree (for valid inputs) which you can traverse in order to produce assembly code (this is the "back end" of the compiler). I give detailed hints on how to approach the problem below.
It may be difficult to complete the compiler for the whole language all at once. I recommend taking a progressive approach, so that you implement only a few language features at a time. Adding fancier features in "layers" will allow you to keep a more-or-less complete working program at various checkpoints along the way, so that you have something to turn in should a deadline arrive.
See this section of Lab 2 for the expression grammar and this section of Lab #5 for typing rules.

More informal semantics
We took a brief look at informal semantics for each of the different expression forms in Lab #2. Here is a more detailed description of each, along with some remarks relevant to code generation (all remarks are made under the assumption that you'll use the PC-231 for your compilation target):

integer literals (sequences of digits only) mean exactly what you would expect, i.e., they are just non-negative, decimal numerals. Recall that the PC-231 has only 11 bits plus a sign bit to work with, and thus can only handle integers whose absolute value is smaller than 2048 (i.e., -2047 to +2047). You might want to generate an error upon seeing a numerals whose value is larger than 2047 (again, you need only handle non-negative literals). Note that literal values will need to be held in a RAM location or (much better!) in some register. Smaller values can be put in a register with a SET command, but larger values will need a DATA command and then a COPY (usually).
Examples: "0", "1", "10", "235", "2047".

Boolean literals are just "T" or "F" and have as their meanings the usual truth-values (true and false). You may want to represent these values as right-justified, single-bit values 1 and 0, as this will make it easier to implement logical operations on them.
Examples: "T", "F",.

variables are essentially storage locations on the PC-231 (either in RAM or in registers) interpreted as holding integer values. This means that they can be set, using assignment, or read, just by mentioning their names as part of an expression. Variable names are lowercase only and may not include digits. You may want to enhance your symbol table from Lab 2 to keep track of where variables are being stored (unless you move them around, e.g. from registers to RAM, in order to free up temporary registers).
Examples: "x", "y", "abc", "xyz".

the READ token is a simple means of allowing our expressions to involve user input. When the READ token is evaluated, the effect should be to read a decimal integer or boolean value in from the user, using the decimal device ("DD") or ASCII device ("AD") of the PC-231. The decimal reader provided with the sim231 simulator takes care of all the details of number conversion (including negative numbers), so you need only generate machine code which performs the read itself. For Boolean values, you'll need to generate code which does appropriate checks and perhaps conversion (e.g., into a single 0 or 1 bits).

assignment expressions are similar to those used in C, C++ and Java. Their constituents are a variable and an expression: the expression should be evaluated and the value stored in the variable. Note that the value of the assignment expression is the value of its constituent expression, i.e., the same value which is stored in the variable.
Examples: "x=3", "y=y++", "z=READ+x", "2*(n=READ+1)".

the postfix operators include just the increment (++) and decrement (--) operators familiar from C, C++ and Java and a similar negation operator (~~). Note that these operators can only be applied to variables, not to arbitrary expressions. Just as in C and Java, when these operators are written postfix, the meaning is that the variable's value is modified after it's value is returned as the result of the expression. In other words, in order to evaluate the expression, we first stash the current value of the variable for the result, then operate on the value of the variable, and then return the stashed value as the result of the expression as a whole.
Examples: "x++", "x=y++", "x+++y++", "p~~".

the binary operators include the usual multiplication, addition and subtraction operators, plus relational and logical operators for producing or operating on Boolean values. Evaluation is done from right to left, i.e., the right argument is evaluated first, then the left. This is important because of the interaction of variables with assignment and the postfix operators: consider the expression "x+x+x=6". According to the precedence and associativity, rules this should clearly be parsed the same as "x+(x+(x=6))", but this does not address the issue of how the expression should be evaluated. Assume for example that the value of x was 4 prior to evaluating this expression: then we might evaluate left-to-right and return results equivalent to "4+4+6" or right-to-left and return results equivalent to "6+6+6". Here we choose right-to-left order.
Logical operators can be implemented either directly with the underlying "AND" operation of the PC-231 or using arithmetic (or a combination of the two). Multiplication is not supported directly on the PC-231, so you'll need to implement this as a sub-routine call or in terms of SHIFT instructions.
Examples: "2+3", "2*(3+x)", "10-x*2+y=3".
the IF-THEN-ELSE construct .
Examples: "x++", "x=y++", "x+++y++", "p~~".

Some sample expressions
Here is a set of sample expressions you might want to test your compiler on, along with the results they should produce. Note that several of the expressions involve the READ construct, and therefore depend on the values that a user types in at run time.

(2+3)*2
produces: 10 (addition is done first, then multiplication).

2+3*7-1
produces: 22 (multiplication is done first, then subtraction, and finally the addition).

2*x*x + 3*x=7 + 2
produces: 121

READ + READ * 2
produces: 37 (assuming the first (rightmost) READ yields 15 and the second (leftmost) READ yields 7).

x*x+++10-x=3
produces: 19 (note that the expression parses as "(x*(x++))+(10-(x=3))"; the value of the rightmost x is 3, the value of the middle x is also 3, and the value of the final x is 4 because of the ++).

2*x+++10-x=3+x=READ
produces: 8 (assuming the value read is 5; note that the expression parses out as "(2*(x++))+(10-((x=3)+(x=READ)))"; the result then follows from right-to-left evaluation. x is left with the value 4).

Code generation
Code generation is largely a process of traversing the parse tree in a bottom-up, right-to-left order, producing code to implement each of the operators, literals or variable references encountered along the way.

You will also need some general strategies for handling registers and RAM locations. In order to generate code for an integer literal, you will need to put it into some register, and you will need to know which ones are free for this purpose. Assuming that register R3, for example, is free, you might generate code like this for the integer literal 5:
	DATA 5
	COPY DR,R3
You will also need to know which register the result of a given sub-tree is stored in, so that the node above it can access it properly. For example, if the integer literal 5 from above occurs as a sub-tree of an addition operator node, you will want to generate code such as the following:
	DATA 5		; from above
	COPY DR,R3	; from above
	
	ADD R2,R3	; assuming other argument was in R2
			; result of the "+" node is in R3
			; and now R2 is free again
As mentioned in above, you can use a free register list in order to keep track of register usage. This list can then be modified as you pass from node to node in the tree. In the example above, you would remove R3 from the list as you passed through the node for the integer literal 5, and then add R2 back into the free list after passing through the addition operator node.
It is possible that a straight-forward use of registers will leave you with no free registers at certain points in the process, especially for larger expressions. If this happens, make sure you are using the free list properly, but you need not go the length of "spilling" registers out into RAM locations. In other words, it is OK if your algorithm fails for some large expressions, as long as you are doing your best to free registers as you go.
Variables can be handled by setting aside a RAM location for them and then loading them into a register and proceeding as above (i.e., noting the register used in the node somehow). For example, an expression such as "x=(x+2)" might compile as follows:
			; code for int lit "2"
	DATA 2		; int literal 2
	COPY DR,R0	; R0 now in use
	
			; code for var "x"
	DATA #x		; label for var x storage
	LOAD R1,DR	; load x using DR indirectly
	
			; code for op "+"
	ADD R0,R1	; free up R0, result in R1
	
			; code for assignment to "x"
	DATA #x		; set up for store using DR
	STORE R1,DR	; store x, result in R1
	
	. . .		; etc., etc.
			
	HALT		; end of program
	
#x	CONST 0		; placeholder for x
#y	CONST 0		; placeholder for y
	. . .		; etc., etc.
Handling the multiplication operation will be a bit tougher than the rest: you will need to jump out of the main line of evaluation to a "subroutine" which implements multiplication for you, and then jump back when it is finished. I have a multiplication routine in this file which you can use to get started, but note that you will need to do some extra work to make sure that you have left registers for the multiplier to use, a register for the result, and a couple of jump registers for the "call" and "return".
The various other kinds of expressions are much easier to handle. You should probably try generating machine code for some simple expressions by hand until you are comfortable with the techniques before trying to code it up for the compiler. I will add some more detailed hints on code generation strategies for the various expression types, including multiplication, later in the week.
At the root of your tree, you should remember to generate a WRITE instruction which will print out the final value you have computed. Forgetting this step leads to a disappointing anti-climax in which you may have computed the right value, but can't tell, since it never gets printed!