CS 348 Lab #3: Parsing a Simple Expression Language

Lab #3: Parsing a Simple Expression Language
Assigned: Thu 31 Jan 2002
Due: Thu 14 Feb 2002

Contents

The problem description
The expression grammar
Precedence and associativity
Informal semantics
Some sample expressions
Parsing strategy

Newsflash:
Here are some sample expressions for Lab 3. All of these should work OK (i.e., they are valid). I'll post another file of invalid ones later. (Remember that you need not give good error messages, but you should reject all invalid inputs and properly echo (parenthesized) valid inputs.)
And here are some examples of bad inputs, i.e., expressions which are not valid.

The problem description
For this lab, you will write a parser that reads in an infix expression, including variables and assignment operators, and produces an internal abstract syntax tree (AST) which you will use to re-write the expression (i.e., to "pretty-print" it). Re-printing the tree is intended mostly as a way of ensuring that you are parsing and constructing it correctly; in subsequent labs you will use this parser to compile the same expression language into machine code for the PC-231 architecture. In working on this lab, you should take into account the fact that you will be re-using the code (especially the AST representation) in that later lab: plan ahead so that your work will require minimal changes later on.
The specific expression language you will work with is described below in more detail using a context-free grammar written in extended BNF syntax (I will discuss BNF in lab and lecture).
In addition to planning out the structure and implementation of trees you will use to represent expressions, your major problem is to decide how to parse the input. Several natural choices come to mind:

use a hand-coded parser based on techniques you may a;ready know (e.g., recursive descent parsing): this may require you to "massage" the grammar a bit (so be sure you really have the right language when you're done!);

use a parser generator such as YACC or javacc: again, a certain amount of modification to the grammar might be required;

use an informal, ad-hoc technique suitable for very simple grammars (and therefore easy to verify as correct): I have provided a summary description of such a technique in a separate section below.

Although this lab is not concerned with semantic issues, you will probably be able to figure out roughly what each of the language constructs is intended to mean: keeping these intuitions in mind will help you plan for static semantics (e.g., type-checking) and code generation phases later on. Note that there are two types of expressions (and of values) in the language, namely integer and boolean, although this distinction is not enforced at the level of the grammar given here.

The expression grammar
We will use a rich set of expressions as our input language in order to better approximate the kinds of language constructs which are used in more realistic programming languages. The grammar given below is fairly abstract, in that it does not specify the precedence or associativity of the operations involved. Precedence rules are given following the grammar (this allows us to focus more clearly on expression structure, which you need to mirror in the structure of your trees). The lexical structure of variables, literals and operators is also given in an extended BNF style: if you use a parser generator or other such tool, you may want to split off lexical issues into a separate part suitable as input to a scanner generator. <
<expr>  ::=  <int-lit>
         |   <bool-lit>
         |   <var>
         |   READ
         |   <var> = <expr>
         |   <var> <post-op>
         |   <expr> <bin-op> <expr>
         |   IF <expr> THEN <expr> ELSE <expr> END
         |   ( <expr> )

<bool-lit> ::=  T | F					{can also result from a boolean READ}

<int-lit>  ::=  <digit>+				{i.e., non-empty sequences of digits}
<digit>    ::=  0 | 1 | ... | 9

<var>      ::=  <alpha>+				{note: lowercase only, and no digits}
<alpha>    ::=  a | b | ... | z

<bin-op>   ::=  <arith-op> | <rel-op> | <bool-op>
<arith-op> ::=  +  | - | *				{no division}
<rel-op>   ::=  == | < | >				{defined for numbers AND booleans!}
<bool-op>  ::=  ^  | |					{note dual use of "|"}

<post-op>  ::=  ++ | -- | ~~
A few notes on the syntax:

Variable names are lowercase only and may not include digits; all "keywords" are uppercase only.

The boolean literals stand for "true" (T) and "false" (F).

Numerals are written in decimal; there is no provision for negative numerals (although they can be easily simulated with expressions).

The logical operators are for conjunction (or "and", written "^") and disjunction (or "or", written "|"); there's an unfortunate clash between the object language use of the ASCII "pipe" symbol (|) and it's meta-language use as part of BNF syntax.

Precedence and associativity
Assignment has the highest precedence, so that it is always performed first (i.e., "x=3+2" is interpreted as "(x=3)+2" rather than "x=(3+2)". Postfix operators only apply to variables, so they are clearly performed immediately on the variable given (i.e., there is no potential conflict since expressions such as "3*x++" could never be interpreted as "(3*x)++" anyway). Multiplication takes precedence over addition and subtraction; the latter two are of the same precedence. Similarly, conjunction (^) takes precedence over disjunction (|), but all the relational operators (==, < and >) have the same precedence. Operators of the same precedence associate to the right, so that "2+3-4" is the same as "2+(3-4)". Among the different "types" of operators, arithmetic ones have higher precedence than relational ones, which have higher precedence than boolean ones. For example, the expression:
2+3*5 > x ^ y<7 | x-3 == 2
would be interpreted as:
(((2+(3*5)) > x) ^ (y<7)) | ((x-3) == 2)
Informal semantics
As stated above, we will not be interpreting the syntax of expressions in this assignment, but it will nevertheless be useful to have some ideas about meaning in mind as you work to help prepare for later labs. To that end, here is a quick description of the intended meanings of each form of expression:

integer literals (sequences of digits only) mean exactly what you would expect, i.e., they denote non-negative numbers.

boolean literals T and F just stand for the boolean values true and false.

variables are as usual readable, writable, named locations holding (in this case) integer or boolean values.

the READ token is a simple means of allowing our expressions to involve user input: the value of a READ comes from what the user types in. Both integer and boolean values might be read using a READ expression (here is one place where type-checking comes in).

assignment expressions are similar to those used in C, C++ and Java. Their constituents are a variable and an expression: the expression should be evaluated and the value stored in the variable.

the postfix operators are just the increment (++) and decrement (--) operators familiar from C, C++ and Java, along with a boolean negation operator (~~) just for fun. Note that these operators can only be applied to variables, not to arbitrary expressions.

the binary operators are just the usual multiplication, addition and subtraction operators (for integer arguments), the relational operators for less than (<), greater than (>) and equality comparison (==), and the logical operators for conjunction and disjunction (^ and |).

the IF-THEN-ELSE construct allows for conditionals at the expression level (i.e., it is not a statement). The value of the (necessarily boolean) first expression determines which of the other expressions will be used for the value of the whole.

Some sample expressions
Here is a set of sample expressions you might want to test your parser on, along with their fully parenthesized forms.

2+3*5
produces: (2+(3*5)).

2+3*7-1
produces: (2+((3*7)-1)).

2*x*x + 3*x=7 + 2
produces: ((2*(x*x))+((3*(x=7))+2))

READ + READ * 2
produces: (READ + (READ * 2)).

x*x+++10-x=3
produces: ((x*(x++))+(10-(x=3))).

Basic plan of action
As described above, your job is to read in an expression (as a String), then parse it into a tree form. You should put some work into designing the tree class(es) you will use to represent the expressions (idea: use a general Node class to capture common structural properties, then use inheritance to make sub-classes corresponding to more specific kinds of nodes).
Error-checking is an important part of the assignment: you should be able to reject any expressions which do not fit the syntax given in this lab write-up. The minimum requirement is to just reject the ill-formed expressions: of course, it's better if you can describe to the user just what has gone wrong, but (as you'll find out), giving good feedback can be quite difficult with syntax errors.
The last step in the process is to print the expression out again: this should be pretty easy if you get the tree structure write (in fact, you could have the Node class support the toString method so that Nodes can just be mentioned in output statements and then automatically printed). Note however that you will want at least 2 variations on the printed form: one which exploits the precedence and associativity rules to minimize the use of parentheses and one which uses full parenthesization to make the structure completely explicit. (Note that these are the two forms used in the examples above.)
It may be convenient to convert the input string into tokens (or "chunks") before trying to apply a parsing strategy (note: this is a different use of the term "token" than Java uses in "StringTokenizer", although the latter may be helpful). You can either cull out tokens as a separate phase or just handle them dynamically as part of the parsers input routines. See the section Parsing strategy below for details on token handling.
You will probably want to work through several example expressions by hand, using the parsing strategy of your choice, in order to make sure you understand the structures and process involved.

Parsing strategy
As described above, there are several approaches you might take to parsing the input, including recursive descent, a parser generator or more ad-hoc methods. The following section describes one ad-hoc method which will work for the simple grammar here (it is based on operator-precedence parsing, but presented here more informally).
Even if you use one of the other strategies, you may want to read through this section in order to get a feel for some of the parsing issues this grammar presents.
The first stage in parsing, conceptually at least, is to break the input string up into tokens. Our expression language has been designed to make this phase very simple: in most cases, characters in the input can be categorized and handled in a general way (e.g., alphabetic characters and digits can be consumed one after the other and combined to produce a variable or literal). White space can be essentially ignored, since variable names and integer literals can never occur immediately adjacent to another instance of the same token class. The "+", "-" and "=" characters should signal a one-character "look ahead", since two adjacent occurrences should be taken as a postfix increment (++) or decrement (--) operator, or a relational operator (==), respectively. (A similar approach will handle the various ambiguous prefixes of boolean literals and keywords). Note that assignment can be treated as a binary operator for tokenization purposes.
To continue with larger syntactic structures, infix expressions can be parsed from a series of tokens using two stacks, and the basic algorithm can be modified to accommodate postfix operators and assignment as they appear in our language. One stack is used to hold parse trees under construction (we'll call this the argument stack), the other to hold operators (and left parentheses, for matching purposes; we'll call this the operator stack).
As we read in each new token (from left to right), we either push the token (or a related tree) onto one of the stacks, or we reduce the stacks by combining an operator with some arguments. Along the way, it will be helpful to maintain a search state which tells us whether we should see an argument or operator next (the search state is a refinement which helps us to reject certain ill-formed expressions: if you don't see how to use it at first, leave it out until later).
The basic rules for handling individual tokens can be summarized as follows:

upon seeing a variable or literal, push it onto the tree stack (possible after wrapping it up as a leaf node). The search state can be changed from "argument" to "operator".

upon seeing a postfix operator, check that there is a variable leaf node on the argument stack, then pop it off and push a postfix operation node (built from the variable and the postfix operator) back onto the argument stack. The search state is still for an operator.

upon seeing a binary operator, we must check the status of the operator stack: if it is empty, we simply push the new operator. If it has some operator on top, we compare the precedence of the two operators and push the new one if it has higher precedence (or equal precedence: this will force right associativity). Otherwise we can reduce the two stacks before continuing with the same input operator. In any case, pushing an operator means we should now search for an argument, whereas if we continue with the same operator the search state should clearly stay as "operator".

upon seeing a left parenthesis, we simply push it on the operator stack; the search should have been for an argument and should remain so.

upon seeing a right parenthesis, we have two cases: if there is a left parenthesis on the operator stack, we can "cancel" the pair and continue searching for an operator. If the operator stack contains some other operator on top, we should reduce the stacks and continue with the right parenthesis as input again.

upon seeing the end of the input, we should have one of the following situations: if there is only one tree on the argument stack and the operator stack is empty, we should finish and return the single tree as the result. If there are more trees and/or operators, we should attempt to reduce the stacks until they reach the first form.

Exercise: what is the appropriate way to treat the precedence of the left parenthesis as it sits on the operator stack? Try a short example to see what it should be.
Handling the if-then-else construct: we can extend the scheme described above to handle the "mix-fix" if-then-else construct by pushing an "IF" marker on the operator stack, then ultimately matching it against a corresponding "END" keyword. There are a number of ways to handle the intervening keywords and expressions (e.g., pushing keyword markers on the operator stack); just be sure that you only allow well-formed expressions.
The rules for reduction are a bit simpler: if we see a binary operator on top of the operator stack, we should have two trees on top of the argument stack, both representing expressions. We pop the operator and two trees off, combining them into a single tree node, which is then pushed back on the argument stack. Note that the trees on the argument stack represent the right and left arguments, respectively: this is the opposite of what you might expect. There is a special case when the operator is an assignment: we should then check that the second tree on the argument stack is just a variable leaf node before building and pushing the combined tree.

Any other combinations (e.g., no operator on top of the operator stack, or not enough trees on the argument stack, or no variable to go with an assignment operator) should be considered an error.