CS 451/LLC Lab 5: Monadic Parsing

We have been reading and working with Graham Hutton and Erik Meijer's monadic parser technique and library in lecture: now it's time to put what you've learned to work on something a little more substantial. For this second problem of the lab, you will write a parser for a subset of the Java language. In order to keep the work reasonable, we will use a very small subset, including just a smattering of statement and expression forms. The key to maximizing the value of the assignment is to explore and use the parsing library to make the job especially straightforward: look to examples from lecture, from the paper linked on the course homepage and to the Powerpoint and video tutorials for examples.

We will use the following grammar as our basis for parsing Java statements and expressions:

statement ::= variable = expression ;
            | return expression ;
            | if ( expression ) statement
            | if ( expression ) statement else statement
            | { statement ... statement }

expression ::= numeral
             | variable
             | expression operator expression
             | variable ( expression , ... , expression )

operator ::=  *  |  /  |  +  |  -  |  <  |  >  |  ==  |  &&  |  ||

numeral ::= digit +

variable ::= loweralpha alphanumeric *
Note that the grammatical meta-variables here are statement, expression, variable, numeral, operator, digit, loweralpha (lowercase alphabetic letters) and alphanumeric (alphabetic letters or numeric digits). Terminal "keywords" and punctuation used in the grammar include return, if, else, parentheses, commas, braces, semi-colons, the assignment operator (=) and the various operators listed. The grammatical "meta-punctuation" includes the vertical bar for or (|), the BNF production symbol (::=), ellipsis (...) and the extended BNF post-fix repetition operators (* and +), which signify "zero or more" and "one or more" repetitions, respectively. The ellipsis is used when it is more clear than a specification using repetition operators would be: statements carry their own semi-colons where needed and expressions are overtly separated by commas (the former will sometimes lead to a semi-colon before a final brace, and some statements won't terminate in a semi-colon).

(In particular, note that no square or curly braces are used as meta-punctuation here: they are used as terminal characters in the language.)

NB: much of the intent in the expression grammar is hidden here by using a relatively abstract, ambiguous specification. In particular, there is no specification of association or precedence: we will assume that all operators are left-associating and that they have the usual Java/other languages/mathematical precedences, namely: multiplication and division of equal precedence higher than addition and subtraction, in turn of equal precedence but higher than the relational operators (less than, greater than and equal to), in turn of higher precedence than logical disjunction (&&), and finally with logical conjunction (||) lowest precedence of all.

(Are there any similar conflicts which must be resolved among statements? That is, are there any lingering ambiguities left unresolved by the grammar? (Yes!) What are they? How should they be resolved? Look to your knowledge of Java (if any) or to references (if none) to see if you can see what is usually done here: can you get your parser to do the accepted thing?)

In order to be very clear about the abstract structures you should be parsing, I give you the following Haskell algebraic datatype definitions:

data JStmt = Set Var JExpr | Ret JExpr | Ife JExpr JStmt JStmt | If JExpr JStmt | JBlock [JStmt]  
             deriving Show

data JExpr = Lit Integer | Occ Var | App Op JExpr JExpr | Call Var [JExpr]  deriving Show

data Op = Mul | Div | Add | Sub | Lth | Gth | Eq | Or | And  deriving Show

type Var = String
Note that there are separate datatypes for statements (JStmt), expressions (JExpr) and operators (Op); the last is merely a flat "alphabet" of (as yet) uninterpreted symbols. The statement and expression types, on the other hand, are recursive (not quite mutually so, though). Note also that there are no parentheses or other such ambiguity-resolving devices reflected in the "abstract grammar" which constitutes the dataype definitions: by the time we reach this level of representation, the embedding hierarchy can be reflected directly, without need for any ancillary techniques. But your parser should produce terms whose structure nevertheless reflects the intended association and precedence rules (for example) specified above.

Another note about the datatypes: the App constructor for expressions takes an Op argument as its first argument, then constructs an expression, as a kind of parameterized binary operator, from two sub-expressions. So, when you are parsing an operator, you might want to return a partially constructed value like (App Add), of type Jexpr -> Jexpr -> Jexpr which can then be applied to two sub-expression arguments.

In order to complete your parser, you will need to have read and understood (at leat most of) the paper by Graham Hutton and Erik Meijer listed on the course homepage, and you will need the following (slightly modified) version of the parsing module used in Hutton's textbook:

(Note that this is a "literate Haskell" file, with a .lhs suffix; it should load normally in Hugs or GHC, but the sense of code and comments is inverted in the file. Also note that this is not the (non-world-readable) file Parsing.hs that some of the rest of the code in the course code directory uses.

Output: here is some sample output.

Lab3> try expr "x>2 && x < y"

App And (App Gth (Occ "x") (Lit 2)) (App Lth (Occ "x") (Occ "y"))

Lab3> try stmt " { x = 7; return x + 2; } "

JBlock [SET "x" (Lit 7),RET (App Add (Occ "x") (Lit 2))]

Lab3> try stmt " { x = y + 2 * 7; if ( x>2 && x < y) return x; else return x / 2; } "

JBlock [ SET "x" (App Add (Occ "y") (App Mul (Lit 2) (Lit 7))),
         IFE (App And (App Gth (Occ "x") (Lit 2)) (App Lth (Occ "x") (Occ "y")))
             (RET (Occ "x"))
             (RET (App Div (Occ "x") (Lit 2)))
Note that you won't necessarily get the nice line-breaks in the last example (... although ... there is an app a Haskell library for that.)