exam will be three hours, closed book/notes and cumulative (but stressing second half of course)
typical format: some T/F, some short answer, a few longer "conceptual" questions
Topics from the first half of the course (architecture); see previous review sheet!
Numeral systems: binary, hexadecimal, etc.
Data Representation
Boolean Logic
Gates and Circuits
Computer Organization
Instruction Set Architecture and Assembly Programming
Overview of language translation
translation takes place in stages, with distinct data structures associated with each stage (see chart from lecture)
several appropriate theories associated with the various stages can help manage complexity and suggest the "right" way to do things (but we concentrate on practical issues and a small, simple language)
separation into phases makes the process much easier to understand, but they can be combined in practice (making only one or two passes over the code, for example)
the meaning of a program can either be implemented dynamically, as we process the code (interpretation or evaluation) or statically, by way of translation to another form (compilation)
Regular languages and tokenization
lexical analysis or scanning: in this phase, a character string is broken into "chunks" called tokens
each token represents a separate, atomic component of syntax or semantics (the syntactic ones, such as parentheses and other delimiters, will ultimately be removed during parsing)
tokens matter during parsing mainly for their classification (e.g., literal or variable), but during code generation also for their content (e.g., 147 or x)
some tokens, such as variables or literals, might be entered into a symbol table during this phase of processing or the next
theoretically, we define a class of regular languages, capturing sequential repetition, as a means for describing token structure
regular languages can be defined in terms of finite automata (better for machine implementation) or in terms of regular expressions (better for human reading and writing)
scanner generators essentially translate regular expressions into an automaton (but with "hooks" to insert code to track symbols, etc.)
Context-free grammars and parsing techniques
parsing: this phase involves the recognition of hierarchical phrase structure in the language (phrases and sub-phrases, e.g., statements, expressions, etc.)
we describe the hierarchical structure of possible forms using a context-free grammar, which uses variables (or non-terminals) to express structure and terminals (actual pieces of language text) for final content
each context-free grammar describes a language, or set of strings, based on possible expansions starting from its start symbol and ending in a string of terminal symbols
CFGs may be abstract, intended only to describe the possible structure, or more concrete, in order to facilitate different parsing techniques
in typical usage, we use two grammars which are equivalent, in the sense that they correspond to the same set of strings, but one of which is more abstract
Parsing techniques:
top-down (start with main non-terminal) versus bottom-up (start with "leaves")
recursive descent: top-down, one "function" per non-terminal, branch on token seen, consume parts consecutively
LL parser: top-down, expands non-terminals from left based on lookahead, using a table to determine next expansion, stack to hold expansion against input matching
LR parser: bottom-up, reduces sequences of terms and non-terms on stack, replacing with non-terminals, also shifts input onto stack (table controls shift versus reduce)
shunting yard: two stacks, one for operands and one for operators, plus a boolean to track whether seeking operand or operator next
(shunting yard can transform infix to postfix using just one stack, but can also build trees, or even values if statically determinable)
parser generator programs take a grammar (usually modified) as input and provide a parsing program as output
pros: easier than hand-generated parser, more likely to correspond exactly to the grammar
cons: high learning curve, grammar often has to be "massaged" to fit the technique used (reducing confidence in correctness)
Term representation and interpretation (evaluation)
terms (syntactic phrases) are naturally represented as trees with each node identifying a specific form of expression, and its children representing its immediate sub-phrases
in OO languages, it is natural to use a class/sub-class hierarchy to help organize phrases by types
interpretation proceeds as a pass over the tree, in some order determined by the language semantics, with appropriate actions being taken during the traversal (i.e., dynamically)
be careful to distinguish the syntactic issues of precedence and association with the semantic issues of order of evaluation
a parse tree is one which represents a parse exactly from a grammar: it usually contains a lot of redundant information based on grammar structure
an abstract syntax tree represents only the phrases and sub-phrase structure of interest, without the "residue" of grammar artifacts
Abstract machines and intermediate representations
in order to make code generation easier, we often choose an intermediate representation corresponding to some abstract machine which has features set in between our language and the actual target machine
typical abstract machine features include stacks for evaluating expressions or method calls, environments for variable look-up, or 3-address codes for representing unitary arithmetic and logic operations
an abstract machine can either be implemented with an interpreter itself (as in Java's JVM) or can be used as the basis for further processing
abstract machines and intermediate representations allow for a more flexible back-end to the compiler, since it is easier to re-target their implementation for different actual machines
the main distinguishing feature of an abstract machine used for these purposes is that its code is likely to have a more linear (non-hierarchical) form
Code generation and optimization
code generation follows the same plan as interpretation (an ordered tree traversal), but now generating pieces of code rather than actually performing semantic actions dynamically
when is it compilation (versus interpretation)?:
eliminate tree-like phrase structures for linear object code (with jumps);
typical problems involve keeping track of run-time resources (e.g., registers and RAM locations), mapping names to their numeric equivalents and determining proper sequencing of events
we usually depend on support from a run-time system for such services as dynamic memory allocation, communication with I/O devices, etc.
many kinds of optimizations can be used to increase performance: algebraic rewriting (distributivity, etc.), compile-time evaluation (when arguments are known), choice of faster storage options (registers vs. RAM), etc.
Java compilation
high-level Java language code is compiled to a byte-code language (for the Java Virtual Machine, or JVM)
the JVM can be implemented easily on any number of architectures as an interpreter for the byte-codes
(this approach allows for one compiler, easy cross-platform distribution of compiled code, and quick re-targeting of new architectures)
JVM is a stack-based machine with quite high-level architecture (built-in threads, garbage collection, method call/return, etc.)
JVM has separate instructions for a number of different data types (int, float, double, etc.)