CS 348 Lab #1: Language Processing and Statistics

Lab #1: Language Processing and Statistics
Assigned: Thu 17 Jan 2002
Due: Thu 24 Jan 2002

[From the class home page]
Thu 24 Jan: Sample input files for Lab #1 are available here on the web or (better) through the L drive in the classes/cs348/inputs directory (there are 9 files in all). You should access the files directly through the file system or make copies on your H drive, as browsers will mess up the formatting (especially tabs). For your demo, be prepared to show how your program handles these files as input.

Basic problem description
Your task is to read in a file containing a program (written in some language, not unlike Java) and to report back on the contents of this program file in terms of various statistics (as detailed below). Note that you need not analyze the file completely for correctness of syntax (we will not even fully specify the language under consideration); rather, you need only attend to certain basic patterns such as keywords and variable names, numbers, comments and strings. These entities correspond to what are generally known as tokens, in some sense the "words" of program texts. Typically, the first stage of a language processor will discover and serve up a stream of such tokens to be used by later stages which analyze the syntactic structure of the program: your code need not be so elaborate to complete this assignment, but you are well advised to use the opportunity to work in that direction, as you will benefit from reusing your code in later assignments.
The main goals of this assignment are:

to get you back into the swing of Java programming after a long vacation;
to get you into the habit of thinking of (and working with) programs as input files in their own right;
and to provide you with a basic program framework for reading and processing program files and for finding and distinguishing tokens.

Detailed definitions
A string literal is any sequence of characters between quotation marks. Some characters in a string (tabs, newlines, quotation marks and backslashes) may be "escaped" using a backslash, as in Java. For example, the string literal "abc\txyz\n\"hello\"\\n" contains 17 characters, since:

\t represents a tab
\n represents a newline
\" represents a quotation mark (contained inside the string)
\\ represents a backslash
and the final n is not escaped.
For the purposes of the current assignment, we'll consider only these forms of escape sequences (Java actually defines a few more).
[NB: if a backslash precedes any other character, consider the whole (i.e., the two characters just mentioned) to be one character, but always just a part of the string. In other words, unlike Java, we will not worry about invalid escape sequences (this will make things easier). Thanks and a hat tip to Joel for pointing out the ambiguity here.]
[NB: yet another ambiguity: what about non-terminated strings? In particular, what if we see a newline in the middle of a string? Again, Java handles this differently, but let's make things easier and just say that a string continues until the closing quotation mark (or the end of the file, I suppose). Specifically, then, string literals will be able to span several lines, and the newline characters inside them count as characters inside the string. Thanks and a hat tip to Toby for pointing out this ambiguity.]
A comment is either an end-of-line comment, which starts out with a pair of forward slashes (//) and continues to the next end-of-line character, or a bounded comment, which includes all characters, even newlines, between matched pairs of the form /* ... */. Note that an attempt to "nest" one bounded comment within another fails, since the first occurrence of */ will close the comment.
A word is a sequence of characters beginning with an alphabetic letter (either upper or lower case) and continuing for as many consecutive characters as are either letters or digits (0 through 9).
A numeral is a sequence of characters beginning with a digit and continuing for as many characters as are also digits (we won't bother with floating point literals, i.e., those written in scientific notation).
Note: under the Windows operating system, spurious "carriage return" characters may be added to a file (such as the input file) when it is saved. We may therefore need to liberalize our definition to include these characters as being part of the newline code.
A block is any sequence of characters contained between two bounding braces ({ ... }).
Your program will need to detect the presence of various token forms as described above and report on their properties. Note, however, that you will treat strings and comments differently from the "code" parts of the program: specifically, you will detect and count words, numerals and blocks, but only when these appear outside comments or strings. Similarly, strings and their delimiters are only significant when they appear outside the bodies of comments, and comments and their delimiters are only significant outside the bodies of strings.
For example, in this code:
	// Print a quote mark (")
	emit("\"");
		
	// Print some stars and slashes (/* */)
	emit("/* */");
the quotation mark in the first line does not count as the start of a string, since it appears in a comment. The string in the second line is of length one (just a single, albeit escaped, quotation mark appears within). The "/* */" sequence on the third line does not count as a bounded comment because it appears in the middle of another (end-of-line) comment. And finally, the similar "/* */" sequence in the next line does not count as a bounded comment because it is in a string.
For each input file (you might read only a single file per run of your program, see below) you should detect and report on each of the following as they appear (plus compute and report some statistics at the end):

each word appearing in the code (i.e., not in comments or strings);
each numeral appearing in the code;
each string literal appearing in the code, along with its size (in characters not including the delimiters);
the opening and closing of each block (in the code);
each comment along with its type and size (in characters, not including the delimiters);
the total number of lines in the file (as measured by newlines or newline/carriage return combinations);
the total numbers of words, numerals and string literals appearing in the code.

By way of example, the following input:
	for(int i=0; i<5; i++) {
		
		// Print a quote mark (")
		emit("\"");
		
		// Print some stars and slashes (/* */)
		emit("/* */");
	}
might produce this output:
Processing file "sample1" ...

	* word in code (for);
	* word in code (int);
	* word in code (i);
	* numeral in code (0);
	
	...
	
	* opening block;
	* end-of-line comment (23 chars);
	* word in code (emit);
	* string literal (1 chars)
	
	...
	
	* closing block;
	* end of file.
	
	Total lines: 7
	Total words: 7
	Total numerals: 2
	Total string literals: 2
	
NB: you might want to turn off detailed reporting for some categories and just look at totals and "focal" categories while debugging.

Basic plan of action
Read in the file, one line at a time, and run through each line character by character. At any point in the process you can be in any of a few different "states" (in code, in comment, in string) and can see any of a few categories of characters (quote mark, digit, etc.). Each combination of these things will have a different significance (e.g., quote mark ignored in comments, starts string literal in code, ends string literal while in string literal (unless it's escaped!)).
When you have seen a specific construct (such as word, numeral, etc.) you can take an "action" such as reporting it and/or updating any relevant statistics. You might want to use a nested switch statement (or similar) to trigger the actions based on current state and input category. Another (better) way to organize this is through an explicit table indexed by (say) state numbers and input category numbers (the items in the tables could be members of an interface which provides an "action" method).
A higher-level approach based on Java tools might be to use the StringTokenizer to do more of this work for you, but it could be tricky: getting the tools to stop and start at the right places might take some effort. There are even more sophisticated tools available (lexer generators, etc.), but they are probably to difficult to learn in the time available (you are free to use them if you like, though).