Lab #6: Text Concordance

For this assignment you will implement a more realistic, medium-sized application, and add some file IO to the mix.

In particular, you should write a concordance program: this is a text-processing application in which you present a sorted index of all the words which appear in a text, along with the frequency of their occurrence. A concordance can be used in simple studies of content and authorship (sometimes an author can be identified by patterns of relative frequency of usage; fuller grammatical analyses are better, though).

The basic ideas

First, develop a concordance function which works on strings. You can use a number of different functions which can be "pipe-lined" together with composition (the ".") to give you the final concordance program.

Write a function wordsPwhich will break up a string into words; for starters this can be just the Prelude function words, but see below for why (and how) you will want to enhance (or re-write) it.
Use a variation on your functions from the last lab to insert items into a binary search tree. Your function (say, insertFreq) should be sensitive to frequency: rather than just items stored in the tree, you will keep pairs of items and counts. If a word is inserted into a tree it already occurs in, you just increase the stored frequency by one. If a word is inserted into a tree for the first time, just make a new node and use a count of 1. In some sense, this will be very much like the "list-ified" version of tree insertion, except that you use 1 and (1+) instead of [x] and (x:) for the Tip and "found" cases (where x is the item involved). (For those into abstract algebra, there is a homomorphic relationship between the two, which uses length to "count" the lists back to integers.)
(The use of a tree is motivated by the desire that insertions be fairly quick, useful when we get a much larger data set.).
You can fold this function across a list of words to build a complete tree (Question: do you use a foldl or a foldr? Hint: it depends on the types, and perhaps is more efficient one way than the other.
Write (or steal from lecture files!) a function which will traverse a tree and produce a list, i.e., which will flatten the tree. The order of the results for this function are just the same as for the tree, i.e., they will naturally be sorted by word, but it won't matter in the long run-- see the next item for why.
Use the sortBy function from the List module to sort the list of pairs (word, frequency) by frequency. (Just put "import List" on a separate line after your module declaration in your file.) You will need to pass a suitable comparison function to sortBy to get it to work; in other words, you will need to pass a function that performs the comparison based on the frequency rather than the word itself. Note also that we want the highest frequency first, so you may need to invert the usual order of arguments. You will want to use the (Prelude) function compare, which returns an Ordering.

The big picture

By putting together these ingredients you can make a function which will take in a string and produce a list of words and frequencies sorted by frequency.

Example and refinements

For example, if you call this basic function wordFreq, you might see it work like this:

> wordFreq "a bee and a b saw a bee be"
[("a",3),("bee",2),("and",1),("b",1),("be",1),("saw",1)]

This list can be printed out in a nice format, using unlines and a little helper function (you write this), so that the output might look like this:

	a:    3
	bee:  2
	and:  1
	b:    1
	be:   1
	saw:  1

(Don't worry about getting the spacing pretty, although tabs can help some in this regard.)

How does your program handle punctuation? It should probably be thrown away, but be careful about certain cases, e.g., commas and periods at the ends of words. Without a certain care, you will end up counting these as different words, due to the presence of spurious punctuation. Look at the Prelude functions span, break and dropWhile for ways to handle this problem (hint: use the predicate isAlpha to find words as such, less spacing and punctuation). Work on the overall concordance problem first, but try to refine this part of it if you have time.

What about input? You can read text from a large file, then process it using the function f using code as follows:

fileFreq filename = do { text <- readFile filename; concord text }

(Here I have made filename a parameter to the function, so you would then call it with something like fileFreq "wizardofOz.txt".) (In this example, concord is meant to be the name of your overall function which takes a String and prints out its frequency count information.)

Lastly, if you want some good test data, you can get longer data files from any number of sources; there is a small on-line library of works in the public domain at the literature.org website. (Just download text from there and save it into a file.)