Lab #5 Text Concordance

For this assignment you will implement a more realistic, medium-sized application, and add some file IO to the mix.

In particular, you should write a concordance program: this is a text-processing application in which you present a sorted index of all the words which appear in a text, along with the frequency of their occurrence. A concordance can be used in simple studies of content and authorship (sometimes an author can be identified by patterns of relative frequency of usage; fuller grammatical analyses would be better, of course, but would also be considerably more complex, so we will stick with this simpler alternative for now).

The basic ideas

First, develop a concordance function which works on strings. You can use a number of different functions which can be "pipe-lined" together with composition (the ".") to give you the final concordance program.

  1. Write a function wordsP which will break up a string into words; for starters this can be just the Prelude function words, but see below for why (and how) you will want to enhance (or re-write) it. (In the diagram below I have written it words? in order to indicate that you might adjust the function used as you go.)

  2. Use a variation on the standard insertion function to put items into a binary search tree. Your function (say, insertFreq) should be sensitive to frequency: rather than just items stored in the tree, you will keep pairs of items and counts. If a word is inserted into a tree it already occurs in, you just increase the stored frequency by one. If a word is inserted into a tree for the first time, just make a new node and use a count of 1.

    (The use of a tree is motivated by the desire that insertions be fairly quick, useful when we get a much larger data set.).

    You can fold this function across a list of words to build a complete tree (Question: do you use a foldl or a foldr? Hint: it depends on the types, and perhaps is more efficient one way than the other.

  3. Write (or steal from lecture files!) a function which will traverse a tree and produce a list, i.e., which will flatten the tree. The order of the results for this function are just the same as for the tree, i.e., they will naturally be sorted by word, but it won't matter in the long run-- see the next item for why.

  4. Use the sortBy function from the List module to sort the list of pairs (word, frequency) by frequency. (Just put "import List" on a separate line after the module declaration in your file.) You will need to pass a suitable comparison function to sortBy to get it to work; in other words, you will need to pass a function that performs the comparison based on the frequency rather than the word itself. Note also that we want the highest frequency first, so you may need to invert the usual order of arguments. You will want to consider the (Prelude) function compare, which returns a value of type Ordering.

  5. Finally, as described below, you can use an appropriate function to print out the word/frequency pairs in a nice, readable format (this is called printConc in the diagram below).

The big picture

By putting together these ingredients you can make a function which will take in a string and produce a list of words and frequencies (as pairs), sorted by frequency.

Example and refinements

For example, if you call this basic function wordFreq, you might see it work like this:
> wordFreq "a bee and a b saw a bee be"

This list can be printed out in a nice format, using unlines and a little helper function (you write the helper function!), so that the output might look like this:

	a:    3
	bee:  2
	and:  1
	b:    1
	be:   1
	saw:  1

(Don't worry about getting the spacing exactly right, although tabs can be of some help in this regard.)

How will your program handle punctuation? It should probably be thrown away, but be careful about certain cases, e.g., commas and periods at the ends of words. Without a certain care, you will end up counting these as different words, due to the presence of spurious punctuation. Look at the Prelude functions span, break and dropWhile for ways to handle this problem (hint: use the predicate isAlpha to find words as such, less spacing and punctuation). Work on the overall concordance problem first, but try to refine this part of it if you have time.

What about input? You can read text from a large file, then process it using the function concord with code like the following:

fileFreq filename = do { text <- readFile filename; concord text }
(Here I have made filename a parameter to the function, so you would then call it with something like fileFreq "wizardofOz.txt".) (In this example, concord is meant to be the name of your overall function which takes a String and prints out its frequency count information.)

Lastly, if you want some good test data, you can get longer data files from any number of sources; there are small on-line libraries of works in the public domain at the or Project Gutenberg" websites (these are just two examples). Just download text from one of these sites, either a single chapter or a whole book, and save it into a text file (don't use something like a PDF).