For this assignment you will implement a more realistic, medium-sized application, and add some file IO to the mix.
In particular, you should write a concordance program: this is a text-processing application in which you present a sorted index of all the words which appear in a text, along with the frequency of their occurrence. A concordance can be used in simple studies of content and authorship (sometimes an author can be identified by patterns of relative frequency of usage; fuller grammatical analyses would be better, of course, but would also be considerably more complex, so we will stick with this simpler alternative for now).
First, develop a concordance function which works on strings. You can use a number of different functions which can be "pipe-lined" together with composition (the ".") to give you the final concordance program.
wordsP which will break up a string into words;
for starters this can be just the Prelude function words,
but see below for why (and how) you will want to enhance (or re-write) it.
(In the diagram below I have written it words? in order to
indicate that you might adjust the function used as you go.)
insertFreq) should be
sensitive to frequency: rather than just items stored in
the tree, you will keep pairs of items and counts. If a word is
inserted into a tree it already occurs in, you just increase the
stored frequency by one. If a word is inserted into a tree for the first
time, just make a new node and use a count of 1.
(The use of a tree is motivated by the desire that insertions be fairly quick, useful when we get a much larger data set.).
You can fold this function across a list of words to build a complete tree
(Question: do you use a foldl or a foldr?
Hint: it depends on the types, and perhaps is more efficient one way than the other.
flatten the tree.
The order of the results for this function are just the same as for the tree, i.e.,
they will naturally be sorted by word, but it won't matter in the long run--
see the next item for why.
sortBy function from the List module to
sort the list of pairs (word, frequency) by frequency.
(Just put "import List" on a separate line after
the module declaration in your file.) You will need to pass a
suitable comparison function to sortBy to get it to work; in other
words, you will need to pass a function that performs the comparison based
on the frequency rather than the word itself. Note also that we want the
highest frequency first, so you may need to invert the usual order of
arguments. You will want to consider the (Prelude) function compare,
which returns a value of type Ordering.
printConc in the diagram below).
By putting together these ingredients you can make a function which will take in a string and produce a list of words and frequencies (as pairs), sorted by frequency.
wordFreq, you might
see it work like this:
> wordFreq "a bee and a b saw a bee be"
[("a",3),("bee",2),("and",1),("b",1),("be",1),("saw",1)]
This list can be printed out in a nice format, using unlines and a
little helper function (you write the helper function!),
so that the output might look like this:
a: 3
bee: 2
and: 1
b: 1
be: 1
saw: 1
(Don't worry about getting the spacing exactly right, although tabs can be of some help in this regard.)
How will your program handle punctuation? It should probably be thrown
away, but be careful about certain cases, e.g., commas and periods at
the ends of words. Without a certain care, you will end up counting these
as different words, due to the presence of spurious punctuation. Look at
the Prelude functions span, break and
dropWhile
for ways to handle this problem (hint: use the predicate isAlpha
to find words as such, less spacing and punctuation). Work on the overall
concordance problem first, but try to refine this part of it if you have time.
What about input? You can read text from a large file, then process it using the
function concord with code like the following:
fileFreq filename = do { text <- readFile filename; concord text }
(Here I have made filename a parameter to the function, so you would then call it with
something like fileFreq "wizardofOz.txt".)
(In this example, concord is meant to be the name of your overall function which takes a
String and prints out its frequency count information.)
Lastly, if you want some good test data, you can get longer data files from any number of sources; there are small on-line libraries of works in the public domain at the literature.org or Project Gutenberg" websites (these are just two examples). Just download text from one of these sites, either a single chapter or a whole book, and save it into a text file (don't use something like a PDF).