For this assignment you will implement a more realistic, medium-sized application, and add some file IO to the mix.
In particular, you should write a concordance program: this is a text-processing application in which you present a sorted index of all the words which appear in a text, along with the frequency of their occurrence. A concordance can be used in simple studies of content and authorship (sometimes an author can be identified by patterns of relative frequency of usage; fuller grammatical analyses are better, though).
First, develop a concordance function which works on strings. You can use a number of different functions which can be "pipe-lined" together with composition (the ".") to give you the final concordance program.
wordsP
which will break up a string into words; for starters
this can be just the Prelude function words
, but see below for
why (and how) you will want to enhance (or re-write) it.
insertFreq
) should be
sensitive to frequency: rather than just items stored in
the tree, you will keep pairs of items and counts. If a word is
inserted into a tree it already occurs in, you just increase the
stored frequency by one. If a word is inserted into a tree for the first
time, just make a new node and use a count of 1. In some sense, this will
be very much like the "list-ified" version of tree insertion, except that you
use 1 and (1+) instead of [x] and (x:) for the Tip and "found" cases (where
x is the item involved). (For those into abstract algebra, there is a homomorphic
relationship between the two, which uses length to "count" the lists back to
integers.)
(The use of a tree is motivated by the desire that insertions be fairly quick, useful when we get a much larger data set.).
You can fold this function across a list of words to build a complete tree
(Question: do you use a foldl
or a foldr
?
Hint: it depends on the types, and perhaps is more efficient one way than the other.
flatten
the tree.
The order of the results for this function are just the same as for the tree, i.e.,
they will naturally be sorted by word, but it won't matter in the long run--
see the next item for why.
sortBy
function from the List
module to
sort the list of pairs (word, frequency) by frequency.
(Just put "import List" on a separate line after
your module declaration in your file.) You will need to pass a
suitable comparison function to sortBy
to get it to work; in other
words, you will need to pass a function that performs the comparison based
on the frequency rather than the word itself. Note also that we want the
highest frequency first, so you may need to invert the usual order of
arguments. You will want to use the (Prelude) function compare
,
which returns an Ordering.
By putting together these ingredients you can make a function which will take in a string and produce a list of words and frequencies sorted by frequency.
wordFreq
, you might
see it work like this:
> wordFreq "a bee and a b saw a bee be"
[("a",3),("bee",2),("and",1),("b",1),("be",1),("saw",1)]
This list can be printed out in a nice format, using unlines
and a
little helper function (you write this), so that the output might look like this:
a: 3
bee: 2
and: 1
b: 1
be: 1
saw: 1
(Don't worry about getting the spacing pretty, although tabs can help some in this regard.)
How does your program handle punctuation? It should probably be thrown
away, but be careful about certain cases, e.g., commas and periods at
the ends of words. Without a certain care, you will end up counting these
as different words, due to the presence of spurious punctuation. Look at
the Prelude functions span
, break
and dropWhile
for ways to handle this problem (hint: use the predicate isAlpha
to find words as such, less spacing and punctuation). Work on the overall
concordance problem first, but try to refine this part of it if you have time.
What about input? You can read text from a large file, then process it using the function f using code as follows:
fileFreq filename = do { text <- readFile filename; concord text }
(Here I have made filename
a parameter to the function, so you would then call it with
something like fileFreq "wizardofOz.txt"
.)
(In this example, concord
is meant to be the name of your overall function which takes a
String and prints out its frequency count information.)
Lastly, if you want some good test data, you can get longer data files from any number of sources; there is a small on-line library of works in the public domain at the literature.org website. (Just download text from there and save it into a file.)