Lab #2: Phonebook comparison of lists and dictionaries

Introduction: purposes of the lab

[For a black-and-white version of this lab, suitable for people with certain visual disabilities, see this link.]

The purposes of this lab are to:

develop your modular design skills: you will develop some timing tools (the StopWatch class), a class that uses randomization (the PhoneNum class), and a larger testing application that compares multiple data structures in terms of their performance;

have you experience some actual, real-world program timings: although mathematical tools like “big-O notation” are very useful, they have a theoretical flavor when compared with actual, running programs (there’s nothing like waiting a long time for some code to run, wondering if there might be an infinite loop, to drive home some of the lessons about algorithm and program timing);

compare the performance of Python lists, (your own) linked lists, and Python dictionaries: one of the main purposes of this course is to help you develop the skills and intuitions to choose data structures well for various purposes in your programming work—to see these three different structures in a real-world example should help you build those skills;

and, finally, just to practice coding in Python some more!

Below you will find:

an introduction to the overall project;
a recommended module structure to help you get started (and some sample code is provided!);
a review of the Names module that I have provided for you;
some notes and sample code concerning the standard Python random library;
ideas for the PhoneNum class that you will provide (based on Names);
some notes and sample code concerning the standard Python datetime library;
a[n unfinished draft of] notes on a StopWatch class that you will write;
a review of the testing harness or main body for the project;
some sample output; and
some questions to think about and address in demo.

The overall `TimingLab` module

Dictionaries in Python match up keys and values together in an overall structure that contains several such “pairs”, with just a colon (:) to separate them, and curly braces around the whole structure. On the other hand, lists can contain many types of values—but in particular, they can contain pairs of values: two values put together with parentheses and a comma. In fact, there is a close correspondence between dictionaries and lists of pairs, at least in terms of their Python syntax:

{ 'a':1 , 'b':2 } # a dictionary, with keys and values [('a',1), ('b',2) ] # a list [] of pairs (), with left and right parts

If you know lists, their methods, and their syntax especially well, and if you are a bit unfamiliar with dictionaries and their methods, you might be tempted to use a list-of-pairs rather than a dictionary. After all, one of the main things we do with dictionaries is to look up a value based on the corresponding key:

>>> d = {'a':1, 'b':2} # a simple dictionary >>> d['b'] # looking up the value 2 based on the key 'b' 2

If you were to use a list-of-pairs instead, you would merely need to write a function to lookup a right part in a list-of-pairs by finding the corresponding left part. This shouldn’t be too hard: just write a loop to run down the list and look for the correct left part (as specified in a parameter by the caller) and when you find it, return the corresponding right part. (Of course you must raise an exception or similar if the left part being sought is not available.)

Task: Write a function lookup(key,lops) where key is sought on the left sides of pairs in the list-of-pairs lops. Then return the right part of the pair as the result (and if no key-pair is found, you may pass, return None, or raise an exception (such as KeyError).
>>> lop = [('a',1), ('b',2)] >>> lookup(lop,'b') 2
(Note that you should in particular return the resulting right part of the pair, not just print it: printing something out is great for a user or programmer at the console, but programs will in general need to get a result back as a returned value for it to be useful to whichever part of the program needs the value next.)

At this point, you should recall some of the charts from our textbook, the ones which compare the performance of dictionaries to lists: although lists are more familiar to many beginning programmers, dictionaries seem to outperform them in significant ways: many operations that are O(n) on lists are just O(1) on dictionaries!

A third option for making and using a ‘lookup structure’ would be to use your own linked lists, as implemented in lab, to “host” the pairs in the list-of-pairs. We know that for some operations, at least, linked lists can outperform native Python lists. (Spoiler alert! As you might guess, however, this is not one of those cases: trolling through a list, item by item, searching for something, is bound to take time proportional to O(n), just because we may, in general, have to look all the way down the list, one item at a time. The fact that dictionaries can do something essentially equivalent in just O(1) time is the unexpected, amazing thing!)

So, in this lab, let’s check out the performance of these three options: (1) a regular Python dictionary, or (2) a “roll-your-own” one built from either a Python list-of-pairs, or (3) one built from a linked list-of-pairs. Our approach will be to write a short “test harness”: this code will generate some random-sample based data and then drive it through a few performance tests, so as to compare the various options. We can time the runs of the various sample sets we generate, and report those timings out at the end, in order to make comparisons. We’ll write a short (but useful!) StopWatch class in order to make the timings. Along the way, we'll have use to generate random names and phone numbers (from the Names and PhoneNum modules), using random number methods from the random library. Finally, we’ll also use the datetime and timedelta classes from the datetime library to implement our StopWatch class for for making the timings, thus practicing the use of standard libraries as well.

For the purposes of a “back-story” to provide context for the lab, we’ll go to the phonebook for inspiration: we’ll assume that the (key,value) pairs we will be working with are thus names and phone numbers, with just a little bit of structure to them. We will then generate sample lists-of-pairs and dictionaries of a given size: call these the general population, or “genpop”. Their generation will depend on the Python random library indirectly, since the Names and PhoneNum modules will use some random utilities. We will also create a smaller sample population of names to do our testing: for each name in the sample population, we will look it up in the genpop list or dictionary, and capture timing data on the results. By using the same sample and general populations for the dictionary and two list-of-pairs approaches, we will get a fair comparison.

What about the actual code that we are timing? It will be just the time it takes to look up a phone number using a name, since this is the aspect of the competing data structures that we want to compare. For the list-of-pairs examples, you can use the lookup function you wrote above, and for the dictionaries you can just use indexing ([ ]) or the dictionary's .get() method. (This is a little easier than using indexing; see the section called Getting information from a dictionary at Leticia Portella's blog post on her favorite python tricks for an explanation of why. By the way: she has other great stuff on her blog; why don't you take a look around?)

Recommended module structure

How should we build a project like this? Let’s use Python’s object features to structure things. Below is a suggested overall design, presented in a diagram, where each rectangle corresponds to a class or module:

You don't have to use this specific structure, but if you don't, try to make sure you do so from a place of knowledge, i.e., try to understand my recommendations and why I make them before creating your own design.)

In this diagram, the blue modules/classes are the ones you will write yourself: StopWatch, PhoneNum and the main TimingLab. The solid green one (Names) is supplied by me, as an example; see below. And the dashed green ones are standard Python libraries. (Diagrams like this are often useful in understanding the structure of code like this: in the “real world” developers sometimes use tools that allow them to draw similar pictures and then actually generate code templates and files automatically from the diagrams.)

The `Names` module

As alluded to above, we will need a way to generate names for our phonebook general populations, both the general one which represents the overall “book”, and the more specific ones which represent the test population for lookup.

In order to generate the names, I am providing a definition of a module of code for you, as both a resource and a model of coding style. In particular, the file Names.py shows how you can use the features and functions of the random library to generate random names.

See here for the actual code file and here for a colorful HTML version (with the help of www.hilite.me).

How does this code work? Overall, note that it does not implement an actual python class (more about that later), but rather just a single variable (alphabet), a few convenient functions (name, firstname, and lastname, and a short bit of test code to make sure everything works (called, unsurprisingly, testrandom is imported.

The idea is that the name function generates random strings of letters in lengths that can be specified in parameters—it is the real workhorse of the module. The firstname and lastname functions are then just “convenience functions” which call the name function with specific parameters (they are intended to return names of lengths that are typical of first and last names, with the former being a bit shorter, and the latter a bit longer).

The name function is perhaps easiest read from the inside out, so to speak —in particular, note that it is a single return statement which calls several nested methods and functions “layered” one on top of another. On the very inside (indented furthest to the right) is a call to the randrange function from the random library: this will return a parameter k which is essentially a “die roll” that returns a random number between the minimum and maximum specified.

This parameter k is then used to get some random choices from the alphabet, namely k of them. So the result of the random.choices call is just a list of k letters chosen at random. (Note that this scheme will not lead to very pronounceable names … .) Since the letters are in a python list (as returned by random.choices), we will want to convert them into a string: for this we can use the python join method, joining in particular with an empty string between the letters. (In general, you might want tabs, commas, etc., in between a bunch of strings when you join them, so the separation string is specified as the one on which the join method is called.) Finally, we use the python capitalize method, called on the resulting string of k letters, in order to make it look like a typical name.

Now the convenience functions firstname and lastname will return names within specified length ranges. These are the functions your code will likely call, in order to build a final name to be looked up in the “phonebook“.

But here is an interesting conundrum: the functions let you create first and last names, but they don’ provide you with a whole name! How will you add these names into your dictionaries (as values to be looked up) and as “right parts” of pairs for your various lists-of-pairs? One idea would be to just “jam” the names together to create a longer string, say with a space in between, and call that a ‘whole’ name. But consider the longer game, the way a more realistic program ought to work in a more realistic environment. What is likely to happen is that names are likely to grow more complex, to include middle names, to allow for hyphenated names, various prefixes and suffixes (such as “Mc” or “Jr.”), to account for various cultural differences (such as different orders of family and personal parts of the name), not to mention nicknames, etc.

What I would like to suggest is that you consider converting my little Names.py module into one which provides a real python class, say the name class. That way, if and when the program grew into something more realistic, you would have an easier route to these various generalizations, should they turn out to be needed. In particular, you could make sub-classes, or just add features, in order to allow for a more complex structure to both the names and to the various methods and functions which would be available to a client for manipulating them.

This would mean that your dictionaries values, and the right parts of your pairs, would be actual objects of the name class, rather than just strings jammed together, or pairs of strings, or whatever other ad-hoc solution you might come up with. In other words, there is an extra “little lesson” of design embedded in the lab here: you can design a better alternative to simple or separate strings and use it to protect yourself against the complexity of later changes. This is a hard kind of lesson to learn on the job in the Real World, so perhaps better to practice in a smaller, gentler context like this lab.

So, see if you can come up with a simple class that uses my code (feel free to copy and paste, just credit me in a comment) to allow a client to get a new name object, presumably by calling the __init__() method of the class. It needn’t be especially complicated, really just a wrapper around my code, if you will. The point is not to introduce all (or any!) of the aforementioned complexities: the point is to make something simple, as appropriate for the current program, but also to plan for future growth by making careful design choices early.

Note: In order to use python classes and objects for the names in this program, you will need to compare two name objects to see if they are equal (and python itself will need to be able to do this for you in order to implement dictionary lookup and perhaps a few other things). So, how do you make sure that python can compare two of your objects (ones whose class you defined) for equality? As usual with python, there is a standard method, using the usual “underscores” naming style, that you must define in your class. The particular method name this time is __eq__(…): python will call this method in order to compare objects of your class to see if they are equal.

How will you compare two names? Well, for our purposes here, if the two first-name parts are equal (as strings) and the two last-name parts are equal, then the two names themselves are equal. (In a more realistic version, a lot of tricky situations would need to be handled: 'Tom''Thomas, etc.)

As for the actual mechanics of defining the __eq__(…) method, see this python tutorial on __eq__.

The `random` library

In the Names.py file I have provided, we use functions and methods from the standard python random library. This library can help us generate the random data (names and phone numbers) that will form the basis of our tests. In order to help you in confronting what is a fairly complex library, this section provides some further notes on random. (Note that we will not use any particularly sophisticated techniques in our lab program: a more serious treatment would likely involve data science issues of distributions of real-world data, more attention to probability and statistics, and even the use of more accurate timing tools for that part. But the approach described here will be sufficient for the purposes of learning about data structures.)

You can find the official documentation on the random library here at python.org. (Although we will only use a few simple utilities from this library, it is worth looking around a bit at the documentation more broadly, as this is the sort of resource you will be using from now on in making use of Python’s extensive library tools.) Note in particular that, due to the deterministic nature of most modern computer hardware, we cannot generate truly random numbers from Python on a typical machine (whatever that would even mean: perhaps some quantum feature?). In any case, as with almost all random libraries, in any language, these “random numbers” and related phenomena are really only pseudo-random. In Python this means that they are generated in a mathematical fashion based around the idea of a seed value, which is manipulated in successive turns to generate sequences of numbers, that then have fairly good properties relative to theoretical assumptions and the usual real-world phenomena, such as coins or dice. In any case, the initial seed values are usually taken from computer clock time, but can also be specified by the user—the latter approach allows one to run the same calculations with what amount to the same sequence of “dice rolls” multiple times in a row, which is useful in experimental contexts.

But we will use only a few simple utilities, and the default use of seeds and the like will be perfectly adequate for our needs. One of the functions we will use simply generates a random number in some range: the range is specified in the usual Python fashion, and the function simply returns a (pseudo-)random number (whole number) within the specified range (which carries with it the usual oddity of not including the terminal value in the range). We can call either one of the following:

>>> r = random.randrange(start, stop) # a random number start ≤ r ≤ stop-1 >>> r = random.randint(start, stop) # a random number start ≤ r ≤ stop

(Note that the second form does not have the usual range behavior with regard to the stop terminal value. Note also that an optional step parameter is also allowed for the first form, although we will not need it here.)

Another random utility we will use is one which generates a list of random choices from some initial population of possibilities. It happens to return results “with replacement”, which is technical terminology in probability and statistics: it just means that when choosing the next value in the list, we consider all the original possibilities, including any already chosen, so that occasional duplications may occur in the context of the list.

In particular, this is the choices function, which is documented on the page cited above as follows:

random.choices(population, weights=None, *, cum_weights=None, k=1)

Note that there are multiple parameters, that all but the first are optional, and that the optional ones are specified by Python’s named parameter feature (so that they can be given in any order, and some skipped, etc.) We will not need anywhere near the full generality of this function, but just a version of the call with a population and a specification of the k parameter. This will allow us to get a list of (pseudo-)randomly values from a set: in the case above with names, these are alphabetic letters; for your numbers case below, they will be digits or numbers. We can make a call like this (see the sample file Names.py and the section below that describes it):

random.choices( alphabet, k = random.randrange(min,max) ))

This call will get us k alphabetic letters in a list, where the value k is itself a random number between min and max. (Again, see the file and the description for more context.)

The `PhoneNum` class

In the Names.py file above, I wrote code to generate names as just two strings … but as we discussed above, it would probably be better to use something like a real class, and then supply methods to extract the parts when necessary. In this next section, we discuss a similar module that you will write, in order to manifest phone numbers for the other half of our general population of phonebooks and our sample populations for lookup. You could of course represent phone numbers as simple 10-digit numbers, or as triples of three integers: area code, exchange, and number proper. But when we write out phone numbers, we want them to get formatted something like this: '(503) 555-6767'. And as for the names above, we should probably make a better design decision in anticipation of program growth, even if we never extend this simple lab.

One way to do better would be to make the basic triples of ints into a new object class, one with three instance variables. We could then use the standard .__str__() and .__repr__() methods to format them these numbers for printing, etc., in the usual way, but retain the flexibility of a structure with access methods, etc., in anticipation of program growth.

So this class can (for now) be quite simple: the main task will be to format the integers so that they have enough leading zeros to look natural: we don't want '(30)-7-98' as the format, we want rather '(030)-007-0098' (presuming there really are valid area codes and exchanges that start with zeroes). Note that python allows you to say f'{56:04}' and get the padded string '0056' in return, for example.

Of course, you should also set something up to allow you to get a random phone number, in much the same way you will have modified my names code to allow for a random name to be returned, i.e., as the result of an __init__(), so that you can get new random phone numbers through an object construction like pn = PhoneNumer(). (Note that will need to use the random library for these purposes, but you can basically just mimic the code I provided for names.)

Before you try to actually write out your phonenumber class, think about what you will need: three integer variables, an init method, a string method, … and anything else? In a Real World application, we might want to be able to extract area codes, etc., separately—we won’t need that functionality here, but it would be easy to provide with a few “getter” methods.

The `datetime` library

Just as we used the random library to form a basis for generating random names and phone numbers, we ill appeal to the standard python datetime library for tools to implement our StopWatch class. There are actually two types of time-related data implemented in this library that we will need, called datetime (like the library itself) and timedelta. The datetime class implements a notion of an absolute point on the “global” timeline that we all live in. In particular, these values are number-like, but with an “origin” placed at the beginning of the day on January first, 1990. (Other important standard time origins used widely in computing include the Unix or Epoch time origin of January 1st, 1970, and the Universal Time Code (UTC) origin of January 1st, 1972.)

A datetime value denotes a moment in time, an instantaneous point on the timeline. But the timedelta type represents instead a measure of time, in the sense of the “length” of a time interval, i.e., the amount of time which passes between one point and another on the timeline. (Famously, time itself passes in only one direction—well, at least for most of us …). This interpretation means that although both of these types are sort of like numbers, in fact only certain arithmetic operations are defined between various combinations of datetime and timedelta values. In particular, subtracting two datetime values yields a timedelta value; two of the latter can be added or subtracted (as this makes sense for intervals), and they can even be multiplied by numbers (since it makes sense to consider an interval which is several times as long as another interval). But you cannot add or multiply together two datetime values, nor multiply them by numbers, as those operations don’t really make sense for absolute points on the timeline.

As you can see, the issues in representing time-related values are a bit more complex and subtle than you might have guessed at first—you can read more about these types and the operations available on them in the official documentation for the datetime standard library. (Just note that the description runs to almost 50 pages!)

The `StopWatch` class

[This section still needs some details … ]

You should write this class, based on your preliminary designs from the past week or so in lab. The design is up to you, but it should probably have at least some methods to start, stop, reset, read (as in: get the time), and print (as in the standard .__str__() and .__repr__() methods).

I made mine so that starting without resetting accumulates the time, but other options could be useful, too.

I ended up using datetime.timedeltas for my StopWatch class, as they can be added and subtracted from each other, and obtained by subtracting datetimes in the first place.

The testing harness or main body of the program

OK, with the above preliminaries out of the way, we are finally ready to discuss how we will actually set up the comparative tests for our various implementations of lookup code, namely: python dictionaries, python native lists (of pairs), and your own linked lists (again, of pairs). Recall that for each implementation, we will want to try looking up some sample or test cases in a bigger structure or “phonebook”. So, we will need to generate two populations for each test run, the general population (say: “genpop”) and the smaller sample population.

Since we want the lookup tasks to be at least moderately realistic, let’s agree that the sample populations will have both successful and some (fewer) unsuccessful lookups. We can ensure this by first generating the general population data, and then drawing the sample population mostly from the general one we have generated. Since we don’t need to look up every name in order to test things out, let’s also agree that the samples will constitute some standard percentage of the general population, say something like 5% or 10%.

But we might also include some names to look up that we know won’t be found: we can’t guarantee that fresh random names will fit this characterization, but it’s very unlikely that we would happen to generate the same names twice. So we could add another say 2% of the size of the genpop of names that are newly-generated and unlikely to be found: this will more realistically simulate real-world situations where unfound names are usually a possibility. (In a more sophisticated version of this testing, we would need to more about the actual intended uses and data patterns, but again, these simple assumptions are reasonable for a lab exercise.)

Note that we should try to test the various data structure implementations with the same test data in roder to be fair. This means that we should generate our genpop data first, then place it in each structure in turn, and next generate our sample population data (in part drawn from the genpop data), and then finally test the same genpop and samples across each of the three implementations structures. One way to do this would be to generate into python lists for simplicity, then convert the data from the python lists into dictionary form and linked list form. (And note that we are not too worried here about how long the conversion process takes, just the lookups at the end, once everything is generated.)

To be more concrete, then, you might develop your populations as follows:

generate a (python) list of random names (of some size n, where n is usually a power of 10);
generate a (python) list of random phone numbers (again, of size n);
create from these lists the 3 versions of the general population or phone book;
now create a sample population basis (this will be just a list of names) by taking some subset (say 5 or 10%) of the genpop names list and adding some new (presumably different, and therefore unfound) names;
finally, running those sample names against the various versions of the general population by looking them up (as appropriate per data structure) and timing.

In a bit more detail …

In order to generate some phonebooks and sample populations, it is convenient to first generate the “base data”, a list of names and a list of phone numbers, first, and only then populate the data structures in question (dictionary, python list, linked list). Why? If we generate the names and numbers directly into a dictionary, for example, the python standard random library will not be able to extract a sample population from it.

Assuming we have a variable n that corresponds to the desired number of names and phonenums:

First generate the “base” lists:

namelist = ... # generate n names numlist = ... # generate n phone numbers

Next, populate the phonebooks from these lists:

genpopdict = ... # put values from namelist and numlist into a dict genpopplist = ... # put values from namelist and numlist into a python list of pairs genpopllist = ... # put values from namelist and numlist into a linked list of pairs

Now, we can generate a sample population as follows: we want some standard fraction (say 5% or so) of the names from the original namelist, plus a fsmaller bunch of “fresh” names (not expected to be found in the phonebook) which we might then “shuffle” (using a random library call) into a final list of names to use for sampling or testing purposes.

samppop = ... # choose 5% of namelist + add some more (0.2%?) of "fresh names" sampop.shuffle(...) # check the random libraries fro the correct call

Now we can try looking up the sample names in each of the three phonebooks (dict, plist, and llist):

for name in samppop: # lookup n in genpopdict for name in samppop: # lookup n in genpopplist for name in samppop: # lookup n in genpopllist

Some sample output

Here is some sample output from my own implementation:

    5 samples from python list population of     100 took  0:00:00.000016
    5 samples from linked list population of     100 took  0:00:00.000033
    5 samples from python dict population of     100 took  0:00:00.000004

   52 samples from python list population of    1000 took  0:00:00.001730
   52 samples from linked list population of    1000 took  0:00:00.002051
   52 samples from python dict population of    1000 took  0:00:00.000014

  520 samples from python list population of   10000 took  0:00:00.136822
  520 samples from linked list population of   10000 took  0:00:00.139762
  520 samples from python dict population of   10000 took  0:00:00.000101

 5200 samples from python list population of  100000 took  0:00:15.184087
 5200 samples from linked list population of  100000 took  0:00:14.958131
 5200 samples from python dict population of  100000 took  0:00:00.001354

Some other questions

Try timing inside the loop versus outside the loop; does it make a difference? Can you speculate why?
Try timing the generation of your genpop list versus your genpop dict, for various sizes. Are they similar or different? Do the timings vary in the same way as they do for the lookup?
We made objects out of three-part phone numbers, but not out of two-part names; should we have? Will lookup for lists and .get() for dictionaries work if we pass them objects as “keys”? There are some tricky issues with equality here: how can we make things work intuitively?
Dictionaries sound great! Maybe we should make a PhoneBook class which wraps up this sort of dictionary and gives it some useful methods. It could even print out in the usual format, sorted alphabetically by last name … can we therefore just print the dictionary out directly, using its .__str__() function? Why or why not?

Lab #2: Phonebook comparison of lists and dictionaries

Introduction: purposes of the lab

The overall TimingLab module

Recommended module structure

The Names module

The random library

The PhoneNum class

The datetime library

The StopWatch class