CS 241 final review notes Fall 2019

CS 241: Data Structures—Final Exam Review Notes

Important notes!

click on the triangles to the left to expand or contract sections!

use the text size feature in your browser to make this text larger and more readable!

About the exam

the final exam will take place on Wednesday, December 11, 2019, from 2-5 pm, in the usual lecture room (Ford 204)

the exam will be designed so that you should be able to complete it in one-and-a-half to two hours

you will not have access to books, notes, computers, networks or neighbors during the exam

(you may use scratch paper and it will be supplied if needed; you are encouraged to ask your instructor for clarifications during the exam if necessary)

In preparing for the exam, you should consult:

your textbook (see especially chapters and sections mentioned below)

your notes from lecture and lab—these are likely to be especially helpful

the home page for the course and various links, examples, etc., provided there

… and also especially the lab assignments linked from the home page, as well as your own programs

the sample quizzes on the homepage

Format: the exam will consist of several kinds of questions; typically:

10-15 True-or-False questions at 1-2 points each

8-10 short answer questions at 4-6 points each

2-4 longer answer questions at 8-15 points each

These review notes are not exhaustive, but are meant as a guide to coverage—see other sources for details!

The final exam is comprehensive!

But there will naturally be some emphasis on material from after the midterm

For more information on the midterm, see the Midterm Review Notes

… but here are the main subject areas again (with book chapters) for the midterm material:

General background to Data Structures and modularity

Chapter 1, section 1.1; also Chapter 3, section 3.1

Math review and recursion

Chapter 1, sections 1.2-1.3

Java review and new features

Chapter 1, sections 1.4-1.5

Review—see external notes!

the Java ontology page linked from the home page

Algorithm analysis and asymptotics (O notation)

Chapter X, sections y-z

Arrays, ArrayLists and linked lists

Chapter 3, sections 3.2-3.5

Stacks and Queues

Chapter 3, sections 3.6-3.7

Infix and Postfix expressions

Chapter 3, sections 3.6.3

Priority queues and binary heaps

See Chapter 6 of the text (esp. 6.1through 6.4)

Abstraction levels

a priority queue is an ADT (abstract data type) supporting insertion of prioritized items, and removal of the highest priority item

(priority queues based on integer orderings can either be based on minimum or maximum items as “high priority”)

a binary heap is an implementation idea for priority queues using left-complete binary trees

a binary heap may be implemented in an array, giving especially fast access to the children and parents of a “node”

Linear implementations

there are obvious, but ultimately naïve, implementations of a priority queue using a linked list (say)

like a normal queue, we add to one end (the tail of the list) and remove from the other

we can either choose to honor the priority ordering on:

the way in (search for proper position on insertion)

or the way our (search for highest priority item on removal)

either way, the naïve implementations lead to an O(1) versus O(n) split for insertion and removal

(in other words, one of the two operations will be especially expensive in time)

the binary heap, by contrast, provides both operations in O(log n) time: faster and more consistent

Binary heaps (as trees)

the binary heap implementation has both a structural aspect and an ordering aspect

we can store the items in a left-complete binary tree (filled up all the way to the bottom row and then from the left)

the heap order: the values in the descendants of every node are lower priority than the value in the node

NOTE: there is no left/right aspect; as long as all descendants are lower priority, the heap order is satisfied

this is a “looser” order than in a binary search tree, whence the name “heap”

for both operations: restore structure first, then heap order

to insert: add at next position in bottom row (last in level-order traversal), then “bubble up” to right place

to remove(Min): take from root, then move last place item to root & bubble down

both operations take O(log n) time (with good constants) since the tree is guaranteed to be a minimum height

in both cases, we can save even more time by swapping “holes” rather than actual value

(this saves time because we make only one re-assignment per move until the end)

to build the original heap (buildHeap): start bubble up each node starting from first with child (in reverse level-order traversal)

Binary heap in an array

we can avoid allocating actual tree nodes by making a left-complete binary tree in an array

store the items in positions indexed starting from 1 in level-order

every node’s parent is in index n/2 (rounding down)

every node’s children are in indices 2n and 2n+1

these operations (involving multiplication and division by 2) can be implemented very quickly as bit shifts

Applications of priority queues

printer queues: in the old days, printers were shared by so many that priority was used to determine printing order

priority queues are used as “internal” structures in the implementation of many advanced algorithms (esp. for graphs)

sorting! (priority queues are used in the so-called “heapsort” algorithm)

Other variations on priority queues and heaps

you are not responsible for the exam for variations, but you should know there are many (d-heaps, leftist heaps, binomial queues, fibonacci heaps, etc.)

most variations support extra operations well, such as merging two priority queues

Hashing and hash tables

See Chapter 5 of the text, especially sections 5.1-5.4

Introduction

Dictionary interface: look values up by key

often the key is a field in a larger value object

can be implemented in O(log n) time using a BST (see above)

… but it would be better to have O(1)!

solution: keep items in an array

but under what index?

basic idea: chop up and mash around the keys (i.e., as numbers, using their bits)

we want a “messy”, random-seeming (but repeatable!) process which distributes keys well

unfortunately, it may happen that two keys collide under the hash function

use the modulus operator (%) to limit the hash results to the table size (-1)

what do we want from hash function?

speed!

coverage: all slots (not limited in magnitude or ability to hit), even distribution

lack of collisions (as much as possible)

collision resolution strategies

separate chaining: use a list of items hanging off each array entry

internal addressing/probing: try to insert at nearby slots if the preferred one is filled

… but there are problems due to clustering (“clumps” of filled entries, which tend to grow in size)

linear probing: try successive entries (but primary clustering)

quadratic probing: square the “attempt number” to get an offset (but secondary clustering)

best bets for hashing

well-chosen table size (about twice as much as there is data, and use a prime number)

use a well-distributed hash function

Sorting

See Chapter 7 of the text, especially sections 7.1-7.8

Introduction to sorting

one of the most common operations

ordered data supports binary search (fast!)

comparison interfaces in Java

Comparable—internal/“natural”: this.compareTo(that)

Comparator—external: compare(this, that)

both return an integer: negative (this is less), zero, positive

Naïve sorting

selection sort: next ordered item selected and added physically next in the result

insertion sort: next physical item inserted into proper place in the ordered result

bubble sort: repeatedly swap items, “bubbling up” to the boundary

average- vs. best- vs. worst-case

time bounds: O(n2)

Better sorting

tree sort: put items into a binary search tree, then traverse

heap sort: put items into a heap, then pull them all out (with repeated deleteMin)

quick sort: split items using pivot, then recursively sort the two parts

merge sort: split items into tow halves, recursively sort, then merge back together

time bounds: O(n log n) in average case (but can be bad for sorted data)

hard bounds on comparison-based sorting: no comparison-based sort can do better than O(n log n)

A rational reconstruction of (many) sorts

hard-easy division: either hard-split/easy-join or easy-split/hard-join (you have to do the work somewhere)

one-many versus half-and-half split: the latter is better, since we get the power of divide-and conquer (logarithmic times)

see the Visual Guide to Sorting Algorithms

Disjoint Sets

See Chapter 8 of the text (esp. 8.1through 8.5)

Relations and equivalence relations

a relation R is just a set of pairs of elements of some common set S

when a pair is in the set, we say that the elements x, y of S are related by it: x R y

for example, in a set of numbers, n < m would be a relation, so e.g. 2 < 3

an equivalence relation is one which has these properties:

reflexive: when a R a for every element a of S

Partitions

a partition of a set S is just a way of “breaking up” S into a collection of subsets

(so they should join together to completely “cover” S)

every equivalence relation determines a partition: related elements are placed in the same subset

every partition determines an equivalence relation: elements in the same subset are deemed related

the subset in which a given element sits (in a partition) is called its equivalence class

Dynamic equivalence problem

we imagine an unknown equivalence relation (or partition)

over time, we both receive queries (called “find”s) and information (called “union”s)

find asks which subset (equivalence class) an element is in

find takes elements and returns a “name” for the subset

union tells us that (new information) two elements are in fact related (so in the same subset)

the algorithm is dynamic and on-line: we discover and report over time, but accurately for the moment

initially, no elements are related (so each is in its own, lonely little subset 😢 )

Simple algorithm

use a forest (collection) of “upward-pointing” trees

for a find, proceed up the tree to root, and report root’s name

for a union, join one tree’s root to the other’s

rather than actual trees (with pointers), we can just use an array of indices

Smart unions and path compression

rather than choosing which root to merge “into” randomly, we can

perform unions by size: make smaller tree a sub-tree of the larger (yields O(M), where M = # of operations)

perform unions by height: make the shallower tree a sub-tree of the deeper one

with path compression, we merge nodes into the target root as we climb the “tree” on a find operation

Graphs and graph algorithms

See Chapter 9 of the text (esp. 9.1through 9.3)

basic definitions

vertices and edges

may keep extra info in nodes or on edges (often “weights” or “costs”, also colors, etc.)

adjacency

directed and undirected graphs

paths and path lengths

loops (self), and cycles

connected graph

strongly and weakly connected for digraphs

complete graph

Graph representation

as an adjacency matrix O(|V|^2)

… but probably sparse, so try adjacency list O(V*E)

Graph applications

airport travel

may be non-symmetric

may want to optimize relative to weights, etc.

(also see, e.e., colorings of nodes)

Topoological sort

list of edges that respects adjacency (for directed graph)

i.e., if a path from vi to vj, then vj appears after vi in the ordering

example of pre-requisite structure

doesn’t work with cycles, of course, since no way to respect the ordering

(no good starting point for cycle)

recursive algorithm:

start with “root” (no incoming edges)

print & remove

repeat

better:

keep “root” (-ish) in a “box"

when deleting edge, decrement the indegree of adjacent nodes

see pseudocode in Weiss p. 365

note: graph and in-degree computations already in place (is this fair?)

note how comparison of O(V^2) to O(V+E) is dependent on sparseness of graph

Shortest path; tech: single-source, shortest-path, for weighted (usual) or unweighted graphs

weighted path length (sum of weights) vs. unweighted (only path length)

path from (vi,vj) has cost cij

note possibility of negative edge weights and thus negative-cost cycles

interesting fact (Weiss p. 367):

“Currently there are no algorithms in which finding the path from s to one vertex is any faster (by more than a constant factor) than finding the path from s to all vertices.”

unweighted version: O(|E|+|V|)

ignore actual path, focus on length only: can recover path just by keeping track as we go

start by marking (shortest path to) source as length 0

use breadth-first traversal to increase (minimally) the markers on (starting with adjacent to s) nodes

keep a table with three pieces of info per node

boolean known

distance from s (starts at infinity, hopefully goes down)

bookkeeping for final path emission (printing)

weighted version: O(|E| log |V|)