I. Sorting algorithms A. Introduction 1. Sorting: a fundamental problem a. many applications in computing either directly concern sorting of data (e.g., pre-sorting address labels to take advantage of postal discounts) or use sorting as an integral component (e.g., when finding duplicates) 2. Sorting and other algorithms a. often the performance of other algorithms, e.g., searching for data based on a key, can be improved when the the data is first sorted 3. Case analysis and input data a. it is especially important to consider the characteristics of input data when choosing sorting algorithms: for example, many sorts break down in performance if the input data is already sorted (or "anti-sorted") 4. Simple, sophisticated and special-case sorting a. many "naive" sorting algorithms can be classified as simple because they are based on a progressive boundary technique that tends to give O(n^2) running times b. more sophisticated algorithms tend to use a recursive structure (even if indirectly) which makes about O(n log n) comparisons between items c. occassionally we can make use of special-case algorithms which take advantage of particular kinds of input structure (e.g., radix sort) 5. Theoretical limitations of comparison-based sorts a. in fact, we can prove a theoretical result about comparison-based techniques which shows that O(n log n) is the best we can achieve under those circumstances) B. Simple sorting algorithms 1. General character of simple sorts a. with simple sorting techniques, we imagine a boundary separating the sorted from unsorted parts of the data; the boundary is incrementally increased until all data is sorted b. each incremental adjustment usually involves, at least potentially, comparison between the "new item" and all of the items in the sorted area 2. Insertion sort a. with insertion sort, on each increment we move the "new item" from the unsorted area into its correct position in the sorted area, shifting items over as needed 3. Better insertion sorts a. in order to improve the efficiency of insertion sort, we can use a better-than-linear method for finding the correct position for the new item among the sorted ones (note however that we must still shift items over to make room) 4. Bubble sort a. with bubble sort, we move new items into the boundary area by repeatedly comparing two items in the unsorted area, swapping them if they are out of order, until the extreme value (i.e., wither minumum or maximum) moves into the sorted area 5. Selection sort a. with selection sort, we find the extreme element in the unsorted area and exchange it with the item in the "new" position C. Quicksort (analysis) 1. Basic form of quicksort a. to quicksort a collection of items, we follow this basic structure: i. pick a particular item, called the pivot; ii. partition all of the items in the collection into those which are less than the pivot and those which are greater; iii. recursively quicksort the two partitions. 2. Best-case analysis a. under optimum conditions, the two partitions are of roughly equal size; the recursive calls are of half the size of the original, so we make O(log n) calls, each with linear overhead, for a total time which is O(n log n) 3. Worst-case analysis a. in the worst case, the partition will be completely "lopsided" and we will make O(n) calls each with linear overhead, for a total time which is O(n^2) 4. Average-case analysis (hard) a. see text pp. 238-240 D. Quicksort (details) 1. Choosing the pivot a. the preferred method of choosing a pivot is the median of three approach, which takes the median value between the first, last and middle values in the array b. this approach has the advantage that it is fairly quick and simple, and does not degrade when the input is sorted or anti-sorted 2. Partitioning: burning the candle at both ends a. in order to partition the values in an array, we can move two counters from the ends of the array toward the middle, swapping values when we find items which are too large on the left and too small on the right (relative to the pivot value) 3. Keys equal to the pivot a. by symmetrically allowing equal values to be swapped, we can actually help achieve a more balanced partition when there are many pivot-equivalent values 4. Optimizing for small arrays a. there is some extra overhead expense to recursive calls; studies show that cutting off the recursion and using (e.g.) insertion sort for arrays of size roughly 10 improves performance E. Mergesort 1. (see discussion in the book) F. Shellsort 1. Basic form of shellsort a. repeatedly sort dispersed sub-arrays separated by some fixed number of indices, decreasing the fixed number over time 2. Shellsort in the limit a. the final phase of shellsort should be for the fixed distance 1; this de-generates to insertion sort, but with a nearly sorted array (assuming previous phases do some work) 3. Performance of shellsort a. measurements and analysis show that shellsort can improve over insertion sort down to about O(n\sup{r}) for r = various fractions between 1 and 2 G. Heapsort 1. Basic form of heapsort a. insert each item into a priority queue, using sort key as priority; extract each item using delete_min, and store one after the other 2. Analysis of heapsort a. assuming we can support both insertion and delete_min in log time (idea: use trees), we can achieve O(n log n) time with heapsort H. Radix sort 1. (to be discussed later)