C4 Sorting

Dated: 13-11-2024

Sorting is crucial for efficiency in large software systems and is a well-studied problem with provable lower bounds.
Given an array \(A[1 .. n]\), sorting reorders its elements into increasing or decreasing order.
Sorting can also apply to objects, using a key value from a totally ordered domain where any two elements \(x\) and \(y\) satisfy \(x < y\), \(x = y\) or \(x > y\).

Slow Sorting Algorithms

There are well known slow \(O(n^2)\) sorting algorithms .

Bubble Sort

Scan the array.
Whenever two consecutive items are found that are out of order, swap them.
Repeat until all consecutive items are in order.

Insertion Sort

Assume that \(A[1..i − 1]\) have already been sorted.
Insert \(A[i]\) into its proper position in this sub array.
Create this position by shifting all larger elements to the right.

Selection Sort

Assume that \(A[1..i − 1]\) contain the \(i − 1\) smallest elements in sorted order.
Find the smallest element in \(A[i..n]\).
Swap it with \(A[i]\).

Sorting in \(O (n \log n)\) times

Now we will study heapSort and quicksort.

Heaps

A heap is a left-complete binary tree with the heap order property: in a min heap, the parent’s key is smaller than or equal to its children’s keys, making the root the smallest.
In a max heap, the parent’s key is larger, with the root being the largest.

Pasted image 20241113161336.png — A min-heap

The number of nodes in a complete binary tree of height \(h\) is

\[n = 2^0 + 2^1 + 2^2 + \ldots + 2^h = \sum_{i = 0}^h 2^i = 2^{h + 1} - 1\]

\(h\) in terms of \(n\) is

\[h = (\log(n + 1)) - 1 \approx \log n \in \Theta(\log n)\]

Heaps are stored in arrays using level-order traversal, leveraging the left-complete binary tree structure for pointer-free access via simple arithmetic.

\[\text{left}(i): \text{returns } 2i \text{, index of left child of node } i\]

\[\text{right}(i): \text{returns } 2i + 1 \text{, the right child}\]

\[\text{parent}(i) : \text{returns } \left\lfloor \frac i 2 \right\rfloor \text{, the parent of } i\]

The root is at position \(1\) of the array.

Heapsort Algorithm

A max heap is built from \(A[1 .. n]\), the maximum is repeatedly extracted.
The root is replaced with the last leaf, and the heapify procedure restores heap order.

heapSort(array A, int n)
    buildHeap(A, n)
    m <- n
    while(m >= 2)
    do
        swap(A[1], A[m])
        m <- m - 1
        heapify(A, 1, m)

Heapify Procedure

Heapify fixes heap order by sifting a suspected element down the tree, swapping it with the larger child until it is larger than both children or reaches a leaf.
Given heap \(A\), index \(i\) and size \(m\), find \(A[max]\) (largest of \(A[i]\) and its children), swap if needed, recurse on \(A[max]\).

heapify(array A, int i, int m)
    l <- left(i)
    r <- right(i)
    max <- i
    if (l <= m) and (A[l] > A[max])
    then
        max <- l
    if (r <= m) and (A[r] > A[max])
    then
        max <- r
    if (max != i)
    then
        swap(A[i], A[max])
        heapify(A, max, m)

Analysis of Heapify

Calling heapify on the root takes \(O(\log n)\) time due to at most \(\Theta(\log n)\) levels and \(O(1)\) work per level.
On leaves, it completes in \(\Theta(1)\).

Buildheap

To build a heap, start with the array \(A\) in its given order.
Begin Heapify from the last non-leaf node \((\left\lfloor\frac n 2\right\rfloor)\), as leaves are already in heap order.
Heapify assumes subtrees of node \(i\) are in heap order, so starting at the top is invalid.

buildHead(array A, int n)
    for i <- n / 2 downto 1
    do
        heapify(A, i, n)

Analysis of Buildheap

Assume \(n = 2^{k + 1} - 1\) for a tree of height \(h\).
In a left-complete binary tree, level \(j < h\) has \(2^j\) nodes, and level \(h\) has at most \(2^h\) nodes.
The work done by buildHeap depends on traversing and heapifying nodes across all levels.

Pasted image 20241113181954.png — Total work performed in buildheap

At level \(h\), there are \(2^h\) nodes, not heapified.
At level \(h - 1\), \(2^{h - 1}\) nodes may shift down \(1\) level.
Generally, level \(j\) has \(2^{h - j}\) nodes, each potentially shifting down \(j\) levels.
So we count from bottom to top, level by level and total time is

\[T(n) = \sum_{j=0}^{h} j2^{h-j} = \sum_{j=0}^{h} j\frac{2^h}{2^j}\]

We can factor out \(2^h\) term

\[T(n) = 2^h \sum_{j=0}^{h} \frac{j}{2^j}\]

Recall geometric series for any constant \(x < 1\).

\[\sum_{j=0}^{\infty} x^j = \frac{1}{1-x}\]

Taking derivative¹ with respect to \(x\)

\[\sum_{j=0}^{\infty} jx^{j-1} = \frac{1}{(1-x)^2}\]

And then multiplying by \(x\), we have

\[\sum_{j=0}^{\infty} jx^j = \frac{x}{(1-x)^2}\]

Then plug \(x = \frac 1 2\)

\[\sum_{j=0}^{\infty} \frac{j}{2^j} = \frac{\frac 1 2}{(1 - \frac 1 2)^2} = \frac{\frac 1 2}{\frac 1 4} = 2\]

\[T(n) = 2^h \sum_{j=0}^{h} \frac{j}{2^j}\]

\[\le 2^h \sum_{j=0}^{\infty} \frac{j}{2^j}\]

\[\le 2^h \cdot 2 = 2^{h + 1}\]

Recall that \(n = 2^{h + 1} - 1\).
Therefore,

\[T(n) \le n + 1 \in O(n)\]

buildHeap takes \(\Theta(n)\) time since it must access all elements \(\Omega(n)\).
Most nodes in a binary tree are at the lowest level, which explains the \(\Theta(n)\) complexity for \(n \approx 2^{k + 1}\) nodes in a complete binary tree.
The number of nodes at the bottom three levels alone is

\[2^h + 2^{h-1} + 2^{h-2} = \frac{n}{2} + \frac{n}{4} + \frac{n}{8} = \frac{7n}{8} = 0.875n\]

Most nodes (90%) in a complete binary tree are in the 3 lowest levels.
Algorithms like buildHeap must focus on efficiency at these levels where most of the tree's weight lies.

Analysis of Heapsort

Heapsort calls buildHeap in \(\Theta(n)\) and then extracts n elements, each taking \(O(\log n)\) for heapify.
Thus, heapSort is \(O(n \log n)\).
Comparison-based sorting algorithms, including heapSort and mergeSort, cannot run faster than \(\Omega(n \log n)\).

Quicksort

Quicksort is one of the fastest sorting algorithms and widely used.
It follows the divide and conquer strategy.

quickSort(array A, int p, int r)
    if (r > p)
    then
        i <- a random index from [p .. r]
        swap A[i] with A[p]
        q <- partition(A, p, r)
        quickSort(A, p, q - 1)
        quickSort(A, q + 1, r)

Partition Algorithm

The partition algorithm divides array \(A[p..r]\) into three parts:

\(A[p..q−1]\) with elements \(\le\) pivot \(x\)
\(A[q] = x\)
\(A[q+1..r]\) with elements \(> x\).

The pivot is initially \(A[p]\), but if another pivot selection method is used, we swap it with \(A[p]\).
The invariant is that \(A[p..q−1]\) has elements \(< x\), \(A[q+1..s−1]\) has elements \(\ge x\), and \(A[s..r]\) has unknown values.

partition (array A, int p, int r)
    x <- A[p]
    q <- p
    for s <- p + 1 to r
    do 
        if (A[s] < x)
        then
            q <- q + 1
            swap A[q] and A[s]
    swap A[p] with A[q]
    return q

Pasted image 20241113184815.png — Trace of partitioning algorithm

Quick Sort Example

Pasted image 20241113184935.png — Example of quick sort

This figure shows the quicksort algorithm, each row being an array.
Each row below shows a step taken over the array.
The array is first partitioned using \(10\) as the pivot.
The left portion is partitioned around \(5\), and the right around \(13\).
\(10\) ends up in its final sorted position.
The process repeats recursively, sorting the array.

Pasted image 20241113185352.png — Quick Sort

Analysis of Quicksort

Quicksort's running time depends on pivot selection.
If the pivot is poorly chosen, the partition can be unbalanced, leading to a worst-case time of \(O(n^2)\).
With random pivot selection, the expected time is \(O(n \log n)\), which happens more frequently.

Worst case Analysis of Quick Sort

To analyze the worst-case performance of quicksort, we use a recurrence.
Given an array of size \(n\), if the pivot has rank \(q\), partitioning takes \(\Theta(n)\) time.
The two recursive calls are made on subarrays \(A[1 : q − 1]\) (size \(q − 1\)) and \(A[q + 1 : n]\) (size \(n − q\)).
So if we ignore the \(\Theta(n)\), we get

\[T(n) = T(q-1) + T(n-q) + n\]

The worst case occurs by maximizing over all possible values of \(q\).

\[ T(n) = \begin{cases} 1 \quad \text{ if } n \le 1 \\ \max_{1 \le q \le n} (T(q - 1) + T(n - q) + n) \quad \text{ otherwise} \end{cases} \]

Recurrences with max’s and min’s are tricky, but the worst case often occurs at the extremes or middle.
By testing \(q = 1\), \(q = n\), and \(q = \frac n 2\), we find that the worst case occurs at the extremes.
we expand the recurrence for \(q = 1\), we get:

\[T(n) \leq T(0) + T(n-1) + n\]

\[= 1 + T(n - 1) + n\]

\[= T(n - 1) + (n + 1)\]

\[= T(n − 2) + n + (n + 1)\]

\[= T(n-k) + \sum_{i=-1}^{k-2} (n-i)\]

For the basis \(T(1) = 1\), we set \(k = n - 1\) and get

\[T(n) \leq T(1) + \sum_{i=-1}^{n-3} (n-i)\]

\[= 1 + (3 + 4 + 5 + \ldots + (n − 1) + n + (n + 1))\]

\[\le \sum_{i=1}^{n+1} i = \frac{(n+1)(n+2)}{2} \in \Theta(n^2)\]

Average case Analysis of Quicksort

In the average case, quicksort runs in \(\Theta(n \log n)\) time.
The analysis doesn't depend on input distribution, but on random pivot choices.
Assuming all elements are distinct, each pivot choice has an equal probability of \(\frac 1 n\).
The average running time is denoted by \(T(n)\), representing the time for a list of size \(n\).
So we can modify the above recurrence to compute an average rather than a max

\[ T(n) = \begin{cases} 1 \quad \text{ if } n \le 1 \\ \frac 1 n \sum_{q = 1}^n (T(q - 1) + T(n - q) + n) \quad \text{ otherwise} \end{cases} \]

The time \(T(n)\) is the weighted sum of the times taken for various choices of \(q\).

\[ \begin{align} T(n) &= \frac 1 n (T(0) + T(n - 1) + n) \\ &+ \frac 1 n (T(1) + T(n − 2) + n) \\ &+ \frac 1 n (T(2) + T(n − 3) + n) \\ &+ \ldots + \frac 1 n (T(n − 1) + T(0) + n) \end{align} \]

We have not seen such a recurrence before.
To solve it, expansion is possible but it is rather tricky.
We will attempt a constructive induction to solve it.
We know that we want a \(\Theta(n \log n)\).
Let us assume that \(T(n) \le cn \log n\) for \(n \ge 2\) where \(c\) is a constant.
For the base case \(n = 2\), we have

\[T(n) = \frac{1}{2} \sum_{q=1}^{2} (T(q-1) + T(2-q) + 2)\]

\[= \frac 1 2 \left(T(0) + T(1) + 2) + (T(1) + T(0) + 2\right)\]

\[= \frac 8 2 = 4\]

We want this to be at most \(2c \log 2\)

\[T(2) \le 2c \log 2\]

\[4 \le 2c \log 2\]

\[c \ge \frac 4 {2 \log 2} \approx 2.88\]

In the induction step, we assume \(n \ge 3\) and that \(T(n_0) \le c n_0 \log n_0\) for all \(n_0 > 0\).
We aim to prove this for \(T(n)\).
By expanding \(T(n)\) and move the factor \(n\) outside of the sum.¹

\[T(n) = \frac{1}{n} \sum_{q=1}^{n} (T(q-1) + T(n-q) + n)\]

\[=n + \frac{1}{n} \sum_{q=1}^{n} (T(q-1) + T(n-q))\]

\[T(n) =n + \frac{1}{n} \sum_{q=1}^{n} T(q-1) + \frac{1}{n} \sum_{q=1}^{n} T(n-q)\]

The two sums¹ add the same values, \(T(0) + T(1) + \ldots + T(n − 1)\), but in different orders.
We replace them with \(2 \sum_{q = 0}^{n - 1} T(q)\) and handle \(T(0)\) and \(T(1)\) separately, as they don't follow the formula.

\[T(n) = \frac{2}{n} \left(\sum_{q=0}^{n-1} T(q) \right) + n\]

\[=\frac{2}{n}\left(T(0)+T(1)+\sum_{q=2}^{n-1}T(q)\right)+n\]

We will apply the induction hypothesis for \(q < n\) we have

\[T(n) = \frac{2}{n}\left(T(0) + T(1) + \sum_{q=2}^{n-1} T(q)\right) + n\]

\[\leq \frac{2}{n}\left(1 + 1 + \sum_{q=2}^{n-1} (cq \log q)\right) + n\]

\[=\frac{2c}{n} \left(\sum_{q=2}^{n-1} (cq \ln q) \right) + n + \frac{4}{n}\]

\[\because S(n) = \sum_{q=2}^{n-1} (cq \ln q)\]

\[\because \sum_{i=a}^{b-1} f(i) \leq \int_{a}^{b} f(x) dx\]

\[\therefore S(n) = \sum_{q=2}^{n-1} (cq \ln q) \leq \int_2^n x \ln x dx\]

Performing integration² by parts, we have

\[\int_{2}^{n} x \ln x dx = \left[\frac{x^2}{2} \ln x - \frac{x^2}{4}\right]_{x=2}^{n}\]

\[= \left(\frac{n^2}{2} \ln n - \frac{n^2}{4}\right) - (2 \ln 2 - 1)\]

\[\leq \frac{n^2}{2} \ln n - \frac{n^2}{4}\]

\[S(n) = \sum_{q=2}^{n-1} (cq \ln q) \leq \frac{n^2}{2} \ln n - \frac{n^2}{4}\]

Put back in the expression for \(T(n)\), we get

\[T(n) = \frac{2c}{n}\left(\frac{n^2}{2} \ln n - \frac{n^2}{4}\right) + n + \frac{4}{n}\]

\[= cn \ln n - \frac{cn}{2} + n + \frac{4}{n}\]

\[= cn \ln n + n\left(1 - \frac{c}{2}\right) + \frac{4}{n}\]

To finish the proof, we want all of this to be at most \(cn \ln n\).
For this to happen, we will need to select \(c\) such that.

\[n\left(1-\frac{c}{2}\right)+\frac{4}{n}\le0\]

If we select \(c = 3\), and use the fact that \(n \ge 3\) we get

\[n\left(1-\frac{c}{2}\right)+\frac{4}{n}=\frac{3}{n}-\frac{n}{2}\]

\[\leq 1 - \frac{3}{2} = -\frac{1}{2} \leq 0\]

From the basis case, we have \(c \ge 2.88\).
Choosing \(c = 3\) satisfies all constraints.
Thus

\[T(n) = 3n \ln n \in \Theta(n \log n)\]

In-place, Stable Sorting

An in-place sorting algorithm uses no extra storage, and a stable sorting algorithm keeps duplicate elements in the same relative order.

Unsorted

9 3 3' 5 6 5' 2 1 3''

Stable Sort

1 2 3 3' 3'' 5 5' 6 9

Unstable

1 2 3' 3 3'' 5' 5 6 9

Bubble sort, insertion sort, and selection sort are in-place algorithms.
Bubble and insertion sort can be stable, but selection sort cannot without modifications.
Mergesort is stable but not in-place, requiring extra storage.
Quicksort is in-place but not stable, and heapsort is in-place but not stable.

Lower Bounds for Sorting

The best sorting algorithms so far are \(O(n \log n)\).
If sorting relies solely on comparisons, it's impossible to sort faster than \(\Omega(n \log n)\).
All the algorithms we've seen are comparison-based.
Consider sorting three numbers \(a_1\), \(a_2\), \(a_3\).
There are \(3! = 6\) possible combinations

\[(a_1, a_2, a_3)\]

\[(a_1, a_3, a_2)\]

\[(a_3, a_2, a_1)\]

\[(a_3, a_1, a_2)\]

\[(a_2, a_1, a_3)\]

\[(a_2, a_3, a_1)\]

Pasted image 20241113195739.png — Decision Tree

For \(n\) elements, there are \(n!\) possible permutations.
The height of the comparison tree is \(T(n)\), and any path from the root to a leaf represents a sequence of comparisons.
A binary tree of height \(T(n)\) has at most \(2^{T(n)}\) leaves, so a comparison-based sorting algorithm can distinguish at most \(2^{T(n)}\) outcomes.

\[2^{T(n)} \ge n!\]

\[T(n) \ge \log(n!)\]

We can use Stirling’s approximation for \(n!\)

\[n! \ge \sqrt{2\pi n} \left(\frac{n}{e}\right)^n\]

\[T(n) \geq \log \left(\sqrt{2\pi n} \left(\frac{n}{e}\right)^n\right)\]

\[= \log(\sqrt{2\pi n}) + n \log n - n \log e \in \Omega(n \log n)\]

Theorem

Any comparison-based sorting algorithm has worst-case running time \(\Omega(n \log n)\).

References

Read more about sum. ↩↩↩
Read more about integration. ↩