External Memory Merge-Sort

COMP207 Lectures 2021-11-16

This part is not required for the exam.

We can perform faster joins provided that the tables are sorted or indexed. In order to sort the data quickly we can do this inside of memory, instead of on the disk.

Merge Sort

Divide input in two parts P1, P2
Sort P1 & P2 recursively
While P1 or P2 is not empty:
	Add smallest remaining element from P1 or P2 to output

External Memory Merge Sort

Divide input in M parts P1, P2, ..., PM
Sort P1, P2, ..., PM recursively
While not all Pi are empty:
	Add smallest remaining element from any part Pi to output

$M$ is the number of disk blocks that fit in RAM

Additionally, the following are true:

The RAM holds the smallest blocks from each part.
On each level or recursion we scan through all blocks once ($O(\frac NB)$) disk operation where $B$ is the block size.
We split in M buckets in each level or recursion until the remainder is below MB (where it fits in RAM and can be sorted directly): $\log_M(\frac M {MB})$ levels.
The total number of disk operations is $O(\frac N {B\log_M(\frac N {MB})})$

Comparison

In practice, since hard disks are slow, the running time is mostly spent on disk operations. Therefore, the standard running time is:

\[O(\frac N {B\log_M(\frac N {MB})})\]

where $D$ is the time of a disk operation.

Quick-sort (and other internal memory sorting algorithms) use the following order of time:

\[O(\frac N {B\log_2(N)\times D})\]

External memory sorting is much faster on inputs that are much larger than main memory.