UoL CS Notes

Computing Joins Using Sorting

COMP207 Lectures 2021-11-16

Equijoins

An equijoin:

\[R\bowtie_{A=B}\]

is defined as:

\[\sigma_{A=B}(R\times S)\]

$A,B$ are the join attributes.

You can see an example on the following tables:

Stores

code city

12345 1

67910 2

code	city
12345	1
67910	2

Employees

name	depart
Oscar	12345
Janice	67910
David	67910

$\text{Stores}\bowtie_\text{code=depart}\text{Employees}$

code	city	name	depart
1235	1	Oscar	12345
67910	2	Janice	679810
67910	2	David	678910

Merging (Merge Sort)

If $R$ is sorted on $A$ and $S$ is sorted on $B$, then $R\bowtie_{A=B}S$ can be computed with one pass over $R$ and $S$ + run time equal to the size of the output.

With sorted tables we can just step through the columns in the constraint ($A$ and $B$) while one is smaller than the other and just print out when the constraint is matched.

This gives a runtime of:

\[O\lvert R\rvert +\lvert S\rvert +\text{size of output}\]

Merging with Duplicates in Column $A$

If there are duplicates in the left table then you can just loop through the right table again for the duplicate values.

Algorithm

You can use the following algorithm to compute this:

Sort $R$ on $A$
Sort $S$ on $B$
Merge the sorted $R$ and $S$

Running time $O(\lvert R\rvert \times \log_2\lvert R\rvert)$
Running time $O(\lvert S\rvert \times \log_2\lvert S\rvert)$
Running time O\lvert R\rvert +\lvert S\rvert +\text{size of output}

Therefore the typical running time is:

\[O(\lvert R\rvert\log_2\lvert R\rvert +\lvert S\rvert\log_2\lvert S\rvert)\]

This is much faster than nested loop join but is the same in the worse case scenario.

Running Time vs. Disk Access

$B$ - Size of a disk block (512/4k bytes)
$M$ - Number of disk blocks that fit into available RAM

Algorithm	No. of Elementary Operations	No. of Disk Accesses
Reading a relation $R$.	$O(\lvert R\rvert)$	$O(\frac{\lvert R \rvert}B)$
Sorting $R$ on attrubte $A$	$O(\lvert R\rvert\log_2\lvert R \rvert)$	$O(\frac{\lvert R\rvert}B\log_M\frac{\lvert R \rvert}B)$