Computing Joins Using Sorting
Equijoins
An equijoin:
\[R\bowtie_{A=B}\]is defined as:
\[\sigma_{A=B}(R\times S)\]$A,B$ are the join attributes.
You can see an example on the following tables:
-
Stores
code city 12345 1 67910 2 -
Employees
name depart Oscar 12345 Janice 67910 David 67910 -
$\text{Stores}\bowtie_\text{code=depart}\text{Employees}$
code city name depart 1235 1 Oscar 12345 67910 2 Janice 679810 67910 2 David 678910
Merging (Merge Sort)
If $R$ is sorted on $A$ and $S$ is sorted on $B$, then $R\bowtie_{A=B}S$ can be computed with one pass over $R$ and $S$ + run time equal to the size of the output.
With sorted tables we can just step through the columns in the constraint ($A$ and $B$) while one is smaller than the other and just print out when the constraint is matched.
This gives a runtime of:
\[O\lvert R\rvert +\lvert S\rvert +\text{size of output}\]Merging with Duplicates in Column $A$
If there are duplicates in the left table then you can just loop through the right table again for the duplicate values.
Algorithm
You can use the following algorithm to compute this:
Sort $R$ on $A$
Sort $S$ on $B$
Merge the sorted $R$ and $S$
- Running time $O(\lvert R\rvert \times \log_2\lvert R\rvert)$
- Running time $O(\lvert S\rvert \times \log_2\lvert S\rvert)$
- Running time O\lvert R\rvert +\lvert S\rvert +\text{size of output}
Therefore the typical running time is:
\[O(\lvert R\rvert\log_2\lvert R\rvert +\lvert S\rvert\log_2\lvert S\rvert)\]This is much faster than nested loop join but is the same in the worse case scenario.
Running Time vs. Disk Access
- $B$ - Size of a disk block (512/4k bytes)
- $M$ - Number of disk blocks that fit into available RAM
Algorithm | No. of Elementary Operations | No. of Disk Accesses |
---|---|---|
Reading a relation $R$. | $O(\lvert R\rvert)$ | $O(\frac{\lvert R \rvert}B)$ |
Sorting $R$ on attrubte $A$ | $O(\lvert R\rvert\log_2\lvert R \rvert)$ | $O(\frac{\lvert R\rvert}B\log_M\frac{\lvert R \rvert}B)$ |