SQL Queries (Optional Part)

COMP207 Lectures 2021-10-05

These aren’t required to form a valid query and are extensions to standard queries.

SQL queries have the following form:

SELECT [] FROM [] WHERE [] GROUP BY [] HAVING [] ORDER BY []

The first two parts are required to have a valid query.

`WHERE`

This is already mostly covered in SQL Data Manipulation Language (DML).

`IN`

In addition to providing a list like here you can also use sub-queries:

DELETE FROM Students WHERE name IN (SELECT name FROM Lecturers);

you can also do the same for a SELECT query:

SELECT * FROM Students WHERE name IN (SELECT name FROM Lecturers);

`EXISTS`

This is a generalisation of IN. With EXISTS you can write a sub-query and if it returns something, the current row will be kept, otherwise thrown out. You can use attributes from the current row in the sub-query:

SELECT *
FROM Students
WHERE EXISTS ( SELECT 1
	FROM Lecturers
	WHERE Students.name=Lecturers.name AND Students.last_name=Lecturers.last_name);

It is common to use the constant 1 when we just want to have an output or not.

Semi-Join

Most types on joins are done in the FROM clause. The only difference is semi-joins that are done from WHERE clauses.

A left semi-join between two tables $A$ and $B$, finds the subset of row in $A$ that would be used to form some row of the natural join of $A$ and $B$.

A right semi-join would find the subset of $B$ instead.

These are done using EXISTS (or IN, if there is only one shared attribute).

Example

We will complete a left semi-join for the following tables (Employees to Transactions):

birthday	first_name	family_name	e_id
1990-11-10	Anne	Smith	1
200-02-05	David	Jones	2
1995-05-09	William	Taylor	3

t_id	c_id	e_id
1	3	1
2	6	1
3	19	3

We would use the following query:

SELECT *
FROM Employees E
WHERE EXISTS (SELECT 1
	FROM Transaction T
	WHERE E.e_id=T.e_id);

This gives the following output:

birthday	first_name	family_name	e_id
1990-11-10	Anne	Smith	1
1995-05-09	William	Taylor	3

Notice that only one Anne row is output even though two transactions where done under here e_id. This is as this method can only remove rows.

`GROUP BY`

GROUP BY allows you to count transactions for each employee rather than just for one by using a WHERE clause. For the following tables:

birthday	first_name	family_name	e_id
1990-11-10	Anne	Smith	1
200-02-05	David	Jones	2
1995-05-09	William	Taylor	3

t_id	c_id	e_id
1	3	1
2	6	1
3	19	3

You could count the number of transactions for all e_id using the following query:

SELECT e_id, COUNT(t_id)
FROM Transactions
GROUP BY e_id;

This would return the following table:

e_id	COUNT(t_id)
1	2
3	1

Notice how the behaviour of COUNT changes due to the iteration.

Complications

When using GROUP BY you can only include in SELECT attributes that we GROUP BY or aggregates.

There are a couple of ways to overcome this:

Including additional attributes into the GROUP BY statement:

  SELECT first_name, e_id, COUNT(t_id)
  FROM Employees NATURAL JOIN Transactions
  GROUP BY first_name, e_id;

This seems like my preferred method.

Including the additional attribute in an aggregate:
```
  SELECT MIN(first_name), e_id, COUNT(t_id)
  FROM Employees NATURAL JOIN Transactions
  GROUP BY e_id;
```
This works as each e_id will have on associated first_name, taking the MIN will return the only entry.

These both give the following output where the first_name included too:

first_name	e_id	COUNT(t_id)
Anne	1	2
William	3	1

Intuitive Method

Intuitively GROUP BY works like this:

Do the previous part of the query until GROUP BY (except ignoring the part in SELECT)
Split the resulting table into sub-tables:
1. One sub-table for each value of the variables in GROUP BY. (If you had first_name and e_id, yo would have a sub-tale for each pair of first_name and e_id).
  
  This means that if you had two employees called Anne you would have two sub-tables, one for each Anne, identified by their e_id.

This means that for the following query you would get the following table (prior to SELECTing the required rows):

...
FROM Employees NATURAL JOIN Transactions
GROUP BY first_name, e_id;

first_name	e_id	sub-table(birthday)	sub-table(family_name)	sub-table(c_id)	sub-table(t_id)
Anne	1	1990-11-10	Smith	3	1
”	”	1990-11-10	Smith	6	2
William	3	1995-05-09	Taylor	19	3

As the table is grouped by the first_name and e_id these are pulled out of the table and redundant rows are removed. Aggregates are then calculated on the sub-tables of each group.

`Having`

This is just WHERE but done after a GROUP BY. This means that it doesn’t affect GROUP BY and can be used on aggregates on the sub-tables in its expressions.

For the following query:

SELECT first_name,e_id,COUNT(t_id)
FROM Employees NATURAL JOIN Transactions
GROUP BY first_name, e_id
HAVING MIN(c_id)<5;

you would get the following output:

first_name	e_id	COUNT(t_id)
Anne	1	2

This is for the same table as used on GROUP BY.

`ORDER BY`

This defines how the output is sorted:

SELECT first_name
FROM Employees
ORDER BY family_name;

first_name
David
Anne
William

You can also have more than one attribute in ORDER BY. Then it would order those attributes in order.

It defaults to ascending order. You can also do descending by writing DESC after the attribute you want to sort.