COMP207 (Databases)

Introduction

COMP207 Lectures 2021-09-26

Running Example - Supermarket

The example for the course is a fake supermarket called CS Store. It is initially based in Liverpool but will become a national company.

Relational DBMS Components

A DBMS is a database management system.

flowchart TD
QC <-.-> B
ua[User/Application] --> QC[Query Compiler] --> EE[Execution Engine] <--> IFRM[Index/File/Record Manager] <--> BMSM[Buffer manager & Storage Manager]
ua --> TM[Transaction Manager] --> LR[Logging & Recovery] & CC[Concurrency Control]
LR <-.-> B[(Buffers)]

EE <-.-> B
BMSM <--> S[(Storage)]
BMSM <--> B
BMSM <--> LR
EE --> LR
subgraph Query Processing - Week 7
QC
EE
IFRM
end
subgraph Transaction Management - Week 3-5
TM
LR
CC
end
EE <--> CC

Introduction to SQL

COMP207 Lectures 2021-09-27

Relational Databases

Data is organised into tables.
- Tables have a table name.
- Tables have attributes (columns).
  - All elements in a column are of the same datatype.
- Tables have tuples (rows).
- Each table has a schema. This lists the attributes of a table e.g:
  - Items(name,price,number)

Tables

Tables typically have:

Few columns that are fixed.
A variable number of rows.

Rows

Each row corresponds to an entity, or a relationship between entities.

Each row corresponds to the same kind of entity or relation.
- This could be a table of many employees details or many transactions.
If two rows are the same they are the same item, as opposed to two identical items.

SQL

Relational databases are accessed using SQL (standard query language). This standard is updated every few years but implementations may not exactly follow the standard.

SQL Parts

Data Definition Language (DDL)
- Create/alter/delete databases, tables and their attributes.
Data Manipulation Language (DML)
- Add/remove/update and query rows in tables.
Transact-SQL
- Completing a sequence of SQL statements.

SQL DDL (Data Definition Language)

COMP207 Lectures 2021-09-28

This covers how to:

Create
Alter
Delete

databases, tables and their attributes.

Create a Database

Databases are similar to a folder and are initially empty. They can be filled with tables.

CREATE DATABASE databasename;

The commands in SQL are not case-sensitive. Conventionally, they are in upper-case.

Creating Tables

CREATE TABLE Table_name (
	column1 datatype,
	column2 datatype,
	column3 datatype
);

You can use whatever type of white-space you want to separate key-words.

The schema of this table would be:

Table_name(column1, column2, column3)

Dataypes

Datatype	Description
`INT`	Integers
`FLOAT`	Floating-point numbers
`CHAR(x)`	Fixed length string of length `x`. Spaces are required for padding.
`VARCHAR(x)`	Variable length string of max length `x`.
`DATE`	Date of format `YYY-MM-DD`.
`DATETIME`	Format of `YYY-MM-DD HH:MI:SS`.
`XML`	For XML files.
`BLOB`	For binary files.

There are also other data-types for other precisions.

Example

For the following table:

birthday	first_name	family_name
1990-11-10	Anne	Smith
2000-02-05	David	Jones

You would use the following command to create an appropriate table:

CREATE TABLE Employees (
	birthday DATE,
	first_name VARCHAR(100),
	family_name VARCHAR(100)
);

The schema of this table would be:

Employees(birthday, first_name, family_name)

Unique

You can specify that you want some attributes to be unique per entry:

CREATE TABLE Employees (
	birthday DATE,
	first_name VARCHAR(100),
	family_name VARCHAR(100)
	CONSTRAINT UC_Employees UNIQUE(birthday, first_name)
);

UC_Employess is just a reference so that we can refer to the constraint.

Primary Key

You can specify how the data should be sorted physically in storage by defining a primary key.

CREATE TABLE Employees (
	birthday DATE,
	first_name VARCHAR(100),
	family_name VARCHAR(100)
	CONSTRAINT PK_Employees PRIMARY KEY(birthday, first_name)
);

There can only be one primary key constraint per table.
Rows in a primary key must be unique.

Foreign Key

A foreign key is used to link tow tables together explicitly:

It points from some attributes in one table (the child table) to the primary key of another (the parent table).
It ensures that the values, in the attributes of the foreign key in the child table, must also be in the parent table.
Can be used to create custom data-types, by having a table with all the values you want to allow.

Create the parent table:

 CREATE TABLE Employees (
     e_id INT,
     first_name VARCHAR(100),
     family_name VARCHAR(100)
     CONSTRAINT PK_Employees PRIMARY KEY(e_id)
 );

Create a child table referencing the parent table:

 CREATE TABLE Transaction (
     emp_id INT,
     CONSTRAINT FK_Transactions FOREIGN KEY (emp_id)
         REFERENCES Employees(e_id)
 );

Deleting

Deleting is called DROP in SQL. To delete a database you would use:

DROP DATABASE CS_Store;

To delete a table:

DROP TABLE Employees;

You can also remove unique, primary key or foreign keys. You should look this up if required.

Modifying Tables

To alter a table adding an attribute you could:

ALTER TABLE Employees ADD email VARCHAR(100);

To alter and existing attribute:

ALTER TABLE Employees MODIFY email VARCHAR(200);

This allows longer emails under the email attribute.

The MODIFY keyword is variable depending on the implementation.

To remove a column you must modify the table:

ALTER TABLE Employees DROP COLUMN email;

SQL Data Manipulation Language (DML)

COMP207 Lectures 2021-09-29

Queries are not covered in this lecture and will have their own notes.

Insert

For the following table:

name	number	programme
Anna	20171989	G402
John	20174378	G702

If you want to insert a value you would use the following command:

INSERT INTO Students VALUES('Oliver', 20171112, 'G402');

' is used to denote the start and end of a string and not ".

To insert a tuple while missing an attribute you can use the following:

INSERT INTO Students(programme, name) VALUES('G702', 'Danny');

The unused attribute will be filled with null.

Comments

In the SQL languages -- is used to denote comments.

Delete

To remove an entry from the result of a simple query you can do the following:

DELETE FROM Students WHERE name='John';

If you write DELETE FROM Students; you will remove all the entries from the table.

`WHERE` Clauses

For comparisons you can use the following:

=,<,<=,>=,>
You can also use <> or != for the last item.

You can also use some boolean operations:

AND
OR
NOT
BETWEEN
- E.g. Price BETWEEN 10 AND 20.
LIKE
- _ matches single characters.
- % matches any number of characters.

For combinations you can use the following:

... WHERE name='Oliver' AND programme='G402';

`IN`

To specify a set to match against you can use the IN keyword:

DELETE FROM Students WHERE name IN('John','Sebastian');

Updating

To update the contents of a row you can do the following:

UPDATE Students SET programme='G402' WHERE name='Oliver';

You can also update programatically:

UPDATE Students SET number=number+1 WHERE name='Oliver';

SQL Queries (Required Part)

COMP207 Lectures 2021-10-04

SQL queries have the following form:

SELECT [] FROM [] WHERE [] GROUP BY [] HAVING [] ORDER BY []

The first two parts are required to have a valid query.

Basic Query

For the following table:

birthday	first_name	family_name
1990-11-10	Anne	Smith
200-02-05	David	Jones
1995-05-09	William	Taylor

SELECT * FROM Employees

This will return all the entries in a given table.

`SELECT`

SELECT defines what is outputted, making you able to do four kinds of modifications to it:

Projection ($\pi$)
DISTINCT
Renaming ($\rho$)
Creating new columns.

The symbols will be explained in the SQL misc. lecture.

You can also combine these modifications as you like.

Projection ($\pi$)

Projection allows you to sleect attributes you wan to keep.

You use it by writing a list of attributes:

SELECT family_name, birthday FROM Employees;

The order of the attributes matters.

`DISTINCT`

DISTINCT is for removing duplicated rows. If we select normally like this:

SELECT first_name FROM Employees;

then we will get duplicates if two entries share an attribute.

We can use the following:

SELECT DISTINCT first_name FROM Employees;

to only get unique entries for a projection.

Renaming ($\rho$)

This allows you to rename attributes. You use it by writing AS and then the new name after the attribute:

SELECT birthday, first_name, family_name AS surname FROM Employees;

This would rename the family_name attribute to surname on the output table.

Creating New Columns

You can also create new columns, using maths on current columns. You would also typically give the column a new name.

For the following table:

name	price	number
2L Cola	3.00	20
Banana	0.10	120
Toilet Paper	2.00	0

running this query:

SELECT name, price, number, price * number AS total_cost FROM Items;

would give the following result:

name	price	number	total_cost
2L Cola	3.00	20	90.00
Banana	0.10	120	12.00
Toilet Paper	2.00	0	0.00

Other operations are +, -, / and %.

You can also do aggregates, like sums or counts over the output table. For example, to sum up all the items in the shop you could run the following query:

SELECT SUM(number) FROM Items;

This would produce the following table:

SUM(number)
150

You can also do COUNT, AVG, MIN and MAX.

`FROM`

FROM can contain many input tables and we combine them together in various ways. You can combine them with:

Cross Product ($\times$)
Natural Join ($\bowtie$)

Cross Product ($\times$)

For the following tables:

birthday	first_name	family_name
1990-11-10	Anne	Smith
200-02-05	David	Jones
1995-05-09	William	Taylor

name	price	number
2L Cola	3.00	20
Banana	0.10	120

with the following query:

SELECT first_name, name FROM Employees, Items;

you will get the following output table:

first_name	name
Anne	2L Cola
Anne	Banana
David	2L Cola
David	Banana
William	2L Cola
William	Banana

This is every combination of rows, in the order of Employees $\times$ Items.

It is called the product as the size of the resultant table is $3\times2=6$.

Natural Join ($\bowtie$)

The two tables to be joined should have some overlap. For the following tables:

birthday	first_name	family_name	e_id
1990-11-10	Anne	Smith	1
200-02-05	David	Jones	2
1995-05-09	William	Taylor	3

t_id	c_id	e_id
1	3	1
2	6	1
3	19	3

with the following query:

SELECT * FROM Employees NATURAL JOIN Transactions

birthday	first_name	family_name	e_id	t_id	c_id
1990-11-10	Anne	Smith	1	3	1
200-02-05	David	Jones	2	6	1
1995-05-09	William	Taylor	3	19	3

This just extends the attributes and joins with identical attributes.

Formally, you take the cross product, remove all rows where the common attributes don’t match and then only keep one column fro each common attribute.

This means that the following query is the same as the last natural join:

SELECT birthday, first_name, family_name, Employees.e_id AS e_id, t_id, c_id 
	FROM Employees, Transactions 
	WHERE Employees.e_id = Transactions.e_id;

Using Tablename.attribute is how you reference an attribute multiple tables have in common.

You can also assign shorthands to table names:

SELECT birthday, first_name, family_name, E.e_id AS e_id, t_id, c_id 
	FROM Employees E, Transactions  T
	WHERE E.e_id = T.e_id;

Natural join has some issues when equivalent records (or attributes with different meanings.) can’t be found between tables. This results in empty output.

SQL Queries (Optional Part)

COMP207 Lectures 2021-10-05

These aren’t required to form a valid query and are extensions to standard queries.

SQL queries have the following form:

SELECT [] FROM [] WHERE [] GROUP BY [] HAVING [] ORDER BY []

The first two parts are required to have a valid query.

`WHERE`

This is already mostly covered in SQL Data Manipulation Language (DML).

`IN`

In addition to providing a list like here you can also use sub-queries:

DELETE FROM Students WHERE name IN (SELECT name FROM Lecturers);

you can also do the same for a SELECT query:

SELECT * FROM Students WHERE name IN (SELECT name FROM Lecturers);

`EXISTS`

This is a generalisation of IN. With EXISTS you can write a sub-query and if it returns something, the current row will be kept, otherwise thrown out. You can use attributes from the current row in the sub-query:

SELECT *
FROM Students
WHERE EXISTS ( SELECT 1
	FROM Lecturers
	WHERE Students.name=Lecturers.name AND Students.last_name=Lecturers.last_name);

It is common to use the constant 1 when we just want to have an output or not.

Semi-Join

Most types on joins are done in the FROM clause. The only difference is semi-joins that are done from WHERE clauses.

A left semi-join between two tables $A$ and $B$, finds the subset of row in $A$ that would be used to form some row of the natural join of $A$ and $B$.

A right semi-join would find the subset of $B$ instead.

These are done using EXISTS (or IN, if there is only one shared attribute).

Example

We will complete a left semi-join for the following tables (Employees to Transactions):

birthday	first_name	family_name	e_id
1990-11-10	Anne	Smith	1
200-02-05	David	Jones	2
1995-05-09	William	Taylor	3

t_id	c_id	e_id
1	3	1
2	6	1
3	19	3

We would use the following query:

SELECT *
FROM Employees E
WHERE EXISTS (SELECT 1
	FROM Transaction T
	WHERE E.e_id=T.e_id);

This gives the following output:

birthday	first_name	family_name	e_id
1990-11-10	Anne	Smith	1
1995-05-09	William	Taylor	3

Notice that only one Anne row is output even though two transactions where done under here e_id. This is as this method can only remove rows.

`GROUP BY`

GROUP BY allows you to count transactions for each employee rather than just for one by using a WHERE clause. For the following tables:

birthday	first_name	family_name	e_id
1990-11-10	Anne	Smith	1
200-02-05	David	Jones	2
1995-05-09	William	Taylor	3

t_id	c_id	e_id
1	3	1
2	6	1
3	19	3

You could count the number of transactions for all e_id using the following query:

SELECT e_id, COUNT(t_id)
FROM Transactions
GROUP BY e_id;

This would return the following table:

e_id	COUNT(t_id)
1	2
3	1

Notice how the behaviour of COUNT changes due to the iteration.

Complications

When using GROUP BY you can only include in SELECT attributes that we GROUP BY or aggregates.

There are a couple of ways to overcome this:

Including additional attributes into the GROUP BY statement:

  SELECT first_name, e_id, COUNT(t_id)
  FROM Employees NATURAL JOIN Transactions
  GROUP BY first_name, e_id;

This seems like my preferred method.

Including the additional attribute in an aggregate:
```
  SELECT MIN(first_name), e_id, COUNT(t_id)
  FROM Employees NATURAL JOIN Transactions
  GROUP BY e_id;
```
This works as each e_id will have on associated first_name, taking the MIN will return the only entry.

These both give the following output where the first_name included too:

first_name	e_id	COUNT(t_id)
Anne	1	2
William	3	1

Intuitive Method

Intuitively GROUP BY works like this:

Do the previous part of the query until GROUP BY (except ignoring the part in SELECT)
Split the resulting table into sub-tables:
1. One sub-table for each value of the variables in GROUP BY. (If you had first_name and e_id, yo would have a sub-tale for each pair of first_name and e_id).
  
  This means that if you had two employees called Anne you would have two sub-tables, one for each Anne, identified by their e_id.

This means that for the following query you would get the following table (prior to SELECTing the required rows):

...
FROM Employees NATURAL JOIN Transactions
GROUP BY first_name, e_id;

first_name	e_id	sub-table(birthday)	sub-table(family_name)	sub-table(c_id)	sub-table(t_id)
Anne	1	1990-11-10	Smith	3	1
”	”	1990-11-10	Smith	6	2
William	3	1995-05-09	Taylor	19	3

As the table is grouped by the first_name and e_id these are pulled out of the table and redundant rows are removed. Aggregates are then calculated on the sub-tables of each group.

`Having`

This is just WHERE but done after a GROUP BY. This means that it doesn’t affect GROUP BY and can be used on aggregates on the sub-tables in its expressions.

For the following query:

SELECT first_name,e_id,COUNT(t_id)
FROM Employees NATURAL JOIN Transactions
GROUP BY first_name, e_id
HAVING MIN(c_id)<5;

you would get the following output:

first_name	e_id	COUNT(t_id)
Anne	1	2

This is for the same table as used on GROUP BY.

`ORDER BY`

This defines how the output is sorted:

SELECT first_name
FROM Employees
ORDER BY family_name;

first_name
David
Anne
William

You can also have more than one attribute in ORDER BY. Then it would order those attributes in order.

It defaults to ascending order. You can also do descending by writing DESC after the attribute you want to sort.

UNION

COMP207 Lectures 2021-10-06

This is basically the same as in mathematical sets. Consider the following table:

birtday	first_name	family_name	e_id
1990-11-10	Anne	Smith	1
2000-02-05	David	Jones	2
2995-05-09	William	Taylor	3

Executing the following query would give the following table:

SELECT first_name AS name
FROM Employees
UNION
SELECT family_name
FROM Employees;

name
Anne
David
William
Smith
Jones
Taylor

The name of the column comes from the first SELECT statement.

The things in the UNION must have similar data-types and also the same number of columns.

Similar Data-Types

Different implementations have different definitions of what classes as a “similar data-type”. Generally you can mix most things but you should check the docs for your specific implementation.

Tutorial 1

COMP207 Tutorials 2021-10-07

This tutorial is focused on the tutorial 1 exercises which can be found here.

Part 1

Schemas:

 Students(first_name, last_name, birthday, s_id)
 Enrolments(s_id, course_code, year)

Students - 4 columns by 5 rows

Enrolments - 3 columns by 9 rows

Creating tables:

 CREATE TABLE Students {
     first_name VARCHAR(20),
     last_name VARCHAR(20),
     birthday DATE,
     s_id INT,
     CONSTRAINT PK_Students PRIMARY KEY(s_id)
 };

 CREATE TABLE Enrollments {
     s_id INT,
     course_code CHAR(7),
     year INT
     CONSTRAINT FK_Enrollents FOREIGN KEY(s_id) REFERENCES Employees(s_id)
 };

Deleting enrolment:

 DELETE FROM Enrolments WHERE s_id=4 AND course_code='COMP201';

Deleting column:

 ALTER TABLE Students DROP COLUMN birthday;

Changing names:

 UPDATE Students SET first_name='Leah' WHERE s_id=2;

Part 2

Works
Fails - Longer than 20 characters.
Fails - Identical s_id.
Fails - No parent in Students.
Works
Fails - No primary key (s_id) set.
Fails - Missing VALUES command.
Fails - Incorrect attribute names.
Fails - Orphans left in Enrollments.
Fails - No parent in Students.
Fails - Orphans left in Enrollments.
Fails - Deletes all entries leaving orphans in Enrollments.
Works
Works - Deletes all entries

SQL Queries - Misc.

COMP207 Lectures 2021-10-09

Order of Execution

SQL queries have the following order of execution:

SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY

Views

Views are saved queries that you can use as virtual tables.

Consider the following tables:

Employees

birthday	first_name	family_name	e_id
1990-11-10	Anne	Smith	1
2000-02-05	David	Jones	2
1995-05-09	William	Taylor	3

Transactions

t_id	c_id	e_id
1	3	1
2	6	1
3	19	3

Say we completed a complex query and we want to save it. We can add a line to the query for the table to be saved as a VIEW:

CREATE VIEW Employee_transaction_count AS
SELECT first_name, e_id, COUNT(t_id)
FROM Employees NATURAL JOIN Transactions
GROUP BY first_name, e_id;

This would save the following VIEW:

Employee_transaction_count

first_name	e_id	COUNT(t_id)
Anne	1	2
William	3	1

Views as Virtual Tables

To query a VIEW you would do it like any other query:

SELECT first_name, COUNT(t_id)
FROM Employee_transaction_count;

This would create the following result:

first_name	COUNT(t_id)
Anne	2
William	1

Restrictions of Views

Data in views can’t be modified.
- Entries can’t be updated, deleted from or inserted to conventionally.
- Triggers can do this but this is not covered in this module.

Views can be replaced with updated versions:

  CREATE OR REPLACE VIEW Employee_transaction_count AS
  SELECT first_name, COUNT(t_id)
  FROM Employees NATURAL JOIN Transactions
  GROUP BY first_name, e_id;

Views can be removed:

  DROP VIEW Employee_transaction_count;

Relational Algebra

This is mathematics over a relational set of tables. As it is in a table you use set semantics as there is no order and no duplicates.

This is very important for optimisation.

Projection ($\pi$)

$\pi_\text{attribute list}(R)=$ restricts $R$ to attributes in the attribute list.

As an example we can have the following expression:

\[\pi_\text{family_name,birthday} (\text{Employees})\]

This would produce the following output:

family_name	birthday
Smith	1990-11-10
Jones	2000-02-05
Taylor	1995-05-09

This is the same as the following SQL query:

SELECT DISTINCT family_name, birthday
FROM Employees;

We are using DISTINCT due to the set semantics of the expression.

Renaming ($\rho$)

$\rho_{A1\rightarrow B1,A2\rightarrow B2,\ldots}=$ renames attribute $A1$ to $B1$, attribute $A2$ to $B2$ and so on.

As an example we can have the following expression:

\[\rho_\text{birthday}\rightarrow\text{bday}(\text{Employees})\]

This would give the following table:

bday	first_name	family_name
1990-11-19	Anne	Smith
2000-02-05	David	Jones
1995-05-09	William	Taylor

This is the same as the following SQL query:

SELECT DISTINCT birthday AS bdday,first_name,family_name
FROM Employees;

Selection ($\sigma$)

$\sigma_\text{condition}(R) =$ set of all tuples in $R$ that satisfy the condition.

As an example we can have the following expression:

\[\sigma_\text{first_name='Anne'}(\text{Employees})\]

Which gives the following table:

birthday	first_name	family_name
1990-11-10	Anne	Smith

This is the same as the following SQL query:

SELECT DISTINCT *
FROM Employees
WHERE first_name='Anne';

In SQL you should remember that you select using the WHERE keyword.{:.info}

Cartesian Product ($\times$)

$R_1\times R_2=$ pairs each tuple in $R_1$ with each tuple in $R_2$.

Consider the following tables:

Modules

code	year	sem
COMP105	1	1
COMP201	2	1

Lecturers

name module

John COMP105

Sebastian COMP201

name	module
John	COMP105
Sebastian	COMP201

An example relational algebra expression could be:

\[\text{Modules}\times\text{Lecturers}\]

Which gives the following table:

code	year	sem	name	module
COMP105	1	1	John	COMP105
COMP105	1	1	Sebastian	COMP201
COMP201	2	1	John	COMP105
COMP201	2	1	Sebastian	COMP201

This is the same as the following query:

SELECT DISTINCT *
FROM Modules, Lecturers;

Natural Join ($\bowtie$)

$R_1\bowtie R_2=$ pairs of each tuple in $R_1$ with each tuple in $R_2$ with matching common attributes.

As an example we could have the following expression:

\[\text{Employees}\bowtie\text{Transactions}\]

This would give the following table:

birthday	first_name	family_name	e_id	t_id	c_id
1990-11-10	Anne	Smith	1	1	3
1990-11-10	Anne	Smith	1	2	6
1995-05-09	William	Taylor	3	3	19

This is the same as the following SQL query:

SELECT DISTINCT *
FROM Employees NATURAL JOIN
Transactions;

Additional Functions

Left/right semi-join ($\ltimes/\rtimes)$ - Returns the rows from the left/right table that would be matched to a row on the other side in a natural join.
GROUP BY - This function is part of relational algebra but is complex to define.

See SQL Queries (Optional Part) for additional examples of semi-join and GROUP BY.

Combined Example

With the following schema:

R(a,b)
S(c,d)
T(e,z)

Consider the following relational algebra example:

\[\rho_{a\rightarrow z}(R\times S)\bowtie T\]

You could write this in the following SQL query:

SELECT *
FROM (SELECT a AS z, b c, d
      FROM (SELECT *
            FROM R, S)) NATURAL JOIN T;

Tutorial 2

COMP207 Lectures 2021-10-14

This tutorial wasn’t that great. It would be much better if this was on a live database so I’ve got no idea if any of these are actually right.

Part 1

Average grade:

 SELECT AVG(grade) 
     FROM Enrolment
     WHERE s_id=1;

Years of COMP207 teaching:

 SELECT COUNT(year) 
     FROM CoursesInYear 
     WHERE code='COMP207';

Students in COMP207 in 2021:

 SELECT COUNT(s_id) 
     FROM Enrolment 
     WHERE code='COMP207' AND year='2021';

Simple Joins

Lecturers teach me, sorted by birthday:

 SELECT DISTINCT first_name, last_name 
     FROM Lecturers L, CoursesInYear CIY, Enrolment E
     WHERE E.code = CIY.code AND CIY.l_id = L.l_id AND E.s_id = 1 
     SORT BY L.birthday;

Names of courses you attend, sorted by name:

 SELECT name
     FROM Enrollment E, Courses C
     Where e.code=c.code
     ORDER BY name;

Names of courses Rasmus Ibsen-Jensen has taught:

 SELECT DISTINCT name
     from CourseesInYear ciy, Lecturers l, Courses c
     WHERE ciy.code=c.cody AND l.l_id=ciy.L_id
     AND first_name='Rasmus' AND Last_name='Ibsen-Jensen'
     ORDER BY c.code;

Unions

What names of students, lecturers and courses exist:

 SELECT first_name AS name
     FROM Students
     UNION
     SELECT first_name AS name
     FROM Lecturers
     UNION
     SELECT name AS name
     FROM Courses

List all lecturers and students ordered by first name and last name:

 SELECT first_name, last_name
     FROM Students
     UNION
     SELECT first_name, last_name
     FROM Lecturers
     ORDER BY first_name, last_name;

Introduction to Transactions

COMP207 Lectures 2021-10-19

These are a sequence of SQL statements one after the other. There may be issues if transactions are running in parallel or if a single transaction fails.

Concurrency

If multiple people are interacting with a database then you may run into bugs like the following:

sequenceDiagram
User 1-)Database:Which seats on flight 123 are available?
User 2-)Database:Which seats on flight 123 are available?
User 1-)Database:Book seat 14B.
User 2-)Database:Book seat 14B.

This might lead to an inconsistent database.

If “User 1” completed their transaction all in one go then this wouldn’t be a problem.

Transactions Solution

This allows a series of statements to be grouped together and run in order:

START TRANSACTION;
SELECT	seatNo
FROM	Flights
WHERE	flightNo = 123 AND date = '2020-10-30'
	AND seatStatus = 'available';
UPDATE	Flights
SET	seatStatus = 'occupied'
WHERE	flightNo = 123 AND date = '2020-10-30'
	AND seatNo = '14B';
COMMIT;

START TRANSACTION; is often not required.

Before COMMIT; all changes are tentative.

By running a transaction we ensure serialisable behaviour:

This is like the whole transaction is run in isolation from other transactions.

Partial Execution

If a transaction failed half way through a transaction this could leave the database in an unacceptable state. For example if an error occurred half way through a transaction.

Transactions Solution

To fix this SQL allows us to state that a transaction must be executed atomically (as a whole or not at all):

START TRANSACTION;
UPDATE	Accounts
SET	balance = balance + 100
WHERE	accountNo = 456;
UPDATE	Accounts
SET	balance = balance - 100
WHERE	accountNo = 123;
COMMIT;

Transaction Syntax

A transaction is a sequence of SQL statement wrapped in:

START TRANSACTION;

-- SQL Statements

COMMIT; /* or ROLLBACK; */

ROLLBACK; aborts the transaction and enacts no changes.

Schedules

COMP207 Lectures 2021-10-19

Translating SQL into Low-Level Operations

In our example table with the following schema:

Emplyees(e_id, first_name, last_name, birth_day, salary)

From this we could run the following two statements:

SELECT salary
FROM Employees
WHERE e_id = 1234;

UPDATE Employees
SET salary = salary * 1.1
WHERE e_id = 1234;

These statement could be made into the following three transaction operations:

read(e_id = 1234, salary)
salary = salary * 1.1
write(e_id = 1234, salary)

There are two database operations (read() and write()) and a calculation.

Basic Operations

There are two basic operations of database transactions:

read(X) - Reads a database item X into a program variable, also called X:
1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory.
  - Provided that disk block isn’t already in some main memory buffer.
3. Copy item X from the buffer to the program variable X.
write(X) - Writes the value of program variable X into the database item named X:
1. Find the address of the disk block that contains item X.
2. Copy that disk block into a buffer in main memory.
  - If that disk block is not already in some main memory buffer.
  - Copy item X from the program variable X into its correct location in the buffer.
  - Store the updated block from the buffer back to disk either immediately or at some later point in time.

Schedules

Schedules hold many transactions for execution. The operations making up the transaction are then executed by the schedule in the date order:

It must preserve that the operation in each transaction happen in the right order.

There are two types:

Serial Schedules - Executes the transactions one after another.
Concurrent Schedules - Can interleave operations from the transactions in the schedule.

Serial Schedule Example

Consider the following two transactions:

Begin(T1)
	read(X);
	X := X + 100;
	write(X);
	read(Y);
	Y := Y + 50
	write(Y);
	commit;
End(T1)

Begin(T2)
	read(X);
	read(Y);
	X := X + Y;
	write(X);
	commit;
End(T2)

Shorthand

We could write this as the following shorthand:

\[S_a:r_1(X);w_1(X);r_1(Y);w_1(Y);c_1; r_2(X); r_2(Y); w_2(X); c_2;\]

The following symbols are used:

$S_{\text{id}}$ = Schedule (id is the schedule ID)
$r_i(X)$ = read(X) in transaction i
$w_i(X)$ = write(X) in transaction i
$c_i$ = commit in transaction i
$a_i$ = abort (rollback) in transaction i

Concurrent Schedule Example

The following transactions:

Begin(T1)
	read(X);
	X := X + 100;
	write(X);
	read(Y);
	Y := Y + 50
	write(Y);
	commit;
End(T1)

Begin(T2)
	read(X);
	read(Y);
	X := X + Y;
	write(X);
	commit;
End(T2)

can be represented in many ways in a concurrent schedule:

\[S_b: r_1(X); w_1(X); r_2(X); r_1(Y); w_1(y); c_1; r_2(Y); w_2(X); c_2;\]

In addition all serial schedules are types of concurrent schedules.

The order of operations must still reflect that of the order of the underlying transactions.

ACID (Schedule & Transaction Issues)

COMP207 Lectures 2021-10-19

Concurrency Example

Consider the concurrency example that was shown before. The specific ordering of the statements made it such that a seat was double-booked.

There are a few issues with how the transactions interacted with eachother:

The two transactions are not isolated from each-other.
Additionally, the outcome in not consistent with the real world.

To solve this issue we could:

Undo the second transaction when it tries to do the second operation.

Partial Execution Example

Consider a bank transfer from one account to another. A failure, mid-transaction, could leave the database in an unacceptable state.

There is an issue with how only parts of the transaction get executed:

A transaction should be atomic (indivisible).
Maybe not consistent with the real world as the account is not debited properly.

To solve the issue we could:

Undo the transaction when the computer comes back online.

Concurrency & Partial Execution Example

Consider the following turn of events:

sequenceDiagram
User 1 ->> Database:Add £100 to '456'
User 2 ->> Database:Does '456' have £100?
User 2 ->> Database:Add £100 to '789'
User 2 ->> Database:Subtract £100 from '456'
note over Database: Failure, subsequent interactions don't complete.
User 1 ->> Database:Subtract £100 from '123'

User 1 has account 123.
User 2 has account 456.

This leaves User 1 with £100 more than expected.

Problems

There is an issue with how parts of the first transaction get executed:

A transaction should be atomic.

There is an issue with how the two transactions interact:

Consistency issue as user 2 had £100 before user 1 has committed.
The two transactions are not isolated from each-other.

Solutions

We could do nothing and the bank would lose £100:

This would result from the atomicity issue.
We could undo the first transaction, but not the second. This would leave the seccond transaction invalid if there are insufficient funds:

This results from the isolation issue.

This would give a consistency issue.
We could undo both transactions. This would make the users confused as the state is not preserved.

This is an issue as we expect that any change we make is durable.

ACID Properties

All of these properties together are called:

Atomicity
Consistency
Isolation
Durability (or permanency)

ACID (In Detail)

COMP207 Lectures 2021-10-19

This lecture will precisely cover the four ACID properties.

A - Atomicity

A transaction is an atomic unit of processing:

An indivisible unit of execution.
Executed in its entirety or not at all.

Deals with failure:

User aborts a transaction (cancel button).
System aborts a transaction (deadlock).
Transaction aborts itself (unexpected database state).
System crashes, network failure and more.

There are two possible outcomes:

Abort - An error prevented full execution:
- We undo the work done up to the error point. The system re-creates the database state as it was before the start of the aborted transaction.
Commit - No error and the entire transaction executes:
- The system is updated correctly.

C - Consistency

A correct execution of the transaction must take the database from one consistent state to another consistent state.

There are two definitions of this property:

Weak Version - Transaction may not violate constraints.

This is the typical definition. Constraints are things such as relations and data-types.
Strong Version - It should correctly transform the database state to reflect the effect of a real world event. (ISO/IEC 10026-1:1998)

Due to the two definitions this can be hard to verify.

Transactions & Consistency

We can assume that transactions preserve consistency as:

Transactions (if run on their own) always transform a consistent database state into another consistent database state.

Transactions can only produce one of two outcomes:

Commit - Execution was successful and the database is left in a consistent state.
Abort - Execution was not successful and we need to restore the database to the state is was in before.

I - Isolation

A schedule satisfies isolation if and only if ($\iff$) it is serialisable

As serialisable transactions share the same properties as multiple transactions they can be considered consistent.
Additionally, a schedule is serialisable if and only if ($\iff$) it satisfied isolation.

SQL Isolation Levels

The SQL standard divides isolation into several levels:

Read Uncommitted (no isolation) - It is fine if you read data which has not been committed.
Read Committed - Everything you read must have been committed before you can see it.
Repeatable Read - Above and if you read the same thing twice in a transaction, you must get the same return value.

Other transactions don’t affect this transaction’s reads.
Serialisable - Above and a serial schedule is followed.

Isolation level 4 is the default meaning of isolation.

Some implementations have more levels.

You can set the isolation level with the following statement:

SET TRANSACTION READ WRITE
	ISOLATION LEVEL READ UNCOMMITTED;

TRANSACTION can also be set to READ ONLY.

D - Durability

Once a transaction commits and changes the database, these changes cannot be lost because of subsequent failure.

It could be overwritten by a later update.

We redo the transaction if there are any problem after the update.

Durability deals with things like media failure.

ACID & Relational DBMS Components

There are several components in a DBMS:

flowchart LR
ua([User/Application]) -->tm[Transaction Manager]
tm --> cc[Concurrency Control]
tm --> lr[Logging & Recovery]
lr <--> b[(Buffers)]
b <--> s[(Storage)]

Two of which are used to satisfy ACID:

Atomicity - Via recovery control (Logging & Recovery)
Consistency - Via scheduler (Concurrency Control)
Isolation - Via scheduler (Concurrency Control)
Durability - Via recovery control (Logging & Recovery)

Serialisable Schedules

COMP207 Lectures 2021-10-19

Types of Schedules

So far we have seen two extremes of scheduling:

Serial Schedules

Execute correctly.
Maintain consistency of the database.
Inefficient in multi-user environments (because transactions have to wait until others are finished).

Concurrent Schedules

May not execute correctly.
May not guarantee consistency of the database or isolation.
Efficient in multi-user environments (if you get asked to do something, just do it!).

Concurrent Schedules

Concurrent schedules can be faster then serial ones by they are not always equivalent.

In a serial schedule you have to wait until one transaction is over until you can start another.

However by running two transactions simultaneously you may get erroneous results do to the interactions between the two transactions.

Concurrent schedules do not guarantee consistency or isolation.

Serialisable Schedules

A schedule $S$ is serialisable if there is a serial schedule $S’$ that has the same effect on $S$ on every initial database state:

Schedule $S$
`read(X)`
`X := X + N`
	`read(X)`
	`X := X + M`
	`write(X)`
	`commit`
`read(Y)`
`Y := Y + N`
`write(Y)`
`commit`

Equivalent Serial Schedule $S’$
`read(X)`
`X := X + N`
`read(Y)`
`Y := Y + N`
`write(Y)`
`commit`
	`read(X)`
	`X := X + M`
	`write(X)`
	`commit`

Serialisable schedules guarantee:

Correctness and consistency because they inherit the properties of serial schedules.
Doesn’t satisfy isolation, as they can read uncommitted data.

This can be fixed.

The problem is that they are difficult to test:

Don’t depend on reads, writes, and commits, but also on the non-database operations.
Non-database operations can be complex.

Conflict Serialisable Schedules

COMP207 Lectures 2021-10-19

This is how we calculate quickly how much concurrency we can allow while satisfying isolation and consistency.

Conflict-Serialisability

This is a stronger from of serialisability that is used by most commercial DBMS. It is base on the notion of conflict:

A conflict in a schedule is a pair of operations, from different transaction, that cannot be swapped without changing the behaviour of at least one of the transactions.

\[r_1(X); \underbrace{w_1(X); r_2(X)}_\text{conflict}; w_2(X); r_1(Y); w_1(Y);\]

Look for a write, then read, from different transactions. Read, then write, is okay.

Conflicts

A conflict in a schedule is a pair of operations:

From different transactions.
That access the same item.
At least one of them is a write operation.

Conflict-Equivalency

Two schedules $S$ and $S’$ are conflict-equivalent if $S’$ can be obtained from $S$ by swapping any number of: consecutive, non-conflicting, operations from different transactions.

Example

You want to order the transactions to make it serial:

\[\begin{aligned} S:& r_1(X); w_1(X); w_2(X); \mathbf{r_2(Y); r_1(Y)}; w_1(Z)\\ & r_1(X); w_1(X); \mathbf{w_2(X); r_1(Y)}; r_2(Y); w_1(Z)\\ & r_1(X); w_1(X); r_1(Y); w_2(X); \mathbf{r_2(Y); w_1(Z)}\\ & r_1(X); w_1(X); r_1(Y); \mathbf{w_2(X); w_1(Z)}; r_2(Y)\\ S':& r_1(X); w_1(X); r_1(Y); w_1(Z); w_2(X); r_2(Y) \end{aligned}\]

A schedule is conflict-serialisable if it is conflict-equivalent to a serial schedule.

Hierarchy of Schedules

All of the types of schedules shown so far are represented below:

flowchart
subgraph "All Schedules (Concurrent Schedules)"
subgraph Serialisable Schedules
subgraph Conflict-Serialisable Schedules
ss[Serial Schedules]
end
end
end

Inclusions in the diagram are strict.

Recognising Conflict-Serialisable Schedules

COMP207 Lectures 2021-10-19

Converting Schedules to Precedence Graphs

The precedence graph for a schedule $S$ is defined as follows:

It is a directed graph.
Its nodes are the transactions that occur in $S$.
it has an edge from transaction $T_i$ to transaction $T_j$ if there is a conflicting pair of operations $op_1$ and $op_2$ in $S$ such that:
- $op_1$ appears before $op_2$ in $S$.
- $op_1$ belongs to transaction $T_i$.
- $op_2$ belongs to transaction $T_j$.

Example

Convert the following schedule to a precedence graph:

\[S: r_2(X); r_1(Y); w_2(X); r_2(Y); r_3(X); w_1(Y); w_3(X); w_2(Y);\]

Connections are made on operations that overwrite existing variables (operations from different transactions that cannot be swapped without changing the behaviour of at least one transaction):

graph LR
T1 --> T2 --> T3
T2 --> T1

Testing Conflict-Serialisability

Construct the precedence graph for $S$.
If the precedence graph has no cycles, then $S$ is conflict-serialisable.

Examples

$S: r_1(X); w_1(X);$ $r_2(X); w_2(X);$ $r_1(Y); w_1(Y);$ $r_2(Y); w_2(Y)$:
```
 graph LR
 T1 --> T2
```
There are no cycles so $S$ is conflict-serialisable.
$S: r_2(X); r_1(Y);$ $w_2(X); r_2(Y);$ $r_3(X); w_1(Y);$ $w_3(X); w_2(Y)$:
```
 graph LR
 T1 --> T2 --> T3
 T2 --> T1
```
There is a cycle to $S$ is not conflict-serialisable.

Meaning of Precedence Graphs

The following graph:

graph LR
T1 --> T2

means that an operation in $T_1$ must occur before an operation in $T_2$.

This means that in a cycle you are unable to determine the true order of operations as the transactions are co-dependant.

graph LR
T1 --> T2
T2 --> T1

For a full proof by contradiction see the slides and lecture video.

Constructing a Serial Schedule

Consider that you have the following schedule and you want to make it serial:

\[S: r_1(Y), r_3(Y), r_1(X), r_2(X), w_2(X), r_3(Z), w_3(Z), r_1(Z), w_1(Y), r_2(Z)\]

This is the equivalent precedence graph:

graph LR
T1 --> T2
T3 --> T2
T3 --> T1

To find the serial schedule:

Find a transaction with only outgoing edges.
Put this transaction in your schedule and remove it from the graph.

Following these rules yields the following serial-schedule:

\[S': r_3(Y), r_3(Z), w_3(Z), r_1(Y), r_1(X), r_1(Z), w_1(Y), r_2(X), w_2(X), r_2(Z)\]

ACID and Precedence Graphs

COMP207 Tutorials 2021-10-21

This tutorial relates to the questions here.

Question 1

The short hand notation would be:

\[S_a: r_1(X);w_1(X);c_1;r_2(X);w_2(X);c_2;\]

The schedule is a serial schedule as the transactions are run one after the other.

The following ACID properties are broken:

No ACID properties are broken as there are no writes to the databases mid transaction.

Question 2

This breaks:
- Atomicity as $T_1$ is not complete.
- Maybe consistency if strong consistency is used.
- Isolation as the transaction is not serialisable.
- Durability breaks as $T_1$ is not complete.
If $T_1$ is rolled back then the database is left in an inconsistent state due to writes caused by $T_2$.
If we rolled both transactions back then durability is broken as we change the data committed by $T_2$.

Question 3

$S_3$:

  graph LR
  T3 --> T1
  T2 --> T3
  T2 --> T1

No cycles so it is conflict-serialisable.

$S_4$:
```
  graph LR
  T3 --> T1
  T2 --> T3 & T1
  T1 --> T3
```
There is a cycle between $T_3$ and $T_1$ so it is not conflict-serialisable.

Question 4

A transaction can’t be in conflict with itself. Therefore a cycle but be between two transactions and therefore be greater than 1.

Locking Conflict-Serialisable Schedules

COMP207 Lectures 2021-10-26

Transaction Scheduling in a DBMS

Transaction scheduling looks like the following inside a DBMS:

graph LR
so[Stream of Operations] -->|Operations of Transactions| Scheduler -->|"(Conflict) Serialisable Schedule"| b[(Buffers)]
Scheduler -->|Execute or Delay Request| Scheduler

This needs to be completed live, as transactions come into the system.

Using Locks

This is a simple version of locking. There are more advanced versions in other lectures.

Simple Locking Mechanism

A transaction has to lock an item before it accesses it.

Locks are requested from and granted by the scheduler:

Each item is locked by at most one transaction at a time.
Transactions wait until a lock ca be granted.

Each lock as to be released (unlocked) eventually.

flowchart
subgraph T1
l["lock(x)"]
r["read(x)"]
u["unlock(x)"]
end
subgraph T2
l2["lock(x)"]
r2["read(x)"]
w["write(x)"]
u2["unlock(x)"]
end
T1 & T2 --> Scheduler
Scheduler --> b[(Buffers)]

The other transaction is blocked from going into a locking state unless the resource is free.

Schedules With Simple Locks

We can extend the syntax for schedules by including the following two operations:

$l_i(X)$ - Transactions $i$ requests a lock for item $X$.
$u_i(X)$ - Transaction $i$ unlocks item $X$.

The previous graph could then be shown in the following shorthand:

\[S: \mathbf{l_1(X)};r_1(X);\mathbf{u_1(X);l_2(X);}r_2(X);w_2(X);\mathbf{u_2(X)};\]

The rules for simple locking are:

For each $r_i(X)/w_i(X)$ there is an earlier $l_i(X)$ without any $u_i(X)$ occurring between $l_i(X)$ and $r_i(X)/w_i(X)$.

A lock must be created for reads and writes to an object.
For each $l_i(X)$ there is a later $u_i(X)$.
If $l_i(X)$ comes before $l_j(X)$, then $u_i(X)$ occurs between $l_i(X)$ and $l_j(X)$.

An object can only be locked by a single transaction at a time.

Not Serialisable

There are transactions, when they are combined with this locking technique, that produce different results compared to when they are executed serialy.

This method may not produce serialisable results.

Two-Phase Locking (2PL)

We can change the simple locking mechanism so that it guarantees conflict-serialisability .

The condition is:

In each transaction, all lock operations should precede all unlocks.

Therefore the transaction takes place in two phases:

Phase 1 - Request locks and possibly other read/write operations.
Phase 2 - Unlock and possible other read/write operations.

A transaction is 2PL if all locks are completed before any unlocks.

2PL Conflict-Serialisability Proof

COMP207 Lectures 2021-10-26

See Locking Conflict-Serialisable Schedules for how two phase locking works.

If $S$ is a schedule containing only 2PL transactions, then $S$ is conflict-serialisable.

Claim: We can move all operation of $T_i$ to the beginning of the schedule (using swaps of consecutive non-conflicting operations).

No formal definition is given in this proof so it is better to watch the video (proof starts at 1:33).

More Flexible Locks, Deadlock & Granularity

COMP207 Lectures 2021-10-26

Basic locks are very simple, but not very precise:

If we just want to read we need a full lock, even though conflicts don’t happen on a read-read.

2PL Issues

2PL ensure conflict-serialisability, by might lead to:

Deadlocks - Where transactions are forced to wait forever.
Other issues to be shown later.

Deadlocks in 2PL

If an object is already locked by one transaction and then another transaction also want a lock on the same object then there is a risk of deadlock.

This is as all the lock occur in the first phase so there is no chance that the object can be unlocked in order for the second transaction to use it.

Different Lock Modes

To fix the deadlock issue we can implement different lock modes that can allow read only access.

Shared & Exclusive Locks

Shared Locks (Read Lock):

Requested by transactions to read an item $X$.
Granted to several transactions at the same time.

The operation can be represented as s-lock(X).

Exclusive Locks (Write Lock):

Requested be transactions to write an item $X$.
Granted to at most one transaction at a time.

The operation can be represented at x-lock(X).

Additional Rules:

Shared lock on $X$ is only granted if no other transaction hold an exclusive lock on $X$.
Exclusive lock on $X$ is granted only if no other transaction holds a lock of any kind on $X$.

Schedules with Shared & Exclusive Locks

We can use the following shorthand notation to write schedules:

$sl_i(X)$ - Transaction $i$ requests a shared lock for item $X$.
$xl_i(X)$ - Transaction $i$ requests an exclusive lock for item $X$.
$u_i(X)$ - Transaction $i$ releases all locks on item $X$.

For the following transactions:

$T_1$	$T_2$
s-lock(X)	s-lock(X)
read(X)	read(X)
unlock(X)	x-lock(X)
	write(X)
	unlock(X)

We could write the following schedule:

\[S: \mathbf{sl_1(X)};r_1(X);\mathbf{sl_2(X)};r_2(X);\mathbf{u_1(X);xl_2(X)};w_2(X);\mathbf{u_2(X)};\]

An individual transaction may hold both a shared lock and an exclusive lock for the same item $X$.

Upgrading Locks Issues

A shared lock on $X$ can be upgraded later to an exclusive lock on $X$. You can use this to be friendly to other transactions.

There is still a risk of deadlocks using this method.

Update Locks

These avoid deadlock by giving an indication that you want to write later but not right now.

Update Lock:

Requested by a transaction to read an item.
May be upgraded to an exclusive lock later (shared locks can no longer be upgraded).
Granted to at most one transaction at a time.

The operation can be represented as u-lock(X) or $ul_i(X)$.

Inter-Transaction Lock Policy

You can check if you can use a particular type of lock using the following table:

	Shared	Update	Exclusive
Shared	Yes	Yes	No
Update	No	No	No
Exclusive	No	No	No

The lock you request is on the top and the locks other transactions hold for the same object are on the side.

2PL with Different Lock Modes

To incorporate shared, upgrade and exclusive locks with 2PL then we just use the same rules as before but for all locks:

All lock operations (shared, upgrade and exclusive) precede all unlocks.

This still guarantees conflict-serialisability.

Locks with Multiple Granularity

DBMS may use locks at different levels of granularity (from course to fine):

Locking relations.
- Locking disk blocks.
  - Locking tuples.

Here are some examples on how granular locks might be used:

SELECT name FROM Student WHERE studentID = 123456;

A shared lock on the tuple would suffice.

SELECT avg(salary) FROM Employee;

A shared lock on the relation might be necessary.

Trade-Offs

Lock at too coarse of granularity:

Low Overhead
- Dont need to store as much information.
Less Degree of Concurrency
- May cause unnecessary delays.

Lock at too fine granularity:

High Overhead
- Need to keep track of all locked items.
High Degree of Concurrency
- No unnecessary delays.

You should pay attention to the hierarchy to guarantee conflict-serialisability. The following should not happen:

A transaction holds a shared lock for a tuple, while another transaction holds an exclusive lock for the relation.

Intention Locks (Warning Locks)

With this method we use only shared and exclusive locks. These are made for locking with multiple granularity.

We gain the following intention locks:

IS - Intention to request a shared lock on a sub-item.
IX - Intention to requrest an exlclusive lock on a sub-item.

We have the following rules:

If a transaction want to lock an item $X$, it must first put an intention lock on the super-items of $X$.

This is according to the granularity hierarchy above.
Shared locks are IS.
Exclusive locks are IX.

For example if I want to request a shared lock on a tuple I must then:

Request IS on the parent block.
Request IS on the block’s parent relation.

Policy For Granting Locks (Including Intention Locks)

You can check if you can use a particular type of lock using the following table:

	Shared (S)	Exclusive (X)	IS	IX
Shared (S)	Yes	No	Yes	No
Exclusive (X)	No	No	No	No
IS	Yes	No	Yes	Yes
IX	No	No	Yes	Yes

The lock you request is on the top. You can grant the lock only if the types of locks held by other transactions are “yes” from the left.

Aborting Transacions

COMP207 Lectures 2021-10-26

Reasons For Transaction Abortions

Errors while executing transactions:

Violation of integrity constraints, other run-time errors.
- Multiple people in a seat that should only relates to one customer.

Deadlocks:

Concurrency control requests to abort transactions.
- When using two-phase locking.

Explicit Request:

User says ROLLBACK; after they START TRANSACTION;.

Reasons Beyond the DBMS’s Control

Media Failure:

The medium holding the database becomes partially or completely unreadable.

Can be guarded against by using archives and controlled redundancy with RAID.

Catastrophic Events:

The medium holding the database is destroyed.

Can be guarded against by having archives at safe, different locations.

System Failures:

Information about the active transaction’s state is lost.
- Could be due to power failures or software errors.

Undo Logging

COMP207 Lectures 2021-10-26

Logging

The following activities should be written to the log so that a desired database state can be recovered later:

Starts of transactions, commits and aborts.
Modification of database items.

This should be implemented in such a way that it should work even if the system fails.

You can use the following techniques:

Undo Logging (Maintains atomicity.)
Redo Logging (Maintains durability.)
Combinations of the two which allow for atomicity and durability.

Logging Transaction Syntax

A DBMS can be modelled like so:

graph 
Transaction <--> Buffers
lm[Log Manager] --> Buffers
Buffers <--> db[(Database & Log)]

The following addition transaction syntax are required for logging:

write(X) - Writes changes only to the buffer.
output(X) - Copy database item $X$ from the buffer to the database.
flush_log - Write log entries that are currently residing in main memory (buffer) to the log.

Undo Logging

This logs the database activities with the goal of restoring a previous state:

To do this we should have log entries for the following:

START T> - Transaction $T$ has started.
COMMIT T> - Transaction $T$ has committed.
<ABORT T> - Transaction $T$ was aborted.
<T, X, v> - Transaction $T$ has updated the value of database item $X$, and the old value of $X$ was $v$.
- This is the response to write(X)
If this entry occurs in the log, then the new value of $X$ might not have been written to the database yet.

Undo Logging Procedure

If transaction $T$ updates database item $X$, and the old value was $v$, then <T, X, v> must be written to the log on disk before $X$ is written to the disk.
If transaction $T$ commits, then <COMMIT> must be written to the disk as soon as all database element changed by $T$ are written to the disk.

This is represented in the following transaction:

Transaction	Comments
read(X)
X := X*2
write(X)
read(Y)
Y := Y*2
write(Y)
flush_log	Writes all log entries for updates to disk.
output(X)	Writes updates for $X$ to disk.
output(Y)	Writes updates for $Y$ to disk.
flush_log	Writes the `<COMMIT T>` record to disk

Example

This is a full example of the previous table:

Time	Transaction	$X$ (Local)	$Y$ (Local)	$X$ (Buffer)	$Y$ (Buffer)	$X$ (Database)	$Y$ (Database)	Log Record Written
0						1	10	`<START T>`
1	read(X)	1		1		1	10
2	X := X * 2	2		1		1	10
3	write(X)	2		2		1	10	`<T, X, 1>`
4	read(Y)	2	10	2	10	1	10
5	Y := Y * 2	2	20	2	10	1	10
6	write(Y)	2	20	2	20	1	10	`<T, Y, 10>`
7	flush_log	2	20	2	20	1	10
8	output(X)	2	20	2	20	2	10
9	output(Y)	2	20	2	20	2	20
10		2	20	2	20	2	20	`<COMMIT T>`
11	flush_log	2	20	2	20	2	20

There could be several failure scenarios:

<COMMIT T> occurs in the log on disk:

$T$ has committed successfully (no recovery is needed).
<COMMIT T> doesn’t occur in the log database and error is after output(X):

Must undo all update to database items that were written to disk.

For each log record <T, X, v> on disk, replace $X$ on disk by $v$.
<COMMIT T> doesn’t occur in the log database and flush_log wasn’t run:

Nothing needs to occur as no logs for the transaction were written to the disk yet.

Recovery with Undo Logs

If an error occurs we should traverse the log backwards and take the following actions. If the current entry is:

<COMMIT T> - $T$ was committed successfully.
<ABORT T> - $T$ was manually aborted.
<T, X, v> - If $T$ has not finished, change the value of $X$ on disk to $v$.

Then write <ABORT T> for each uncommitted transaction $T$ that as not previously aborted and call flush_log.

Dealing with `ABORT`

This is similar to recovery with undo logs but focuses on a single transaction:

Assume $T$ aborts.
Traverse the undo log from the last to the fisrt item.
if we see <T, X, v>, change the value of $X$ on disk back to $v$.

Other Kinds of Logging (Redo & Undo/Redo)

COMP207 Lectures 2021-10-26

Redo Logging

Logs activities with the goal of restoring committed transactions (ignoring incomplete transactions).

Log Records:

Same records as before and…
New meaning of <T, X, v>:
- Transaction T has updated the value of database item $X$ and the new value of $X$ is $v$.
  - Direct response to write(X)
  This happens before $X$ is changed on disk.

Redo Logging Procedure

$T$ first writes all log records for all updates.
$T$ writes <COMMIT T> to the log on disk.
$T$ writes all committed updates to disk.

Redo logging has the two following fundamental principles:

<COMMIT t> occurs in the log only when the log contains complete information on $T$.
<COMMIT t> doesn’t occur in the log when $T$ hasn’t written anything to disk.

There is also the following syntax:

outcom(X) - Outputs the last committed value from buffer to disk.

Redo Logging Example

Time	Transaction	$X$ (Local)	$Y$ (Local)	$X$ (Buffer)	$Y$ (Buffer)	$X$ (Database)	$Y$ (Database)	Log Record Written
0						1	10	`<START T>`
1	read(X)	1		1		1	10
2	X := X * 2	2		1		1	10
3	write(X)	2		2		1	10	`<T, X, 1>`
4	read(Y)	2	10	2	10	1	10
5	Y := Y * 2	2	20	2	10	1	10
6	write(Y)	2	20	2	20	1	10	`<T, Y, 10>`
7								`<COMMIT T>`
8	flush_log							(Log buffer is written to disk.)
9	outcom(X)	2	20	2	20	2	10	($X$ is written from buffer to database.)
10	outcom(Y)	2	20	2	20	2	20	($Y$ is written from buffer to database.)

Recovery with Redo Logging

This is essentially the reverse of undo logging:

Identify all the transactions with a COMMIT log record.
Traverse the log from first to the last item.
If we see <T, X, v> and $T$ has a COMMIT log record, then change the value of $X$ on disk to $v$.
For each incomplete transaction $T$, write <ABORT T> into the log on disk.

Undo/Redo Logging

This combines the atomicity and durability of undo and redo logging.

Log Records:

Same as before but replace <T, X, v>.
<T, X, v, w> - Transaction $T$ has updates the value of database item $X$, and the old and new values of $X$ are $v$ and $w$ respectively.

Undo/Redo Logging Procedure

Write all log records for all updates to database items first.
Then write updates to disk.
<COMMIT T> can then be written to disk before or after all change have been written to disk.

Recovery needs to process the log in both directions.

Undo/Redo Logging Example

Consider the following schedule that needs to be logged using undo/redo logging where:

$X=1$
$Y=2$

$T_1$	$T_2$
read(X)
X := X * 2
write(X)
	read(X)
read (Y)
	X := X * 3
	write(X)
Y := Y + Y
write(Y)

We can put this in the following table:

Time	Transaction $T_1$	Transaction $T_2$	Local $T_1$ $X$	Local $T_1$ $Y$	Local $T_2$ $X$	Buffer $X$	Buffer $Y$	Disk $X$	Disk $Y$	Buffer Log
0								1	2	`<START T1>`
1	read(X)		1			1		1	2
2	X := X * 2		2			1		1	2
3	write(X)		2			2		1	2	`<T1, X, 1, 2>`
4			2			2		1	2	`<START T2>`
5		read(X)	2		2	2		1	2
6	read(Y)		2	2	2	2	2	1	2
7		X := X * 3	2	2	6	2	2	1	2
8		write(X)	2	2	6	6	2	1	2	`<T1, X, 2, 6>`
9	Y := X + Y		2	4	5	6	2	1	2
10	write(Y)		2	4	6	6	4	1	2	`<T1, Y, 2, 4>`
11			2	4	6	6	4	1	2	`<COMMIT T1>`
12	flush_log		2	4	6	6	4	1	2	(Log is written to disk)
13	output(X)		2	4	6	6	4	6	2
14	output(Y)		2	4	6	6	4	6	4
15			2	4	6	6	4	6	4	`<COMMIT T2>`
16		flush_log	2	4	6	6	4	6	4	(Logs since last flush are written to disk)
17		output(X)	2	4	6	6	4	6	4	No change

Undo without Redo (Force)

Undo ensures atomicity.

We can ensure durability by using force:

This forces the writing of updates to disk before commit.

No force is the opposite of this.
Force is expensive in disk operations.

Redo without Undo (No Steal)

Redo ensures durability.

We can ensure atomicity by using no steal:

No steal means that uncommitted data may not overwrite committed data on disk.

Steal is not to require this.
No steal is expensive to ensure.

Ensuring Atomicity & Durability

We could ensure atomicity and durability without a log by using no steal and force, however this is:

Hard and expensive to ensure.
- Must write every change to disk made by the transaction while performing the commit.

In practice we use undo/redo as it is cheap and ensures atomicity and durability.

Locks, Logging & Recovery

COMP207 Tutorials 2021-10-28

This covers tutorial 4.

Add links to answers to this tutorial.

Locks

See the answers for tutorial 4 for this question.

Logging & Recovery

Recovery with undo-logging:
1. Change the value of $X$ on disk back to 10 and call abort for $T_1$ and $T_2$.
2. Change:
  - $Z=30$
  - $X=10$
  on disk and call abort for $T_1$.
3. Same as 2 but also change:
  - $V=50$
4. No change all transactions completed successfully.
Values on disk:
1. $X$ might be on disk.
2. All values from $T_2$ must be on disk and $X$ and $Z$ might be on disk from $T_1$.
3. All values from $T_2$ must be on disk and $X$, $V$ and $Z$ might be on disk from $T_1$.
4. All values must be on disk.
For the schedules in the tutorial:
1. The following transactions need to be rolled back:
  - $S_1$ - $T_3$
  - $S_2$ - $T_2$
  - $S_3$ - None
  - $S_4$ - $T_3$
  You should roll back any transactions with dirty reads. You should also roll back any other transaction that reads a write created by a transaction that has had a dirty read.
  
  Dirty reads are where a transaction is allowed to read a row that has been modified by an another transaction which is not committed yet.
2. This question is tedious due to the number of combinations. See the answers.

Checkpoints

COMP207 Lectures 2021-11-02

By making checkpoints we can enable faster recovery and make the log files use less space.

Simple Checkpointing

This method is applicable for undo logging.

The idea is that we checkpoint the log periodically:

Every $m$ minutes or after $t$ transaction since the last checkpoint.
No need to undo transactions before the checkpoint.

The part before the checkpoint can be discarded from the log.

Procedure

Stop accepting new transactions
Wait until all active transactions finish and have written COMMIT or ABORT record to the log.
Flush the log to disk.
Write a log record <CHECKPOINT>.
Flush the log to disk again.
Resume accepting transactions.

Recovery with Simple Checkpoints

To complete recovery just step back through the undo log up to the checkpoint just as before.

ARIES Checkpoints

This style of checkpoints has the following two requirements:

Undo/redo logging is used.
Transactions do not write to buffers before they are sure they want to commit.

Procedure

Write <CHECKPOINT(T1, T2,…)> in log and flush it.
- T1, T2,… are the transaction in progress (i.e. not committed and not aborted).
Write the content of the buffer to disk (i.e. output).
Write <END CHECKPOINT> in log and flush it.

Recovery with ARIES Checkpoints

This is the same as recovering with an undo/redo log except:

Only redo part of committed transactions in T1,T2,… after <CHECKPOINT(T1,T2,…)>.
Then undo all of uncommitted transactions in T1,T2,… including before <CHECKPOINT(T1,T2,…)>.

You must stop after having found a <CHECKPOINT(T1,T2,…)> with corresponding <END CHECKPOINT> and <START Ti> for all mentioned transactions that are uncommitted.

Recoverable Schedules

COMP207 Lectures 2021-11-02

This lecture aims to fix the Concurrency & Partial Execution Example. We said that there a no good solutions to recover from this problem.

Dirty Reads

This is where the isolation property is not enforced in order to gain efficiency:

Spend less time preventing dirty reads.
Gain more parallelism by executing more transactions that would otherwise have to wait.

You can set this in the DBMS:

SET TRANSACTION READ WRITE
	ISOLATION LEVEL READ UNCOMMITTED;`

You can also set the first value to READ ONLY.

You can also set the second value to:

READ COMMITTED
REPEATABLE READ
SERIALIZABLE

Cascading Rollback

If a transaction $T$ aborts then we need to:

Find all transaction that have read items written by $T$.

This is very slow and we wan to avoid this.
Recursively abort all transaction that have read items written by an aborted transaction.

If we do not abort these then we break isolation. If we do abort them then we can break durability.

Recoverable Schedules

The problem for durability in regards to cascading rollbacks is that if a transaction $T_2$ reads data from some transaction $T_1$, then $T_2$ commits and afterwards $T_1$ aborts.

We can still do cascading rollback, but only active transaction can be forced to abort.

A schedule is recoverable if the following is true:

A transaction $T_1$ commits and has read an item $X$ that was written before by a different transaction $T_2$.
Then $T_2$ must commit before $T_1$ commits.

There is also the following additional requirement:

All log records have to reach disk in the order in which they are written.

This avoids the issue where the commit statements can be swapped.

Example

This is an example of a recoverable schedule:

\[S_1: w_2(X);w_1(Y);w_1(X);\mathbf{r_2(Y)};w_2(Y);\mathbf{c_1};c_2;\]

This is however not serialisable.

Avoiding Cascading Rollbacks

COMP207 Lectures 2021-11-02

Cascading rollbacks have the issues of breaking durability and they are also slow. As a result we want to avoid them.

Recoverable schedules still cascade.

Cascadeless Schedules

A schedule is cascadeless if each transaction in it reads only values that were written by transaction that have already committed.

No dirty reads = no cascading rollbacks.

Additionally, for recoverable schedules, the log records have to reach the disk in the right order.

Cascadeless schedules are recoverable but not serialisable.

Strict Schedules

A schedules is strict if each transaction in it reads and writes only values that were written by transaction that have already committed.

For example:

\[S_1:w_2(X);c_2;w_1(X);r_1(X);c_1;\]

Strict Two-Phase Locking (2PL)

This is the most popular variant of 2PL as it enforces:

Conflict-serialisability.
Strict Schedules

The following constraints must be followed:

A transaction $T$ must not release any lock that allows $T$ to write data until:
- $T$ has committed or aborted, and
- The commit/abort log record has been written to disk.

This means that:

With simple locking no locks can be released until the conditions are met.
With shared/exclusive locks just exclusive locks can’t be released.

For examples of strict 2PL with simple and shared/exclusive locks see the lecture.

Strict 2PL & Deadlocks

With this method there is still a risk of deadlocks we can fix this by:

Detecting deadlocks & fixing them.
Enforcing deadlock-free schedules. This is an issue as deadlock-free schedules aren’t (strict) 2PL.

Schedule Categorisation

Schedules can be categorised in the following hierarchy:

Serial
- 2PL Strict Schedules
  - Cascadeless Schedules
    
    You can have a cascadeless schedule that is not strict.
    - Recoverable Schedules
      
      All cascadeless schedules are recoverable.
  - Conflict Serialisable Schedules
    
    These may or may not be cascadeless (including children).
    - Serialisable Schedules

Detecting Deadlocks

COMP207 Lectures 2021-11-02

There are thee main approaches for deadlock detection:

Timeouts - Assume a transaction is in a deadlock if it exceeds a given time limit.
Wait-for Graphs:
- Nodes - Transactions
- Edge from $T_1$ to $T_2$ if $T_1$ waits for $T_2$ to release a lock.
- Deadlock correspond to cycles.
```
  graph LR
  T1 -->|waits for| T2 --> T3 --> T1
```
Timestamp-Based

Timestamps

Each transaction $T$ is assigned a unique integer $\text{TS}(T)$ upon arrival. This is the timestamp of $T$.

If $T_1$ arrived earlier than $T_1$, we require $\text{TS}(T_1)<\text{TS}(T_2)$:

If two transactions arrive at the same time then the order doesn’t matter.
If a transaction is restarted after an abortion, then the timestamp stays that same.

In this methods timestamps do not change even after a restart.

We then use the order of the timestamps to decide which can wait and which must abort to prevent deadlock.

Wait-Die Scheme

Older transactions always wait fro unlocks.

graph LR
T1 -->|waits for| T2

$T_1$ requests an item that is locked by $T_2$.

Therefore there can be two cases:

$T_1$ is older than $T_2$ - $T_1$ is allowd to wait further for $T_2$ to unlock.
$T_1$ is younger than $T_2$ - $T_1$ is rolled back and dies.

Only older transactions are allowed to wait so no cyclic dependencies are created.

Wound-Wait Scheme

Older transactions never wait for unlocks.

graph LR
T1 -->|waits for| T2

$T_1$ is older than $T_2$ - $T_2$ is rolled back unless is has finished (it is wounded).
$T_1$ is younger than $T_2$ - $T_1$ allowed to wait further for $T_2$ to unlock.

Only younger transactions are allowed to wait so no cyclic dependencies are created.

This works as eventually, any finite number of transactions finishes under wound-wait. Therefore, at all times, the oldest transaction can move.

Timestamp Based Schedules

COMP207 Lectures 2021-11-02

This methods detects and fixes deadlocks in strict 2PL systems.

This implements the idea that each transaction should be executed instantly when it is started.

This makes it equivalent to a serial schedule that has a transactions in the order of their start time.

Timestamps

Each transaction $T$ is assigned a unique integer $\text{TS}(T)$ when it starts. This is the timestamp of $T$.

If $T_1$ arrived earlier than $T_1$, we require $\text{TS}(T_1)<\text{TS}(T_2)$.

This is different to last time as we assign a new timestamp even after a restart.

Additionally, for each database item $X$, we need to maintain:

Read Time of $X$ - $\text{RT}(X)$

The timestamp of the youngest transaction that read $X$.
Write Time of $X$ - $\text{WT}(X)$

The timestamp of the youngest transaction that wrote $X$.

Action	Timestamp	Read Time	Write Time
		$\text{RT}(X)=0$	$\text{WT}(X)=0$
$r_2(X)$	$\text{TS}(T_2)=2$	$\text{RT}(X)=2$	$\text{WT}(X)=0$
$r_1(X)$	$\text{TS}(T_1)=1$	$\text{RT}(X)=2$	$\text{WT}(X)=0$
$r_3(X)$	$\text{TS}(T_3)=3$	$\text{RT}(X)=3$	$\text{WT}(X)=0$
$w_2(X)$		$\text{RT}(X)=3$	$\text{WT}(X)=2$
$w_4(X)$	$\text{TS}(T_4)=4$	$\text{RT}(X)=3$	$\text{WT}(X)=4$

Read Requests

If $T_1$ requests to read $X$:

Abort and restart $T_1$ if $\text{WT}(X)>\text{TS}(T_1)$.
Else, grant request.

Write Requests

If $T_1$ requests to write $X$:

Abort and restart $T_1$ if $\text{RT}(X)>\text{TS}(T_1)$ or $\text{WT}(X)>\text{TS}(T_1)$.
Else, grant request.

A full example of a timestamp based schedules is given in the lecture.

Properties of Timestamp Based Scheduling

Good properties:

Enforces conflict-serialisable schedules.
Deadlocks don’t occur.

Bad properties:

Cascading rollbacks.
Starvation can occur.
- Through cyclic aborts and restarts of transactions.
This can be prevented using methods not covered.

Ensuring Strictness

Schedules enforced by timestamp based schedulers are not strict.

An additional condition can be added to enforce a strict schedule:

Delay read or write requests until the youngest transaction who wrote $X$ before has committed or aborted.

Multiversion Concurrency Control (MVCC)

This is a variant of timestamp scheduling used by many DBMSs. It is the same idea as timestamp based approaches by you have multiple versions of the database:

This means that write operations do not overwrite each other, but instead $w_i(X)$ creates a new version of $X$ with it’s own timestamp $\text{TS}(T_i)$.
Whenever you read, you read the latest version before your timestamp.

This means that a transaction only needs to restart if it tries to write and the read timestamp is later than your timestamp.

The rules are:

For Writes - Abort and restart $T_1$ if $\text{RT}(X)>\text{TS}(T_1)$ and otherwise, grant request.
For Reads - Always granted.

There is also a strict variant, where you delay reads until the transaction you read from commits.

There is a full example with MVCC in the lecture.

Introduction to Query Processing

COMP207 Lectures 2021-11-16

The query compiler and execution engine are responsible for:

Transforming SQL queries into sequences of database operation.
Executing these operations.

Query Processing

Query processing is completed in the following order:

graph
a[ ] -->|SQL Query| pqp[Parse Query & Pre-process] -->|"Logical Query Plan (relational algebra)"| olqp[Optimise Logical Query Plan] -->|Optimised Logical Query Plan| spqp[Select Physical Query Plan] -->|Physical Query Plan| qe[Query Execution]

Relational Algebra

The following set of operation can be applied to relations to compute new relations:

Selection ($\sigma$)
Projection ($\pi$)
Cartesian Product ($\times$)
Union ($\cup$)
Renaming ($\rho$)
Natural Join ($\bowtie$)
Semijoin ($\ltimes$)

These are covered in prior lectures.

Query Plan

A relation algebra expression that is obtained from an SQL query is also called a (logical) query plan.

The following query:

SELECT department, name
FROM Stores, Employees
WHERE department=worksAt AND city='Liverpool';

can generate the following query plan:

\[\pi_\text{department, name}(\sigma_\text{department=worksAT AND city='Liverpool'}(\text{Stores}\times\text{Empoyees}))\]

You can also represent this as a tree:

graph
sig[σ department=worksAt] --> times["⨉"]
times --> pid[π department] & piw[π worksAt,name]
pid --> Stores
piw --> Emplyees

Leaves - Input Relations
Inner Nodes - Operators

Equivalent Query Plans

There are typically many different query plans. The DBMS aims to select a best possible query plan.

Relational algebra is better suited than SQL for this:

Can use equivalence laws of relational algebra to generate a query plan for the same query that can be executed faster.

For example:
\[\sigma_\text{condition}(R_1\bowtie R_2)=\sigma_\text{condition}(R_1)\bowtie\sigma_\text{condition}(R_2)\]
```
  graph
  subgraph Slow
  sig[σ condition] --> bow["⋈"] --> R1 & R2
  end
	
  subgraph Parallel
  bow2["⋈"] --> sig1[σ condition] & sig2[σ condition]
  sig1 --> R12[R1]
  sig2 --> R22[R2]
  end
```

Naïve Query Plan Execution

COMP207 Lectures 2021-11-16

Executing Query Plans

A query plan tells us exactly how to computer the result to a query. On a particular tree we should proceed from bottom to top:

graph
sig[σ department=worksAt] --> times["⨉"]
times --> pid[π department] & piw[π worksAt,name]
pid --> Stores
piw --> Emplyees

Compute an intermediate result for each node.
For a leaf labelled with relation $R$, the intermediate result is $R$.
For an inner node labelled with operator $op$, get the intermediate result by applying $op$ to the childrens’ intermediate results.
Result of the query - intermediate result of the root.

We go from bottom to top, applying the operation on the last table(s).

How to Apply an Operator

Computing $\sigma_\text{condition}(R)$
- We have to read the entire file $R$. If the intermediate is sorted or has an index there are methods of doing this without reading the whole table.
Computing $\pi_\text{attribute list}(R)$
- Similar to $\sigma$, read $R$ only once.

Naïve Computation of Joins

We can use the following nested loop algorithm:

Compute $R\bowtie S$:
for each tuple $r$ in $R$:
for each tuple $s$ in $S$:
if $r$ and $s$ have the same value for all common attributes:
output $r\bowtie s$

This algorithm is slow as for each tuple $r$ in $R$ we have to read the entire relation $S$.

The running time is:

\[O(\lvert R\rvert\times\lvert S\rvert)\]

Where:

$\lvert R\rvert$ - Number of tuples in $R$.
$\lvert S\rvert$ - Number of tuples in $S$.

Computing Joins Using Sorting

COMP207 Lectures 2021-11-16

Equijoins

An equijoin:

\[R\bowtie_{A=B}\]

is defined as:

\[\sigma_{A=B}(R\times S)\]

$A,B$ are the join attributes.

You can see an example on the following tables:

Stores

code city

12345 1

67910 2

code	city
12345	1
67910	2

Employees

name	depart
Oscar	12345
Janice	67910
David	67910

$\text{Stores}\bowtie_\text{code=depart}\text{Employees}$

code	city	name	depart
1235	1	Oscar	12345
67910	2	Janice	679810
67910	2	David	678910

Merging (Merge Sort)

If $R$ is sorted on $A$ and $S$ is sorted on $B$, then $R\bowtie_{A=B}S$ can be computed with one pass over $R$ and $S$ + run time equal to the size of the output.

With sorted tables we can just step through the columns in the constraint ($A$ and $B$) while one is smaller than the other and just print out when the constraint is matched.

This gives a runtime of:

\[O\lvert R\rvert +\lvert S\rvert +\text{size of output}\]

Merging with Duplicates in Column $A$

If there are duplicates in the left table then you can just loop through the right table again for the duplicate values.

Algorithm

You can use the following algorithm to compute this:

Sort $R$ on $A$
Sort $S$ on $B$
Merge the sorted $R$ and $S$

Running time $O(\lvert R\rvert \times \log_2\lvert R\rvert)$
Running time $O(\lvert S\rvert \times \log_2\lvert S\rvert)$
Running time O\lvert R\rvert +\lvert S\rvert +\text{size of output}

Therefore the typical running time is:

\[O(\lvert R\rvert\log_2\lvert R\rvert +\lvert S\rvert\log_2\lvert S\rvert)\]

This is much faster than nested loop join but is the same in the worse case scenario.

Running Time vs. Disk Access

$B$ - Size of a disk block (512/4k bytes)
$M$ - Number of disk blocks that fit into available RAM

Algorithm	No. of Elementary Operations	No. of Disk Accesses
Reading a relation $R$.	$O(\lvert R\rvert)$	$O(\frac{\lvert R \rvert}B)$
Sorting $R$ on attrubte $A$	$O(\lvert R\rvert\log_2\lvert R \rvert)$	$O(\frac{\lvert R\rvert}B\log_M\frac{\lvert R \rvert}B)$

External Memory Merge-Sort

COMP207 Lectures 2021-11-16

This part is not required for the exam.

We can perform faster joins provided that the tables are sorted or indexed. In order to sort the data quickly we can do this inside of memory, instead of on the disk.

Merge Sort

Divide input in two parts P1, P2
Sort P1 & P2 recursively
While P1 or P2 is not empty:
	Add smallest remaining element from P1 or P2 to output

External Memory Merge Sort

Divide input in M parts P1, P2, ..., PM
Sort P1, P2, ..., PM recursively
While not all Pi are empty:
	Add smallest remaining element from any part Pi to output

$M$ is the number of disk blocks that fit in RAM

Additionally, the following are true:

The RAM holds the smallest blocks from each part.
On each level or recursion we scan through all blocks once ($O(\frac NB)$) disk operation where $B$ is the block size.
We split in M buckets in each level or recursion until the remainder is below MB (where it fits in RAM and can be sorted directly): $\log_M(\frac M {MB})$ levels.
The total number of disk operations is $O(\frac N {B\log_M(\frac N {MB})})$

Comparison

In practice, since hard disks are slow, the running time is mostly spent on disk operations. Therefore, the standard running time is:

\[O(\frac N {B\log_M(\frac N {MB})})\]

where $D$ is the time of a disk operation.

Quick-sort (and other internal memory sorting algorithms) use the following order of time:

\[O(\frac N {B\log_2(N)\times D})\]

External memory sorting is much faster on inputs that are much larger than main memory.

Index

COMP207 Lectures 2021-11-16

Selection can be performed faster if we know where to find the rows that are being selected. One way of doing this is by using indexes.

Index

Given the values for one or more attributes of a relation $R$, provides quick access to the tuples with these values.

Consider that we have the following data that we want to index:

Students

id	name	programme
1234	Anna	G401
2345	Ben	G701
3456	Chloe	G401
4567	Dave	G401

We can use the following index table, sorted by value:

value	pointers to rows
G401	1234, 3456, 4567
G701	2435

You should have an index for each attribute that you want to select.

There are two different types of index:

Secondary - Only points to the location of records on disk.

Always dense.
Primary - In addition, defines how data is sorted on disk.

Good when attributes involve primary keys.

Using indexes

To complete a selection using an index:

Find the entry that you are selecting in the index (G401).
Visit all rows in Students whose ids occur in the index entry for G401.

This has a running time of:

\[O(\log d+\text{size of output})\]

Forms of Indexes

You can use the following datatypes as indexes:

Hash-Index - Good for equality.
B+- Tree - Good for ranges.

You can create an index in SQL using the following command:

CREATE INDEX
	ON Students USING btree -- default, can use hash
	(programme, year); -- the columns that you want to index

Searching & Inserting in B+ Trees

COMP207 Lectures 2021-11-16

There is a visualisation tool for B+ trees available at https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html.

Single vs Multi-Level Indexes

Single Level Index - Stores in a single list.
Multi-Level Index - Distributes across different layers.

B+ Tree Leaves (Idea)

$a_1$	$a_2$	$\ldots$	$a_n$
Points to tuples with value $a_1$.	Points to tuples with value $a_2$		Points to tuples with value $a_n$.	Next Leaf

$n$ is chosen such that a node fits into a single disk block:

Disk Block Size - 512 byte
Values - 4 byte integers
Pointers - 8 bytes

With the above example $n=42$ as:

\[\begin{aligned} 512&\geq42(8+2)+8\\ 512&<43(8+4)+8 \end{aligned}\]

B+ Tree Inner Nodes (Idea)

$a_1$	$a_2$	$\ldots$	$a_n$
Points to a nodes for values $<a_1$.	Points to nodes for values $\geq a_1$ and $<a_2$		Points to nodes for values $\geq a_{n-1}$ and $<a_n$	Points to node for values $\geq a_n$

Pointers point to B+ tree nodes at the level below. $n$ is chosen as before.

B+ Tree Leaves (Actually)

Not all the fields have to be used.
Fields are filled from left to right.

$a_1$	$a_2$	$\ldots$	$a_i$	Unused	Unused
Points to tuples with value $a_1$.	Points to tuples with value $a_2$		Points to tuples with value $a_u$.	Points Nowhere	Next Leaf

Ensure that at least $\lfloor \frac{n+1}2\rfloor$ pointers are used (unless this is the only leaf).

To follow with the online tool, we count the next leaf pointer as a pointer, even if there is none.

B+ Tree Nodes (Actually)

Not all the fields have to be used.
Fields are filled from left to right.

$a_1$	$a_2$	$\ldots$	$a_i$	Unused	Unused
Points to a nodes for values $<a_1$.	Points to nodes for values $\geq a_1$ and $<a_2$		Points to nodes for values $\geq a_{i-1}$ and $<a_i$	Points to node for values $\geq a_i$	Points Nowhere

Where:

$a_i$ - Number is smallest number in child-sub-tree slightly to the right.

Ensure that at least $\lceil \frac{n+1}2\rceil$ pointers are used (the root must use $\geq 2$ pointers).

B+ Tree Index

A B+ tree index of the prime numbers through 47 looks like the following:

B+ tree with max degree of 4.

Searching Values

To find the pointer to the rows with value $v$:

Start at the root of the B+ tree.
While the current node is a non-leaf node:
- f $v<a_1$, proceed to the first child of the node.
- Otherwise find the largest $i$ with $a_i\geq v$ and proceed to the associated child node.
If the current node is a leaf:
- If $v$ occurs in the leaf, follow the associated pointer.
- If $v$ does not occur in the leaf, return “$v$ doesn’t exist in the index”.

The running time is:

\[O(h\times\log_2n)\]

where $h$ is the height of the B+ tree.

The real running time is:

\[O(h\times D)\]

where $D$ is the time for a disk operation.

Inserting Values

To insert a new value/pointer pair:

Find the leaf that should contain the value.
If the leaf is not full, insert the key value pair at a suitable location.
If the leaf is full:
1. Split the leaf to make space for the new value/pointer pair and move half of the pointers to the new node.
2. Insert the value/pointer pair
3. Connect the leaf to a suitable parent node (which might incur the creation of a new node).

The running time is:

\[O(h\times\log_2n)\]

where $h$ is the height of the B+ tree.

The real running time is:

\[O(h\times D)\]

where $D$ is the time for a disk operation.

Deletion & Properties of B+ Trees

COMP207 Lectures 2021-11-16

Deleting Values

To delete a value/pointer pair:

Find the leaf that should contain the value.
- If not there then we are done.
Remove the value/pointer pair.
Let the current node $C$.
Let $x$ be:
\[x= \begin{cases} 2 & \text{if } C\text{ is root}\\ \lceil\frac{n+1}2\rceil & \text{if } C \text{ id internal node}\\ \lfloor\frac{n+1} 2\rfloor & \text{if } C \text{ is leaf} \end{cases}\]
- If $C$ has above $x$ pointers - Fix ancestors (if necessary) and you are done.
- if $C$ is the root but not a leaf - Remove it (and let the child of the root be the new root).
- Otherwise, check if an adjacent node has at least $x+1$ pointers.
  
  A node is adjacent to another if they are at the same depth and their common ancestor has no descendant at that depth between them.
  - If so - Take one, fix ancestors (if necessary) and you are done.
  - Otherwise - Merge with sibling and go to line 3 with the parent as the current node.

The running time is:

\[O(h\times\log_2n)\]

where $h$ is the height of the B+ tree.

The real running time is:

\[O(h\times D)\]

where $D$ is the time for a disk operation.

Properties of B+ Tree Indexes

Fast lookups, insertions and deletions:
\[\begin{aligned} O(h\times\log_2n) & = O(\log_nN\times\log_2n)\\ & = O(\log_2N) \end{aligned}\]
where $n\approx B$ and $h$ is the height of the tree.
They remain balanced when modified.
Huge capacity, even with height 3 if blocks are large enough:
- Block Size - 16K
- Values Stores in Index - 4 bytes
- Pointers - 8 bytes
- Largest $n$ so that each B+ tree node fits into a block is 1364.
- B+ trees with height 3 can store $>n^3=2.5\times 10^9$ values.
Can be implemented efficiently with respect to number of disk accesses:
- Number of disk accesses is typically $\approx 2+h$, where $h$ is the height of the B+ tree.
Most of the B+ tree can be kept in memory:
- Specifically the upper levels.

Optimising Query Plans

COMP207 Lectures 2021-11-16

The initial query plan, generated from a query, may not be the optimal one. We can complete the following to make it run faster.

How to Optimise

Queries are evaluated from the bottom up therefore:

Efficiency depends on the size of intermediate results.

We should rewrite the initial query plan so that intermediate results are as small as possible.

We can use relational algebra equivalence laws to make query plans with smaller intermediate results.

Heuristics

Push selections as far down the tree as possible.

 flowchart LR
 subgraph inefficient
 11["π depart, name"] --> 12[σ worksAt=depart] --> 13["⨉"] --> 14[Stores] & 15[Emplyees]
 end
 subgraph better
 21["π depart, name"] --> 22[σ worksAt=depart AND city=Liv] --> 23["⨉"] --> 24[σ city=Liv] & 25[Emplyees]
 24 --> Stores
 end
 inefficient --> better

This gets rid of many irrelevant tuples very early during execution.

Push projections as far down the tree as possible, or insert projections where appropriate.

 flowchart LR
 subgraph inefficient
 11["π depart, name"] --> 12[σ worksAt=depart AND city=Liv] --> 13["⨉"] --> 14[σ city=Liv] & 15[Emplyees]
 14 --> 16[Stores]
 end
 subgraph better
 21["π depart, name"] --> 22[σ worksAt=depart AND city=Liv] --> 23["⨉"] --> 24[π worksAt] & 25[π depart, name]
 24 --> 26[σ city=Liv]
 26 --> 27[Stores]
 25 --> 28[Emplyees]
 end
 inefficient --> better

If possible, introduct equijoins for $\times$ followed by $\sigma$.

 flowchart LR
 subgraph inefficient
 11["π depart, name"] --> 12[σ worksAt=depart AND city=Liv] --> 13["⨉"] --> 14[π worksAt] & 15[π depart, name]
 14 --> 16[σ city=Liv]
 16 --> 17[Stores]
 15 --> 18[Emplyees]
 end
 subgraph better
 21["π depart, name"] --> 23["⋈ worksAt=depart"] --> 24[π worksAt] & 25[π depart, name]
 24 --> 26[σ city=Liv]
 26 --> 27[Stores]
 25 --> 28[Emplyees]
 end
 inefficient --> better

$\bowtie$ is faster than a $\times$ followed by a $\sigma$.

There are also many more heuristics that can be used to optimise the query plan.

Physical Query Plans

COMP207 Lectures 2021-11-16

A physical query plan adds information required to execute the optimised query plan such as:

Which algorithm to use for execution of operators.
How to pass information from one operator to the other.
Good order for computing joins, unions, etc.
Additional operations such as sorting.

Converting a Logical Plan to a Physical Plan

We can generate many different physical query plans from an optimised logical query plan. We need to choose the physical plan with the lowest cost estimate in terms of:

Time
Disk Accesses
Memory
Communication
…

Estimating Cost of Execution

We should mainly focus on the number of disk accesses as they are usually the most expensive operation.

It can be influenced by:

Selection of algorithms for the individual operators.
Method for passing information.
Size of intermediate results.

This is one of the most critical factors.

We can estimate these from parameters of the database such as:

Size of relations.
Number of distinct items per attribute per relation.
Computed exactly from the database.

You can’t execute the query, you have to rely on statistics gathered from data.

Estimate Size of Selection

Estimate for the size of $\sigma_{A=a}(R)$:

\[\frac{\lvert R\rvert}{\text{number of distinct values in column }A \text{ of relation }R}\]

Estimate for the number of block required to store $\sigma_{A=a}(R)$:

\[\frac{\text{number of blocks for }R}{\text{number of distinct values in column }A\text{ of relation }R}\]

This is good if values in column $A$ or $R$ occur equally often, but can be bad.

Example

Assume that:

Stores contains 80 tuples, stored in 20 blocks.
There are exactly 4 distinct values for the city attribute of Stores.

To estimate the size of $\sigma_\text{city=Liv}(\text{Stores})$:

\[\frac{80}4=20\text{ tuples}\]

To estimate for the number of blocks that are required to store $\sigma_\text{city=Liv}(\text{Stores})$:

\[\frac{20}4=5\text{ blocks}\]

Estimate Size of Joins

To estimate $R\bowtie S$ we can do an estimate based on the size of $R$ and $S$ and a number of distinct values in common attributes:

\[\frac{\lvert R\rvert\times\lvert S\rvert}{\text{max. number of distinct values for }A\text{ in }R\text{ or }S}\]

This assumes that $A$ is the only common attribute.

Passing Information

There are two methods for passing information between different stages of the query plan:

Materialisation - Write intermediate results to disk.
Pipelining (stream based processing):
- Passes the tuples of one operation directly to the next operation without using disk.
- Extra buffer for each pair of adjacent operation to hold tuples passing from one relation to the other.

Introduction to Distributed Databases

COMP207 Lectures 2021-11-23

We want to have databases at several sites, which only hold data that is primarily relevant to them.

Distributed Databases

They are a:

Collection of multiple logically interrelated databases.
Distributed over a computer network.

graph LR
A ---|network link| B --- C["C (node/site)"] --- A

The databases are connected by a network link.
Nodes/sites are where the databases are stored.

Advantages

Performance Improvements:

Answer queries faster by distributing tasks over the nodes.
Reduces CPU time, disk accesses, communication cost and more at individual nodes.

Scalability:

Easier extension of the system for capacity and performance.
Just add a new node.

Resilience:

Data can be replicated at geographically separate sites.
Catastrophic failures don’t affect the entire system

Fragmentation, Replication & Transparency

COMP207 Lectures 2021-11-23

This lecture covers transparency:

Things that the DDBMS keeps hidden from people accessing the database.
- Fragmentation and replication are examples of the things that are handled transparently.

Fragmentation

This is where the database is split into different fragments that can then be stored at different nodes.

Horizontal Fragmentation (Sharding)

Fragment 1
Fragment 2
Fragment 3
Fragment 4

The fragments may overlap.

Each tuple could be stored at a different site.
The way to get the original table back is to complete the UNION of the fragmented tables.

Vertical Fragmentation

1	2	3	4

The fragments typically overlap.

Certain attributes are stored at different sites.
You can get the original table back by completing the natural join of the distributed tables.

Users don’t see the fragments, just the full relations.

Replication

Different sites can store copies of data that belongs to other sites:

Consider that a central office holds all the data from the branches.
- This improves resilience.
Consider that a branch holds data about suppliers from the central office.
- This improves efficiency.

Full Replication

This means that each fragment is stored at every site:

There are no longer any fragments.
Faster query answering.
Very slow updates.

No Replication

This is where each fragment is stored at a unique site:

Crashes are a big problem.

Partial Replication

This is where a combination of the above are used:

Limit the number of copies of each fragment.
Replicate only the fragments that are useful.

Transparency

DDBMSs ensure that users do not need to know certain facts when creating queries. Transparency can be implemented at different levels.

Fragmentation Transparency
- Fragmentation is transparent to users.
- Users pose queries against the entire database.
- The distributed DBMS translates this into a query plan that fetchers the required information from appropriate nodes.
Replication Transparency
- Ability to store copies of data items and fragments at different sites.
- Replication is transparent to the user.
Location Transparency
- The location where data is stored is transparent to the user.
Naming Transparency
- A given name (of a relation) has the same meaning everywhere in the system.

There are also many more types of transparency.

Trasaction Management in DDBMS

COMP207 Lectures 2021-11-23

Concurrency Control

CC in DDBMS Using Locks

Locks for concurrency control can be used in many different ways:

	No Backups	Backups
One Authority	One computer does all the locks.	There are backups if the primary fails.
Many Authorities	Database items can have different computers in charge of their locks.	Many authorities with backups.

There are various downsides to these methods:

One Computer
1. If it fails you need to restart the whole system.
2. There are too many transactions requiring locks.
Backup of One Computer
- Must be synced with the primary computer.
Many Authorities
- Must be synced with the primary computer.
Many Authorities with Backups
1. Must be synced with the primary computer.
2. Must be synced with the primary computer.

CC in DDBMS using Voting

Each site with a copy of an item has a local lock that it can grant transaction for that item.
If a transaction gets over half the local locks for an item, it has a global lock on the item.
- If so, it must tell the sites with a copy that it has the lock.
- If it takes too long, it must stop trying to get the lock.

This method is much more distributed but requires more communication.

Recovery in DDBMS

Violation of Atomicity

Even though atomicity is enforced at individual locations it can be violated globally.

This is because nodes can fail during the execution of a global transaction.

Two-Phase Commit Protocol

This is the protocol that is used to enforce atomicity in global transactions.

This is not related to 2PL.

There is a single coordinator on the network that decides when local transaction can commit.

Logging is handled at each node locally:

Messages sent and received from other nodes are logged too.

The two phases are as follows:

Decide When to Commit or Abort
1. Send prepare T to ask the nodes if they want to commit.
2. At each node decide whether to commit or abort:
  - If commit, go into precommitted state and send back ready T.
    
    Only the coordination can abort once a node is in a precommitted state.
  - If abort, send back don't commit T and abort local transaction.
Commit or Abort
- If all nodes respond ready T:
  - Send commit T to all the nodes.
- If some node responds don't commit T or there is a timeout waiting for a reply:
  - Send abort T to all nodes.

Two-Phase Logging

The logging for phase 1 is like so:

sequenceDiagram
participant T0 as T0 (Coordinator)
note over T0:log prepare T
T0 ->> T1: prepare T
alt ready
note over T1:flush T1 logs to enter precommitted state
note over T1:log ready T
T1 ->> T0:ready T
else abort
note over T1: log don't commit T
T1 ->> T0:don't commit T
note over T1:abort T1
end

alt means that you can take either path in the sequence diagram.

The logging for phase 2 is like so:

sequenceDiagram
participant T0 as T0 (Coordinator)
alt all ready
note over T0: log commit T
T0 ->> T1:commit T
else don't commit/timeout
note over T0: log abort T
T0 ->> T1:abort T
end

Three-Phase Commit Protocol

In the two-phase commit protocol, if the coordinator and another transaction fails, durability could be broken when aborting the transaction.

To fix this we can split phase 2 into two parts:

Prepare to Commit
1. Send the decision (commit/abort) to all nodes.
2. Nodes go into prepare-to-commit state.
Commit
- The old phase 2.

Query Processing for DDBMS

COMP207 Lectures 2021-11-23

If we don’t have all the information we need for a query at a node, we need to request information from other sites:

This is very slow.
Joins are the most expensive operation.

Joins

We can often send less than the whole table over the network when completing joins.

Semijoins ($\ltimes$)

We can deduce that:

\[R\ltimes S=R\bowtie \pi_{\text{common attributes of }R\text{ and }S}(S)\]

$R\ltimes S$ is the set of all tuples in $R$ that NATURAL JOIN at least one tuple in $S$.

Therefore we can save on network communications by only sending the part of $S$ that is needed.

Semijoin Reduction

graph LR
a[("A (stores R)")] ---|network| b[("B (stores S)")]

We can then use this to create an optimised method to calculate $R\bowtie S$ at site $B$ (which stores $S$).

Site $B$ sends $S’:=\pi_{\text{common attributes of }R\text{ and }S}(S)$ to site $A$.
Site $A$ sends $R’:=R\ltimes S(=R\bowtie S’)$ to site $B$.
Site $B$ outputs $R’\bowtie S$

There is an example, with attribute names, at the end of the lecture.

The cost of this communication is:

\[\lvert S'\rvert\times(\text{size of tuple in }S')+\lvert R'\rvert\times(\text{size tuple in }R')\]

Efficiency

This method is sometimes more efficient than just exchanging the whole table depending on:

Is the projection much smaller than the full relation?
- This is better.
Is the size of the semi-join much smaller.
- This is better.

In general, this solution is optimal if:

\[\lvert\pi_\text{common attributes}(S)\rvert+\lvert R\ltimes S\rvert <\lvert R\rvert\]

In the worst case you can send twice the size of $R$.

MapReduce

COMP207 Lectures 2021-11-23

Often problems involving large datasets can be divided into many small problems that can be solved in parallel and then have their results combined.

MapReduce is a programming framework that allows us to implement two methods:

Map - The computation on the smaller chunks of data.
Reduce - How the results are combined to the final result.

MapReduce is not a universal parallelism framework, not every problem that is parallelisable can be expresses and solved nicely with MapReduce.

Counting Words in a Document

Given a collection of text files, for each word $w$, count how many times $w$ occurs in the files.

To do this we input the data like so:

("file1.txt","<file contents>"),("file2.txt","<file contents>"),("file3.txt","<file contents>"),...

These are grouped into (key,value) pairs.

The MapReduce computation will then return:

(<word1>,<count>),(<word2>,<count>),(<word3>,<count>),...

The Map Function

The map function is applied to a single key/value pair and produces a list of zero or more key/value pairs:

Map(String filepath, String contents):
	for each word w in contents:
		output pair (w, "1")

The Reduce function will handle adding together all of the key/value pairs so we don’t have to do that at this stage.

Grouping

After the Map function have been applied, we group all values by key:

("hello",1),("world",1)("hello",1),("hello",1)

this would then go to:

("hello", (1, 1, 1),("world",1)

Copy & Merge

This completes the grouping again for each text file.

The Reduce Function

The Reduce function takes a key and a list of values as input and outputs a list of key/value pairs:

Reduce(String word, Iterator<String>values):
	int count = 0
	for each v in values:
		count = count + parseInt(v)
	output pair (word,toString(count))

This only applies to a single key/value pair, not the whole list. This enables more parallel computing.

Additional Examples

There are addition examples on the following in this lecture video:

Matrix Multiplication (page rank)
Relational Algebra:
- Selection $\sigma_c(R)$
- Natural Join $R\bowtie S$

Distributed File System

MapReduce operates on it’s own distributed file system:

All the data is distributed over the nodes.
Replication is used for fault-tolerance.

It appears to the user as if it was a regular file system.

Input files can be very large. To deal with this, MapReduce splits these into chunks (16-64 MB) and provides an ID for each chunk in a key/value pair.

Implementation

Map returns a list of key/value pairs:

Map(String key, String value)

Reduce returns a list of key/value pairs:

Reduce(String key, Iterator<String> values)

In practice that is also some additional code to set:

Locations of input and output files.
Tuning Parameters:
- Number of Machines
- Memory per Map

Semi-Structured Data

COMP207 Lectures 2021-11-29

Semi-structured data lies between fully structured data (like relational databases) and unstructured data (like arbitrary data files).

Semi-structured data doesn’t have to fit to a schema.
Programs have to know how to read & interpret the data.
Semi-structured data is self describing and flexible.

Semi-Structured Data Model

Semi-structured data is typically a tree with the following properties:

Leaf Nodes - Have associated data.
Inner Nodes - Have edges going to other nodes:
- Each edge has a label.
Root:
- Has no incoming edges.
- Each node is reachable from the root.

graph
root -->|lecturer| a[ ] & b[ ]
a -->|name| c[ ]
a -->|teaches| d[ ]
d -->|taughtBy| c[J. Fearnley]
d -->|code| e[COMP105]
d -->|title| f[Progr. Paradigms]
b --> ...

Applications

They are a popular form for storing and sharing data on the web.

Also are used for the storage of documents such as:

Word processor documents.
Spreadsheets.
Vector Graphics.

Implementations

XML (eXtensible Markup Language)
- Tree structured data.
JSON (JavaScript Object Notation)
- Similar to XML
Simple Key-Value Relationships:
- Correspond to very simple XML/JSON documents.
Graphs
- A general form of semi-structured data.

Basics of XML

COMP207 Lectures 2021-11-29

XML

XML files look similar to HTML files. This is an XML document representing the graph from last lecture:

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<lecturers>
	<lecturer>
		<name>J. Fearnley</name>
		<phone>7954…</phone>
		<teaches>
			<code>COMP105</code>
			<title>Progr. Paradigms</code>
		</teaches>
		<memberOf>Economics & …</memberOf>
	</lecturer>
	...
</lecturers>

There are the following structures:

Tags - Enclosed between <>.
- Open with <...>.
- Close with </...>.
Element - The opening, closing tags and all data in-between.
Attributes:
- Elements can have attributes (name/value pairs), which are put into the element’s opening tag:
```
  <lecturer name="J. Fearnley">
      <module code=”COMP105” title=“Programming Paradigms” />
  </lecturer>
```
- Sub-elements should be used if there is a one to many relationship.

Nodes with Multiple Parents

These don’t exist in XML:

There is a notion of references like symbolic links in a file-system.

Other Features

There can be an optional document type definition or XML schema at the start of the XML document.
Entity References - Can serve as shortcuts to often repeated text or to distinguish reserved characters from content.
Comments - Enclosed in  tags.
CDATA - These are processing instructions that can be used to provide information to the application.

Document Type Definitions (DTD)

COMP207 Lectures 2021-11-30

There are two ways to define the format of XML files. This lecture explains how to use DTD.

DTDs contain information about the structure of an XML document:

Elements that may occur in the document.
Sub-elements that an element may have.
Attributes that an element may have

Structure

The are included at the beginning of the XML document between:

<?xml ... standalone="no">

and the root element.

They are included between the following tags:

<!DOCTYPE lecturers [
...
]>

where lecturers is the name of the root tag.

Elements

The elements of a DTD has the following syntax:

<!--There must be 1 or more lecturer below the root element-->
<!ELEMENT lecturers (lecturer+)>
<!--Name is required, phone and email are optional, zero or more occurrences of teaches-->
<!ELEMENT lecturer (name, phone?, email?, teaches*)>
<!ELEMENT teaches (code, title)>
<!--Includes special symbol for text data-->
<!ELEMENT name (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT code (#PCDATA)>
<!ELEMENT title (#PCDATA)>

To parse this we identify the rules for element that can occur in the XML document (inside ():

Names with no qualifying punctuation must occur exactly once.
Options for repetitions are:
- * - Zero or more.
- + - One or more.
- ? - Either zero or one.

Defining Attributes

To define the following attribute:

<module code=”COMP105” title=“Programming Paradigms” />

We would first define the module (without the elements):

<!ELEMENT module EMPTY>

Then define the attributes like follows:

<!ATTLIST module code CDATA #IMPLIED>
<!ATTLIST module title CDATA #IMPLIED>

The value after the # can be some of the following:

#IMPLIED - Optional
#REQUIRED - Required
Some value COMPXXX defining a default string.
#FIXED and some value - Constant

Attribute Datatypes

You can use the following datatypes:

CDATA - Character data, containing any text.
ID - Used to identify individual elements in a document.
- ID is an element name.
IDREF/IDREFS - Must correspond to the value of ID attributes for some element in the document.
- You can’t define what type you are pointing to, only that it is some ID for some object.
You can also have a list of names in the following form:
```
  (en1|en2|...)
```

Element Identity, `ID`s & `IDREF`s

ID allows a unique key to be associated with an element.
IDREF allows an element to refer to another element with the designated key and attribute type.
IDREFS allows an element to refer to multiple elements.

To loosely mode the relationships:

Department has Lecturers
Department has a head who is a Lecturer

we would write:

<!ATTLIST lecturer staffNo ID #REQUIRED>
<!ATTLIST department staff IDREFS #IMPLIED>
<!ATTLIST department head IDREF #IMPLIED>

Document Validity

There are two levels of document processing:

Well-Formed
- All elements must be within one root element.
- Elements must be nested in a tree structure without any overlap.
This ensures that the XML document conforms to the structural and notational rules of XML.
Valid
- Document is well-formed.
- Document conforms it it’s DTD or XML schema

XML Schema

COMP207 Lectures 2021-11-30

This is the second way of defining the format of an XML file.

Comparison to DTD

XML schema should be used if you want more precision on the schema than offered by DTD.

Positives:

Defined in XML.
Give bounds on number of occurrences.
Can do unordered sets.
Can do more precise references.
More complex simple data types.
- A number > 1915.

Negative:

The syntax is heavy.

Syntax

To get an XML schema version of the following DTD:

<!ELEMENT lecturers (lecturer+)>
<!ELEMENT lecturer (name, phone?, email?, teaches*)>
<!ELEMENT teaches (code, title)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT code (#PCDATA)>
<!ELEMENT title (#PCDATA)>

we could write something like the following:

<? xml version="1.0" encoding="utf-8” ?>
<!-- anything prefixed with xs: is an object -->
<!-- also means we are defining a "University" schema -->
<xs:schema xmlns: xs= “http://www.w3.org/2001/XMLSchema” targetNamespace=“University”>
	
	<xs:element name = “lecturers”>
		<xs:complexType>
			<xs:sequence>
				<xs:element name = “lecturer” type = “lecturerType” minOccurs = “1” maxOccurs = “unbounded”/>
			</xs:sequence>
		</xs:complexType>

		<xs:complexType name = “lecturerType”>
			<xs:sequence>
				<xs:element name=“name” type = “xs:string”/>
				<xs:element name=“phone” type = “xs:string” minOccurs = “0”/>
				<xs:element name=“email” type = “xs:string” minOccurs = “0”/>
				<xs:element name=“teaches” type = “theachesType” minOccurs = “0” maxOccurs = “unbounded”/>
			</xs:sequence>
		</xs:complexType>

		<xs:complexType name = “theachesType”>
			<xs:sequence>
				<xs:element name=“code” type = “xs:string”/>
				<xs:element name=“title” type = “xs:string”/>
			</xs:sequence>
		</xs:complexType>

		<xs:simpleType name="phoneType"> 
			<xs:restriction base="xs:string">
				<xs:pattern value=”07[0-9]{3} [0-9]{6}"/> 
			</xs:restriction> 
		</xs:simpleType>

		...
	</xs:element>
</xs:schema>

Convert Database to XML

There is an example of converting a table to an XML document at the end of the lecture slides

XPath Basics

COMP207 Lectures 2021-11-30

XPath allows us to write queries that return a set of value or nodes from an XML document:

Values
Nodes
- Document Node
  - The whole document.
- Element Node
  - Includes the root.
- Attributes
  - Inside the tags of elements.

Path Expressions

They take the following format:

\[/E_1/E_2/E_3/\ldots/E_n\]

This would select all nodes reachable from the root by following the edges in the order of the path.

The result is returned in the document order.

Relative Path Expressions

You can miss out the first slash to evaluate relative to another node.

Attributes

The path expressions can be extended so that we can return attribute values:

/students/student/@name

Wildcards

A wildcard * represents any tag name or attribute name:

/students/student/module/@*

This would return all attributes of the module elements.

Advanced XPaths

COMP207 Lectures 2021-11-30

General Form

XPaths generally take the following form:

/axis1::E1[C1]/axis2::E2[C2]

axis are navigation axes.
C are conditions.
E are name or a tag, attribute or *.

Axis	Shorthand	Description
`attribute`	`@`	An attribute
`child`	Default (`child::` can be omitted)	Any child
`descendant`	`//E` instead of `/descendant::E`	Any proper descendant
`descendant-or-self`		Any descendant
`ancestor`		Any proper ancestor
`following-sibling`		Any sibling to the right
`preceding-sibling`		Any sibling to the left
`parent`	`..`	The parent
`self`	`.`	Yourself

A proper descendant or ancestor is at least one node away (not self).

Conditions

The idea is the if the condition is true, follow the path further.

You can use comparitors like =, != and >=.
- A value can be a relative path expression or any constant.
You can combine by using and and or.

//book[category="CS"]/title

You can also write E[i] which is true if the current node is the i-th child of its parent:

//*[category="CS" or category="Scifi"][1]

This would return the first item in “CS” or “Scifi.

You can also use functions like:

last()

in place of i.

Boolean Conversions

We can also include additional XPaths and strings in the condition:

Returns the titles of all book that have a category:
```
  //book[category]/title
```
Returns the books if the price attribute is non-empty:
```
  //book[@price/data()]
```

XQuery Basics

COMP207 Lectures 2021-11-30

This is a method of running queries on XML. They are an extension of XPath and therefore every XPath is an XQuery expression.

For additional XQuery examples see this lecture video

Syntax

For the following tree:

graph
$doc -->|students|a[ ]
a -->|student|$s
$s -->|name| ''Anna''
$s -->|module| ''COMP207''

an XQuery could look like the following:

let $doc := doc("students.xml")
for $s in $doc/students/student
where $s/module = "COMP207"
return $s/name

This is case sensitive.

FLWR Expressions

Clause	Requirement
`for` & `let`	Any number, interleaved. Some interpreters require at least one `let`.
`where`	Optional
`return`	Mandatory

Therefore we can get the following smallest expressions depending on the implementation:

return <greeting>Hello World!</greeting>

let $hello := <greeting>Hello World!</greeting>
return $hello

Clauses

let
- for variable in XQuery expression
- Assigns the result of an expression to a variable.
for
- let variable := XQuery expression
- For each item in the result of the expression assign it the the variable.
where
- where condition 1, condition 2, ...
- If all the conditions are true then execute the return clause.
- You can also use the following syntax:
```
  where some/every $var in XQuery expression satisfies condition
```
Comparisons between expressions $s1/name = $s2/name may not have the tags removed.

Advanced XQuery

COMP207 Lectures 2021-11-30

Order By

This clause is included in the following order:

Clause	Requirement
`for` & `let`	Any number, interleaved. Some interpreters require at least one `let`.
`where`	Optional
`order by`	Optional
`return`	Mandatory

and we can use it like so:

let $doc := doc("mydoc.xml")
for $b in $doc/books/book
for $author in $b/author
order by $author descending
return <pair>{$b/title}, {$author}</pair>

Group By

This clause is interted in the following order:

Clause	Requirement
`for` & `let`	Any number, interleaved. Some interpreters require at least one `let`.
`where`	Optional
`group by`	Optional
`order by`	Optional
`return`	Mandatory

and we can use it like so:

let $doc := doc("mydoc.xml")
for $m in $doc/university/student/module
group by $mod:=$m
return <pair>{$mod}, {count($m)}</pair>

We can only aggregate by variables that are grouped (in this case only $m not $m/...).

Other aggregate functions include:

min()
max()
avg()
sum()

Distinct-Values

This converts the enclosed elements to strings, and filters so that they are distinct.

We can use it on each element $m in the for loop individually:

let $doc := doc("mydoc.xml")
for $m in $doc/university/student/module
return distinct-values($m)

This only removes duplicate module elements per-student.

or we can use it on the whole result:

distinct-values(let $doc := doc("mydoc.xml")
for $m in $doc/university/student/module
return $m)

Convert XML to SQL

COMP207 Lectures 2021-12-06

We can convert the other way by using XML Schema.

To convert an XML database to SQL we have the following approaches:

Storing XML in an Attribute

We can use the XML data type (XML or XMLType) to store raw XML in a serialised form.

Creating Tables Using XML Type

We can use the following SQL:

CREATE TABLE XMLStaff (
	docNo CHAR(4),
	docDate DATE,
	staffData XML
);
INSERT INTO XMLStaff VALUES (
	'D001',
	'2004-12-01',
	XML('
		<STAFF branchNo = "B005">
		<STAFFNO>SL21</STAFFNO>
		<POSITION>Manager</POSITION>
		<DOB>1945-10-01</DOB>
		<SALARY>30000</SALARY> </STAFF>
	')
);

Schema-Independent Representation

This method allows us to store the XML as a tree inside of a table:

This creates the issue of making the tree difficult to search.
- To solve this we can use a denormalised index in a second table that links nodes and their parents.

From this we then get the following schema-independent table:

nodeID	nodeType	nodeName	nodeData	parentID
0	Document	STAFFLIST
1	Element	STAFFLIST		0
2	Element	STAFF		1
3	Element	STAFFNO		2
4	Text		SL21	3
5	Element	NAME		2
6	Element	FNAME		5
7	Text		John	6
8	Element	LNAME		5
9	Text		White	8

and it’s denormalised index:

path	nodeID	parentID
/STAFFLIST	1	0
STAFFLIST	1	0
STAFFLIST/STAFF	2	1
STAFF	2	1
/STAFFLIST/STAFF/NAME	5	2
STAFFLIST/STAFF/NAME	5	2
STAFF/NAME	5	2
NAME	5	2

Storing XML in Shredded Form

XML is decomposed into its constituent element and data distributed over a number of attributes in one or more relations.
Storing shredded document may make it easier to index values of some element, provided these element are placed into their own attributes.
Also possible to add some additional data relation to hierarchical nature of the XML, making it possible to recompose the original structure and ordering.
- This allows the XML to be updated.

With this approach we also have to create an appropriate database structure.

NoSQL Databases

COMP207 Lectures 2021-12-06

These are non-relational databases that generally cater to web applications:

Very fast access, will millions of users in parallel.
Vault Tolerance
Semi-structured
- Flexibility in the type of data stored.
Full ACID compliance can sometimes be relaxed.

NoSQL Example

Consider that you have the following setup:

flowchart
ws[Web Service] <--> nsql[NoSQL Database] & rd[Relational Database]
nsql --- n([User Profiles, Shopping Cart])
rd --- r([Inventory, Transactions])

We can store data that must be compliant on the relational database and high volume, non-compliant data can be stored on the noSQL database.

In the NoSQL database we can use tables like so:

Key	Value
User1	{“name”:”John Smith”, “items”:{ “1”:{“name”:”product 1”,”quantity”:3} …} }

They have the following properties:

Simple Queries:
- Return teh value for key $k$.
Fast reads/writes via index on keys.
ACID can sometimes be relaxed.

NoSQL Characteristics

NoSQL databases are often distributed:

Nodes run on commmidity hardware.
Standard network.

They are designed to guarantee:

Availability - Every non-failing node always executes queries.
Consistency - Every read receives the most recent write or an error.

This is not the same as in ACID
Scalability - More capacity by adding new nodes.
High Performance - Supports few, fast, interfaces.
Partition-Tolerance - Even if nodes, or messages, fail - the remaining sub-network can continue their work.

They often give up ACID to improve performance.

The CAP Theorem

This theorem says we can only choose two of:

Consistency
Availability
Partition-Tolerance

Single node DBMS can have:

C and A

NoSQL can have:

C and P
A and P.

BASE

NoSQL databases drop ACID and often provide the weaker guarantee of:

Basically
Available
Soft State
Eventually Consistent

Soft state & eventual consistency mean that the database might occasionally be inconsistent but will eventually be consistent after propagation occurs.

NoSQL Classification

Key-Value Stores
- Efficient lookups and insertions of key-value pairs.
Document Stores
- Similar to key-value stores, where the value is a semi-structured datatype.
- Lookups can involve values from the documents.
Column Stores
Graph Databases

Key-Value Stores

COMP207 Lectures 2021-12-06

This is the simplest type of database with only two columns, one for the key and one for the value.

They have a simple access mechanism:

find(k) - Returns the local value for key $k$.
write(k, v) - Inserts value $v$ under key $k$.

Distributed Storage

Each key-value pair $(k,v)$ is stored at some node:

Assign values $v$ for $k$ to integers between 0 and $2^n-1$ where $n$ is large enough to hold all nodes and duplicates:
- This uses a hash function:
  \[h:\text{keys}\rightarrow\{0,\ldots,2^n-1\}\]
Distribute node to some of the integers (typically randomly). This creates a ring of nodes.
If $(k,v)$ is assigned to integer $i$, it is stored at the node following $i$ on the ring.

Scalability via Horizontal Fragmentation

New nodes can be added to the ring easily:

Add nodes to free range(s) and more key-value pairs appropriately.
Automatic horizontal fragmentation.

Replication

Replication is used to ensure availability.

Replicas (copies of the key-value pairs) are stored on consecutive nodes on the ring in clockwise order.

Eventual Consistency

DynomoDB and Voldemort allow multiple versions of a data item to be present at the same time (versioning):

If a newer version of a data item is not yet available at a node, the older version is used.
- This is fine for many applications, like a shopping cart.

If this isn’t good enough we can assign a vector clock to each version of an item $X$:

A list (vector) of pairs (node, timestamp):
- node - The node that has written $X$.
- timestamp - The local time on the node the iem $X$ was written.
Use the clock to decide if version $V_1$ originated from version $V_2$:
- $V_1$ originated from $V_2$ if, for all nodes in $V_2$s clock, the corresponding timestamp is $\leq \text{timestamp in }V_1$.
  
  Check that all the numbers in the vector clock are bigger, or equal to, the previous clock.
- If it is incomparable (multiple values change), return all possibilities.
  
  This is if some numbers are bigger and smaller between versions.
  
  Incomparability between versions are resolved at read-time.

Property Implementation

Scalability:
- Adding new nodes to increase capacity is easy.
- Automatic horizontal fragmentation.
Availability & Fault-Tolerance:
- Due to replication.
- Can retrieve value for a key, even if a few nodes storing values for that key fail.
High Performance:
- Retrieving the value for a key is fast:
  - Apply the hash function to determine the node, then ask the node.
- The same is true for writing.

Other NoSQL Databases

COMP207 Lectures 2021-12-06

Document Stores

This is a database that stores an ObjectID associated in a document (typically in JSON).

JSON vs XML

We can represent the following XML:

<?xml version=“1.0” standalone=“yes”>
<students>
	<student>
		<name>Anna</name>
		<number>20171989</number>
		<programme>G402</programme>
	</student>
	<student>
		<name>John</name>
		<number>20174378</number>
		<programme>G702</programme>
	</student>
	…
</students>

as the following JSON file:

{"students":[
	{"name": "Anna",
	"number":20171989,
	"programme":"G402"},
	{"name": "John",
	"number":20174378,
	"programme":"G702"},
	…
]}

{} are used for objects.
[] are used for arrays.

MongoDB

This DBMS stores documents in a variant of JSON:

Creating/managing collections:
```
  dc.createCollection("students")
```

Insert/update/delete documents:

  db.students.insert{
      {name: "Anna", studentID: 20171989, year: 2}
  }

Finding documents:

  db.students.find({year: 2})
  db.students.find({year: {$lt: 3}}).sort({year:1})

MongoDB Properties

Sharding (horizontal fragmentation):

Collections are split into horizontal fragments.
Based on the shard key:
- Indexed Field
- Exists in all documents.

Replication:

Horizontal fragments of collections are replicated.
Master-slave approach:
- Primary copy (master) and secondary copies (replicas).
- Updates - On master, then delegated to replicas.
- Reads - Delegated to master (default) or to any replica.

Transaction Support:

ACID is supported on the newest version from 2020.
Isolation is not fully supported as only snapshots are completed.

Column Stores

Column stores are laid out like so:

Student:

	name		contact
	first	last	address	city	postcode	email
123	Anna	Collins	…	…	…	…	…

The following are different to normal tables:

Column Family - A group of column qualifiers.
- Column Qualifier - Conventional columns.
These are referenced as <columnfamiliy>:<columnqualifier>.

Hbase

This column store implementation uses the hadoop distributed file system:

Creating tables:

  create 'STUDENT','NAME','Grade','Programme'

Inserting rows:

  put 'STUDENT','row1','Name:Fname','Anna'
  put 'STUDENT','row1','Grade:COMP207','100'

Finding documents:
```
  get 'STUDENT','row1'
  scan 'STUDENT'
```
scan is used for going through the whole table.

Hbase Properties

Hbase uses two levels of fragmentation:

Top Level - Rows are divided into regions (ranges of rows).
Bottom Level - Regions store different column families in different nodes (computers).

There is no transaction support.

Each item has a time stamp and one can access past version of the database (if setup):

> get ’STUDENT’, ‘row1’
	COLUMN	CELL
	name:first	timestamp=1511139129384, value=Anna
	name:last	timestamp=1511139140333, value=…

Graph Databases

These store data as a graph with:

Objects and attributes on the nodes.
Relations on the arcs.

Data is then accessed using an SQL-like path query language.

Data Analysis

COMP207 Lectures 2021-12-07

Consider that a big company wants analyse its product sales. They have a database with the following schema:

Sales(productNo, date, store_name, price)
Stores(name, city, country, phone)

and will run the following query:

SELECT country, AVG(price)
	FROM Sales, Stores
	WHERE Sales.store_name = Stores.name AND
		date >= '2021-01-01'
	GROUP BY country;

This requires most of the data in the database which will block large parts of the database:

This should be avoided on DBMSs serving many users.

Data Warehouses

These are specific facilities made to support data analysis.

flowchart LR
a[(A)] & b[(B)] & c[(C)] --> etl[Extract, Transform, Load] --> dw
dw[(Data Warehouse)] --- da[Data Analyst]

The warehouse isn’t constantly updated. Only every day or so.

Online Analytic Processing (OLAP)

Online analytic processing is the process of analysing complex data store in data warehouse.

This is in contrast to online transaction processing (OLTP) which are traditional DBMS tasks:

Queries and updates that can be executed fast.
Only affect a small portion of the database.

Fact Tables & Data Cubes

In OLAP applications, there is typically a unique fact table that:

Represents events & objects of interest for the analysis.

Data cubes allow us to represent an attribute as a point in 3D space. For example, for the fact table:

Sales(productNo, date, store, price)

we can have axes of:

productNo
date
store

which will point to a specific price.

There are examples of slicing and dicing data cubes using star schema at the end of the slides.

Star Schema

These are:

Unique fact tables that contain all the points in the data cube.
Dimension Tables - Describe the values along each axis.

From the following schema:

Products(productNo, type, model)
Days(date, day, week, month, year)
Stores(name, city, country, phone)

you can make the following star schema:

Sales(productNo, date, store, price)

This has the dependant attribute of price.

Star Schema General Form

A star schema describes a database consisting of:

A fact table $R(A_1,\ldots,A_n,B_1,\ldots,B_m)$ where:
- $A_1,\ldots,A_n$ are called dimensions.
- $B_1,\ldots,B_m$ are called dependent attributes.
Dimension table $D_i(A_i\ldots)$ for each dimension $A_i$.

Characteristics of Star Schema

Unlike typical databases star schema have a denormalised schema:

Main data in one table (fact table).
Rest of the data can be joined with the fact table very quickly.

As a result we get the following:

Queries don’t requires many joins.
Performance gains, especially when processing queries that would require many joins in a normalised database.
Faster and easier aggregation of data.

Data-Mining

COMP207 Lectures 2021-12-07

This is a more general form of OLAP. They refer to the discovery of patterns or knowledge in data

Applications of Data-Mining

Deviation Detection - Identifying anomalies such as intruders.
Link Analysis - Trying the discover links between attributes.
- These can be used to make association rules.
Predictive Modelling - Trying to predict future behaviour of certain attributes in the data base on pas behaviour.
Database Segmentation - Grouping data by similar behaviour.
- Group customers based on spending habits.
- Group customers based on reaction of a marketing campaign.

Types of Discovered Knowledge

Association Rules:
- People who buy nappies buy beer.
Classification Hierarchies:
- Classification of mutual funds based on performance data characteristics such as growth, income and stability.
Sequential Patterns:
- If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, they is likely to suffer from kidney failure within the next 18 months.
Clustering:
- Group treatment data on a disease based on similarity of side effects.

Market Basket Model

COMP207 Lectures 2021-12-07

This is a basic technique for data-mining.

Market-Basket Data

Data can be described by:

A set of items $I$.
A set of baskets $B$:
- Each basket $b\in B$ is a subset of $I$.

For example:

Purchase ID	Items Bought
101	milk, bread, cookies, juice
792	milk, juice
1130	milk, bread, eggs
1735	bread, cookies, coffee

In this table:

$I=$ all unique items that were bought.
- {bread, coffee, eggs, juice, milk}
$B=$ the set of baskets represented by each purchase ID.
- {$b_1,b_2,b_3,b_4$}

Frequent-Itemset Mining

Consider for the previous table we are asked:

Which items occur frequently together in a basket?

Given a set of items $I$ and a set of baskets $B$:

The support of a subset $J$ of $I=$ frequency with which the items in $J$ occur together in a basket in $B$.

Therefore to calculate the support of a set of items $J$ we can use:

\[\frac{\text{number of baskets in }B\text{ containing all items in }J}{\text{number of baskets in }B}\]

The support of {milk, juice} $=\frac 2 4=0.5$ as the set is contained in two out of the four baskets.

There are additional examples of calculating support at the end of the lecture.

Frequent Itemsets

A subset $J$ of $I$ is frequent if its support is at least $s$:

$s$ - Support threshold specified by the user.

There are examples of classifying frequent itemsets and frequent pairs at the end of the lecture.

Complex Tables

Sometimes we have tables with additional columns which make the baskets hard to define:

Purchase ID	Customer ID	Items Bought
101	A	milk, bread, cookies, juice
792	B	milk, juice
1130	A	milk, bread, eggs
1735	C	bread, cookies, coffee

Customer Baskets

We could define the baskets as per customer:

This would mean one basket per Customer ID.
- Only add distinct items to the basket of the customer.

We can also do this with respect to any of the other basket columns, such as Purchase ID.

This would generate the following table:

Customer ID	Items Bought
A	milk, bread, cookies, juice, eggs
B	milk, juice
C	bread, cookies, coffee

We then define all other notion using this new table and the following language:

The support of a set $J$ with respect to the customers.

This is the support of $J$ computed using the new set of baskets.

$J$ is frequent with respect to the customers.

If its support with respect to the customers is at least the support threshold.

There is an additional example on slide 8 of the lecture with generic item names.

Association Rules

This is a variant of frequent-item mining queries.

They take the form:

\[\{i_1,i_2,\ldots,i_n\}\Rightarrow j\]

For example:

Customers who buy nappies frequently also buy beer:
\[\{\text{nappies}\}\Rightarrow \text{beer}\]

Properties of Association Rules

The support of ($i_1,i_2,\ldots,i_n,j$) must be high otherwise the association won’t have any meaning.

We can calculate confidence like so:

\[\frac{\text{support of }\{i_1,i_2,\ldots,i_n,j\}}{\text{support of }\{i_1,i_2,\ldots,i_n\}}\]

This is the percentage of people who bought all of the items vs those who just bought the indicator items.

This should also be high.
Should differ significantly from fraction of baskets containing $j$.

We could phrase this association like:

67% of all customers who bought milk also bought juice.

There are examples of calculating confidence starting at slide 11 of the lecture.

Finding Association Rules

Focus on assocation rules with high support:
- At lease the support threshold $s$.
Compute the set $F$ of all itemsets $J$ with support $\geq s$.
From $F$ it is easy to obtain the association rules:
- For each set $J$ in $F$ and each $j$ in $J$, consider the rule $\{j\}\Rightarrow j$.
- Compute the confidence for $\{j\}\Rightarrow j$ and compare with the percent of baskets containing $j$

A-Priori Algorithm

The goal is compute all the itemset $J$ with support $\geq s$. The idea is that:

\[\text{If }J\text{ has support }\geq s\text{, then all subsets of }J\text{ have support }\geq s\text{.}\]

Using this we compute frequent itemsets from smaller ones:

$F_1:=$ itemsets $\{i\}$ with support $\geq s$.
$F_2:=$ itemsets $\{i,j\}$ with $\{i\},\{j\}\in F_1$ and support $\geq s$.
$F_3:=$ itemsets $\{i,j,k\}$ with $\{i,j\},\{i,k\},\{j,k\}\in F_1$ and support $\geq s$.
…

A-Priori Algorithm Formal Method

Input:
- The set of items $I$.
- The set of baskets $B$.
- Max size $q$.
- Support threshold $s$.
Output - Subsets $J$ of $I$ with $\lvert J\rvert\leq q$ and support $\geq s$.

C1 := {{i} : i∈I}
for k = 1 to q do
	Fk := {J ∈ Ck : J has support ≥ s}
	if k = q then stop
	Ck+1 := {J ⊆ I : |j| = k + 1 and all subsets of J of size k occur in Fk}

COMP207 (Databases)

Running Example - Supermarket

Relational DBMS Components

Relational Databases

Tables

Rows

SQL

SQL Parts

Create a Database

Creating Tables

Dataypes

Example

Unique

Primary Key

Foreign Key

Deleting

Modifying Tables

Insert

Comments

Delete

WHERE Clauses

IN

Updating

Basic Query

SELECT

Projection ($\pi$)

DISTINCT

Renaming ($\rho$)

Creating New Columns

FROM

Cross Product ($\times$)

Natural Join ($\bowtie$)

WHERE

IN

EXISTS

Semi-Join

Example

GROUP BY

Complications

Intuitive Method

Having

ORDER BY

Similar Data-Types

Part 1

Part 2

Order of Execution

Views

Views as Virtual Tables

Restrictions of Views

Relational Algebra

Projection ($\pi$)

Renaming ($\rho$)

Selection ($\sigma$)

Cartesian Product ($\times$)

Natural Join ($\bowtie$)

Additional Functions

Combined Example

Part 1

Simple Joins

Unions

Concurrency

Transactions Solution

Partial Execution

Transactions Solution

Transaction Syntax

Translating SQL into Low-Level Operations

Basic Operations

Schedules

Serial Schedule Example

Shorthand

Concurrent Schedule Example

Concurrency Example

Partial Execution Example

Concurrency & Partial Execution Example

Problems

Solutions

ACID Properties

A - Atomicity

C - Consistency

Transactions & Consistency

`WHERE` Clauses

`IN`

`SELECT`

`DISTINCT`

`FROM`

`WHERE`

`IN`

`EXISTS`

`GROUP BY`

`Having`

`ORDER BY`