Data Analysis

Consider that a big company wants analyse its product sales. They have a database with the following schema:

Sales(productNo, date, store_name, price)
Stores(name, city, country, phone)

and will run the following query:

SELECT country, AVG(price)
	FROM Sales, Stores
	WHERE Sales.store_name = Stores.name AND
		date >= '2021-01-01'
	GROUP BY country;

This requires most of the data in the database which will block large parts of the database:

Data Warehouses

These are specific facilities made to support data analysis.

flowchart LR
a[(A)] & b[(B)] & c[(C)] --> etl[Extract, Transform, Load] --> dw
dw[(Data Warehouse)] --- da[Data Analyst]

The warehouse isn’t constantly updated. Only every day or so.

Online analytic processing is the process of analysing complex data store in data warehouse.

This is in contrast to online transaction processing (OLTP) which are traditional DBMS tasks:

In OLAP applications, there is typically a unique fact table that:

Data cubes allow us to represent an attribute as a point in 3D space. For example, for the fact table:

Sales(productNo, date, store, price)

we can have axes of:

which will point to a specific price.

There are examples of slicing and dicing data cubes using star schema at the end of the slides.

These are:

From the following schema:

Products(productNo, type, model)
Days(date, day, week, month, year)
Stores(name, city, country, phone)

you can make the following star schema:

Sales(productNo, date, store, price)

This has the dependant attribute of price.

A star schema describes a database consisting of:

A fact table $R(A_1,\ldots,A_n,B_1,\ldots,B_m)$ where:
- $A_1,\ldots,A_n$ are called dimensions.
- $B_1,\ldots,B_m$ are called dependent attributes.
Dimension table $D_i(A_i\ldots)$ for each dimension $A_i$.

Unlike typical databases star schema have a denormalised schema:

As a result we get the following:

Queries don’t requires many joins.
Performance gains, especially when processing queries that would require many joins in a normalised database.
Faster and easier aggregation of data.