Market Basket Model

COMP207 Lectures 2021-12-07

This is a basic technique for data-mining.

Market-Basket Data

Data can be described by:

A set of items $I$.
A set of baskets $B$:
- Each basket $b\in B$ is a subset of $I$.

For example:

Purchase ID	Items Bought
101	milk, bread, cookies, juice
792	milk, juice
1130	milk, bread, eggs
1735	bread, cookies, coffee

In this table:

$I=$ all unique items that were bought.
- {bread, coffee, eggs, juice, milk}
$B=$ the set of baskets represented by each purchase ID.
- {$b_1,b_2,b_3,b_4$}

Frequent-Itemset Mining

Consider for the previous table we are asked:

Which items occur frequently together in a basket?

Given a set of items $I$ and a set of baskets $B$:

The support of a subset $J$ of $I=$ frequency with which the items in $J$ occur together in a basket in $B$.

Therefore to calculate the support of a set of items $J$ we can use:

\[\frac{\text{number of baskets in }B\text{ containing all items in }J}{\text{number of baskets in }B}\]

The support of {milk, juice} $=\frac 2 4=0.5$ as the set is contained in two out of the four baskets.

There are additional examples of calculating support at the end of the lecture.

Frequent Itemsets

A subset $J$ of $I$ is frequent if its support is at least $s$:

$s$ - Support threshold specified by the user.

There are examples of classifying frequent itemsets and frequent pairs at the end of the lecture.

Complex Tables

Sometimes we have tables with additional columns which make the baskets hard to define:

Purchase ID	Customer ID	Items Bought
101	A	milk, bread, cookies, juice
792	B	milk, juice
1130	A	milk, bread, eggs
1735	C	bread, cookies, coffee

Customer Baskets

We could define the baskets as per customer:

This would mean one basket per Customer ID.
- Only add distinct items to the basket of the customer.

We can also do this with respect to any of the other basket columns, such as Purchase ID.

This would generate the following table:

Customer ID	Items Bought
A	milk, bread, cookies, juice, eggs
B	milk, juice
C	bread, cookies, coffee

We then define all other notion using this new table and the following language:

The support of a set $J$ with respect to the customers.

This is the support of $J$ computed using the new set of baskets.

$J$ is frequent with respect to the customers.

If its support with respect to the customers is at least the support threshold.

There is an additional example on slide 8 of the lecture with generic item names.

Association Rules

This is a variant of frequent-item mining queries.

They take the form:

\[\{i_1,i_2,\ldots,i_n\}\Rightarrow j\]

For example:

Customers who buy nappies frequently also buy beer:
\[\{\text{nappies}\}\Rightarrow \text{beer}\]

Properties of Association Rules

The support of ($i_1,i_2,\ldots,i_n,j$) must be high otherwise the association won’t have any meaning.

We can calculate confidence like so:

\[\frac{\text{support of }\{i_1,i_2,\ldots,i_n,j\}}{\text{support of }\{i_1,i_2,\ldots,i_n\}}\]

This is the percentage of people who bought all of the items vs those who just bought the indicator items.

This should also be high.
Should differ significantly from fraction of baskets containing $j$.

We could phrase this association like:

67% of all customers who bought milk also bought juice.

There are examples of calculating confidence starting at slide 11 of the lecture.

Finding Association Rules

Focus on assocation rules with high support:
- At lease the support threshold $s$.
Compute the set $F$ of all itemsets $J$ with support $\geq s$.
From $F$ it is easy to obtain the association rules:
- For each set $J$ in $F$ and each $j$ in $J$, consider the rule $\{j\}\Rightarrow j$.
- Compute the confidence for $\{j\}\Rightarrow j$ and compare with the percent of baskets containing $j$

A-Priori Algorithm

The goal is compute all the itemset $J$ with support $\geq s$. The idea is that:

\[\text{If }J\text{ has support }\geq s\text{, then all subsets of }J\text{ have support }\geq s\text{.}\]

Using this we compute frequent itemsets from smaller ones:

$F_1:=$ itemsets $\{i\}$ with support $\geq s$.
$F_2:=$ itemsets $\{i,j\}$ with $\{i\},\{j\}\in F_1$ and support $\geq s$.
$F_3:=$ itemsets $\{i,j,k\}$ with $\{i,j\},\{i,k\},\{j,k\}\in F_1$ and support $\geq s$.
…

A-Priori Algorithm Formal Method

Input:
- The set of items $I$.
- The set of baskets $B$.
- Max size $q$.
- Support threshold $s$.
Output - Subsets $J$ of $I$ with $\lvert J\rvert\leq q$ and support $\geq s$.

C1 := {{i} : i∈I}
for k = 1 to q do
	Fk := {J ∈ Ck : J has support ≥ s}
	if k = q then stop
	Ck+1 := {J ⊆ I : |j| = k + 1 and all subsets of J of size k occur in Fk}