Market Basket Model
This is a basic technique for data-mining.
Market-Basket Data
Data can be described by:
- A set of items $I$.
- A set of baskets $B$:
- Each basket $b\in B$ is a subset of $I$.
For example:
Purchase ID | Items Bought |
---|---|
101 | milk, bread, cookies, juice |
792 | milk, juice |
1130 | milk, bread, eggs |
1735 | bread, cookies, coffee |
In this table:
- $I=$ all unique items that were bought.
- {bread, coffee, eggs, juice, milk}
- $B=$ the set of baskets represented by each purchase ID.
- {$b_1,b_2,b_3,b_4$}
Frequent-Itemset Mining
Consider for the previous table we are asked:
Which items occur frequently together in a basket?
Given a set of items $I$ and a set of baskets $B$:
- The support of a subset $J$ of $I=$ frequency with which the items in $J$ occur together in a basket in $B$.
Therefore to calculate the support of a set of items $J$ we can use:
\[\frac{\text{number of baskets in }B\text{ containing all items in }J}{\text{number of baskets in }B}\]The support of {milk, juice} $=\frac 2 4=0.5$ as the set is contained in two out of the four baskets.
There are additional examples of calculating support at the end of the lecture.
Frequent Itemsets
A subset $J$ of $I$ is frequent if its support is at least $s$:
- $s$ - Support threshold specified by the user.
There are examples of classifying frequent itemsets and frequent pairs at the end of the lecture.
Complex Tables
Sometimes we have tables with additional columns which make the baskets hard to define:
Purchase ID | Customer ID | Items Bought |
---|---|---|
101 | A | milk, bread, cookies, juice |
792 | B | milk, juice |
1130 | A | milk, bread, eggs |
1735 | C | bread, cookies, coffee |
Customer Baskets
We could define the baskets as per customer:
- This would mean one basket per
Customer ID
.- Only add distinct items to the basket of the customer.
We can also do this with respect to any of the other basket columns, such as Purchase ID
.
This would generate the following table:
Customer ID | Items Bought |
---|---|
A | milk, bread, cookies, juice, eggs |
B | milk, juice |
C | bread, cookies, coffee |
We then define all other notion using this new table and the following language:
The support of a set $J$ with respect to the customers.
This is the support of $J$ computed using the new set of baskets.
$J$ is frequent with respect to the customers.
If its support with respect to the customers is at least the support threshold.
There is an additional example on slide 8 of the lecture with generic item names.
Association Rules
This is a variant of frequent-item mining queries.
They take the form:
\[\{i_1,i_2,\ldots,i_n\}\Rightarrow j\]For example:
-
Customers who buy nappies frequently also buy beer:
\[\{\text{nappies}\}\Rightarrow \text{beer}\]
Properties of Association Rules
The support of ($i_1,i_2,\ldots,i_n,j$) must be high otherwise the association won’t have any meaning.
We can calculate confidence like so:
\[\frac{\text{support of }\{i_1,i_2,\ldots,i_n,j\}}{\text{support of }\{i_1,i_2,\ldots,i_n\}}\]This is the percentage of people who bought all of the items vs those who just bought the indicator items.
- This should also be high.
- Should differ significantly from fraction of baskets containing $j$.
We could phrase this association like:
67% of all customers who bought milk also bought juice.
There are examples of calculating confidence starting at slide 11 of the lecture.
Finding Association Rules
- Focus on assocation rules with high support:
- At lease the support threshold $s$.
- Compute the set $F$ of all itemsets $J$ with support $\geq s$.
- From $F$ it is easy to obtain the association rules:
- For each set $J$ in $F$ and each $j$ in $J$, consider the rule $\{j\}\Rightarrow j$.
- Compute the confidence for $\{j\}\Rightarrow j$ and compare with the percent of baskets containing $j$
A-Priori Algorithm
The goal is compute all the itemset $J$ with support $\geq s$. The idea is that:
\[\text{If }J\text{ has support }\geq s\text{, then all subsets of }J\text{ have support }\geq s\text{.}\]Using this we compute frequent itemsets from smaller ones:
- $F_1:=$ itemsets $\{i\}$ with support $\geq s$.
- $F_2:=$ itemsets $\{i,j\}$ with $\{i\},\{j\}\in F_1$ and support $\geq s$.
- $F_3:=$ itemsets $\{i,j,k\}$ with $\{i,j\},\{i,k\},\{j,k\}\in F_1$ and support $\geq s$.
- …
A-Priori Algorithm Formal Method
- Input:
- The set of items $I$.
- The set of baskets $B$.
- Max size $q$.
- Support threshold $s$.
- Output - Subsets $J$ of $I$ with $\lvert J\rvert\leq q$ and support $\geq s$.
C1 := {{i} : i∈I}
for k = 1 to q do
Fk := {J ∈ Ck : J has support ≥ s}
if k = q then stop
Ck+1 := {J ⊆ I : |j| = k + 1 and all subsets of J of size k occur in Fk}