Naive Bayes Classifier - 1
Supervised Learning of Classifiers
Given:
- A set $\mathcal X$ of possible instances of be classified:
- Emails
- Hand-written digits
- Image windows
- A finite set $\mathcal Y$ of classes:
- $\{\text{spam, not spam}\}$
- Training data $(x_1,L(x_1)),\ldots,(x_n,(L(x_n))\in \mathcal X \times \mathcal Y$. $L(x_1)$ is called the label of $x_i$.
- A class of functions $\mathcal F$ from $\mathcal X$ to $\mathcal Y$ from with the classifier $f$ is selected.
Aim
-
Compute classifier $f\in \mathcal F$ such that:
\[f(x)\approx L(x_i)\]This is to say that the computer makes a function that performs similarly to a function that a human would program.
for all training data $x_i(L(x_i))$ such that for new $x\in\mathcal X$:
\[f(x)=y\]is a good prediction for the class of $x$.
Towards the Naive Bayes Classifier
One wants the learn a classifier $f:\mathcal X \rightarrow \mathcal Y$
We have training data:
\[x_i(L(x_i)),\ldots,(x_m,L(x_m))\]For a new $x\in\mathcal X$, the classifier predicts $f(x)=y\in\mathcal Y$ if:
\[P(Y=y\vert x)\geq P(Y=y'\vert x)\]for all $y’\in \mathcal Y$
You take the maximally probable hypothesis. This last line also means any $y’$ distinct from $\mathcal Y$.
Thus, in the derivation of the naive Bayes’ classifier we introduce a random variable $Y$ that takes values in $\mathcal Y$.
We also make the (vary common assumption that every elemen of $\mathcal X$ is givenby values $e_1,\ldots,e_n$ of features $E_1,\ldots,E_n$. We model those as random variables as well. Thus $P(Y=y\vert x)$ stands for:
\[P(Y=y\vert E_1=e_1,\ldots,E_n=e_n)\]Our training data then takes the form:
\[(e^1_1,\ldots,e^1_n,y_1),\ldots,(e^m_1,\ldots,e^m_n,y_m)\]Example - Playing Tennis
We assume we know the weather on a particular day, and want to predict whether Peter play Tennis on that day.
The classifier (or classes) are that Peter plays tennis or he doesn’t play tennis.
Also assume we have collected data on the weather on days on which Peter plays or does not play Tennis.
The features used then collecting the weather data are:
- The feature outlook with values: sunny, overcast, rain.
- The feature temp with values: hot, mild, cool.
- The feature humidity with values: high, normal.
- The feature wind with values: weak, strong.
For this set the instance is a day and it’s features are the ones listed above.
Assume we want to predict whether Peter plays Tennis on a day for which:
- outlook=sunny
- temp=cool
- humidity=high
- wind=strong
In what follows we write:
\[P(\text{sunny}\vert \text{Yes})\]for
\[P(\text{outlook}=\text{sunny}\vert\text{PlaysTennis}=\text{Yes})\]and so on for the remaining random variables and probabilities.f
We would like to estimate:
\[P(Y=y\vert E_1=e_1,\ldots,E_n=e_n)\]which in the example is:
\[P(\text{Yes}\vert\text{sunny, cool, high, strong})\]But how can we do this directly if in the training data there is no entry for a day on which the outlook is sunny, the temperature is cool, the humidity is high, and the wind is strong?
Towards the Naive Bayes Classifier
We use Bayes’ theorem
\[\begin{aligned} &P(Y=y\vert E_1=e_1,\ldots,E_n=e_n)\\ &=\frac{P(E_1=e_1,\ldots,E_n=e_n\vert Y=y)P(Y=y)}{P(E_1=e_1,\ldots,E_n=e_n)} \end{aligned}\]As we only want to compare probabilities, it is sufficient to estimate:
\[P(E_1=e_1,\ldots,E_n=e_n\vert Y=y)P(Y=y)\]for all $y\in \mathcal Y$.
Estimating Probabilities
Thus we want to estimate:
\[P(\text{sunny, cool, high, strong}\vert\text{Yes})P(\text{Yes})\]and
\[P(\text{sunny, cool, high, strong}\vert\text{No})P(\text{No})\]We can estimate $P(\text{Yes})$ by dividing the number of days on which Peter plays tennis by the total number of days; and we can estimate $P(\text{No})$ by dividing the number of days on which Peter does not play tennis by the total number of days.
But as there are no entries for days on which the outlook is sunny, the temperature is cool, the humidity is high and the wind is strong; how can we estimate $P(\text{sunny, cool, high, strong}\vert\text{Yes})$?
Assume Conditional Independence
Then we can “estimate” $P(\text{sunny, cool, high, strong}\vert\text{Yes})$$P(\text{Yes})$
by $P(\text{sunny}\vert\text{Yes})$
and $P(\text{sunny, cool, high, strong}\vert\text{No})$
by $P(\text{sunny}\vert\text{No})$
Probabilities such as $P(\text{sunny}\vert\text{Yes})$ can be estimated by dividing the number of days on which Peter plays tennis on which it is sunny by the toal number of days on which Peter plays tennis.
Prediction
Using the training data we estimate:
$V_{\text{Yes}}=P(\text{sunny}\vert\text{Yes})P(\text{cool}\vert\text{Yes})P(\text{high}\vert\text{Yes})$
and
$V_{\text{No}}=P(\text{sunny}\vert\text{No})P(\text{cool}\vert\text{No})P(\text{high}\vert\text{No})$
We estimate, for example,
- $P(\text{Yes})=\frac{9}{14}$
- $P(\text{No})=\frac{5}{14}$
- $P(\text{strong}\vert\text{Yes})=\frac{3}{9}$
and obtain:
\[V_{\text{Yes}}=0.0053,V_{\text{No}}=0.0274\]Thus, the naive Bayes’ classifier predicts that Peter will not play tennis.
This does assume that all the random variables are independent which is not true as the elements of the weather are not independent. This is not an issue as this is only an estimate for comparison purposes.