Naive Bayes Classifier - 2

Suppose we want to predict the class of the instance with features $(e_1,\ldots,e_n)$.

We assume $E_1,\ldots,E_n$ are independent given $Y$ and estimate:

\[V_y=P(Y=y)\prod^n_{i=1}P(E_i=e_i\vert Y=y)\]

The $\prod$ symbol means the product of.

for all $y\in \mathcal Y$ as follows:

$P(Y=y)$ is estimated by the number of instance labelled with $y$ in the training data divided by the number of all instances in the training data.
$P(E_i=e_i\vert Y=y)$ is estimated by the number of instances with feature $e_i$ labelled as $y$ in the training data divided by the number of all instances labelled with $y$ in the training data.

We then set:

\[f(e_1,\ldots,e_n)=y\]

for the $y$ for which $V_y$ is maximal (take the maximally probability hypothesis).

View side 27 onwards for an additional example about filtering spam emails.

Summary

A supervised learning algorithm based on Bayes Theorem.
It is called naive because it is assumed that the features are independent of each other, given the classification.
Naive Bayes classifier works surprising well in practice even if the features are obviously not independent given the classification.