Naive Bayes Classifier - 2
Suppose we want to predict the class of the instance with features $(e_1,\ldots,e_n)$.
We assume $E_1,\ldots,E_n$ are independent given $Y$ and estimate:
\[V_y=P(Y=y)\prod^n_{i=1}P(E_i=e_i\vert Y=y)\]The $\prod$ symbol means the product of.
for all $y\in \mathcal Y$ as follows:
- $P(Y=y)$ is estimated by the number of instance labelled with $y$ in the training data divided by the number of all instances in the training data.
- $P(E_i=e_i\vert Y=y)$ is estimated by the number of instances with feature $e_i$ labelled as $y$ in the training data divided by the number of all instances labelled with $y$ in the training data.
We then set:
\[f(e_1,\ldots,e_n)=y\]for the $y$ for which $V_y$ is maximal (take the maximally probability hypothesis).
View side 27 onwards for an additional example about filtering spam emails.
Summary
- A supervised learning algorithm based on Bayes Theorem.
- It is called naive because it is assumed that the features are independent of each other, given the classification.
- Naive Bayes classifier works surprising well in practice even if the features are obviously not independent given the classification.