# Naive Bayes

Naive Bayes is a simplistic probabilistic classifier based on Bayes theorem and strong independence assumptions. It is often applied to text classification tasks with word frequencies as features (i.e. spam classification).

# Underlying Assumption

The primary assumption of the model (making it “naive”) is that each feature is independent of all other features, given the target label.

# Model Formulation

The overarching goal of classification is to estimate the probability of a particular class label given the data. That is,

$P(c | \mathbf{x}) = P(c | x_1,\dots,x_n), \forall{c \in C}$

Naively, we could estimate this probability using the samples from a training set; for each data point x seen in our training set, count up the associated class labels:

$P(c|\mathbf{x}) = \frac{P(\mathbf{x} \cap c)}{P(\mathbf{x})} = \frac{\sum_{i=1}^n{[\![ \mathbf{x}_i = \mathbf{x} \land y_i = c]\!]}}{\sum_{i=1}^n{[\![ \mathbf{x}_i = \mathbf{x}]\!]}}$

However, this method will not work for most every real world scenario, as we often don’t have a lot of samples with the same x value (imagine a continuous setting, for example). This method estimates the correct probabilities, but is a very poor estimate without an extremely large amount of data. Instead, we can attempt to estimate these probabilities using Bayes rule:

$P(c|\mathbf{x}) = \frac{P(\mathbf{x}|c)P(c)}{P(\mathbf{x})}$

The overall goal is to find the class label c that is most likely given the data. So for a given test value x, we want to use the above equation to find the value of c that maximizes P(c|x). As a result, we can ignore the denominator of the above equation since it does not depend on c and will thus be the same regardless of the value that c takes on. The value of c that maximizes P(c|x) also maximizes P(x|c)P(c). Now focusing on the numerator, we know that it is the joint probability over x and c:

$P(\mathbf{x} \cap c) = P(\mathbf{x}|c)P(c)$

which can be expanded using the chain rule of probability:

$P(\mathbf{x} \cap c) = P(x_1,\dots,x_n,c) = P(c)P(x_1|c)\cdots P(x_n|x_1,\dots,x_{n-1},c)$

Depending on the number of features, this expanded joint probability can be very difficult to compute and estimate. So we employ a “naive” assumption of conditional independence, assuming that each feature is independent of all other features given the class label. This simplifies the prior joint expansion (since P(A|B) = P(A) when A and B are independent):

$P(\mathbf{x} \cap c) = P(x_1,\dots,x_n,c) = P(c)P(x_1|c)\cdots P(x_n|c) = P(c)\prod_{i=1}^n{P(x_i|c)}$

This gives us the final conditional probability **, and we have a classifier **.