1 Objective

Construct a na¨ıve Bayes classifier to classify email messages as spam or not spam (“ham”). A Bayesian decision rule chooses the hypothesis that maximizes P(Spam|x) vs P(∼Spam|x) for email x.

Use any computer language you like. I recommend using Python as it includes ready access to a number of powerful packages and libraries for natural language processing (NLP). I include a few Python 3.6 excerpts to get you started. I also describe a several tools from the NLTK (natural language toolkit) to pre-process data to improve classification.

2 Na¨ıve (conditionally independent) classification

Suppose that you have a dataset {xN }. Each xk ∈ {xN } is a separate email for this assignment. Each of the N data points xk = (f1, f2, . . . , fn) ∈ Pattern space = X where f1, f2, . . . are called features. You extract features from each data point. Features in an email may include the list of “words” (tokens) in the message body. The goal in Bayesian classification is to compute two probabilities

P(Spam|x) and P(∼Spam|x) for each email. It classifies each email as “spam” or “not spam” by choosing the hypothesis with higher probability.

Na¨ıve Bayes assumes that features for x are independent given its class. P(Spam|x) is difficult to compute in general. Expand with the definition of conditional probability P(Spam|x) = P(Spam ∩ x) P(x) . (1) Look at the denominator P(x). P(x) equals the probability of a particular email given the universe of all possible emails. This is very difficult to calculate. But it is just a number between 0 and 1 since it is a probability. It just “normalizes” P(Spam ∩x). Now look at the numerator P(Spam ∩x). First expand x into its features {fn}. Each feature is an event that can occur or not (i.e. the word is in an email or not). So P(Spam ∩ x) P(x) ∝ P(Spam ∩ x) = P(Spam ∩ f1 ∩ · · · ∩ fn) (2) = P(Spam) · P(f1 ∩ · · · ∩ fn|Spam) (3) Apply the multiplication theorem (HW2, 1.c) to the second term to give P(f1 ∩ · · · ∩ fn|Spam) = P(Spam) · P(f1|Spam) · P(f2|Spam ∩ f1)· · · P(fn|Spam ∩ fn−1 ∩ · · · f2 ∩ f1). (4) But now you are still stuck computing a big product of complicated conditional probabilities. Na¨ıve Bayes classification makes an assumption that features are conditionally independent. This means that P(fj |fk ∩ Spam) = P(fj |Spam) (5) if j 6= k. This means that the probability you observe one feature (i.e. word) is independent of observing another word given the email is spam. This is a na¨ıve assumption and weakens your model. But you can now simplify the above to