Data Mining/Machine Learning 机器学习数据挖掘作业辅导

Na¨ıve Bayes Classifier

1 Objective

Construct a na¨ıve Bayes classifier to classify email messages as spam or not spam (“ham”). A Bayesian decision rule chooses the hypothesis that maximizes P(Spam|x) vs P(∼Spam|x) for email x.

Use any computer language you like. I recommend using Python as it includes ready access to a number of powerful packages and libraries for natural language processing (NLP). I include a few Python 3.6 excerpts to get you started. I also describe a several tools from the NLTK (natural language toolkit) to pre-process data to improve classification.

2 Na¨ıve (conditionally independent) classification

Suppose that you have a dataset {xN }. Each xk ∈ {xN } is a separate email for this assignment. Each of the N data points xk = (f1, f2, . . . , fn) ∈ Pattern space = X where f1, f2, . . . are called features. You extract features from each data point. Features in an email may include the list of “words” (tokens) in the message body. The goal in Bayesian classification is to compute two probabilities

P(Spam|x) and P(∼Spam|x) for each email. It classifies each email as “spam” or “not spam” by choosing the hypothesis with higher probability.

Na¨ıve Bayes assumes that features for x are independent given its class. P(Spam|x) is difficult to compute in general. Expand with the definition of conditional probability P(Spam|x) = P(Spam ∩ x) P(x) . (1) Look at the denominator P(x). P(x) equals the probability of a particular email given the universe of all possible emails. This is very difficult to calculate. But it is just a number between 0 and 1 since it is a probability. It just “normalizes” P(Spam ∩x). Now look at the numerator P(Spam ∩x). First expand x into its features {fn}. Each feature is an event that can occur or not (i.e. the word is in an email or not). So P(Spam ∩ x) P(x) ∝ P(Spam ∩ x) = P(Spam ∩ f1 ∩ · · · ∩ fn) (2) = P(Spam) · P(f1 ∩ · · · ∩ fn|Spam) (3) Apply the multiplication theorem (HW2, 1.c) to the second term to give P(f1 ∩ · · · ∩ fn|Spam) = P(Spam) · P(f1|Spam) · P(f2|Spam ∩ f1)· · · P(fn|Spam ∩ fn−1 ∩ · · · f2 ∩ f1). (4) But now you are still stuck computing a big product of complicated conditional probabilities. Na¨ıve Bayes classification makes an assumption that features are conditionally independent. This means that P(fj |fk ∩ Spam) = P(fj |Spam) (5) if j 6= k. This means that the probability you observe one feature (i.e. word) is independent of observing another word given the email is spam. This is a na¨ıve assumption and weakens your model. But you can now simplify the above to

Text Data for Sentiment Analysis

1 Overview

In this assignment you will implement the Naive Bayes algorithm with maximum likelihood and MAP solutions and evaluate it using cross validation on the task of sentiment analysis (as in identifying positive/negative product reviews).

2 Text Data for Sentiment Analysis

We will be using the “Sentiment Labelled Sentences Data Set”1 that includes sentences labelled with sentiment (1 for positive and 0 for negative) extracted from three domains imdb.com, amazon.com, yelp.com. These form 3 datasets for the assignment.

Each dataset is given in a single file, where each example is in one line of that file. Each such example is given as a list of space separated words, followed by a tab character (\t), followed by the label, and then by a newline (\n). Here is an example from the yelp dataset:

3 Implementation

3.1 Naive Bayes for Text Categorization

In this assignment you will implement “Naive Bayes for text categorization” as discussed in class. In our application every “document” is one sentence as explained above. The description in this section assumes that a dataset has been split into separate train and test sets.

Given a training set for Naive Bayes you need to parse each example and record the counts for class and for word given class for all the necessary combinations. These counts constitute the learning process since they determine the prediction of Naive Bayes (for both maximum likelihood and MAP solutions).

Now, given the test set, you parse each example, calculate the scores for each class and test the prediction. Note that products of small numbers (probabilities) will quickly lead to underflow problems. Due to that you should work with sum of log probabilities instead of product of probabilities. Recall that a · b · c > d · e · f iff log a + log b + log c > log d + log e + log f so that working with the logarithms is sufficient. However, note that unless your programming environment handles infinity natively, will need to handle ln(0) , −∞ as a special case in your code.

Important point for prediction: If a word in a test example did not appear in the training set at all (i.e. in any of the classes), then simply skip that word when calculating the score for this example. However, if the word did appear with some class but not the other then use the counts you have (zero for one class but non zero for the other).

Sentiment Analytics

Sentiment Analytics

Term Project

 

The term project for this class, due the last week of the semester, consists of a presentation demonstrating the use of the Natural Language Processing techniques you’ve learned in the course.  You will choose a data set, and use the tools of your choice to analyze the data.  In your presentation, you will describe the data and present the results of your analysis in the categories shown below.  You may use screen shots, diagrams, tables, and text in your slides.  You will also need to record audio of yourself presenting each slide of your report.  This project accounts for approximately one third of your grade in this class, so be sure to do the best work you can do.

 

Data:

 

You may choose any data set you would like to work on, as long as it contains at least 1000 distinct unstructured texts.  This can be a collection of Twitter data, blog posts, e-mails, news reports, or similar data.  The following links contain

Frequency

CPSC2620—Fall 2015

Assignment 4Due: November 6, 2015 11:55 pm

1. Write a program freq.cc which reads in a list of words and produce two lists of output.

• The first list is the list of distinct words in the file as well as the number of timeseach word occurs in the input. The words should first be converted to lower case(write a helper function to convert a character to its lower case equivalent anduse transform in STL). This list should be sorted in “dictionary order” based onthe words. If the list of words is:abcd Computer science computer games

The output should look like (the exact format is up to you):Word Frequency——————— ———abcd 1computer 2games 1science 1

• The second list is the list of distinct words sorted in decreasing frequency. Wordswith the same frequency should be listed in “dictionary order.” For the list above,the output should look like:Frequency Word——— ———————2 computer1 abcd1 games1 science

You may assume that the words are separated by white