Text Data for Sentiment Analysis

1 Overview

In this assignment you will implement the Naive Bayes algorithm with maximum likelihood and MAP solutions and evaluate it using cross validation on the task of sentiment analysis (as in identifying positive/negative product reviews).

2 Text Data for Sentiment Analysis

We will be using the “Sentiment Labelled Sentences Data Set”1 that includes sentences labelled with sentiment (1 for positive and 0 for negative) extracted from three domains imdb.com, amazon.com, yelp.com. These form 3 datasets for the assignment.

Each dataset is given in a single file, where each example is in one line of that file. Each such example is given as a list of space separated words, followed by a tab character (\t), followed by the label, and then by a newline (\n). Here is an example from the yelp dataset:

3 Implementation

3.1 Naive Bayes for Text Categorization

In this assignment you will implement “Naive Bayes for text categorization” as discussed in class. In our application every “document” is one sentence as explained above. The description in this section assumes that a dataset has been split into separate train and test sets.

Given a training set for Naive Bayes you need to parse each example and record the counts for class and for word given class for all the necessary combinations. These counts constitute the learning process since they determine the prediction of Naive Bayes (for both maximum likelihood and MAP solutions).

Now, given the test set, you parse each example, calculate the scores for each class and test the prediction. Note that products of small numbers (probabilities) will quickly lead to underflow problems. Due to that you should work with sum of log probabilities instead of product of probabilities. Recall that a · b · c > d · e · f iff log a + log b + log c > log d + log e + log f so that working with the logarithms is sufficient. However, note that unless your programming environment handles infinity natively, will need to handle ln(0) , −∞ as a special case in your code.

Important point for prediction: If a word in a test example did not appear in the training set at all (i.e. in any of the classes), then simply skip that word when calculating the score for this example. However, if the word did appear with some class but not the other then use the counts you have (zero for one class but non zero for the other).

Leave a Reply

Your email address will not be published. Required fields are marked *