Sentiment classifier

In this task, you train a sentiment classifier that detects positive and negative sentiment in (English) texts. In machine learning, the choice of learning methods and hyperparameters plays a major role, so you compare different approaches.

Prepare dataset

The “Large Movie Review Dataset” by Maas et al. consists of 50,000 movie reviews from IMDb. Download the dataset (https://ai.stanford.edu/~amaas/data/ sentiment/) and unzip the archive. We are interested in the positive and negative reviews in the training and testing set. Copy the data to the directory structure expected by Scikit-learn:

|– test

  • |  |– neg
  • |   | |– 0_2.txt
  • |   | |– 10000_4.txt
  • | | …

o|  `– pos.txt

o|        |– 0_10.txt

o

o|        |– 10000_7.txt

o

o|        …

o`– train

o|– neg

  • |  |– 0_3.txt
  • |   |– 10000_4.txt
  • |  … `– pos
  • |–0_9.txt
  • |–10000_8.txt

– Train and evaluate models

– Load the training set and the test set.

– Writeafunctionevaluate_pipeline(pipeline, train_set, test_set) that takes a scikit-learn pipeline, a training set, and a test set. The function should train the pipeline on the training set, on the test set with the F1 value

(sklearn.metrics.f1_score) and return it rounded to four decimal places.

Define the following pipelines and train and evaluate them with the

evaluate_pipeline function. Output the evaluation results.

– – Naive Bayes classifier on Tf-Idf values of all words.

– – Naive-Bayes classifier on Tf-Idf values of all words appearing in at least 2 documents.

– Naive Bayes classifier on L2-normalized frequencies of all words.

– – Naive Bayes classifier on L2-normalized frequencies of all words, which occur in at least 2 documents.

– – Linear Support Vector classifier on Tf-Idf values of all words.

– Linear Support Vector classifier on Tf-Idf values of all words occurring in at least 2 documents.

– Linear Support Vector classifier on L2-normalized frequencies of all words.

– Linear Support Vector classifier on L2-normalized frequencies of all words occurring in at least 2 documents.

– Linear Support Vector classifier on Tf-Idf values of all words and word bigrams occurring in at least 2 documents.

– Linear support vector classifier on L2-normalized frequencies of all words and word bigrams occurring in at least 2 documents.

Leave a Reply

Your email address will not be published. Required fields are marked *