In this task, you train a sentiment classifier that detects positive and negative sentiment in (English) texts. In machine learning, the choice of learning methods and hyperparameters plays a major role, so you compare different approaches.
Prepare dataset
The “Large Movie Review Dataset” by Maas et al. consists of 50,000 movie reviews from IMDb. Download the dataset (https://ai.stanford.edu/~amaas/data/ sentiment/) and unzip the archive. We are interested in the positive and negative reviews in the training and testing set. Copy the data to the directory structure expected by Scikit-learn:
|– test
- | |– neg
- | | |– 0_2.txt
- | | |– 10000_4.txt
- | | …
o| `– pos.txt
o| |– 0_10.txt
o
o| |– 10000_7.txt
o
o| …
o`– train
o|– neg
- | |– 0_3.txt
- | |– 10000_4.txt
- | … `– pos
- |–0_9.txt
- |–10000_8.txt
- …
– Train and evaluate models
– Load the training set and the test set.
– Writeafunctionevaluate_pipeline(pipeline, train_set, test_set) that takes a scikit-learn pipeline, a training set, and a test set. The function should train the pipeline on the training set, on the test set with the F1 value
(sklearn.metrics.f1_score) and return it rounded to four decimal places.
Define the following pipelines and train and evaluate them with the
evaluate_pipeline function. Output the evaluation results.
– – Naive Bayes classifier on Tf-Idf values of all words.
– – Naive-Bayes classifier on Tf-Idf values of all words appearing in at least 2 documents.
– Naive Bayes classifier on L2-normalized frequencies of all words.
– – Naive Bayes classifier on L2-normalized frequencies of all words, which occur in at least 2 documents.
– – Linear Support Vector classifier on Tf-Idf values of all words.
– Linear Support Vector classifier on Tf-Idf values of all words occurring in at least 2 documents.
– Linear Support Vector classifier on L2-normalized frequencies of all words.
– Linear Support Vector classifier on L2-normalized frequencies of all words occurring in at least 2 documents.
– Linear Support Vector classifier on Tf-Idf values of all words and word bigrams occurring in at least 2 documents.
– Linear support vector classifier on L2-normalized frequencies of all words and word bigrams occurring in at least 2 documents.