# Na¨ıve Bayes Classifier

1 Objective

Construct a na¨ıve Bayes classifier to classify email messages as spam or not spam (“ham”). A Bayesian decision rule chooses the hypothesis that maximizes P(Spam|x) vs P(∼Spam|x) for email x.

Use any computer language you like. I recommend using Python as it includes ready access to a number of powerful packages and libraries for natural language processing (NLP). I include a few Python 3.6 excerpts to get you started. I also describe a several tools from the NLTK (natural language toolkit) to pre-process data to improve classification.

2 Na¨ıve (conditionally independent) classification

Suppose that you have a dataset {xN }. Each xk ∈ {xN } is a separate email for this assignment. Each of the N data points xk = (f1, f2, . . . , fn) ∈ Pattern space = X where f1, f2, . . . are called features. You extract features from each data point. Features in an email may include the list of “words” (tokens) in the message body. The goal in Bayesian classification is to compute two probabilities

P(Spam|x) and P(∼Spam|x) for each email. It classifies each email as “spam” or “not spam” by choosing the hypothesis with higher probability.

Na¨ıve Bayes assumes that features for x are independent given its class. P(Spam|x) is difficult to compute in general. Expand with the definition of conditional probability P(Spam|x) = P(Spam ∩ x) P(x) . (1) Look at the denominator P(x). P(x) equals the probability of a particular email given the universe of all possible emails. This is very difficult to calculate. But it is just a number between 0 and 1 since it is a probability. It just “normalizes” P(Spam ∩x). Now look at the numerator P(Spam ∩x). First expand x into its features {fn}. Each feature is an event that can occur or not (i.e. the word is in an email or not). So P(Spam ∩ x) P(x) ∝ P(Spam ∩ x) = P(Spam ∩ f1 ∩ · · · ∩ fn) (2) = P(Spam) · P(f1 ∩ · · · ∩ fn|Spam) (3) Apply the multiplication theorem (HW2, 1.c) to the second term to give P(f1 ∩ · · · ∩ fn|Spam) = P(Spam) · P(f1|Spam) · P(f2|Spam ∩ f1)· · · P(fn|Spam ∩ fn−1 ∩ · · · f2 ∩ f1). (4) But now you are still stuck computing a big product of complicated conditional probabilities. Na¨ıve Bayes classification makes an assumption that features are conditionally independent. This means that P(fj |fk ∩ Spam) = P(fj |Spam) (5) if j 6= k. This means that the probability you observe one feature (i.e. word) is independent of observing another word given the email is spam. This is a na¨ıve assumption and weakens your model. But you can now simplify the above to

# Text Data for Sentiment Analysis

1 Overview

In this assignment you will implement the Naive Bayes algorithm with maximum likelihood and MAP solutions and evaluate it using cross validation on the task of sentiment analysis (as in identifying positive/negative product reviews).

2 Text Data for Sentiment Analysis

We will be using the “Sentiment Labelled Sentences Data Set”1 that includes sentences labelled with sentiment (1 for positive and 0 for negative) extracted from three domains imdb.com, amazon.com, yelp.com. These form 3 datasets for the assignment.

Each dataset is given in a single file, where each example is in one line of that file. Each such example is given as a list of space separated words, followed by a tab character (\t), followed by the label, and then by a newline (\n). Here is an example from the yelp dataset:

3 Implementation

3.1 Naive Bayes for Text Categorization

In this assignment you will implement “Naive Bayes for text categorization” as discussed in class. In our application every “document” is one sentence as explained above. The description in this section assumes that a dataset has been split into separate train and test sets.

Given a training set for Naive Bayes you need to parse each example and record the counts for class and for word given class for all the necessary combinations. These counts constitute the learning process since they determine the prediction of Naive Bayes (for both maximum likelihood and MAP solutions).

Now, given the test set, you parse each example, calculate the scores for each class and test the prediction. Note that products of small numbers (probabilities) will quickly lead to underflow problems. Due to that you should work with sum of log probabilities instead of product of probabilities. Recall that a · b · c > d · e · f iff log a + log b + log c > log d + log e + log f so that working with the logarithms is sufficient. However, note that unless your programming environment handles infinity natively, will need to handle ln(0) , −∞ as a special case in your code.

Important point for prediction: If a word in a test example did not appear in the training set at all (i.e. in any of the classes), then simply skip that word when calculating the score for this example. However, if the word did appear with some class but not the other then use the counts you have (zero for one class but non zero for the other).

# network

Question 1 (Delay, 18%).

As shown in the figure below, a file of size F = 1000 + S bytes is transmitted on an end-to-end connection over four links, where S is the last three digits of your student number. For example, if your student number is 490123456, then S = 456 and F = 1456 bytes.

Each link is 100 km. The signal prorogation speed is 2 × 108 m/s. Assume that a header of 40 bytes is added to each packet. The bandwidth of all links is R = 1 Mbps at the beginning. The nodes use the store-and-forward scheme. (Ignore processing delays at each node.)

(0) What is your student number? Warning: If you use another student’s number as S value to answer the question, the following sub-questions will not be marked and you will get 0 in Question 1.

(1) How long does it take to transmit the file if the whole file is transmitted as a single packet. Now assume that the bandwidth of link B − C and D − E become 0.5 Mbps. Answer (2)–(4).

(2) Repeat (1).

(3) We would like to break the file into smaller packets to decrease the overall delay in the store-and-forward scheme. Assume that each time you break the file to make a new packet, you have to add 40 bytes as the header of the new packet. Repeat (2) when we break the file into N = 4 packets.

(4) What should be the optimal size of the packets to have the minimum overall delay to deliver the whole file? Find the overall delay. Hint: Since the link B − C has a smaller bandwidth compared with A − B, packets could be queued for some time!

# Sheep & Wolves

In this mini-project, you’ll implement an agent that can solve the Sheep and Wolves problem for an arbitrary number of initial wolves and sheep. You will submit the code for solving the problem to the Mini-Project 1 assignment in Gradescope. You will also submit a report describing your agent to Canvas. Your grade will be based on a combination of your report (50%) and your agent’s performance (50%).

The Sheep and Wolves problem is identical to the Guards & Prisoners problem from the lecture, except that it makes more semantic sense why the wolves can be alone (they have no sheep to eat). Ignore for a moment the absurdity of wolves needing to outnumber sheep in order to overpower them. Maybe it’s baby wolves vs. adult rams.

As a reminder, the problem goes like this: you are a shepherd tasked with getting sheep and wolves across a river for some reason. If the wolves ever outnumber the sheep on either side of the river, the wolves will overpower and eat the sheep. You have a boat, which can only take one or two animals in it at a time, and must have at least one animal in it because you’ll get lonely (and because the problem is trivial otherwise). How do you move all the animals from one side of the river to the other?

In the original Sheep & Wolves (or Guards & Prisoners) problem, we specified there were 3 sheep and 3 wolves; here, though, your agent should be able to solve the problem for an arbitrary number of initial sheep and wolves. You may assume that the initial state of the problem will follow those rules (e.g. we won’t give you more wolves than sheep to start). However, not every initial state will be solvable; there may be combinations of sheep and wolves that cannot be solved.

You will return a list of moves that will solve the problem, or an empty list if the problem is unsolvable based on the initial set of Sheep and Wolves. You will also submit a brief report describing your approach.

# Sentiment classifier

In this task, you train a sentiment classifier that detects positive and negative sentiment in (English) texts. In machine learning, the choice of learning methods and hyperparameters plays a major role, so you compare different approaches.

Prepare dataset

The “Large Movie Review Dataset” by Maas et al. consists of 50,000 movie reviews from IMDb. Download the dataset (https://ai.stanford.edu/~amaas/data/ sentiment/) and unzip the archive. We are interested in the positive and negative reviews in the training and testing set. Copy the data to the directory structure expected by Scikit-learn:

|– test

• |  |– neg
• |   | |– 0_2.txt
• |   | |– 10000_4.txt
• | | …

o|  `– pos.txt

o|        |– 0_10.txt

o

o|        |– 10000_7.txt

o

o|        …

o`– train

o|– neg

• |  |– 0_3.txt
• |   |– 10000_4.txt
• |  … `– pos
• |–0_9.txt
• |–10000_8.txt

– Train and evaluate models

– Load the training set and the test set.

– Writeafunctionevaluate_pipeline(pipeline, train_set, test_set) that takes a scikit-learn pipeline, a training set, and a test set. The function should train the pipeline on the training set, on the test set with the F1 value

(sklearn.metrics.f1_score) and return it rounded to four decimal places.

Define the following pipelines and train and evaluate them with the

evaluate_pipeline function. Output the evaluation results.

– – Naive Bayes classifier on Tf-Idf values of all words.

– – Naive-Bayes classifier on Tf-Idf values of all words appearing in at least 2 documents.

– Naive Bayes classifier on L2-normalized frequencies of all words.

– – Naive Bayes classifier on L2-normalized frequencies of all words, which occur in at least 2 documents.

– – Linear Support Vector classifier on Tf-Idf values of all words.

– Linear Support Vector classifier on Tf-Idf values of all words occurring in at least 2 documents.

– Linear Support Vector classifier on L2-normalized frequencies of all words.

– Linear Support Vector classifier on L2-normalized frequencies of all words occurring in at least 2 documents.

– Linear Support Vector classifier on Tf-Idf values of all words and word bigrams occurring in at least 2 documents.

– Linear support vector classifier on L2-normalized frequencies of all words and word bigrams occurring in at least 2 documents.

# network

In this project, you will use Software Defined Networking (SDN) principles to create a configurable firewall using an OpenFlow enabled Switch. The Software Defined Networking function allows you to programmatically control the flow of traffic on the network

This project will start with a review of Mininet (this was first used in the optional Simulating Networks project). This review will explain the basic concepts of Mininet and the functionality you may need to complete this project.

The next phase will involve examining network traffic using Wireshark. This will allow you to see the header contents that will be important in building the code necessary to implement the firewall as well as the necessary ruleset you will create to test the firewall.

After this, you will need to perform two tasks that need to be conducted in parallel:

1. You will create a configuration file ruleset that describes certain types of traffic that should be blocked or allowed between individual hosts and networks. You will define this “ruleset” using header packet parameters such as Source IP Address, Destination Port Number, IP Protocol, and Destination MAC Address (there are more parameters, these are given as an example). Your ruleset will contain instruction on whether certain traffic should be blocked or should be allowed. By default, all traffic will be allowed. You will need to specify “routes” that need to be blocked and any specific exceptions to the block that you want to allow.

2. You will create python code that will take the parameters of the configuration from the first task above and create a flow policy object using the POX OpenFlow SDN frameworks. Please start early on this project, especially if you are unfamiliar working with Python APIs.

Part 0: Project References

You will find the following resources useful in completing this project. It is recommended that you review these resources before starting the project.

# processfile

Project Goals

In this project, you will be developing a simple Java application (processfile) using an agile, test-driven process involving multiple deliverables. While you will receive one grade for the entire project, each deliverable must be completed by its own due date, and all deliverables will contribute to the overall project grade.

Specification of the processfile

Utility processfile is a simple command-line utility written in Java with the following specification:

Summary

processfile allows for simple text manipulation of the content of a file.

Syntax

processfile OPTIONS FILE

Description

Program processfile performs basic text transformations on lines of text from an input FILE. Unless the -f option (see below) is specified, the program writes transformed text to stdout and errors/usage messages to stderr. The FILE parameter is required and must be the last parameter. OPTIONS may be zero or more of the following and may occur in any order:

● -f Edit file in place. The program overwrites the input file with transformed text instead of writing to stdout.

● -n Add line numbers to each line, unpadded starting from 1.

● -s Keep only the lines containing the given string.

● -r Replaces the first instance of string1 in each line with string2.

● -g Used with the -r flag ONLY; replaces all occurrences of string1 in each line with

# VISUAL TEXT CORRECTION

Project Overview

This project will be built based on the paper Visual Text Correction written by Amir Mazaheri and Mubarak Shah[1]. In this project, given a video clip word description and its corresponding video, our proposed framework will detect the inaccurate word and replace it with a more proper one. The topics such as constructing a more realistic dataset and improving an existing NLP result will be explored and discussed in our project.

Summary of previous work

The foundation of our project is built on the paper written by Amir Mazaheri and Mubarak Shah. In that paper, a system utilizes two modules, the inaccuracy detection module and the correct word prediction module, to find the most inaccurate word in a video description and replace it with a more proper one. The inaccuracy detection module uses a convolutional n-gram network and LSTM to extract the textural information and uses a gating module to extract the visual information. A combination of two makes the prediction of the more inaccurate word. For the correct word prediction module, the system uses text encoding and video encoding to make the prediction for the word.

Potential solutions

Approach 1: Generate a more realistic inaccurate description:

The original false dataset is created only according to the frequency of words and by swapping words with the same part of speech in the sentence. As a result, some extreme unrealistic examples like “swimming in the kitchen” are added to the dataset. Thus, one of our approaches can improve the false dataset by adding more metrics to construct it. We can construct the false dataset by replacing the original word with some words which have a better coherence to the location and movement. In addition, the possible replacement can be built in a tree structure and we can generate the false one from it.

# Object-Oriented Programming

1 Preparation

A Zip archive containing the files for the assignment is provided in Minerva. When you unzip it, you should see three files of earthquake data, along with four .java files. One of these, QuakeException.java, is complete and ready for you to use; the other three files containing empty Java classes.

There are ‘To Do’ comments inside the classes that summarise the code that you need to write, and detailed step-by-step instructions are provided below.

Before you begin programming, take some time to study the sample data. You can use a spreadsheet application to do this. Each row in the dataset represents a single earthquake. The different columns represent various parameters measured for an earthquake. You don’t need to know what most of these are, although further details can be found at the USGS Earthquake Hazards Program website if you are interested. The only columns relevant to this assignment are: latitude, longitude, depth and magnitude.

Note that filenames consist of two parts: a severity level (often expressed in terms of earthquake magnitude) and a time period. The two parts are separated by the underscore character. Thus, 2.5_day.csv contains details of earthquakes of magnitude 2.5 or greater, recorded over a single day; 4.5_week.csv records earthquakes of magnitude 4.5 or greater, over a one-week period; and significant_month.csv records ‘significant’ earthquakes recorded during a one-month period.

2 Quake Class

This class encapsulates the details a single earthquake.

1. Edit Quake.java in a text editor. Add to the class fields that represent the latitude, longitude, depth and magnitude of an earthquake.

2. Create a constructor for Quake that takes a single String parameter. This string represents all of the data provided by the USGS for a single earthquake, with each value separated from the others by commas. To see real examples of what this string looks like, open one of the data files in a text editor (not in a spreadsheet), and examine any of the lines after the first. Note: you can use a method of the String class to help you implement this. Your constructor should perform some basic validation, checking that latitude is within the allowed range of −90.0 ◦ to 90.0 ◦ , and that longitude is within the allowed range of −180.0 ◦ to 180.0 ◦ . You should throw an instance of QuakeException, containing an appropriate error message, if latitude or longitude are invalid.

3. Write simple ‘getter’ methods for each of the fields. These methods should simply return the values of the fields.

4. Write a toString method for Quake. This should generate and return a string representation of the Quake object. The string should contain magnitude, depth, latitude and longitude, in that order, formatted like this example: M5.0, 12.6 km, (35.4975°, 141.0217°) Note: the ‘degrees’ symbol can be generated with the Unicode escape sequence \u00b0. As you complete each step of implementing this class, remove the corresponding ‘To Do’ comment and check that your code compiles. Once you’ve written the constructor and getter methods, you can test them by writing a small program that creates a Quake object from a string. You can use one of the lines from the given data files as the string.