The term project for this class, due the last week of the semester, consists of a presentation demonstrating the use of the Natural Language Processing techniques you’ve learned in the course. You will choose a data set, and use the tools of your choice to analyze the data. In your presentation, you will describe the data and present the results of your analysis in the categories shown below. You may use screen shots, diagrams, tables, and text in your slides. You will also need to record audio of yourself presenting each slide of your report. This project accounts for approximately one third of your grade in this class, so be sure to do the best work you can do.
You may choose any data set you would like to work on, as long as it contains at least 1000 distinct unstructured texts. This can be a collection of Twitter data, blog posts, e-mails, news reports, or similar data. The following links contain
Assignment 4Due: November 6, 2015 11:55 pm
1. Write a program freq.cc which reads in a list of words and produce two lists of output.
• The first list is the list of distinct words in the file as well as the number of timeseach word occurs in the input. The words should first be converted to lower case(write a helper function to convert a character to its lower case equivalent anduse transform in STL). This list should be sorted in “dictionary order” based onthe words. If the list of words is:abcd Computer science computer games
The output should look like (the exact format is up to you):Word Frequency——————— ———abcd 1computer 2games 1science 1
• The second list is the list of distinct words sorted in decreasing frequency. Wordswith the same frequency should be listed in “dictionary order.” For the list above,the output should look like:Frequency Word——— ———————2 computer1 abcd1 games1 science
You may assume that the words are separated by white