Machine Learning

You may use any programming l anguage you l ike (Python, C++, C, Java… ) . All programming must be
done individually f rom rst principles. You are only permitted to use existing t ools f or simple l inear algebra
such as matrix multiplication/inversion. Do NOT use any toolkit that performs machine learning
functions and do NOT collaborate with your classmates. Cite any resources that were used.
In t his project you will practice the basics of Machine Learning Classi cation by creating a K-NN clas-si er
for two datasets. You will also practice good practices f or how to describe, evaluate, and write up a report on
the classi er performance.
It i s expected that your project report may require 2 pages per dataset i f you are good about making
interesting gures and making them not too l arge, or 3 pages i f your gures are big.
Datasets: The project will explore two datasets, the famous MNIST dataset of very small pictures of
handwritten numbers, and a dataset that explores the prevelance of diabetes in a native american tribe
named the Pima. You can access the datasets here:
1. https://www.kaggle.com/uciml/pima-indians-diabetes-database
2. https://www.kaggle.com/c/digit-recognizer/data
Programming Task: For each dataset, you must create a K-NN classi er that uses the training data to
build a classi er, and evaluate and report on the classi er performance.
(30 points) Dataset details: Describe the data and some simple visualizations (for images, a few exam-
ples from each category; for other data, perhaps some scatter plots or histograms that show a big picture of
the data). Describe your training/test split for K-NN and justify your choices.
(15 points) Algorithm Description: K-NN is a very clear algorithm, so here describe any data pre-
processing, feature scaling, distance metrics, or otherwise that you did.
(45 points) Algorithm Results: Show the accuracy of your algorithm|in the case of the Pima Dataset,
show accuracy with tables showing false positive, false negative, true positive and true negatives. For the
Pima Dataset, use three di erent distance metrics and compare the results.
In the case of the MNIST digits show the complete confusion matrix. Choose a single digit to measure
accuracy and show how that number varies as a function of K.
(10 points) Runtime: Describe the run-time of your algorithm and also share the actual “wall-clock”
time that it took to compute your results.

Leave a Reply

Your email address will not be published. Required fields are marked *