You may use any programming l anguage you l ike (Python, C++, C, Java… ) . All programming must be
done individually f rom rst principles. You are only permitted to use existing t ools f or simple l inear algebra
such as matrix multiplication/inversion. Do NOT use any toolkit that performs machine learning
functions and do NOT collaborate with your classmates. Cite any resources that were used.
In t his project you will practice the basics of Machine Learning Classication by creating a K-NN clas-sier
for two datasets. You will also practice good practices f or how to describe, evaluate, and write up a report on
the classier performance.
It i s expected that your project report may require 2 pages per dataset i f you are good about making
interesting gures and making them not too l arge, or 3 pages i f your gures are big.
Datasets: The project will explore two datasets, the famous MNIST dataset of very small pictures of
handwritten numbers, and a dataset that explores the prevelance of diabetes in a native american tribe
named the Pima. You can access the datasets here:
1. https://www.kaggle.com/uciml/pima-indians-diabetes-database
2. https://www.kaggle.com/c/digit-recognizer/data
Programming Task: For each dataset, you must create a K-NN classier that uses the training data to
build a classier, and evaluate and report on the classier performance.
(30 points) Dataset details: Describe the data and some simple visualizations (for images, a few exam-
ples from each category; for other data, perhaps some scatter plots or histograms that show a big picture of
the data). Describe your training/test split for K-NN and justify your choices.
(15 points) Algorithm Description: K-NN is a very clear algorithm, so here describe any data pre-
processing, feature scaling, distance metrics, or otherwise that you did.
(45 points) Algorithm Results: Show the accuracy of your algorithm|in the case of the Pima Dataset,
show accuracy with tables showing false positive, false negative, true positive and true negatives. For the
Pima Dataset, use three dierent distance metrics and compare the results.
In the case of the MNIST digits show the complete confusion matrix. Choose a single digit to measure
accuracy and show how that number varies as a function of K.
(10 points) Runtime: Describe the run-time of your algorithm and also share the actual “wall-clock”
time that it took to compute your results.