Vertebral Column Data Set

This Biomedical data set was built by Dr. Henrique da Mota during a medical residence
period in Lyon, France. Each patient in the data set is represented in the data set
by six biomechanical attributes derived from the shape and orientation of the pelvis
and lumbar spine (in this order): pelvic incidence, pelvic tilt, lumbar lordosis angle,
sacral slope, pelvic radius and grade of spondylolisthesis. The following convention is
used for the class labels: DH (Disk Hernia), Spondylolisthesis (SL), Normal (NO) and
Abnormal (AB). In this exercise, we only focus on a binary classication task NO=0
and AB=1.
(a) Download the Vertebral Column Data Set from: https://archive.ics.uci.
edu/ml/datasets/Vertebral+Column.
(b) Pre-Processing and Exploratory data analysis:
i. Make scatterplots of the independent variables in the dataset. Use color to
show Classes 0 and 1.
ii. Make boxplots for each of the independent variables.
Use color to show
Classes 0 and 1 (see ISLR p. 129).
iii. Select the rst 70 rows of Class 0 and the rst 140 rows of Class 1 as the
training set and the rest of the data as the test set.
(c) Classication using KNN on Vertebral Column Data Set
i. Write code for k-nearest neighbors with Euclidean metric (or use a software
package).
ii. Test all the data in the test database with k nearest neighbors. Take de-
cisions by majority polling.
Plot train and test errors in terms of k for
k ∈ {208, 205, . . . , 7, 4, 1, } (in reverse order). You are welcome to use smaller
increments of k. Which k is the most suitable k among those values? Cal-
culate the confusion matrix, true positive rate, true negative rate, precision,
and F -score when k = k.1
iii. Since the computation time depends on the size of the training set, one may
only use a subset of the training set. Plot the best test error rate,which
is obtained by some value of k, against the size of training set, when the
size of training set is N ∈ {10, 20, 30, . . . , 210}.Note: for each N , select
your training set by choosing the rst N/3 rows of Class 0 and the rst
N N/3 rows of Class 1 in the training set you creatd in 1(b)iii. Also, for
each N , select the optimal k from a set starting from k = 1, increasing by 5.
For example, if N = 200, the optimal k is selected from {1, 6, 11, . . . , 196}.
This plot is called a Learning Curve.
Let us further explore some variants of KNN.

Leave a Reply

Your email address will not be published. Required fields are marked *