Vertebral Column Data Set

This Biomedical data set was built by Dr. Henrique da Mota during a medical residence

period in Lyon, France. Each patient in the data set is represented in the data set

by six biomechanical attributes derived from the shape and orientation of the pelvis

and lumbar spine (in this order): pelvic incidence, pelvic tilt, lumbar lordosis angle,

sacral slope, pelvic radius and grade of spondylolisthesis. The following convention is

used for the class labels: DH (Disk Hernia), Spondylolisthesis (SL), Normal (NO) and

Abnormal (AB). In this exercise, we only focus on a binary classication task NO=0

and AB=1.

(a) Download the Vertebral Column Data Set from: https://archive.ics.uci.

edu/ml/datasets/Vertebral+Column.

(b) Pre-Processing and Exploratory data analysis:

i. Make scatterplots of the independent variables in the dataset. Use color to

show Classes 0 and 1.

ii. Make boxplots for each of the independent variables.

Use color to show

Classes 0 and 1 (see ISLR p. 129).

iii. Select the rst 70 rows of Class 0 and the rst 140 rows of Class 1 as the

training set and the rest of the data as the test set.

i. Write code for k-nearest neighbors with Euclidean metric (or use a software

package).

ii. Test all the data in the test database with k nearest neighbors. Take de-

cisions by majority polling.

Plot train and test errors in terms of k for

k ∈ {208, 205, . . . , 7, 4, 1, } (in reverse order). You are welcome to use smaller

increments of k. Which k is the most suitable k among those values? Cal-

culate the confusion matrix, true positive rate, true negative rate, precision,

and F -score when k = k.1

iii. Since the computation time depends on the size of the training set, one may

only use a subset of the training set. Plot the best test error rate,2 which

is obtained by some value of k, against the size of training set, when the

size of training set is N ∈ {10, 20, 30, . . . , 210}.3 Note: for each N , select

your training set by choosing the rst N/3 rows of Class 0 and the rst

N N/3 rows of Class 1 in the training set you creatd in 1(b)iii. Also, for

each N , select the optimal k from a set starting from k = 1, increasing by 5.

For example, if N = 200, the optimal k is selected from {1, 6, 11, . . . , 196}.

This plot is called a Learning Curve.

Let us further explore some variants of KNN.

Computer Science 留学生编程作业代写