1 Introduction

2 Set up Environment

3 Initial Exploration / Visualization

4 Data Overview

5 K-Nearest Neighbor

5.1 Train Model
5.2 Confusion Matrix
5.3 Computational Complexity Analysis

6 KD Tree kNN Algorithm

6.1 Train Model
6.2 Confusion Matrix

7 Evaluate and Compare Models

8 KD Tree kNN Model Optimization

8.1 Principal Component Analysis (PCA)

8.1.1 Rational
8.1.2 Analysis
8.1.3 Principal Component Results
8.1.4 Scree Plot
8.1.5 Re-train Model and Confusion Matrix

8.2 K-fold Cross Validation

9 Final Model

10 Conclusion

1 Introduction

This is an Exploratory Data Analysis of the classic MNIST dataset. The data is available to download at Kaggle. The goal is to correctly classify digits (0-9) from a database of tens of thousands of gray-scale, handwritten images.

In this notebook, I will be practicing several data science algorithmic and visualization techniques.

2 Set up Environment

Load packages and import data files

# Load Packages
# vis
library(knitr)    # dynamic report generation
library(ggplot2)  # visualization
library(scales)   # scale functions for visualization
library(gridExtra) # multi-panel plots

# wrangle
library(EBImage)  # image manipulation/processing
library(reshape2) # data frame manipulation
library(kableExtra) # HTML table manipulation
library(magrittr) # pipe commands

# model
library(class)    # kNN classification
suppressMessages(library(nabor))    # kd tree implementation of kNN
suppressMessages(library(caret))    # confusionMatrix 

# Import MNIST training & test datasets
train <- read.csv("../input/train.csv")
train$label <- as.factor(train$label)
test <- read.csv("../input/test.csv")

3 Initial Exploration / Visualization

Referring to train.csv, the documentation states that each image is 28 pixels in height and 28 pixels in width. Furthermore, each pixel has an integer pixel-value associated with it (0-255), indicating the lightness or darkness of that pixel. Higher numbers are shaded darker.

With this information, we can quickly plot the pixel gray-scale intensity values to obtain an image of the digit.

# Create 28x28 matrix containing the pixel gray-scale intensity values
m = matrix(unlist(train[12,-1]),nrow = 28, byrow = T)
# Plot the image
image(m,col=grey.colors(255), main = paste0("label = ",train[12,1]))

Figure 1

4 Data Overview

The images need to be rotated along the x-axis to be readable. The EBImage package is great for image processing and spacial transformations. Let’s visualize a subset of the data.

# set up plot
par(mai=rep(0,4)) # no margins

# layout the plots into a matrix w/ 12 columns, by row
layout(matrix(1:144, ncol=12, byrow=T))

for (i in 1:144) {
  m <- rotate(matrix(unlist(train[i,-1]), 28,28, byrow = T), 270)
  image(m,col=grey.colors(255), xaxt='n', yaxt='n')
}

Figure 2

What do these images tell us?

Image Size: All of the images in the dataset are the same 28x28 size. This is great. We want all images to be normalized to the same size.
Image Resolution: The digits are low resolution, gray-scale images.
No single hand-drawn digits are identical: As in natural handwriting, no single digits are perfectly identical, even when drawn by the same individual. Images of the same digit can be superimposed on one another and they will still won’t perfectly align.
Digit variation: The images appear to be drawn by different individuals.
Digit typography: There are differences in how digits are stylized. For example, the number 7 can be drawn with or without a horizontal line through it. The number 4 can be drawn with a “tophat”. The number 2 can be drawn with a a loop towards the bottom. The number 9 can be drawn with a straight vertical bar or with a curve. This further adds to the variation.

This is a lot to consider. Fortunately, we have over 40,000 labeled digits to train our supervised model.

Imagine each digits' 28x28 matrix of pixels represent a set of features. The classification model that immediately comes to mind (and certainly the most simple and intuitive) is K-Nearest Neighbor (kNN).

That is, these set of features can be mapped onto a feature space, and subsequently used to classify new digits based on its proximity to its kth-nearest neighbor in the feature space.

The strategy for setting up the data for modeling is as follows:

Convert the 28x28 matrix of pixel-values into a 784 length vector. Each training instance will consist of a 784 length vector and an associated class label.
We do not need to transform the data as we did before in visualization. The kNN algorithm interprets the digits as a set of feature vectors.
Classification is done by calculating the Euclidean distance of the testing instance to every training instance.
A label will be assigned to the testing instance based on the most common class among its k-nearest examples in the training set.

5 K-Nearest Neighbor

5.1 Train Model

Split the training data into a smaller training set and validation set. The validation set will be used to optimize the kNN model.

set.seed(42)
# we will split the data using a 75/25, training/validation split
sample_rows <- sample(nrow(train), nrow(train)*0.75)

trainingSet <- train[sample_rows,]
validationSet <- train[-sample_rows,]

# function to track time in seconds
current_time<-function(display_time_elapse=FALSE){
  if (display_time_elapse){
    return((Sys.time()))
  } else {
   return(as.numeric(Sys.time()))
  }
}

# train kNN model and record run time
start <- current_time(TRUE)
kNN_predictions <- class::knn(train = trainingSet[,-1], test = validationSet[,-1], cl = trainingSet[,1], k = 1)
class_kNN_time <- current_time(TRUE) - start
class_kNN_time

## Time difference of 36.08147 mins

The kNN algorithm from the class package took a little over half an hour to train the model. Let’s look at the accuracy of the classifier.

5.2 Confusion Matrix

# construct confusion matrix
cfm <- confusionMatrix(kNN_predictions, validationSet$label)

# confusion matrix visualization
ggplotConfusionMatrix <- function(m,model){
  mytitle <- paste(model, "Confusion Matrix:", "Accuracy", percent_format()(m$overall[1]),
                   "Kappa", percent_format()(m$overall[2]))
  p <-
    ggplot(data = as.data.frame(m$table),
      aes(x = Prediction, y = Reference)) +
      ylab("Actual Label") + 
      xlab("Predicted Label") +
      geom_tile(aes(fill = log(Freq)), color = "white") +
      scale_fill_gradient(low = "white", high = "steelblue") +
      geom_text(aes(x = Prediction, y = Reference, label = Freq)) +
      theme(legend.position = "right") +
      scale_x_discrete(limits = rev(levels(as.data.frame(m$table)$Prediction))) +
      theme(legend.background = element_rect(color = 'black', fill = 'white', linetype='solid')) +
      ggtitle(mytitle) + theme(plot.title = element_text(hjust = 0.5))
  return(p)
}

# plot confusion matrix
ggplotConfusionMatrix(cfm, "Vanilla kNN")

Figure 3

5.3 Computational Complexity Analysis

The kNN algoritm from the class package achieved an accuracy of when . The kNN model for the MNIST database has a fairly high baseline accuracy. Next, let’s discuss the execution time of the model. The computational complexity of the kNN algorithm with pictures, pixels, and nearest neighbors is

kNN is generally slow at the testing portion of the algorithm. This bottleneck of the complexity comes from having to compute the Euclidean distance of the test instance to every instance in the training set. I read from this blog post that we can improve the time complexity of the algorithm by implementating a KD tree.

The KD tree is a space-partioning data structure that dramatically improves the runtime performance of the classifier. The training instances are stored in a binary tree for effectively faster look up times. Consequently, unlike before, the test instance is not compared to all training examples. Instead the test instance is compared to potential near neighbors. The insert and search operations of the KD tree have an average time complexity of

where

is the number of pictures.

The nabor package stores the training instances in a KD tree and the kNN model is approximated. We will use the nabor package to re-train the model.

6 KD Tree kNN Algorithm

6.1 Train Model

# train KD Tree kNN model and track runtime
start <- current_time(TRUE)
kd_treeModel <- nabor::knn(data = trainingSet[,-1], query = validationSet[,-1], k = 1)
nabor_time <- current_time(TRUE) - start
nabor_time

## Time difference of 6.907541 mins

The kNN KD tree algorithm from the nabor package took significantly less time to train and classify the data than the vanillla kNN model. Let’s look at the accuracy of the classifier.

6.2 Confusion Matrix

get_KD_tree_prediction_labels<-function(kd_tree, trainData){

  get_k_nearest_votes<-function(i,trainData,kd_tree){
    k_nearest_votes <- trainData[c(kd_tree[i,]), 1]
    tt <- table(k_nearest_votes)
    most_common_label <- names(tt[which.max(tt)])
    return(most_common_label)
  }

  count <- nrow(kd_tree)

  labeled_predictions <- NA

  for (i in 1:count){
    labeled_predictions[i] <- get_k_nearest_votes(i,trainData,kd_tree)
  }
  labeled_predictions <- factor(labeled_predictions, levels = 0:9)

  return(labeled_predictions)
}

kd_tree_predictions <- get_KD_tree_prediction_labels(kd_treeModel$nn.idx,trainingSet)

cfm <- confusionMatrix(kd_tree_predictions, validationSet$label)

ggplotConfusionMatrix(cfm, "KD Tree kNN")

Figure 4

7 Evaluate and Compare Models

The results from Figures and above show that the KD Tree kNN Algorithm model has a better time complexity than the Vanilla kNN algortihm whilst maintaining the same level of accuracy. The confusion matrices of the two models are identical.

As a proof of concept, we can plot the computational time complexity of both models with several test instance sizes.

xnums <- c(10,50,100,250,375,500,750,1000,1250,1500,1750,2000, 2500, 3000)
vanilla_knn <- rep(NA, length(xnums))
knn_kd_tree <- rep(NA, length(xnums))

get_execution_time<-function(n){

  start_time <- current_time()
  predictions<-class::knn(trainingSet[,-1], validationSet[c(1:n),-1], trainingSet[,1], k = 1)
  end_time <- current_time()
  vanilla_knn_time <- end_time - start_time

  start_time <- current_time()
  kd_treeModel <- nabor::knn(trainingSet[,-1], validationSet[c(1:n),-1], k = 1)
  end_time <- current_time()
  knn_kd_time <- end_time - start_time
  output <- list(vanilla_knn_time, knn_kd_time)
  return(output)  # runtime in seconds
}

for(i in xnums){
  n <- match(i, xnums)
  execution_times <- unlist(get_execution_time(i))
  vanilla_knn[n] <- execution_times[1]
  knn_kd_tree[n] <- execution_times[2]
}

time_df <- data.frame(xnums,vanilla_knn, knn_kd_tree)
time_df_long <- melt(time_df, id="xnums")

ggplot(data=time_df_long, aes(x=xnums, y=value, color=variable)) +
  geom_line() + geom_point() +
  xlab("Test Input Size (n)") +
  ylab("Time (seconds)") +
  ggtitle("Computational Time Complexity") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(name = "ML Model", labels = c("Vanilla kNN", "KD Tree kNN")) +
  theme(legend.background = element_rect(color = 'black', fill = 'white', linetype='solid'))

Figure 5

The KD tree kNN model performs faster than the Vanilla implementation of kNN as the number of testing instances is increased.

Let’s optimize the KD Tree kNN model performance.

8 KD Tree kNN Model Optimization

Objective: Reduce the dimensionality of the feature space by performing a Principal Component Analysis.
Objective: Improve the accuracy of the model by optimizing for k in kNN. We will perform k-fold cross validation.

8.1 Principal Component Analysis (PCA)

8.1.1 Rational

I read from a blog post that reducing the number of features can dramatically improve the classification outcome. In the tutorial, the author trained a model with features for binary classification. The first model was a neural network using all features. This model achieved an accuracy of . Next he performed a PCA on the dataset to reduce the features to principal components. He made a new dataset with the principal components and re-built the neural network. The new model achieved an accuracy of . It would be interesting to see if we can similarly improve the accuracy of our KD tree kNN model by PCA.

Principal Component Analysis is a statistical technique for transforming a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components. By retaining principal components that explain a high proportion of the total variance, we effectively remove redundant features whilst minimizing the amount of information lost. PCA is not only a great dimensionality reduction technique, it can also be used for data compression to speed up the learning algorithm.

A related topic for discussion is a phenomena called Curse of Dimensionality. This usually occurs in highly dimensional features spaces. As the number of features increase, the amount of data that we need to generalize accurately grows exponently. When there are too many features in the input space, the amount of data available to accurately train the model becomes sparse.

In our classification problem, each digit is represented by 784 features (28x28 resolution pixel values). Let’s perform a principal component analysis to reduce the number of features.

8.1.2 Analysis

To begin the dimension reduction, we can remove pixels from the training data that contain whitespace among all images. These pixels correspond to the background of the images. These empty pixels do not help in the classification therefore we can confidently remove them.

removeEmptyPixels<- function(trainData){
 allmisscols <- apply(trainData,2, function(x)all(x==0))
  colswithallmiss <-names(allmisscols[allmisscols>0])
  trainDataFrame <- trainData[, !names(train) %in% colswithallmiss]
  return(trainDataFrame)
}

trainDataFrame<-removeEmptyPixels(trainingSet)

Let’s perform the principal component analysis and construct a data frame of the results.

# principal components analysis using correlation matrix
computePrincipalComponents<-function(trainData){
  pca_df <- princomp(trainData[,-1], scores = T, cor=T)

  # Compute the eigenvalues of principal components from the standard deviation
  eig <- as.vector((pca_df$sdev)^2)

  # compute variances of principal components
  variance <- eig*100/sum(eig)

  # compute the cumulative variance
  cumvar <- cumsum(variance)

  df <- data.frame(PC = c(1:length(eig)), Eigenvalue = eig,"Variance Explained" = variance, "Cumulative Variance Explained" = cumvar)

  return(df)
}

princomp_df<-computePrincipalComponents(trainDataFrame)

8.1.3 Principal Component Results

For dimension reduction, we will retain principal components with eigenvalues greater than 1. Eigenvalues > 1 is a good heuristic rule of thumb. See Figure 6.

knitr::include_graphics("../input/Reference.png")

Figure 6

Let’s make a dataframe of the filtered principal components.

# exclude rows from dataframe where eigenvalue < 1
new_princomp_df <- princomp_df[princomp_df$Eigenvalue > 1,]

We can generate an HTML table of the filtered principal components.

kable(new_princomp_df) %>%
  kable_styling(c("striped", "bordered", "responsive")) %>%
  add_header_above(c("Principal Compnent Analysis Results" = 4)) %>%
  row_spec(0, bold = T, color = "white", background = "#666") %>%
  scroll_box(height = "400px")

Principal Compnent Analysis Results
PC	Eigenvalue	Variance.Explained	Cumulative.Variance.Explained
1	40.785439	5.7769744	5.776974
2	29.203806	4.1365164	9.913491
3	26.760640	3.7904590	13.703950
4	20.778514	2.9431323	16.647082
5	18.106538	2.5646654	19.211748
6	15.742120	2.2297620	21.441509
7	13.807888	1.9557915	23.397301
8	12.463289	1.7653384	25.162639
9	11.089303	1.5707228	26.733362
10	10.037146	1.4216920	28.155054
11	9.671364	1.3698816	29.524936
12	8.673066	1.2284796	30.753415
13	8.033350	1.1378682	31.891284
14	7.928454	1.1230105	33.014294
15	7.430107	1.0524230	34.066717
16	7.151772	1.0129988	35.079716
17	6.758783	0.9573347	36.037051
18	6.627955	0.9388038	36.975854
19	6.450296	0.9136396	37.889494
20	6.302166	0.8926581	38.782152
21	5.913115	0.8375517	39.619704
22	5.737276	0.8126453	40.432349
23	5.482075	0.7764979	41.208847
24	5.374899	0.7613172	41.970164
25	5.204479	0.7371783	42.707343
26	4.974165	0.7045560	43.411899
27	4.897580	0.6937082	44.105607
28	4.688801	0.6641361	44.769743
29	4.478469	0.6343441	45.404087
30	4.432175	0.6277869	46.031874
31	4.337247	0.6143410	46.646215
32	4.260045	0.6034058	47.249621
33	4.097587	0.5803948	47.830016
34	4.065460	0.5758442	48.405860
35	4.015096	0.5687104	48.974570
36	3.852994	0.5457499	49.520320
37	3.831807	0.5427488	50.063069
38	3.751939	0.5314361	50.594505
39	3.597863	0.5096124	51.104117
40	3.524305	0.4991933	51.603311
41	3.452628	0.4890407	52.092351
42	3.387091	0.4797579	52.572109
43	3.321853	0.4705174	53.042627
44	3.308902	0.4686831	53.511310
45	3.244350	0.4595396	53.970849
46	3.200576	0.4533394	54.424189
47	3.154652	0.4468345	54.871023
48	3.146432	0.4456702	55.316693
49	3.067260	0.4344561	55.751150
50	3.040402	0.4306518	56.181801
51	2.971235	0.4208548	56.602656
52	2.932732	0.4154011	57.018057
53	2.900722	0.4108671	57.428924
54	2.841400	0.4024646	57.831389
55	2.799752	0.3965655	58.227954
56	2.790456	0.3952487	58.623203
57	2.740900	0.3882294	59.011433
58	2.686503	0.3805245	59.391957
59	2.666955	0.3777556	59.769713
60	2.634287	0.3731285	60.142841
61	2.541794	0.3600275	60.502869
62	2.523658	0.3574587	60.860327
63	2.467274	0.3494722	61.209800
64	2.459200	0.3483286	61.558128
65	2.439964	0.3456040	61.903732
66	2.374353	0.3363106	62.240043
67	2.368244	0.3354453	62.575488
68	2.353653	0.3333786	62.908867
69	2.276250	0.3224150	63.231282
70	2.262620	0.3204844	63.551766
71	2.235628	0.3166612	63.868427
72	2.219180	0.3143314	64.182759
73	2.199981	0.3116120	64.494371
74	2.173328	0.3078369	64.802207
75	2.152361	0.3048670	65.107074
76	2.134864	0.3023887	65.409463
77	2.107172	0.2984664	65.707930
78	2.080476	0.2946850	66.002615
79	2.052091	0.2906645	66.293279
80	2.045284	0.2897003	66.582979
81	2.038149	0.2886896	66.871669
82	2.004339	0.2839006	67.155569
83	1.994818	0.2825521	67.438122
84	1.989158	0.2817504	67.719872
85	1.974063	0.2796123	67.999484
86	1.963117	0.2780619	68.277546
87	1.941347	0.2749783	68.552524
88	1.924542	0.2725980	68.825122
89	1.913162	0.2709862	69.096108
90	1.905256	0.2698663	69.365975
91	1.868284	0.2646295	69.630604
92	1.849384	0.2619524	69.892557
93	1.832798	0.2596030	70.152160
94	1.815964	0.2572188	70.409379
95	1.787880	0.2532408	70.662619
96	1.778162	0.2518643	70.914484
97	1.758033	0.2490131	71.163497
98	1.747077	0.2474613	71.410958
99	1.717969	0.2433384	71.654296
100	1.688094	0.2391069	71.893403
101	1.670366	0.2365958	72.129999
102	1.660331	0.2351744	72.365173
103	1.650670	0.2338060	72.598979
104	1.648016	0.2334301	72.832409
105	1.621478	0.2296711	73.062080
106	1.611681	0.2282834	73.290364
107	1.596074	0.2260728	73.516437
108	1.586211	0.2246758	73.741113
109	1.571774	0.2226309	73.963744
110	1.551543	0.2197653	74.183509
111	1.535090	0.2174349	74.400944
112	1.508619	0.2136853	74.614629
113	1.499523	0.2123970	74.827026
114	1.492858	0.2114529	75.038479
115	1.472022	0.2085017	75.246981
116	1.443340	0.2044392	75.451420
117	1.438527	0.2037574	75.655177
118	1.413418	0.2002009	75.855378
119	1.396139	0.1977533	76.053131
120	1.387957	0.1965944	76.249726
121	1.382484	0.1958193	76.445545
122	1.361487	0.1928452	76.638390
123	1.349303	0.1911194	76.829510
124	1.338809	0.1896329	77.019143
125	1.315599	0.1863455	77.205488
126	1.303301	0.1846036	77.390092
127	1.293498	0.1832150	77.573307
128	1.285998	0.1821527	77.755459
129	1.266967	0.1794571	77.934916
130	1.261629	0.1787011	78.113618
131	1.253372	0.1775315	78.291149
132	1.234522	0.1748614	78.466010
133	1.228912	0.1740669	78.640077
134	1.220501	0.1728755	78.812953
135	1.208063	0.1711137	78.984067
136	1.196307	0.1694485	79.153515
137	1.186434	0.1680502	79.321565
138	1.177051	0.1667211	79.488286
139	1.161138	0.1644671	79.652754
140	1.143044	0.1619043	79.814658
141	1.132914	0.1604694	79.975127
142	1.121481	0.1588500	80.133977
143	1.115372	0.1579847	80.291962
144	1.106983	0.1567965	80.448758
145	1.103441	0.1562948	80.605053
146	1.090436	0.1544527	80.759506
147	1.081244	0.1531507	80.912656
148	1.071122	0.1517170	81.064373
149	1.061629	0.1503723	81.214746
150	1.055994	0.1495743	81.364320
151	1.053775	0.1492599	81.513580
152	1.044849	0.1479956	81.661575
153	1.038726	0.1471283	81.808704
154	1.032706	0.1462756	81.954979
155	1.024454	0.1451069	82.100086
156	1.013906	0.1436128	82.243699
157	1.010948	0.1431938	82.386893
158	1.002890	0.1420524	82.528945
159	1.001135	0.1418038	82.670749

dim(new_princomp_df)

## [1] 159   4

Previously we had features to train the classifcation model. After applying the Eigenvalues > 1 rule, we are left with principal components. This is a huge dimensionality reduction.

8.1.4 Scree Plot

We can visualize the variance explained by each principal component with a Scree Plot. For brevity, we will plot the percent variance of the first principal components.

# Print the scree plot
ggplotScreePlot<- function(df){
  df <- df[1:10,]
  p <- 
    ggplot(df)  + 
    geom_bar(aes(x=PC, y=Variance.Explained),stat="identity", fill="tan1", color="sienna3")+
    geom_point(aes(x=PC, y=df$Variance.Explained ),stat="identity")+
    geom_line(aes(x=PC, y=df$Variance.Explained ),stat="identity") +
    geom_text(aes(label=paste0(round(Variance.Explained, 2),"%"), x=PC, y=Variance.Explained), color="black", vjust = -0.4, hjust = -0.15) +
    ylab("Percent of Variance Explained") + xlab("Principal Components") +
    ggtitle("PCA: Scree Plot") + theme(plot.title = element_text(hjust = 0.5))
  return(p)
}

ggplotScreePlot(new_princomp_df)

Figure 7

8.1.5 Re-train Model and Confusion Matrix

Now that we’ve reduced the number of principal components, we can retrain the model.

# Principal Component Analysis
pca.result <- princomp(trainingSet[,-1], scores = T)
train.pca <- pca.result$scores
test.pca <- predict(pca.result, validationSet[,-1])

num.principal.components <- 1:nrow(new_princomp_df)
start <- current_time(TRUE)
kd_treeModel1 <- nabor::knn(data = train.pca[,num.principal.components], query = test.pca[,num.principal.components], k=1)
nabor_time <- current_time(TRUE) - start
nabor_time

## Time difference of 1.095808 mins

kd_treeModel2 <- nabor::knn(data = train.pca[,c(1:49)], query = test.pca[,c(1:49)], k=1)

Reducing the number of features in the model dramatically speeds up the runtime performance of the classification algorithm.

Notice that I created two models. kd_treeModel1 uses the first principal components; these are the principal components where Eigenvalues > 1. kd_treeModel2 uses the first principal components. I determined this number empirically. Let’s plot the Confusion Matrices.

kd_tree_predictions1 <- get_KD_tree_prediction_labels(kd_treeModel1$nn.idx,trainingSet)
kd_tree_predictions2 <- get_KD_tree_prediction_labels(kd_treeModel2$nn.idx,trainingSet)

cfm1 <- confusionMatrix(kd_tree_predictions1, validationSet$label)
cfm2 <- confusionMatrix(kd_tree_predictions2, validationSet$label)

p1<-ggplotConfusionMatrix(cfm1, "159 Principal Components")
p2<-ggplotConfusionMatrix(cfm2, "49 Principal Components")
grid.arrange(p1, p2, nrow = 2)

Figure 8

Recall the accuracy before was . Using the first principal components achieved an accuracy of . Using the first principal components achieved an accuracy of .

Note that there are a number of heuristics that we could have used to determine the number of principal components to retain in the model. For instance, we could have taken the first k eigenvectors that captured at least of the total cumulative variance.

Another heuristic criterion is the Elbow Method. We can created a cumulative variance plot where the y-axis is the cumulative proportion of variance explained and the x-axis is the number of principal components. The number of principal components to retain would be at the point of diminishing returns. That is, the point where little variance is gained by retaining additional eigenvalues.

For our analysis, we chose the first k eigenvectors that had Eigenvalues > 1. This heuristic criterion captured the first principal components. These principal components account for of the total cumulative variance.

However, as we observed earlier, for high dimensionality problems, these heuristics may not necessarily be optimal. In the end, we determined empirically that the first 49 principal components achieved the highest accuracy in the model.

Let’s fine-tune the model by optimizing for k in kNN. We will use 5-fold cross validation.

8.2 K-fold Cross Validation

Objective: Evaluate the accuracy of the kNN classifier with different values of k by cross validation. Optimize the knn model by selecting the k with the lowest misclassification error.

# generate a randomized sample group column in the training data
train$gp <- sample(nrow(train), replace = F)/nrow(train)
# create equal k-fold subsets of the training data
fold_one <- subset(train, train$gp <= 0.2)
fold_two <- subset(train, train$gp > 0.2 & train$gp <= 0.4)
fold_three <- subset(train, train$gp > 0.4 & train$gp <= 0.6)
fold_four <- subset(train, train$gp > 0.6 & train$gp <= 0.8)
fold_five <- subset(train, train$gp > 0.8 & train$gp <= 1)

ldf <- list(fold_one, fold_two, fold_three, fold_four, fold_five)

# k values in kNN
K <- c(1,2,3,4,5,8,10,20,50,100,200)

get_misclassification_error<-function(trainFold.pca,testFold.pca, trainFold, testFold,k_val,prcomp.used = 1:49){

  kd_treeModel <- nabor::knn(data = trainFold.pca[,prcomp.used], query = testFold.pca[,prcomp.used], k=k_val)
  kd_tree_predictions <- get_KD_tree_prediction_labels(kd_treeModel$nn.idx,trainFold)
  cfm <- confusionMatrix(kd_tree_predictions, factor(testFold$label, levels = c(0:9)))
  return(1-cfm$overall[1][["Accuracy"]])
}

my_folds <- rep(list(NA), 5)
m <- 1

k <- rep(NA,length(K))
for(k_val in K){
  for(i in 1:length(ldf)){
    trainFold<-do.call("rbind", ldf[-i])
    testFold<- do.call("rbind", ldf[i])

    pca.result <- princomp(trainFold[,-c(1,ncol(trainFold))], scores = T)
    train.pca <- pca.result$scores
    test.pca <- predict(pca.result, testFold[,-c(1,ncol(testFold))])

    l <- get_misclassification_error(train.pca, test.pca, trainFold, testFold, k_val)
    my_folds[[i]][m] <- l
  }
  k[m] <- k_val
  m <- m + 1
}

Let’s make a dataframe of the cross validation results

df1 <- data.frame(k = k, test_fold_one = my_folds[[1]], test_fold_two = my_folds[[2]], test_fold_three = my_folds[[3]],
                  test_fold_four = my_folds[[4]], test_fold_five = my_folds[[5]])

df1 <- cbind(df1, average = rowMeans(df1[,-1]))

We can generate an HTML table of the validation results.

kable(df1) %>%
  kable_styling(c("striped", "bordered")) %>%
  add_header_above(c("Nearest Neighbors" = 1, "Misclassification Error (1 - Accuracy)" = 6))

Nearest Neighbors	Misclassification Error (1 - Accuracy)
k	test_fold_one	test_fold_two	test_fold_three	test_fold_four	test_fold_five	average
1	0.0270238	0.0267857	0.0270238	0.0289286	0.0292857	0.0278095
2	0.0284524	0.0329762	0.0296429	0.0334524	0.0334524	0.0315952
3	0.0240476	0.0271429	0.0261905	0.0270238	0.0280952	0.0265000
4	0.0227381	0.0265476	0.0250000	0.0283333	0.0271429	0.0259524
5	0.0257143	0.0272619	0.0269048	0.0279762	0.0264286	0.0268571
8	0.0260714	0.0275000	0.0278571	0.0286905	0.0280952	0.0276429
10	0.0278571	0.0292857	0.0308333	0.0313095	0.0297619	0.0298095
20	0.0333333	0.0345238	0.0342857	0.0377381	0.0340476	0.0347857
50	0.0457143	0.0447619	0.0436905	0.0479762	0.0451190	0.0454524
100	0.0579762	0.0563095	0.0566667	0.0597619	0.0569048	0.0575238
200	0.0704762	0.0720238	0.0736905	0.0740476	0.0719048	0.0724286

The minimum misclassification error occurs when . With this, we can finally make predictions our test dataset.

9 Final Model

Let’s re-import the data and classify the test digits. For our submission, we will use the nabor::knn kd tree model with and the first principal components.

train <- read.csv("../input/train.csv")
train$label <- as.factor(train$label)
test <- read.csv("../input/test.csv")

trainData <- removeEmptyPixels(train)

# Compute the principal components
pca.result <- princomp(trainData[,-1], scores = T)
train.pca <- pca.result$scores
test.pca <- predict(pca.result, test)

# number of principal components
num.principal.components <- 1:49
# train model
kd_treeModel <- nabor::knn(data = train.pca[,num.principal.components], query = test.pca[,num.principal.components], k=4)

# get predictions
Label <- get_KD_tree_prediction_labels(kd_treeModel$nn.idx,train)

# write predictions to file
ImageId = c(1:length(Label))
submission <- data.frame(ImageId,Label)
write.csv(submission, file = "../mnist_kdtree_kNN_submission.csv", row.names = F)

10 Conclusion

Our final model reached a maximum accuracy of ~ on the test dataset.

In summary, the default class::kNN classifer was optimized for both runtime performance and accuracy. The training set was a relatively high dimensional dataset. There were features (and the corresponding values) for each digit. I applied the technique of principal component analysis to compress the features into principal components. These principal components explained a cumulative percentage of over of the variance in the model. Next, I used nabor package to improve the computational complexity of class::kNN. The nabor::knn function approximates the kNN algorithm by storing the training instances in a KD tree data structure. Lastly, the model was optimized for in kNN by using a 5-fold cross validation. The minimum misclassification error occurs when . The final model achieved a maximum accuracy of on the test dataset.

== Future Study ==

There are several classifiers and deep learning models (eg. simple neural networks and convolutional neural networks) that we could have used for this classification problem. LeCun et al. 1998, published a paper on Gradient-Based Learning Applied to Document Recognition that explores the usage of more advance models for handwriting recognition. These are all great topics to explore for future study.

Classifying Images of Handwritten Digits

Samuel Moijueh

2018-08-28