• 1 Introduction
  • 2 Set up Environment
  • 3 Initial Exploration / Visualization
  • 4 Data Overview
  • 5 K-Nearest Neighbor
    • 5.1 Train Model
    • 5.2 Confusion Matrix
    • 5.3 Computational Complexity Analysis
  • 6 KD Tree kNN Algorithm
    • 6.1 Train Model
    • 6.2 Confusion Matrix
  • 7 Evaluate and Compare Models
  • 8 KD Tree kNN Model Optimization
    • 8.1 Principal Component Analysis (PCA)
      • 8.1.1 Rational
      • 8.1.2 Analysis
      • 8.1.3 Principal Component Results
      • 8.1.4 Scree Plot
      • 8.1.5 Re-train Model and Confusion Matrix
    • 8.2 K-fold Cross Validation
  • 9 Final Model
  • 10 Conclusion

1 Introduction

This is an Exploratory Data Analysis of the classic MNIST dataset. The data is available to download at Kaggle. The goal is to correctly classify digits (0-9) from a database of tens of thousands of gray-scale, handwritten images.

In this notebook, I will be practicing several data science algorithmic and visualization techniques.

2 Set up Environment

Load packages and import data files

# Load Packages
# vis
library(knitr)    # dynamic report generation
library(ggplot2)  # visualization
library(scales)   # scale functions for visualization
library(gridExtra) # multi-panel plots

# wrangle
library(EBImage)  # image manipulation/processing
library(reshape2) # data frame manipulation
library(kableExtra) # HTML table manipulation
library(magrittr) # pipe commands

# model
library(class)    # kNN classification
suppressMessages(library(nabor))    # kd tree implementation of kNN
suppressMessages(library(caret))    # confusionMatrix 

# Import MNIST training & test datasets
train <- read.csv("../input/train.csv")
train$label <- as.factor(train$label)
test <- read.csv("../input/test.csv")

3 Initial Exploration / Visualization

Referring to train.csv, the documentation states that each image is 28 pixels in height and 28 pixels in width. Furthermore, each pixel has an integer pixel-value associated with it (0-255), indicating the lightness or darkness of that pixel. Higher numbers are shaded darker.

With this information, we can quickly plot the pixel gray-scale intensity values to obtain an image of the digit.

# Create 28x28 matrix containing the pixel gray-scale intensity values
m = matrix(unlist(train[12,-1]),nrow = 28, byrow = T)
# Plot the image
image(m,col=grey.colors(255), main = paste0("label = ",train[12,1]))
Figure 1

Figure 1

4 Data Overview

The images need to be rotated along the x-axis to be readable. The EBImage package is great for image processing and spacial transformations. Let’s visualize a subset of the data.

# set up plot
par(mai=rep(0,4)) # no margins

# layout the plots into a matrix w/ 12 columns, by row
layout(matrix(1:144, ncol=12, byrow=T))

for (i in 1:144) {
  m <- rotate(matrix(unlist(train[i,-1]), 28,28, byrow = T), 270)
  image(m,col=grey.colors(255), xaxt='n', yaxt='n')
}
Figure 2

Figure 2

What do these images tell us?

  • Image Size: All of the images in the dataset are the same 28x28 size. This is great. We want all images to be normalized to the same size.
  • Image Resolution: The digits are low resolution, gray-scale images.
  • No single hand-drawn digits are identical: As in natural handwriting, no single digits are perfectly identical, even when drawn by the same individual. Images of the same digit can be superimposed on one another and they will still won’t perfectly align.
  • Digit variation: The images appear to be drawn by different individuals.
  • Digit typography: There are differences in how digits are stylized. For example, the number 7 can be drawn with or without a horizontal line through it. The number 4 can be drawn with a “tophat”. The number 2 can be drawn with a a loop towards the bottom. The number 9 can be drawn with a straight vertical bar or with a curve. This further adds to the variation.

This is a lot to consider. Fortunately, we have over 40,000 labeled digits to train our supervised model.

Imagine each digits' 28x28 matrix of pixels represent a set of features. The classification model that immediately comes to mind (and certainly the most simple and intuitive) is K-Nearest Neighbor (kNN).

That is, these set of features can be mapped onto a feature space, and subsequently used to classify new digits based on its proximity to its kth-nearest neighbor in the feature space.

The strategy for setting up the data for modeling is as follows:

  • Convert the 28x28 matrix of pixel-values into a 784 length vector. Each training instance will consist of a 784 length vector and an associated class label.
  • We do not need to transform the data as we did before in visualization. The kNN algorithm interprets the digits as a set of feature vectors.
  • Classification is done by calculating the Euclidean distance of the testing instance to every training instance.
  • A label will be assigned to the testing instance based on the most common class among its k-nearest examples in the training set.

5 K-Nearest Neighbor

5.1 Train Model

Split the training data into a smaller training set and validation set. The validation set will be used to optimize the kNN model.

set.seed(42)
# we will split the data using a 75/25, training/validation split
sample_rows <- sample(nrow(train), nrow(train)*0.75)

trainingSet <- train[sample_rows,]
validationSet <- train[-sample_rows,]

# function to track time in seconds
current_time<-function(display_time_elapse=FALSE){
  if (display_time_elapse){
    return((Sys.time()))
  } else {
   return(as.numeric(Sys.time()))
  }
}

# train kNN model and record run time
start <- current_time(TRUE)
kNN_predictions <- class::knn(train = trainingSet[,-1], test = validationSet[,-1], cl = trainingSet[,1], k = 1)
class_kNN_time <- current_time(TRUE) - start
class_kNN_time
## Time difference of 36.08147 mins

The kNN algorithm from the class package took a little over half an hour to train the model. Let’s look at the accuracy of the classifier.

5.2 Confusion Matrix

# construct confusion matrix
cfm <- confusionMatrix(kNN_predictions, validationSet$label)

# confusion matrix visualization
ggplotConfusionMatrix <- function(m,model){
  mytitle <- paste(model, "Confusion Matrix:", "Accuracy", percent_format()(m$overall[1]),
                   "Kappa", percent_format()(m$overall[2]))
  p <-
    ggplot(data = as.data.frame(m$table),
      aes(x = Prediction, y = Reference)) +
      ylab("Actual Label") + 
      xlab("Predicted Label") +
      geom_tile(aes(fill = log(Freq)), color = "white") +
      scale_fill_gradient(low = "white", high = "steelblue") +
      geom_text(aes(x = Prediction, y = Reference, label = Freq)) +
      theme(legend.position = "right") +
      scale_x_discrete(limits = rev(levels(as.data.frame(m$table)$Prediction))) +
      theme(legend.background = element_rect(color = 'black', fill = 'white', linetype='solid')) +
      ggtitle(mytitle) + theme(plot.title = element_text(hjust = 0.5))
  return(p)
}

# plot confusion matrix
ggplotConfusionMatrix(cfm, "Vanilla kNN")
Figure 3

Figure 3

5.3 Computational Complexity Analysis

The kNN algoritm from the class package achieved an accuracy of \(96.4\%\) when \(k = 1\). The kNN model for the MNIST database has a fairly high baseline accuracy. Next, let’s discuss the execution time of the model. The computational complexity of the kNN algorithm with \(n\) pictures, \(d\) pixels, and \(k\) nearest neighbors is \[ \boxed{O(nd)} + O(nlogn) + O(k)\]

kNN is generally slow at the testing portion \(O(nd)\) of the algorithm. This bottleneck of the complexity comes from having to compute the Euclidean distance of the test instance to every instance in the training set. I read from this blog post that we can improve the time complexity of the algorithm by implementating a KD tree.

The KD tree is a space-partioning data structure that dramatically improves the runtime performance of the classifier. The training instances are stored in a binary tree for effectively faster look up times. Consequently, unlike before, the test instance is not compared to all training examples. Instead the test instance is compared to \(m<<n\) potential near neighbors. The insert and search operations of the KD tree have an average time complexity of \[O(nlogn)\] where \(n\) is the number of pictures.

The nabor package stores the training instances in a KD tree and the kNN model is approximated. We will use the nabor package to re-train the model.

6 KD Tree kNN Algorithm

6.1 Train Model

# train KD Tree kNN model and track runtime
start <- current_time(TRUE)
kd_treeModel <- nabor::knn(data = trainingSet[,-1], query = validationSet[,-1], k = 1)
nabor_time <- current_time(TRUE) - start
nabor_time
## Time difference of 6.907541 mins

The kNN KD tree algorithm from the nabor package took significantly less time to train and classify the data than the vanillla kNN model. Let’s look at the accuracy of the classifier.

6.2 Confusion Matrix

get_KD_tree_prediction_labels<-function(kd_tree, trainData){

  get_k_nearest_votes<-function(i,trainData,kd_tree){
    k_nearest_votes <- trainData[c(kd_tree[i,]), 1]
    tt <- table(k_nearest_votes)
    most_common_label <- names(tt[which.max(tt)])
    return(most_common_label)
  }

  count <- nrow(kd_tree)

  labeled_predictions <- NA

  for (i in 1:count){
    labeled_predictions[i] <- get_k_nearest_votes(i,trainData,kd_tree)
  }
  labeled_predictions <- factor(labeled_predictions, levels = 0:9)

  return(labeled_predictions)
}

kd_tree_predictions <- get_KD_tree_prediction_labels(kd_treeModel$nn.idx,trainingSet)

cfm <- confusionMatrix(kd_tree_predictions, validationSet$label)

ggplotConfusionMatrix(cfm, "KD Tree kNN")
Figure 4

Figure 4

7 Evaluate and Compare Models

The results from Figures \(3\) and \(4\) above show that the KD Tree kNN Algorithm model has a better time complexity than the Vanilla kNN algortihm whilst maintaining the same level of accuracy. The confusion matrices of the two models are identical.

As a proof of concept, we can plot the computational time complexity of both models with several \(n\) test instance sizes.

xnums <- c(10,50,100,250,375,500,750,1000,1250,1500,1750,2000, 2500, 3000)
vanilla_knn <- rep(NA, length(xnums))
knn_kd_tree <- rep(NA, length(xnums))

get_execution_time<-function(n){

  start_time <- current_time()
  predictions<-class::knn(trainingSet[,-1], validationSet[c(1:n),-1], trainingSet[,1], k = 1)
  end_time <- current_time()
  vanilla_knn_time <- end_time - start_time

  start_time <- current_time()
  kd_treeModel <- nabor::knn(trainingSet[,-1], validationSet[c(1:n),-1], k = 1)
  end_time <- current_time()
  knn_kd_time <- end_time - start_time
  output <- list(vanilla_knn_time, knn_kd_time)
  return(output)  # runtime in seconds
}

for(i in xnums){
  n <- match(i, xnums)
  execution_times <- unlist(get_execution_time(i))
  vanilla_knn[n] <- execution_times[1]
  knn_kd_tree[n] <- execution_times[2]
}

time_df <- data.frame(xnums,vanilla_knn, knn_kd_tree)
time_df_long <- melt(time_df, id="xnums")

ggplot(data=time_df_long, aes(x=xnums, y=value, color=variable)) +
  geom_line() + geom_point() +
  xlab("Test Input Size (n)") +
  ylab("Time (seconds)") +
  ggtitle("Computational Time Complexity") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(name = "ML Model", labels = c("Vanilla kNN", "KD Tree kNN")) +
  theme(legend.background = element_rect(color = 'black', fill = 'white', linetype='solid'))
Figure 5

Figure 5

The KD tree kNN model performs faster than the Vanilla implementation of kNN as the number of testing instances is increased.

Let’s optimize the KD Tree kNN model performance.

8 KD Tree kNN Model Optimization

  • Objective: Reduce the dimensionality of the feature space by performing a Principal Component Analysis.
  • Objective: Improve the accuracy of the model by optimizing for k in kNN. We will perform k-fold cross validation.

8.1 Principal Component Analysis (PCA)

8.1.1 Rational

I read from a blog post that reducing the number of features can dramatically improve the classification outcome. In the tutorial, the author trained a model with \(18\) features for binary classification. The first model was a neural network using all features. This model achieved an accuracy of \(54.6\%\). Next he performed a PCA on the dataset to reduce the \(18\) features to \(8\) principal components. He made a new dataset with the \(8\) principal components and re-built the neural network. The new model achieved an accuracy of \(97.7\%\). It would be interesting to see if we can similarly improve the accuracy of our KD tree kNN model by PCA.

Principal Component Analysis is a statistical technique for transforming a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components. By retaining principal components that explain a high proportion of the total variance, we effectively remove redundant features whilst minimizing the amount of information lost. PCA is not only a great dimensionality reduction technique, it can also be used for data compression to speed up the learning algorithm.

A related topic for discussion is a phenomena called Curse of Dimensionality. This usually occurs in highly dimensional features spaces. As the number of features increase, the amount of data that we need to generalize accurately grows exponently. When there are too many features in the input space, the amount of data available to accurately train the model becomes sparse.

In our classification problem, each digit is represented by 784 features (28x28 resolution pixel values). Let’s perform a principal component analysis to reduce the number of features.

8.1.2 Analysis

To begin the dimension reduction, we can remove pixels from the training data that contain whitespace among all images. These pixels correspond to the background of the images. These empty pixels do not help in the classification therefore we can confidently remove them.

removeEmptyPixels<- function(trainData){
 allmisscols <- apply(trainData,2, function(x)all(x==0))
  colswithallmiss <-names(allmisscols[allmisscols>0])
  trainDataFrame <- trainData[, !names(train) %in% colswithallmiss]
  return(trainDataFrame)
}

trainDataFrame<-removeEmptyPixels(trainingSet)

Let’s perform the principal component analysis and construct a data frame of the results.

# principal components analysis using correlation matrix
computePrincipalComponents<-function(trainData){
  pca_df <- princomp(trainData[,-1], scores = T, cor=T)

  # Compute the eigenvalues of principal components from the standard deviation
  eig <- as.vector((pca_df$sdev)^2)

  # compute variances of principal components
  variance <- eig*100/sum(eig)

  # compute the cumulative variance
  cumvar <- cumsum(variance)

  df <- data.frame(PC = c(1:length(eig)), Eigenvalue = eig,"Variance Explained" = variance, "Cumulative Variance Explained" = cumvar)

  return(df)
}

princomp_df<-computePrincipalComponents(trainDataFrame)

8.1.3 Principal Component Results

For dimension reduction, we will retain principal components with eigenvalues greater than 1. Eigenvalues > 1 is a good heuristic rule of thumb. See Figure 6.

knitr::include_graphics("../input/Reference.png")
Figure 6

Figure 6

Let’s make a dataframe of the filtered principal components.

# exclude rows from dataframe where eigenvalue < 1
new_princomp_df <- princomp_df[princomp_df$Eigenvalue > 1,]

We can generate an HTML table of the filtered principal components.

kable(new_princomp_df) %>%
  kable_styling(c("striped", "bordered", "responsive")) %>%
  add_header_above(c("Principal Compnent Analysis Results" = 4)) %>%
  row_spec(0, bold = T, color = "white", background = "#666") %>%
  scroll_box(height = "400px")
Principal Compnent Analysis Results
PC Eigenvalue Variance.Explained Cumulative.Variance.Explained
1 40.785439 5.7769744 5.776974
2 29.203806 4.1365164 9.913491
3 26.760640 3.7904590 13.703950
4 20.778514 2.9431323 16.647082
5 18.106538 2.5646654 19.211748
6 15.742120 2.2297620 21.441509
7 13.807888 1.9557915 23.397301
8 12.463289 1.7653384 25.162639
9 11.089303 1.5707228 26.733362
10 10.037146 1.4216920 28.155054
11 9.671364 1.3698816 29.524936
12 8.673066 1.2284796 30.753415
13 8.033350 1.1378682 31.891284
14 7.928454 1.1230105 33.014294
15 7.430107 1.0524230 34.066717
16 7.151772 1.0129988 35.079716
17 6.758783 0.9573347 36.037051
18 6.627955 0.9388038 36.975854
19 6.450296 0.9136396 37.889494
20 6.302166 0.8926581 38.782152
21 5.913115 0.8375517 39.619704
22 5.737276 0.8126453 40.432349
23 5.482075 0.7764979 41.208847
24 5.374899 0.7613172 41.970164
25 5.204479 0.7371783 42.707343
26 4.974165 0.7045560 43.411899
27 4.897580 0.6937082 44.105607
28 4.688801 0.6641361 44.769743
29 4.478469 0.6343441 45.404087
30 4.432175 0.6277869 46.031874
31 4.337247 0.6143410 46.646215
32 4.260045 0.6034058 47.249621
33 4.097587 0.5803948 47.830016
34 4.065460 0.5758442 48.405860
35 4.015096 0.5687104 48.974570
36 3.852994 0.5457499 49.520320
37 3.831807 0.5427488 50.063069
38 3.751939 0.5314361 50.594505
39 3.597863 0.5096124 51.104117
40 3.524305 0.4991933 51.603311
41 3.452628 0.4890407 52.092351
42 3.387091 0.4797579 52.572109
43 3.321853 0.4705174 53.042627
44 3.308902 0.4686831 53.511310
45 3.244350 0.4595396 53.970849
46 3.200576 0.4533394 54.424189
47 3.154652 0.4468345 54.871023
48 3.146432 0.4456702 55.316693
49 3.067260 0.4344561 55.751150
50 3.040402 0.4306518 56.181801
51 2.971235 0.4208548 56.602656
52 2.932732 0.4154011 57.018057
53 2.900722 0.4108671 57.428924
54 2.841400 0.4024646 57.831389
55 2.799752 0.3965655 58.227954
56 2.790456 0.3952487 58.623203
57 2.740900 0.3882294 59.011433
58 2.686503 0.3805245 59.391957
59 2.666955 0.3777556 59.769713
60 2.634287 0.3731285 60.142841
61 2.541794 0.3600275 60.502869
62 2.523658 0.3574587 60.860327
63 2.467274 0.3494722 61.209800
64 2.459200 0.3483286 61.558128
65 2.439964 0.3456040 61.903732
66 2.374353 0.3363106 62.240043
67 2.368244 0.3354453 62.575488
68 2.353653 0.3333786 62.908867
69 2.276250 0.3224150 63.231282
70 2.262620 0.3204844 63.551766
71 2.235628 0.3166612 63.868427
72 2.219180 0.3143314 64.182759
73 2.199981 0.3116120 64.494371
74 2.173328 0.3078369 64.802207
75 2.152361 0.3048670 65.107074
76 2.134864 0.3023887 65.409463
77 2.107172 0.2984664 65.707930
78 2.080476 0.2946850 66.002615
79 2.052091 0.2906645 66.293279
80 2.045284 0.2897003 66.582979
81 2.038149 0.2886896 66.871669
82 2.004339 0.2839006 67.155569
83 1.994818 0.2825521 67.438122
84 1.989158 0.2817504 67.719872
85 1.974063 0.2796123 67.999484
86 1.963117 0.2780619 68.277546
87 1.941347 0.2749783 68.552524
88 1.924542 0.2725980 68.825122
89 1.913162 0.2709862 69.096108
90 1.905256 0.2698663 69.365975
91 1.868284 0.2646295 69.630604
92 1.849384 0.2619524 69.892557
93 1.832798 0.2596030 70.152160
94 1.815964 0.2572188 70.409379
95 1.787880 0.2532408 70.662619
96 1.778162 0.2518643 70.914484
97 1.758033 0.2490131 71.163497
98 1.747077 0.2474613 71.410958
99 1.717969 0.2433384 71.654296
100 1.688094 0.2391069 71.893403
101 1.670366 0.2365958 72.129999
102 1.660331 0.2351744 72.365173
103 1.650670 0.2338060 72.598979
104 1.648016 0.2334301 72.832409
105 1.621478 0.2296711 73.062080
106 1.611681 0.2282834 73.290364
107 1.596074 0.2260728 73.516437
108 1.586211 0.2246758 73.741113
109 1.571774 0.2226309 73.963744
110 1.551543 0.2197653 74.183509
111 1.535090 0.2174349 74.400944
112 1.508619 0.2136853 74.614629
113 1.499523 0.2123970 74.827026
114 1.492858 0.2114529 75.038479
115 1.472022 0.2085017 75.246981
116 1.443340 0.2044392 75.451420
117 1.438527 0.2037574 75.655177
118 1.413418 0.2002009 75.855378
119 1.396139 0.1977533 76.053131
120 1.387957 0.1965944 76.249726
121 1.382484 0.1958193 76.445545
122 1.361487 0.1928452 76.638390
123 1.349303 0.1911194 76.829510
124 1.338809 0.1896329 77.019143
125 1.315599 0.1863455 77.205488
126 1.303301 0.1846036 77.390092
127 1.293498 0.1832150 77.573307
128 1.285998 0.1821527 77.755459
129 1.266967 0.1794571 77.934916
130 1.261629 0.1787011 78.113618
131 1.253372 0.1775315 78.291149
132 1.234522 0.1748614 78.466010
133 1.228912 0.1740669 78.640077
134 1.220501 0.1728755 78.812953
135 1.208063 0.1711137 78.984067
136 1.196307 0.1694485 79.153515
137 1.186434 0.1680502 79.321565
138 1.177051 0.1667211 79.488286
139 1.161138 0.1644671 79.652754
140 1.143044 0.1619043 79.814658
141 1.132914 0.1604694 79.975127
142 1.121481 0.1588500 80.133977
143 1.115372 0.1579847 80.291962
144 1.106983 0.1567965 80.448758
145 1.103441 0.1562948 80.605053
146 1.090436 0.1544527 80.759506
147 1.081244 0.1531507 80.912656
148 1.071122 0.1517170 81.064373
149 1.061629 0.1503723 81.214746
150 1.055994 0.1495743 81.364320
151 1.053775 0.1492599 81.513580
152 1.044849 0.1479956 81.661575
153 1.038726 0.1471283 81.808704
154 1.032706 0.1462756 81.954979
155 1.024454 0.1451069 82.100086
156 1.013906 0.1436128 82.243699
157 1.010948 0.1431938 82.386893
158 1.002890 0.1420524 82.528945
159 1.001135 0.1418038 82.670749
dim(new_princomp_df)
## [1] 159   4

Previously we had \(784\) features to train the classifcation model. After applying the Eigenvalues > 1 rule, we are left with \(159\) principal components. This is a huge dimensionality reduction.

8.1.4 Scree Plot

We can visualize the variance explained by each principal component with a Scree Plot. For brevity, we will plot the percent variance of the first \(10\) principal components.

# Print the scree plot
ggplotScreePlot<- function(df){
  df <- df[1:10,]
  p <- 
    ggplot(df)  + 
    geom_bar(aes(x=PC, y=Variance.Explained),stat="identity", fill="tan1", color="sienna3")+
    geom_point(aes(x=PC, y=df$Variance.Explained ),stat="identity")+
    geom_line(aes(x=PC, y=df$Variance.Explained ),stat="identity") +
    geom_text(aes(label=paste0(round(Variance.Explained, 2),"%"), x=PC, y=Variance.Explained), color="black", vjust = -0.4, hjust = -0.15) +
    ylab("Percent of Variance Explained") + xlab("Principal Components") +
    ggtitle("PCA: Scree Plot") + theme(plot.title = element_text(hjust = 0.5))
  return(p)
}

ggplotScreePlot(new_princomp_df)
Figure 7

Figure 7

8.1.5 Re-train Model and Confusion Matrix

Now that we’ve reduced the number of principal components, we can retrain the model.

# Principal Component Analysis
pca.result <- princomp(trainingSet[,-1], scores = T)
train.pca <- pca.result$scores
test.pca <- predict(pca.result, validationSet[,-1])

num.principal.components <- 1:nrow(new_princomp_df)
start <- current_time(TRUE)
kd_treeModel1 <- nabor::knn(data = train.pca[,num.principal.components], query = test.pca[,num.principal.components], k=1)
nabor_time <- current_time(TRUE) - start
nabor_time
## Time difference of 1.095808 mins
kd_treeModel2 <- nabor::knn(data = train.pca[,c(1:49)], query = test.pca[,c(1:49)], k=1)

Reducing the number of features in the model dramatically speeds up the runtime performance of the classification algorithm.

Notice that I created two models. kd_treeModel1 uses the first \(159\) principal components; these are the principal components where Eigenvalues > 1. kd_treeModel2 uses the first \(49\) principal components. I determined this number empirically. Let’s plot the Confusion Matrices.

kd_tree_predictions1 <- get_KD_tree_prediction_labels(kd_treeModel1$nn.idx,trainingSet)
kd_tree_predictions2 <- get_KD_tree_prediction_labels(kd_treeModel2$nn.idx,trainingSet)

cfm1 <- confusionMatrix(kd_tree_predictions1, validationSet$label)
cfm2 <- confusionMatrix(kd_tree_predictions2, validationSet$label)

p1<-ggplotConfusionMatrix(cfm1, "159 Principal Components")
p2<-ggplotConfusionMatrix(cfm2, "49 Principal Components")
grid.arrange(p1, p2, nrow = 2)
Figure 8

Figure 8

Recall the accuracy before was \(96.4\%\). Using the first \(159\) principal components achieved an accuracy of \(96.7\%\). Using the first \(49\) principal components achieved an accuracy of \(97.2\%\).

Note that there are a number of heuristics that we could have used to determine the number of principal components to retain in the model. For instance, we could have taken the first k eigenvectors that captured at least \(85\%\) of the total cumulative variance.

Another heuristic criterion is the Elbow Method. We can created a cumulative variance plot where the y-axis is the cumulative proportion of variance explained and the x-axis is the number of principal components. The number of principal components to retain would be at the point of diminishing returns. That is, the point where little variance is gained by retaining additional eigenvalues.

For our analysis, we chose the first k eigenvectors that had Eigenvalues > 1. This heuristic criterion captured the first \(159\) principal components. These principal components account for \(82.6\%\) of the total cumulative variance.

However, as we observed earlier, for high dimensionality problems, these heuristics may not necessarily be optimal. In the end, we determined empirically that the first 49 principal components achieved the highest accuracy in the model.

Let’s fine-tune the model by optimizing for k in kNN. We will use 5-fold cross validation.

8.2 K-fold Cross Validation

  • Objective: Evaluate the accuracy of the kNN classifier with different values of k by cross validation. Optimize the knn model by selecting the k with the lowest misclassification error.
# generate a randomized sample group column in the training data
train$gp <- sample(nrow(train), replace = F)/nrow(train)
# create equal k-fold subsets of the training data
fold_one <- subset(train, train$gp <= 0.2)
fold_two <- subset(train, train$gp > 0.2 & train$gp <= 0.4)
fold_three <- subset(train, train$gp > 0.4 & train$gp <= 0.6)
fold_four <- subset(train, train$gp > 0.6 & train$gp <= 0.8)
fold_five <- subset(train, train$gp > 0.8 & train$gp <= 1)

ldf <- list(fold_one, fold_two, fold_three, fold_four, fold_five)

# k values in kNN
K <- c(1,2,3,4,5,8,10,20,50,100,200)

get_misclassification_error<-function(trainFold.pca,testFold.pca, trainFold, testFold,k_val,prcomp.used = 1:49){

  kd_treeModel <- nabor::knn(data = trainFold.pca[,prcomp.used], query = testFold.pca[,prcomp.used], k=k_val)
  kd_tree_predictions <- get_KD_tree_prediction_labels(kd_treeModel$nn.idx,trainFold)
  cfm <- confusionMatrix(kd_tree_predictions, factor(testFold$label, levels = c(0:9)))
  return(1-cfm$overall[1][["Accuracy"]])
}

my_folds <- rep(list(NA), 5)
m <- 1

k <- rep(NA,length(K))
for(k_val in K){
  for(i in 1:length(ldf)){
    trainFold<-do.call("rbind", ldf[-i])
    testFold<- do.call("rbind", ldf[i])

    pca.result <- princomp(trainFold[,-c(1,ncol(trainFold))], scores = T)
    train.pca <- pca.result$scores
    test.pca <- predict(pca.result, testFold[,-c(1,ncol(testFold))])

    l <- get_misclassification_error(train.pca, test.pca, trainFold, testFold, k_val)
    my_folds[[i]][m] <- l
  }
  k[m] <- k_val
  m <- m + 1
}

Let’s make a dataframe of the cross validation results

df1 <- data.frame(k = k, test_fold_one = my_folds[[1]], test_fold_two = my_folds[[2]], test_fold_three = my_folds[[3]],
                  test_fold_four = my_folds[[4]], test_fold_five = my_folds[[5]])

df1 <- cbind(df1, average = rowMeans(df1[,-1]))

We can generate an HTML table of the validation results.

kable(df1) %>%
  kable_styling(c("striped", "bordered")) %>%
  add_header_above(c("Nearest Neighbors" = 1, "Misclassification Error (1 - Accuracy)" = 6))
Nearest Neighbors
Misclassification Error (1 - Accuracy)
k test_fold_one test_fold_two test_fold_three test_fold_four test_fold_five average
1 0.0270238 0.0267857 0.0270238 0.0289286 0.0292857 0.0278095
2 0.0284524 0.0329762 0.0296429 0.0334524 0.0334524 0.0315952
3 0.0240476 0.0271429 0.0261905 0.0270238 0.0280952 0.0265000
4 0.0227381 0.0265476 0.0250000 0.0283333 0.0271429 0.0259524
5 0.0257143 0.0272619 0.0269048 0.0279762 0.0264286 0.0268571
8 0.0260714 0.0275000 0.0278571 0.0286905 0.0280952 0.0276429
10 0.0278571 0.0292857 0.0308333 0.0313095 0.0297619 0.0298095
20 0.0333333 0.0345238 0.0342857 0.0377381 0.0340476 0.0347857
50 0.0457143 0.0447619 0.0436905 0.0479762 0.0451190 0.0454524
100 0.0579762 0.0563095 0.0566667 0.0597619 0.0569048 0.0575238
200 0.0704762 0.0720238 0.0736905 0.0740476 0.0719048 0.0724286

The minimum misclassification error \((1-accuracy)\) occurs when \(k=4\). With this, we can finally make predictions our test dataset.

9 Final Model

Let’s re-import the data and classify the test digits. For our submission, we will use the nabor::knn kd tree model with \(k=4\) and the first \(49\) principal components.

train <- read.csv("../input/train.csv")
train$label <- as.factor(train$label)
test <- read.csv("../input/test.csv")

trainData <- removeEmptyPixels(train)

# Compute the principal components
pca.result <- princomp(trainData[,-1], scores = T)
train.pca <- pca.result$scores
test.pca <- predict(pca.result, test)

# number of principal components
num.principal.components <- 1:49
# train model
kd_treeModel <- nabor::knn(data = train.pca[,num.principal.components], query = test.pca[,num.principal.components], k=4)

# get predictions
Label <- get_KD_tree_prediction_labels(kd_treeModel$nn.idx,train)

# write predictions to file
ImageId = c(1:length(Label))
submission <- data.frame(ImageId,Label)
write.csv(submission, file = "../mnist_kdtree_kNN_submission.csv", row.names = F)

10 Conclusion

Our final model reached a maximum accuracy of ~\(97\%\) on the test dataset.

In summary, the default class::kNN classifer was optimized for both runtime performance and accuracy. The training set was a relatively high dimensional dataset. There were \(784\) features (and the corresponding values) for each digit. I applied the technique of principal component analysis to compress the features into \(49\) principal components. These principal components explained a cumulative percentage of over \(55\%\) of the variance in the model. Next, I used nabor package to improve the computational complexity of class::kNN. The nabor::knn function approximates the kNN algorithm by storing the training instances in a KD tree data structure. Lastly, the model was optimized for \(k\) in kNN by using a 5-fold cross validation. The minimum misclassification error \((1-accuracy)\) occurs when \(k=4\). The final model achieved a maximum accuracy of \(97\%\) on the test dataset.

== Future Study ==

There are several classifiers and deep learning models (eg. simple neural networks and convolutional neural networks) that we could have used for this classification problem. LeCun et al. 1998, published a paper on Gradient-Based Learning Applied to Document Recognition that explores the usage of more advance models for handwriting recognition. These are all great topics to explore for future study.