This is an Exploratory Data Analysis of the classic MNIST dataset. The data is available to download at Kaggle. The goal is to correctly classify digits (0-9)
from a database of tens of thousands of gray-scale, handwritten images.
In this notebook, I will be practicing several data science algorithmic and visualization techniques.
Load packages and import data files
# Load Packages
# vis
library(knitr) # dynamic report generation
library(ggplot2) # visualization
library(scales) # scale functions for visualization
library(gridExtra) # multi-panel plots
# wrangle
library(EBImage) # image manipulation/processing
library(reshape2) # data frame manipulation
library(kableExtra) # HTML table manipulation
library(magrittr) # pipe commands
# model
library(class) # kNN classification
suppressMessages(library(nabor)) # kd tree implementation of kNN
suppressMessages(library(caret)) # confusionMatrix
# Import MNIST training & test datasets
train <- read.csv("../input/train.csv")
train$label <- as.factor(train$label)
test <- read.csv("../input/test.csv")
Referring to train.csv
, the documentation states that each image is 28 pixels in height and 28 pixels in width. Furthermore, each pixel has an integer pixel-value associated with it (0-255), indicating the lightness or darkness of that pixel. Higher numbers are shaded darker.
With this information, we can quickly plot the pixel gray-scale intensity values to obtain an image of the digit.
# Create 28x28 matrix containing the pixel gray-scale intensity values
m = matrix(unlist(train[12,-1]),nrow = 28, byrow = T)
# Plot the image
image(m,col=grey.colors(255), main = paste0("label = ",train[12,1]))
Figure 1
The images need to be rotated along the x-axis to be readable. The EBImage
package is great for image processing and spacial transformations. Let’s visualize a subset of the data.
# set up plot
par(mai=rep(0,4)) # no margins
# layout the plots into a matrix w/ 12 columns, by row
layout(matrix(1:144, ncol=12, byrow=T))
for (i in 1:144) {
m <- rotate(matrix(unlist(train[i,-1]), 28,28, byrow = T), 270)
image(m,col=grey.colors(255), xaxt='n', yaxt='n')
}
Figure 2
What do these images tell us?
28x28
size. This is great. We want all images to be normalized to the same size.This is a lot to consider. Fortunately, we have over 40,000 labeled digits to train our supervised model.
Imagine each digits' 28x28 matrix of pixels represent a set of features. The classification model that immediately comes to mind (and certainly the most simple and intuitive) is K-Nearest Neighbor (kNN).
That is, these set of features can be mapped onto a feature space, and subsequently used to classify new digits based on its proximity to its kth-nearest neighbor in the feature space.
The strategy for setting up the data for modeling is as follows:
Split the training data into a smaller training set and validation set. The validation set will be used to optimize the kNN model.
set.seed(42)
# we will split the data using a 75/25, training/validation split
sample_rows <- sample(nrow(train), nrow(train)*0.75)
trainingSet <- train[sample_rows,]
validationSet <- train[-sample_rows,]
# function to track time in seconds
current_time<-function(display_time_elapse=FALSE){
if (display_time_elapse){
return((Sys.time()))
} else {
return(as.numeric(Sys.time()))
}
}
# train kNN model and record run time
start <- current_time(TRUE)
kNN_predictions <- class::knn(train = trainingSet[,-1], test = validationSet[,-1], cl = trainingSet[,1], k = 1)
class_kNN_time <- current_time(TRUE) - start
class_kNN_time
## Time difference of 36.08147 mins
The kNN algorithm from the class
package took a little over half an hour to train the model. Let’s look at the accuracy of the classifier.
# construct confusion matrix
cfm <- confusionMatrix(kNN_predictions, validationSet$label)
# confusion matrix visualization
ggplotConfusionMatrix <- function(m,model){
mytitle <- paste(model, "Confusion Matrix:", "Accuracy", percent_format()(m$overall[1]),
"Kappa", percent_format()(m$overall[2]))
p <-
ggplot(data = as.data.frame(m$table),
aes(x = Prediction, y = Reference)) +
ylab("Actual Label") +
xlab("Predicted Label") +
geom_tile(aes(fill = log(Freq)), color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
geom_text(aes(x = Prediction, y = Reference, label = Freq)) +
theme(legend.position = "right") +
scale_x_discrete(limits = rev(levels(as.data.frame(m$table)$Prediction))) +
theme(legend.background = element_rect(color = 'black', fill = 'white', linetype='solid')) +
ggtitle(mytitle) + theme(plot.title = element_text(hjust = 0.5))
return(p)
}
# plot confusion matrix
ggplotConfusionMatrix(cfm, "Vanilla kNN")
Figure 3
The kNN algoritm from the class
package achieved an accuracy of \(96.4\%\) when \(k = 1\). The kNN model for the MNIST database has a fairly high baseline accuracy. Next, let’s discuss the execution time of the model. The computational complexity of the kNN algorithm with \(n\) pictures, \(d\) pixels, and \(k\) nearest neighbors is \[ \boxed{O(nd)} + O(nlogn) + O(k)\]
kNN is generally slow at the testing portion \(O(nd)\) of the algorithm. This bottleneck of the complexity comes from having to compute the Euclidean distance of the test instance to every instance in the training set. I read from this blog post that we can improve the time complexity of the algorithm by implementating a KD tree.
The KD tree is a space-partioning data structure that dramatically improves the runtime performance of the classifier. The training instances are stored in a binary tree for effectively faster look up times. Consequently, unlike before, the test instance is not compared to all training examples. Instead the test instance is compared to \(m<<n\) potential near neighbors. The insert and search operations of the KD tree have an average time complexity of \[O(nlogn)\] where \(n\) is the number of pictures.
The nabor package stores the training instances in a KD tree and the kNN model is approximated. We will use the nabor
package to re-train the model.
# train KD Tree kNN model and track runtime
start <- current_time(TRUE)
kd_treeModel <- nabor::knn(data = trainingSet[,-1], query = validationSet[,-1], k = 1)
nabor_time <- current_time(TRUE) - start
nabor_time
## Time difference of 6.907541 mins
The kNN KD tree algorithm from the nabor
package took significantly less time to train and classify the data than the vanillla kNN model. Let’s look at the accuracy of the classifier.
get_KD_tree_prediction_labels<-function(kd_tree, trainData){
get_k_nearest_votes<-function(i,trainData,kd_tree){
k_nearest_votes <- trainData[c(kd_tree[i,]), 1]
tt <- table(k_nearest_votes)
most_common_label <- names(tt[which.max(tt)])
return(most_common_label)
}
count <- nrow(kd_tree)
labeled_predictions <- NA
for (i in 1:count){
labeled_predictions[i] <- get_k_nearest_votes(i,trainData,kd_tree)
}
labeled_predictions <- factor(labeled_predictions, levels = 0:9)
return(labeled_predictions)
}
kd_tree_predictions <- get_KD_tree_prediction_labels(kd_treeModel$nn.idx,trainingSet)
cfm <- confusionMatrix(kd_tree_predictions, validationSet$label)
ggplotConfusionMatrix(cfm, "KD Tree kNN")
Figure 4
The results from Figures \(3\) and \(4\) above show that the KD Tree kNN Algorithm model has a better time complexity than the Vanilla kNN algortihm whilst maintaining the same level of accuracy. The confusion matrices of the two models are identical.
As a proof of concept, we can plot the computational time complexity of both models with several \(n\) test instance sizes.
xnums <- c(10,50,100,250,375,500,750,1000,1250,1500,1750,2000, 2500, 3000)
vanilla_knn <- rep(NA, length(xnums))
knn_kd_tree <- rep(NA, length(xnums))
get_execution_time<-function(n){
start_time <- current_time()
predictions<-class::knn(trainingSet[,-1], validationSet[c(1:n),-1], trainingSet[,1], k = 1)
end_time <- current_time()
vanilla_knn_time <- end_time - start_time
start_time <- current_time()
kd_treeModel <- nabor::knn(trainingSet[,-1], validationSet[c(1:n),-1], k = 1)
end_time <- current_time()
knn_kd_time <- end_time - start_time
output <- list(vanilla_knn_time, knn_kd_time)
return(output) # runtime in seconds
}
for(i in xnums){
n <- match(i, xnums)
execution_times <- unlist(get_execution_time(i))
vanilla_knn[n] <- execution_times[1]
knn_kd_tree[n] <- execution_times[2]
}
time_df <- data.frame(xnums,vanilla_knn, knn_kd_tree)
time_df_long <- melt(time_df, id="xnums")
ggplot(data=time_df_long, aes(x=xnums, y=value, color=variable)) +
geom_line() + geom_point() +
xlab("Test Input Size (n)") +
ylab("Time (seconds)") +
ggtitle("Computational Time Complexity") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_discrete(name = "ML Model", labels = c("Vanilla kNN", "KD Tree kNN")) +
theme(legend.background = element_rect(color = 'black', fill = 'white', linetype='solid'))
Figure 5
The KD tree kNN model performs faster than the Vanilla implementation of kNN as the number of testing instances is increased.
Let’s optimize the KD Tree kNN model performance.
I read from a blog post that reducing the number of features can dramatically improve the classification outcome. In the tutorial, the author trained a model with \(18\) features for binary classification. The first model was a neural network using all features. This model achieved an accuracy of \(54.6\%\). Next he performed a PCA on the dataset to reduce the \(18\) features to \(8\) principal components. He made a new dataset with the \(8\) principal components and re-built the neural network. The new model achieved an accuracy of \(97.7\%\). It would be interesting to see if we can similarly improve the accuracy of our KD tree kNN model by PCA.
Principal Component Analysis is a statistical technique for transforming a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components. By retaining principal components that explain a high proportion of the total variance, we effectively remove redundant features whilst minimizing the amount of information lost. PCA is not only a great dimensionality reduction technique, it can also be used for data compression to speed up the learning algorithm.
A related topic for discussion is a phenomena called Curse of Dimensionality. This usually occurs in highly dimensional features spaces. As the number of features increase, the amount of data that we need to generalize accurately grows exponently. When there are too many features in the input space, the amount of data available to accurately train the model becomes sparse.
In our classification problem, each digit is represented by 784 features (28x28 resolution pixel values). Let’s perform a principal component analysis to reduce the number of features.
To begin the dimension reduction, we can remove pixels from the training data that contain whitespace among all images. These pixels correspond to the background of the images. These empty pixels do not help in the classification therefore we can confidently remove them.
removeEmptyPixels<- function(trainData){
allmisscols <- apply(trainData,2, function(x)all(x==0))
colswithallmiss <-names(allmisscols[allmisscols>0])
trainDataFrame <- trainData[, !names(train) %in% colswithallmiss]
return(trainDataFrame)
}
trainDataFrame<-removeEmptyPixels(trainingSet)
Let’s perform the principal component analysis and construct a data frame of the results.
# principal components analysis using correlation matrix
computePrincipalComponents<-function(trainData){
pca_df <- princomp(trainData[,-1], scores = T, cor=T)
# Compute the eigenvalues of principal components from the standard deviation
eig <- as.vector((pca_df$sdev)^2)
# compute variances of principal components
variance <- eig*100/sum(eig)
# compute the cumulative variance
cumvar <- cumsum(variance)
df <- data.frame(PC = c(1:length(eig)), Eigenvalue = eig,"Variance Explained" = variance, "Cumulative Variance Explained" = cumvar)
return(df)
}
princomp_df<-computePrincipalComponents(trainDataFrame)
For dimension reduction, we will retain principal components with eigenvalues greater than 1. Eigenvalues > 1
is a good heuristic rule of thumb. See Figure 6
.
knitr::include_graphics("../input/Reference.png")
Figure 6
Let’s make a dataframe of the filtered principal components.
# exclude rows from dataframe where eigenvalue < 1
new_princomp_df <- princomp_df[princomp_df$Eigenvalue > 1,]
We can generate an HTML table of the filtered principal components.
kable(new_princomp_df) %>%
kable_styling(c("striped", "bordered", "responsive")) %>%
add_header_above(c("Principal Compnent Analysis Results" = 4)) %>%
row_spec(0, bold = T, color = "white", background = "#666") %>%
scroll_box(height = "400px")
PC | Eigenvalue | Variance.Explained | Cumulative.Variance.Explained |
---|---|---|---|
1 | 40.785439 | 5.7769744 | 5.776974 |
2 | 29.203806 | 4.1365164 | 9.913491 |
3 | 26.760640 | 3.7904590 | 13.703950 |
4 | 20.778514 | 2.9431323 | 16.647082 |
5 | 18.106538 | 2.5646654 | 19.211748 |
6 | 15.742120 | 2.2297620 | 21.441509 |
7 | 13.807888 | 1.9557915 | 23.397301 |
8 | 12.463289 | 1.7653384 | 25.162639 |
9 | 11.089303 | 1.5707228 | 26.733362 |
10 | 10.037146 | 1.4216920 | 28.155054 |
11 | 9.671364 | 1.3698816 | 29.524936 |
12 | 8.673066 | 1.2284796 | 30.753415 |
13 | 8.033350 | 1.1378682 | 31.891284 |
14 | 7.928454 | 1.1230105 | 33.014294 |
15 | 7.430107 | 1.0524230 | 34.066717 |
16 | 7.151772 | 1.0129988 | 35.079716 |
17 | 6.758783 | 0.9573347 | 36.037051 |
18 | 6.627955 | 0.9388038 | 36.975854 |
19 | 6.450296 | 0.9136396 | 37.889494 |
20 | 6.302166 | 0.8926581 | 38.782152 |
21 | 5.913115 | 0.8375517 | 39.619704 |
22 | 5.737276 | 0.8126453 | 40.432349 |
23 | 5.482075 | 0.7764979 | 41.208847 |
24 | 5.374899 | 0.7613172 | 41.970164 |
25 | 5.204479 | 0.7371783 | 42.707343 |
26 | 4.974165 | 0.7045560 | 43.411899 |
27 | 4.897580 | 0.6937082 | 44.105607 |
28 | 4.688801 | 0.6641361 | 44.769743 |
29 | 4.478469 | 0.6343441 | 45.404087 |
30 | 4.432175 | 0.6277869 | 46.031874 |
31 | 4.337247 | 0.6143410 | 46.646215 |
32 | 4.260045 | 0.6034058 | 47.249621 |
33 | 4.097587 | 0.5803948 | 47.830016 |
34 | 4.065460 | 0.5758442 | 48.405860 |
35 | 4.015096 | 0.5687104 | 48.974570 |
36 | 3.852994 | 0.5457499 | 49.520320 |
37 | 3.831807 | 0.5427488 | 50.063069 |
38 | 3.751939 | 0.5314361 | 50.594505 |
39 | 3.597863 | 0.5096124 | 51.104117 |
40 | 3.524305 | 0.4991933 | 51.603311 |
41 | 3.452628 | 0.4890407 | 52.092351 |
42 | 3.387091 | 0.4797579 | 52.572109 |
43 | 3.321853 | 0.4705174 | 53.042627 |
44 | 3.308902 | 0.4686831 | 53.511310 |
45 | 3.244350 | 0.4595396 | 53.970849 |
46 | 3.200576 | 0.4533394 | 54.424189 |
47 | 3.154652 | 0.4468345 | 54.871023 |
48 | 3.146432 | 0.4456702 | 55.316693 |
49 | 3.067260 | 0.4344561 | 55.751150 |
50 | 3.040402 | 0.4306518 | 56.181801 |
51 | 2.971235 | 0.4208548 | 56.602656 |
52 | 2.932732 | 0.4154011 | 57.018057 |
53 | 2.900722 | 0.4108671 | 57.428924 |
54 | 2.841400 | 0.4024646 | 57.831389 |
55 | 2.799752 | 0.3965655 | 58.227954 |
56 | 2.790456 | 0.3952487 | 58.623203 |
57 | 2.740900 | 0.3882294 | 59.011433 |
58 | 2.686503 | 0.3805245 | 59.391957 |
59 | 2.666955 | 0.3777556 | 59.769713 |
60 | 2.634287 | 0.3731285 | 60.142841 |
61 | 2.541794 | 0.3600275 | 60.502869 |
62 | 2.523658 | 0.3574587 | 60.860327 |
63 | 2.467274 | 0.3494722 | 61.209800 |
64 | 2.459200 | 0.3483286 | 61.558128 |
65 | 2.439964 | 0.3456040 | 61.903732 |
66 | 2.374353 | 0.3363106 | 62.240043 |
67 | 2.368244 | 0.3354453 | 62.575488 |
68 | 2.353653 | 0.3333786 | 62.908867 |
69 | 2.276250 | 0.3224150 | 63.231282 |
70 | 2.262620 | 0.3204844 | 63.551766 |
71 | 2.235628 | 0.3166612 | 63.868427 |
72 | 2.219180 | 0.3143314 | 64.182759 |
73 | 2.199981 | 0.3116120 | 64.494371 |
74 | 2.173328 | 0.3078369 | 64.802207 |
75 | 2.152361 | 0.3048670 | 65.107074 |
76 | 2.134864 | 0.3023887 | 65.409463 |
77 | 2.107172 | 0.2984664 | 65.707930 |
78 | 2.080476 | 0.2946850 | 66.002615 |
79 | 2.052091 | 0.2906645 | 66.293279 |
80 | 2.045284 | 0.2897003 | 66.582979 |
81 | 2.038149 | 0.2886896 | 66.871669 |
82 | 2.004339 | 0.2839006 | 67.155569 |
83 | 1.994818 | 0.2825521 | 67.438122 |
84 | 1.989158 | 0.2817504 | 67.719872 |
85 | 1.974063 | 0.2796123 | 67.999484 |
86 | 1.963117 | 0.2780619 | 68.277546 |
87 | 1.941347 | 0.2749783 | 68.552524 |
88 | 1.924542 | 0.2725980 | 68.825122 |
89 | 1.913162 | 0.2709862 | 69.096108 |
90 | 1.905256 | 0.2698663 | 69.365975 |
91 | 1.868284 | 0.2646295 | 69.630604 |
92 | 1.849384 | 0.2619524 | 69.892557 |
93 | 1.832798 | 0.2596030 | 70.152160 |
94 | 1.815964 | 0.2572188 | 70.409379 |
95 | 1.787880 | 0.2532408 | 70.662619 |
96 | 1.778162 | 0.2518643 | 70.914484 |
97 | 1.758033 | 0.2490131 | 71.163497 |
98 | 1.747077 | 0.2474613 | 71.410958 |
99 | 1.717969 | 0.2433384 | 71.654296 |
100 | 1.688094 | 0.2391069 | 71.893403 |
101 | 1.670366 | 0.2365958 | 72.129999 |
102 | 1.660331 | 0.2351744 | 72.365173 |
103 | 1.650670 | 0.2338060 | 72.598979 |
104 | 1.648016 | 0.2334301 | 72.832409 |
105 | 1.621478 | 0.2296711 | 73.062080 |
106 | 1.611681 | 0.2282834 | 73.290364 |
107 | 1.596074 | 0.2260728 | 73.516437 |
108 | 1.586211 | 0.2246758 | 73.741113 |
109 | 1.571774 | 0.2226309 | 73.963744 |
110 | 1.551543 | 0.2197653 | 74.183509 |
111 | 1.535090 | 0.2174349 | 74.400944 |
112 | 1.508619 | 0.2136853 | 74.614629 |
113 | 1.499523 | 0.2123970 | 74.827026 |
114 | 1.492858 | 0.2114529 | 75.038479 |
115 | 1.472022 | 0.2085017 | 75.246981 |
116 | 1.443340 | 0.2044392 | 75.451420 |
117 | 1.438527 | 0.2037574 | 75.655177 |
118 | 1.413418 | 0.2002009 | 75.855378 |
119 | 1.396139 | 0.1977533 | 76.053131 |
120 | 1.387957 | 0.1965944 | 76.249726 |
121 | 1.382484 | 0.1958193 | 76.445545 |
122 | 1.361487 | 0.1928452 | 76.638390 |
123 | 1.349303 | 0.1911194 | 76.829510 |
124 | 1.338809 | 0.1896329 | 77.019143 |
125 | 1.315599 | 0.1863455 | 77.205488 |
126 | 1.303301 | 0.1846036 | 77.390092 |
127 | 1.293498 | 0.1832150 | 77.573307 |
128 | 1.285998 | 0.1821527 | 77.755459 |
129 | 1.266967 | 0.1794571 | 77.934916 |
130 | 1.261629 | 0.1787011 | 78.113618 |
131 | 1.253372 | 0.1775315 | 78.291149 |
132 | 1.234522 | 0.1748614 | 78.466010 |
133 | 1.228912 | 0.1740669 | 78.640077 |
134 | 1.220501 | 0.1728755 | 78.812953 |
135 | 1.208063 | 0.1711137 | 78.984067 |
136 | 1.196307 | 0.1694485 | 79.153515 |
137 | 1.186434 | 0.1680502 | 79.321565 |
138 | 1.177051 | 0.1667211 | 79.488286 |
139 | 1.161138 | 0.1644671 | 79.652754 |
140 | 1.143044 | 0.1619043 | 79.814658 |
141 | 1.132914 | 0.1604694 | 79.975127 |
142 | 1.121481 | 0.1588500 | 80.133977 |
143 | 1.115372 | 0.1579847 | 80.291962 |
144 | 1.106983 | 0.1567965 | 80.448758 |
145 | 1.103441 | 0.1562948 | 80.605053 |
146 | 1.090436 | 0.1544527 | 80.759506 |
147 | 1.081244 | 0.1531507 | 80.912656 |
148 | 1.071122 | 0.1517170 | 81.064373 |
149 | 1.061629 | 0.1503723 | 81.214746 |
150 | 1.055994 | 0.1495743 | 81.364320 |
151 | 1.053775 | 0.1492599 | 81.513580 |
152 | 1.044849 | 0.1479956 | 81.661575 |
153 | 1.038726 | 0.1471283 | 81.808704 |
154 | 1.032706 | 0.1462756 | 81.954979 |
155 | 1.024454 | 0.1451069 | 82.100086 |
156 | 1.013906 | 0.1436128 | 82.243699 |
157 | 1.010948 | 0.1431938 | 82.386893 |
158 | 1.002890 | 0.1420524 | 82.528945 |
159 | 1.001135 | 0.1418038 | 82.670749 |
dim(new_princomp_df)
## [1] 159 4
Previously we had \(784\) features to train the classifcation model. After applying the Eigenvalues > 1
rule, we are left with \(159\) principal components. This is a huge dimensionality reduction.
We can visualize the variance explained by each principal component with a Scree Plot. For brevity, we will plot the percent variance of the first \(10\) principal components.
# Print the scree plot
ggplotScreePlot<- function(df){
df <- df[1:10,]
p <-
ggplot(df) +
geom_bar(aes(x=PC, y=Variance.Explained),stat="identity", fill="tan1", color="sienna3")+
geom_point(aes(x=PC, y=df$Variance.Explained ),stat="identity")+
geom_line(aes(x=PC, y=df$Variance.Explained ),stat="identity") +
geom_text(aes(label=paste0(round(Variance.Explained, 2),"%"), x=PC, y=Variance.Explained), color="black", vjust = -0.4, hjust = -0.15) +
ylab("Percent of Variance Explained") + xlab("Principal Components") +
ggtitle("PCA: Scree Plot") + theme(plot.title = element_text(hjust = 0.5))
return(p)
}
ggplotScreePlot(new_princomp_df)
Figure 7
Now that we’ve reduced the number of principal components, we can retrain the model.
# Principal Component Analysis
pca.result <- princomp(trainingSet[,-1], scores = T)
train.pca <- pca.result$scores
test.pca <- predict(pca.result, validationSet[,-1])
num.principal.components <- 1:nrow(new_princomp_df)
start <- current_time(TRUE)
kd_treeModel1 <- nabor::knn(data = train.pca[,num.principal.components], query = test.pca[,num.principal.components], k=1)
nabor_time <- current_time(TRUE) - start
nabor_time
## Time difference of 1.095808 mins
kd_treeModel2 <- nabor::knn(data = train.pca[,c(1:49)], query = test.pca[,c(1:49)], k=1)
Reducing the number of features in the model dramatically speeds up the runtime performance of the classification algorithm.
Notice that I created two models. kd_treeModel1
uses the first \(159\) principal components; these are the principal components where Eigenvalues > 1
. kd_treeModel2
uses the first \(49\) principal components. I determined this number empirically. Let’s plot the Confusion Matrices.
kd_tree_predictions1 <- get_KD_tree_prediction_labels(kd_treeModel1$nn.idx,trainingSet)
kd_tree_predictions2 <- get_KD_tree_prediction_labels(kd_treeModel2$nn.idx,trainingSet)
cfm1 <- confusionMatrix(kd_tree_predictions1, validationSet$label)
cfm2 <- confusionMatrix(kd_tree_predictions2, validationSet$label)
p1<-ggplotConfusionMatrix(cfm1, "159 Principal Components")
p2<-ggplotConfusionMatrix(cfm2, "49 Principal Components")
grid.arrange(p1, p2, nrow = 2)
Figure 8
Recall the accuracy before was \(96.4\%\). Using the first \(159\) principal components achieved an accuracy of \(96.7\%\). Using the first \(49\) principal components achieved an accuracy of \(97.2\%\).
Note that there are a number of heuristics that we could have used to determine the number of principal components to retain in the model. For instance, we could have taken the first k
eigenvectors that captured at least \(85\%\) of the total cumulative variance.
Another heuristic criterion is the Elbow Method. We can created a cumulative variance plot where the y-axis
is the cumulative proportion of variance explained and the x-axis
is the number of principal components. The number of principal components to retain would be at the point of diminishing returns. That is, the point where little variance is gained by retaining additional eigenvalues.
For our analysis, we chose the first k
eigenvectors that had Eigenvalues > 1
. This heuristic criterion captured the first \(159\) principal components. These principal components account for \(82.6\%\) of the total cumulative variance.
However, as we observed earlier, for high dimensionality problems, these heuristics may not necessarily be optimal. In the end, we determined empirically that the first 49
principal components achieved the highest accuracy in the model.
Let’s fine-tune the model by optimizing for k
in kNN. We will use 5-fold cross validation.
k
with the lowest misclassification error.# generate a randomized sample group column in the training data
train$gp <- sample(nrow(train), replace = F)/nrow(train)
# create equal k-fold subsets of the training data
fold_one <- subset(train, train$gp <= 0.2)
fold_two <- subset(train, train$gp > 0.2 & train$gp <= 0.4)
fold_three <- subset(train, train$gp > 0.4 & train$gp <= 0.6)
fold_four <- subset(train, train$gp > 0.6 & train$gp <= 0.8)
fold_five <- subset(train, train$gp > 0.8 & train$gp <= 1)
ldf <- list(fold_one, fold_two, fold_three, fold_four, fold_five)
# k values in kNN
K <- c(1,2,3,4,5,8,10,20,50,100,200)
get_misclassification_error<-function(trainFold.pca,testFold.pca, trainFold, testFold,k_val,prcomp.used = 1:49){
kd_treeModel <- nabor::knn(data = trainFold.pca[,prcomp.used], query = testFold.pca[,prcomp.used], k=k_val)
kd_tree_predictions <- get_KD_tree_prediction_labels(kd_treeModel$nn.idx,trainFold)
cfm <- confusionMatrix(kd_tree_predictions, factor(testFold$label, levels = c(0:9)))
return(1-cfm$overall[1][["Accuracy"]])
}
my_folds <- rep(list(NA), 5)
m <- 1
k <- rep(NA,length(K))
for(k_val in K){
for(i in 1:length(ldf)){
trainFold<-do.call("rbind", ldf[-i])
testFold<- do.call("rbind", ldf[i])
pca.result <- princomp(trainFold[,-c(1,ncol(trainFold))], scores = T)
train.pca <- pca.result$scores
test.pca <- predict(pca.result, testFold[,-c(1,ncol(testFold))])
l <- get_misclassification_error(train.pca, test.pca, trainFold, testFold, k_val)
my_folds[[i]][m] <- l
}
k[m] <- k_val
m <- m + 1
}
Let’s make a dataframe of the cross validation results
df1 <- data.frame(k = k, test_fold_one = my_folds[[1]], test_fold_two = my_folds[[2]], test_fold_three = my_folds[[3]],
test_fold_four = my_folds[[4]], test_fold_five = my_folds[[5]])
df1 <- cbind(df1, average = rowMeans(df1[,-1]))
We can generate an HTML table of the validation results.
kable(df1) %>%
kable_styling(c("striped", "bordered")) %>%
add_header_above(c("Nearest Neighbors" = 1, "Misclassification Error (1 - Accuracy)" = 6))
k | test_fold_one | test_fold_two | test_fold_three | test_fold_four | test_fold_five | average |
---|---|---|---|---|---|---|
1 | 0.0270238 | 0.0267857 | 0.0270238 | 0.0289286 | 0.0292857 | 0.0278095 |
2 | 0.0284524 | 0.0329762 | 0.0296429 | 0.0334524 | 0.0334524 | 0.0315952 |
3 | 0.0240476 | 0.0271429 | 0.0261905 | 0.0270238 | 0.0280952 | 0.0265000 |
4 | 0.0227381 | 0.0265476 | 0.0250000 | 0.0283333 | 0.0271429 | 0.0259524 |
5 | 0.0257143 | 0.0272619 | 0.0269048 | 0.0279762 | 0.0264286 | 0.0268571 |
8 | 0.0260714 | 0.0275000 | 0.0278571 | 0.0286905 | 0.0280952 | 0.0276429 |
10 | 0.0278571 | 0.0292857 | 0.0308333 | 0.0313095 | 0.0297619 | 0.0298095 |
20 | 0.0333333 | 0.0345238 | 0.0342857 | 0.0377381 | 0.0340476 | 0.0347857 |
50 | 0.0457143 | 0.0447619 | 0.0436905 | 0.0479762 | 0.0451190 | 0.0454524 |
100 | 0.0579762 | 0.0563095 | 0.0566667 | 0.0597619 | 0.0569048 | 0.0575238 |
200 | 0.0704762 | 0.0720238 | 0.0736905 | 0.0740476 | 0.0719048 | 0.0724286 |
The minimum misclassification error \((1-accuracy)\) occurs when \(k=4\). With this, we can finally make predictions our test dataset.
Let’s re-import the data and classify the test digits. For our submission, we will use the nabor::knn
kd tree model with \(k=4\) and the first \(49\) principal components.
train <- read.csv("../input/train.csv")
train$label <- as.factor(train$label)
test <- read.csv("../input/test.csv")
trainData <- removeEmptyPixels(train)
# Compute the principal components
pca.result <- princomp(trainData[,-1], scores = T)
train.pca <- pca.result$scores
test.pca <- predict(pca.result, test)
# number of principal components
num.principal.components <- 1:49
# train model
kd_treeModel <- nabor::knn(data = train.pca[,num.principal.components], query = test.pca[,num.principal.components], k=4)
# get predictions
Label <- get_KD_tree_prediction_labels(kd_treeModel$nn.idx,train)
# write predictions to file
ImageId = c(1:length(Label))
submission <- data.frame(ImageId,Label)
write.csv(submission, file = "../mnist_kdtree_kNN_submission.csv", row.names = F)
Our final model reached a maximum accuracy of ~\(97\%\) on the test dataset.
In summary, the default class::kNN
classifer was optimized for both runtime performance and accuracy. The training set was a relatively high dimensional dataset. There were \(784\) features (and the corresponding values) for each digit. I applied the technique of principal component analysis to compress the features into \(49\) principal components. These principal components explained a cumulative percentage of over \(55\%\) of the variance in the model. Next, I used nabor
package to improve the computational complexity of class::kNN
. The nabor::knn
function approximates the kNN algorithm by storing the training instances in a KD tree data structure. Lastly, the model was optimized for \(k\) in kNN
by using a 5-fold cross validation. The minimum misclassification error \((1-accuracy)\) occurs when \(k=4\). The final model achieved a maximum accuracy of \(97\%\) on the test dataset.
== Future Study ==
There are several classifiers and deep learning models (eg. simple neural networks and convolutional neural networks) that we could have used for this classification problem. LeCun et al. 1998, published a paper on Gradient-Based Learning Applied to Document Recognition that explores the usage of more advance models for handwriting recognition. These are all great topics to explore for future study.