 Scikit-Learn
Use SVC to Classify Handwritten Digits in Scikit-learn: A Beginner Example

[ 179 ]

Classifying handwritten digits

The Mixed National Institute of Standards and Technology database is a collection

of 70,000 images of handwritten digits. The digits were sampled from documents

written by employees of the US Census Bureau and American high school students.

The images are grayscale and 28 x 28 pixels in dimension. Let's inspect some of the

images using the following script:

>>> import matplotlib.pyplot as plt

>>> from sklearn.datasets import fetch_mldata

>>> import matplotlib.cm as cm

>>> digits = fetch_mldata('MNIST original', data_home='data/mnist').

data

>>> counter = 1

>>> for i in range(1, 4):

>>> for j in range(1, 6):

>>> plt.subplot(3, 5, counter)

>>> plt.imshow(digits[(i - 1) * 8000 + j].reshape((28, 28)),

cmap=cm.Greys_r)

>>> plt.axis('off')

>>> counter += 1

>>> plt.show()

[ 180 ]

First, we load the data. scikit-learn provides the fetch_mldata convenience function

we create a subplot for five instances for the digits zero, one, and two. The script

produces the following figure:

The MNIST data set is partitioned into a training set of 60,000 images and test set

of 10,000 images. The dataset is commonly used to evaluate a variety of machine

learning models; it is popular because little preprocessing is required. Let's use

scikit-learn to build a classifier that can predict the digit depicted in an image.

First, we import the necessary classes:

from sklearn.datasets import fetch_mldata

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import scale

from sklearn.cross_validation import train_test_split

from sklearn.svm import SVC

from sklearn.grid_search import GridSearchCV

from sklearn.metrics import classification_report

The script will fork additional processes during grid search, which requires

execution from a __main__ block.

if __name__ == '__main__':

data = fetch_mldata('MNIST original', data_home='data/mnist')

X, y = data.data, data.target

X = X/255.0*2 – 1

[ 181 ]

Next, we load the data using the fetch_mldata convenience function. We scale the

features and center each feature around the origin. We then split the preprocessed

data into training and test sets using the following line of code:

X_train, X_test, y_train, y_test = train_test_split(X, y)

Next, we instantiate an SVC, or support vector classifier, object. This object exposes

an API like that of scikit-learn's other estimators; the classifier is trained using

the fit method, and predictions are made using the predict method. If you

consult the documentation for SVC, you will find that the estimator requires more

hyperparameters than most of the other estimators we discussed. It is common

for more powerful estimators to require more hyperparameters. The most

interesting hyperparameters for SVC are set by the kernel, gamma, and C keyword

arguments. The kernel keyword argument specifies the kernel to be used. scikit-

learn provides implementations of the linear, polynomial, sigmoid, and radial

basis function kernels. The degree keyword argument should also be set when

the polynomial kernel is used. C controls regularization; it is similar to the lambda

hyperparameter we used for logistic regression. The keyword argument gamma is

the kernel coefficient for the sigmoid, polynomial, and RBF kernels. Setting these

hyperparameters can be challenging, so we tune them by grid searching with the

following code.

pipeline = Pipeline([

('clf', SVC(kernel='rbf', gamma=0.01, C=100))

])

print X_train.shape

parameters = {

'clf__gamma': (0.01, 0.03, 0.1, 0.3, 1),

'clf__C': (0.1, 0.3, 1, 3, 10, 30),

}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=2,

verbose=1, scoring='accuracy')

grid_search.fit(X_train[:10000], y_train[:10000])

print 'Best score: %0.3f' % grid_search.best_score_

print 'Best parameters set:'

best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):

print '%s: %r' % (param_name, best_parameters[param_name])

predictions = grid_search.predict(X_test)

print classification_report(y_test, predictions)

The following is the output of the preceding script:

Fitting 3 folds for each of 30 candidates, totalling 90 fits

[Parallel(n_jobs=2)]: Done 1 jobs | elapsed: 7.7min

[Parallel(n_jobs=2)]: Done 50 jobs | elapsed: 201.2min

[Parallel(n_jobs=2)]: Done 88 out of 90 | elapsed: 304.8min

remaining: 6.9min

[Parallel(n_jobs=2)]: Done 90 out of 90 | elapsed: 309.2min finished

Best score: 0.966

Best parameters set:

clf__C: 3

clf__gamma: 0.01

precision recall f1-score support

0.0 0.98 0.99 0.99 1758

1.0 0.98 0.99 0.98 1968

2.0 0.95 0.97 0.96 1727

3.0 0.97 0.95 0.96 1803

4.0 0.97 0.98 0.97 1714

5.0 0.96 0.96 0.96 1535

6.0 0.98 0.98 0.98 1758

7.0 0.97 0.96 0.97 1840

8.0 0.95 0.96 0.96 1668

9.0 0.96 0.95 0.96 1729

avg / total 0.97 0.97 0.97 17500

The best model has an average F1 score of 0.97; this score can be increased further by

training on more than the first ten thousand instances.