 Scikit-Learn
Understanding Kernels and the Kernel Trick in Machine Learning: A Beginner Guide

[ 172 ]

Kernels and the kernel trick

Recall that the perceptron separates the instances of the positive class from the

instances of the negative class using a hyperplane as a decision boundary. The

decision boundary is given by the following equation:

� �

� �

� �

Predictions are made using the following function:

( )

( )

(

)

sign

h x

f x

=

Note that previously we expressed the inner product

,

w x as

T

w x. To be consistent

with the notational conventions used for support vector machines, we will adopt the

former notation in this chapter.

While the proof is beyond the scope of this chapter, we can write the model

differently. The following expression of the model is called the dual form. The

expression we used previously is the primal form:

( )

,

,

i

i

i

f x

w x

b

y

x x

b

α

=

+

=

+

The most important difference between the primal and dual forms is that the primal

form computes the inner product of the model parameters and the test instance's

feature vector, while the dual form computes the inner product of the training

instances and the test instance's feature vector. Shortly, we will exploit this property

of the dual form to work with linearly inseparable classes. First, we must formalize

our definition of mapping features to higher-dimensional spaces.

In the section on polynomial regression in Chapter 2, Linear Regression, we mapped

features to a higher-dimensional space in which they were linearly related to the

response variable. The mapping increased the number of features by creating

quadratic terms from combinations of the original features. These synthetic features

allowed us to express a nonlinear function with a linear model. In general, a

mapping is given by the following expression:

( )

:

d

D

x

x

R

R

φ

φ

[ 173 ]

The plot on the left in the following figure shows the original feature space of a

linearly inseparable data set. The plot on the right shows that the data is linearly

separable after mapping to a higher-dimensional space:

Let's return to the dual form of our decision boundary, and the observation that

the feature vectors appear only inside of a dot product. We could map the data to a

higher-dimensional space by applying the mapping to the feature vectors as follows:

( )

,

i

i

i

f x

y

x x

b

α

=

+

( )

( ) ( )

i

i

i

f x

y

x

x

b

α

φ

φ

=

+

As noted, this mapping allows us to express more complex models, but it

introduces computation and generalization problems. Mapping the feature

vectors and computing their dot products can require a prohibitively large

amount of processing power.

Observe in the second equation that while we have mapped the feature vectors to a

higher-dimensional space, the feature vectors still only appear as a dot product. The

dot product is scalar; we do not require the mapped feature vectors once this scalar

has been computed. If we can use a different method to produce the same scalar as

the dot product of the mapped vectors, we can avoid the costly work of explicitly

computing the dot product and mapping the feature vectors.

[ 174 ]

Fortunately, there is such a method called the kernel trick. A kernel is a function

that, given the original feature vectors, returns the same value as the dot product of

its corresponding mapped feature vectors. Kernels do not explicitly map the feature

vectors to a higher-dimensional space, or calculate the dot product of the mapped

vectors. Kernels produce the same value through a different series of operations that

can often be computed more efficiently. Kernels are defined more formally in the

following equation:

(

)

( )

( )

,

,

K x z

x

z

φ

φ

=

Let's demonstrate how kernels work. Suppose that we have two feature vectors,

x and z:

(

)

1

2

,

x

x x

=

(

)

1

2

,

z

z z

=

In our model, we wish to map the feature vectors to a higher-dimensional space

using the following transformation:

( )

2

x

x

φ

=

The dot product of the mapped, normalized feature vectors is equivalent to:

( )

( )

(

) (

)

2

2

2

2

1

2

1

2

1

2

1 2

,

,

,

2

,

,

,

2

x

z

x

x

x x

z

z

z z

φ

φ

=

The kernel given by the following equation produces the same value as the dot

product of the mapped feature vectors:

(

)

(

)

2

2

2

2

2

2

1 1

2

2

1

1

1 1

2

2

2

2

,

,

2

K x z

x z

x z

x z

x z

x z x z

x z

=

=

+

=

+

+

(

)

( )

( )

,

,

K x z

x

z

φ

φ

=

[ 175 ]

Let's plug in values for the feature vectors to make this example more concrete:

(

)

4,9

x =

(

)

3,3

z =

(

)

2

2

,

4 *3

2*4*3*9*3

1521

K x z =

+

=

( )

( )

(

) (

)

2

2

2

2

,

4 ,9 ,

2

4 9 , 3 ,3 ,

2 3 3

1521

x

z

φ

φ

=

∗ ∗

∗ ∗

=

The kernel

(

)

,

K x z produced the same value as the dot product

( )

( )

,

x

z

φ

φ

of the

mapped feature vectors, but never explicitly mapped the feature vectors to the

higher-dimensional space and required fewer arithmetic operations. This example

used only two dimensional feature vectors. Data sets with even a modest number of

features can result in mapped feature spaces with massive dimensions. scikit-learn

provides several commonly used kernels, including the polynomial, sigmoid,

Gaussian, and linear kernels. Polynomial kernels are given by the following equation:

(

)

(

)

,

1

k

K x x

x

x

=

+ ×

Quadratic kernels, or polynomial kernels where k is equal to 2, are commonly used in

natural language processing.

The sigmoid kernel is given by the following equation. γ and r are hyperparameters

that can be tuned through cross-validation:

(

)

(

)

,

tanh

,

K x x

x x

r

γ

=

+

The Gaussian kernel is a good first choice for problems requiring nonlinear models.

The Gaussian kernel is a radial basis function. A decision boundary that is a

hyperplane in the mapped feature space is similar to a decision boundary that is

a hypersphere in the original space. The feature space produced by the Gaussian

kernel can have an infinite number of dimensions, a feat that would be impossible

otherwise. The Gaussian kernel is given by the following equation:

(

)

(

)

2

2

,

/

K x x

e

x

x

σ

=

γ is a hyperparameter. It is always important to scale the features when using

support vector machines, but feature scaling is especially important when using

the Gaussian kernel.

Choosing a kernel can be challenging. Ideally, a kernel will measure the similarity

between instances in a way that is useful to the task. While kernels are commonly

used with support vector machines, they can also be used with any model that can

be expressed in terms of the dot product of two feature vectors, including logistic

regression, perceptrons, and principal component analysis. In the next section, we

will address the second problem caused by mapping to high-dimensional feature

spaces: generalization.