**[**** 176 ****]**

**support vectors**

The following figure depicts instances from two linearly separable classes and three

possible decision boundaries. All of the decision boundaries separate the training

instances of the positive class from the training instances of the negative class, and

a perceptron could learn any of them. Which of these decision boundaries is most

likely to perform best on test data?

**[**** 177 ****]**

From this visualization, it is intuitive that the dotted decision boundary is the best.

The solid decision boundary is near many of the positive instances. The test set could

contain a positive instance that has a slightly smaller value for the first explanatory

variable,

1^{x}^{ ; this instance would be }^{classified incorrectly. The dashed decision }

boundary is farther away from most of the training instances; however, it is near

one of the positive instances and one of the negative instances. The following figure

provides a different perspective on evaluating decision boundaries:

Assume that the line plotted is the decision boundary for a logistic regression

classifier. The instance labeled **A** is far from the decision boundary; it would be

predicted to belong to the positive class with a high probability. The instance labeled

**B** would still be predicted to belong to the positive class, but the probability would

be lower as the instance is closer to the decision boundary. Finally, the instance

labeled **C** would be predicted to belong to the positive class with a low probability;

even a small change to the training data could change the class that is predicted. The

most confident predictions are for the instances that are farthest from the decision

boundary. We can estimate the confidence of the prediction using its **functional **

**margin**. The functional margin of the training set is given by the following equations:

( )

min

*i*

*i*

*funct*

*y f x*

=

� �

�

*� �*

*� �*

*�*

�

�

In the preceding formulae

*i*^{y}^{ is the true class of the instance. The functional margin }

is large for instance **A** and small for instance **C**. If **C** were misclassified, the functional

margin would be negative. The instances for which the functional margin is equal

to one are called **support vectors**. These instances alone are sufficient to define the

decision boundary; the other instances are not required to predict the class of a test

instance. Related to the functional margin is the **geometric margin**, or the maximum

width of the band that separates the support vectors. The geometric margin is equal

to the normalized functional margin. It is necessary to normalize the functional

margins as they can be scaled by using *�*, which is problematic for training. When

*�* is a unit vector, the geometric margin is equal to the functional vector. We can

now formalize our definition of the best decision boundary as having the greatest

geometric margin. The model parameters that maximize the geometric margin can be

solved through the following constrained optimization problem:

1

min

,

*w w*

*n*

(

)

subject to :

,

1

*i*

*i*

*y*

*w x*

*b*

+

≥

A useful property of support vector machines is that this optimization problem is

convex; it has a single local minimum that is also the global minimum. While the

proof is beyond the scope of this chapter, the previous optimization problem can be

written using the dual form of the model to accommodate kernels as follows:

( )

(

)

,

1

,

2

*i*

*i*

*j*

*i*

*i*

*i*

*j*

*i*

*i j*

*W*

*y y K x x*

α

α

α α

=

−

∑

∑

1

subject to :

0

*n*

*i*

*i*

*i*

*y*α

=

=

∑

subject to :

0

*i*

α ≥

Finding the parameters that maximize the geometric margin subject to the

constraints that all of the positive instances have functional margins of at least 1

and all of the negative instances have functional margins of at most -1 is a quadratic

programming problem. This problem is commonly solved using an algorithm

called **Sequential Minimal Optimization** (**SMO**). The SMO algorithm breaks the

optimization problem down into a series of the smallest possible subproblems,

which are then solved analytically.

Understanding the Perceptron for Binary Classification in Machine Learning: A Beginner Guide[ 159 ] Binary classification with the perceptron Let's work through a toy classification problem. Suppose that you wish to separate adult cats from kittens. Only two explanatory variables are available in your dataset: the proportion of the day that the animal was asleep and the proportion of the day that the animal was grumpy. […]