As we saw in Chapters 1 and 2, a good way to reduce overfitting is to regularize the

model (i.e., to constrain it): the fewer degrees of freedom it has, the harder it will be

for it to overfit the data. For example, a simple way to regularize a polynomial model

is to reduce the number of polynomial degrees.

For a linear model, regularization is typically achieved by constraining the weights of

the model. We will now look at Ridge Regression, Lasso Regression, and Elastic Net,

which implement three different ways to constrain the weights.

**Ridge Regression**

*Ridge Regression* (also called *Tikhonov regularization*) is a regularized version of Lin‐

ear Regression: a *regularization term* equal to *α*∑*i* = 1

*n*

*θ**i*

2 is added to the cost function.

This forces the learning algorithm to not only fit the data but also keep the model

weights as small as possible. Note that the regularization term should only be added

to the cost function during training. Once the model is trained, you want to evaluate

the model’s performance using the unregularized performance measure.

It is quite common for the cost function used during training to be

different from the performance measure used for testing. Apart

from regularization, another reason why they might be different is

that a good training cost function should have optimization-

friendly derivatives, while the performance measure used for test‐

ing should be as close as possible to the final objective. A good

example of this is a classifier trained using a cost function such as

the log loss (discussed in a moment) but evaluated using precision/

recall.

The hyperparameter *α* controls how much you want to regularize the model. If *α* = 0

then Ridge Regression is just Linear Regression. If *α* is very large, then all weights end

11 It is common to use the notation *J*(*θ*) for cost functions that don’t have a short name; we will often use this

notation throughout the rest of this book. The context will make it clear which cost function is being dis‐

cussed.

12 Norms are discussed in Chapter 2.

13 A square matrix full of 0s except for 1s on the main diagonal (top-left to bottom-right).

tion 4-8 presents the Ridge Regression cost function.^{11}

*Equation 4-8. Ridge Regression cost function*

*J θ* = MSE* θ* +* α*^{1}

2 ^{∑}

*i* = 1

*n*

*θ**i*

2

Note that the bias term *θ*0 is not regularized (the sum starts at *i* = 1, not 0). If we

define **w** as the vector of feature weights (*θ*1 to *θ**n*), then the regularization term is

simply equal to ½(∥ **w** ∥2)^{2}, where ∥ · ∥2 represents the ℓ2 norm of the weight vector.^{12}

For Gradient Descent, just add *α***w** to the MSE gradient vector (Equation 4-6).

It is important to scale the data (e.g., using a `StandardScaler`)

before performing Ridge Regression, as it is sensitive to the scale of

the input features. This is true of most regularized models.

Figure 4-17 shows several Ridge models trained on some linear data using different *α*

value. On the left, plain Ridge models are used, leading to linear predictions. On the

right, the data is first expanded using `PolynomialFeatures(degree=10)`, then it is

scaled using a `StandardScaler`, and finally the Ridge models are applied to the result‐

ing features: this is Polynomial Regression with Ridge regularization. Note how

increasing *α* leads to flatter (i.e., less extreme, more reasonable) predictions; this

reduces the model’s variance but increases its bias.

As with Linear Regression, we can perform Ridge Regression either by computing a

closed-form equation or by performing Gradient Descent. The pros and cons are the

same. Equation 4-9 shows the closed-form solution (where **A** is the *n* × *n* *identity*

*matrix*^{13} except with a 0 in the top-left cell, corresponding to the bias term).

14 Alternatively you can use the `Ridge` class with the `"sag"` solver. Stochastic Average GD is a variant of SGD.

For more details, see the presentation “Minimizing Finite Sums with the Stochastic Average Gradient Algo‐

rithm” by Mark Schmidt et al. from the University of British Columbia.

*Figure 4-17. Ridge Regression*

*Equation 4-9. Ridge Regression closed-form solution*

*θ* = �^{T} · � +* α*�

−1 · �*T* · �

Here is how to perform Ridge Regression with Scikit-Learn using a closed-form solu‐

tion (a variant of Equation 4-9 using a matrix factorization technique by André-Louis

Cholesky):

**>>> ****from**` `**sklearn.linear_model**` `**import**` ``Ridge`

**>>> **`ridge_reg`` ``=`` ``Ridge``(``alpha``=``1``, ``solver``=``"cholesky"``)`

**>>> **`ridge_reg``.``fit``(``X``, ``y``)`

**>>> **`ridge_reg``.``predict``([[``1.5``]])`

`array([[ 1.55071465]])`

And using Stochastic Gradient Descent:^{14}

**>>> **`sgd_reg`` ``=`` ``SGDRegressor``(``penalty``=``"l2"``)`

**>>> **`sgd_reg``.``fit``(``X``, ``y``.``ravel``())`

**>>> **`sgd_reg``.``predict``([[``1.5``]])`

`array([[ 1.13500145]])`

The `penalty` hyperparameter sets the type of regularization term to use. Specifying

`"l2"` indicates that you want SGD to add a regularization term to the cost function

equal to half the square of the ℓ2 norm of the weight vector: this is simply Ridge

Regression.

*Least Absolute Shrinkage and Selection Operator Regression* (simply called *Lasso*

*Regression*) is another regularized version of Linear Regression: just like Ridge

Regression, it adds a regularization term to the cost function, but it uses the ℓ1 norm

of the weight vector instead of half the square of the ℓ2 norm (see Equation 4-10).

*Equation 4-10. Lasso Regression cost function*

*J θ* = MSE* θ* +* α* ∑

*i* = 1

*n*

*θ**i*

Figure 4-18 shows the same thing as Figure 4-17 but replaces Ridge models with

Lasso models and uses smaller *α* values.

*Figure 4-18. Lasso Regression*

An important characteristic of Lasso Regression is that it tends to completely elimi‐

nate the weights of the least important features (i.e., set them to zero). For example,

the dashed line in the right plot on Figure 4-18 (with *α* = 10^{-7}) looks quadratic, almost

linear: all the weights for the high-degree polynomial features are equal to zero. In

other words, Lasso Regression automatically performs feature selection and outputs a

*sparse model* (i.e., with few nonzero feature weights).

You can get a sense of why this is the case by looking at Figure 4-19: on the top-left

plot, the background contours (ellipses) represent an unregularized MSE cost func‐

tion (*α* = 0), and the white circles show the Batch Gradient Descent path with that

cost function. The foreground contours (diamonds) represent the ℓ1 penalty, and the

triangles show the BGD path for this penalty only (*α* → ∞). Notice how the path first

15 You can think of a subgradient vector at a nondifferentiable point as an intermediate vector between the gra‐

dient vectors around that point.

the contours represent the same cost function plus an ℓ1 penalty with *α* = 0.5. The

global minimum is on the *θ*2 = 0 axis. BGD first reaches *θ*2 = 0, then rolls down the

gutter until it reaches the global minimum. The two bottom plots show the same

thing but uses an ℓ2 penalty instead. The regularized minimum is closer to *θ* = 0 than

the unregularized minimum, but the weights do not get fully eliminated.

*Figure 4-19. Lasso versus Ridge regularization*

On the Lasso cost function, the BGD path tends to bounce across

the gutter toward the end. This is because the slope changes

abruptly at *θ*2 = 0. You need to gradually reduce the learning rate in

order to actually converge to the global minimum.

The Lasso cost function is not differentiable at *θ**i* = 0 (for *i* = 1, 2, ⋯, *n*), but Gradient

Descent still works fine if you use a *subgradient vector* **g**^{15} instead when any *θ**i* = 0.

Equation 4-11 shows a subgradient vector equation you can use for Gradient Descent

with the Lasso cost function.

*g θ*,* J* = ∇*θ* MSE* θ* +* α*

sign* θ*1

sign* θ*2

⋮

sign* θ**n*

where sign* θ**i* =

−1 if* θ**i* < 0

0

if* θ**i* = 0

+1 if* θ**i* > 0

Here is a small Scikit-Learn example using the `Lasso` class. Note that you could

instead use an `SGDRegressor(penalty="l1")`.

**>>> ****from**` `**sklearn.linear_model**` `**import**` ``Lasso`

**>>> **`lasso_reg`` ``=`` ``Lasso``(``alpha``=``0.1``)`

**>>> **`lasso_reg``.``fit``(``X``, ``y``)`

**>>> **`lasso_reg``.``predict``([[``1.5``]])`

`array([ 1.53788174])`

**Elastic Net**

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The

regularization term is a simple mix of both Ridge and Lasso’s regularization terms,

and you can control the mix ratio *r*. When *r* = 0, Elastic Net is equivalent to Ridge

Regression, and when *r* = 1, it is equivalent to Lasso Regression (see Equation 4-12).

*Equation 4-12. Elastic Net cost function*

*J θ* = MSE* θ* +* rα* ∑

*i* = 1

*n*

*θ**i* + ^{1 −}^{ r}

2

*α* ∑

*i* = 1

*n*

*θ**i*

2

So when should you use Linear Regression, Ridge, Lasso, or Elastic Net? It is almost

always preferable to have at least a little bit of regularization, so generally you should

avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a

few features are actually useful, you should prefer Lasso or Elastic Net since they tend

to reduce the useless features’ weights down to zero as we have discussed. In general,

Elastic Net is preferred over Lasso since Lasso may behave erratically when the num‐

ber of features is greater than the number of training instances or when several fea‐

tures are strongly correlated.

Here is a short example using Scikit-Learn’s `ElasticNet` (`l1_ratio` corresponds to

the mix ratio *r*):

**>>> ****from**` `**sklearn.linear_model**` `**import**` ``ElasticNet`

**>>> **`elastic_net`` ``=`` ``ElasticNet``(``alpha``=``0.1``, ``l1_ratio``=``0.5``)`

**>>> **`elastic_net``.``fit``(``X``, ``y``)`

**>>> **`elastic_net``.``predict``([[``1.5``]])`

`array([ 1.54333232])`

A very different way to regularize iterative learning algorithms such as Gradient

Descent is to stop training as soon as the validation error reaches a minimum. This is

called *early stopping*. Figure 4-20 shows a complex model (in this case a high-degree

Polynomial Regression model) being trained using Batch Gradient Descent. As the

epochs go by, the algorithm learns and its prediction error (RMSE) on the training set

naturally goes down, and so does its prediction error on the validation set. However,

after a while the validation error stops decreasing and actually starts to go back up.

This indicates that the model has started to overfit the training data. With early stop‐

ping you just stop training as soon as the validation error reaches the minimum. It is

such a simple and efficient regularization technique that Geoffrey Hinton called it a

“beautiful free lunch.”

*Figure 4-20. Early stopping regularization*

With Stochastic and Mini-batch Gradient Descent, the curves are

not so smooth, and it may be hard to know whether you have

reached the minimum or not. One solution is to stop only after the

validation error has been above the minimum for some time (when

you are confident that the model will not do any better), then roll

back the model parameters to the point where the validation error

was at a minimum.

Here is a basic implementation of early stopping:

**from**` `**sklearn.base**` `**import**` ``clone`

`sgd_reg`` ``=`` ``SGDRegressor``(``n_iter``=``1``, ``warm_start``=``True``, ``penalty``=``None``,`

` ``learning_rate``=``"constant"``, ``eta0``=``0.0005``)`

`minimum_val_error`` ``=`` ``float``(``"inf"``)`

`best_epoch`` ``=`` ``None`

`best_model`` ``=`` ``None`

**for**` ``epoch`` `**in**` ``range``(``1000``):`

` ``sgd_reg``.``fit``(``X_train_poly_scaled``, ``y_train``) `*# continues where it left off*

` ``y_val_predict`` ``=`` ``sgd_reg``.``predict``(``X_val_poly_scaled``)`

` ``val_error`` ``=`` ``mean_squared_error``(``y_val_predict``, ``y_val``)`

` `**if**` ``val_error`` ``<`` ``minimum_val_error``:`

` ``minimum_val_error`` ``=`` ``val_error`

` ``best_epoch`` ``=`` ``epoch`

` ``best_model`` ``=`` ``clone``(``sgd_reg``)`

Note that with `warm_start=True`, when the `fit()` method is called, it just continues

training where it left off instead of restarting from scratch.

A Completed Introduction to Deep Learning for BeginnersIntroduction to deep learning In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton published a milestone paper titled ImageNet Classification with Deep Convolutional Neural Networks https://papers.nips. cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks. pdf. The paper describes their use of neural networks to win the ImageNet competition of the same year, which we mentioned in Chapter 2, Neural Networks. At the […]