As we saw in Chapters 1 and 2, a good way to reduce overfitting is to regularize the
model (i.e., to constrain it): the fewer degrees of freedom it has, the harder it will be
for it to overfit the data. For example, a simple way to regularize a polynomial model
is to reduce the number of polynomial degrees.
For a linear model, regularization is typically achieved by constraining the weights of
the model. We will now look at Ridge Regression, Lasso Regression, and Elastic Net,
which implement three different ways to constrain the weights.
Ridge Regression (also called Tikhonov regularization) is a regularized version of Lin‐
ear Regression: a regularization term equal to α∑i = 1
2 is added to the cost function.
This forces the learning algorithm to not only fit the data but also keep the model
weights as small as possible. Note that the regularization term should only be added
to the cost function during training. Once the model is trained, you want to evaluate
the model’s performance using the unregularized performance measure.
It is quite common for the cost function used during training to be
different from the performance measure used for testing. Apart
from regularization, another reason why they might be different is
that a good training cost function should have optimization-
friendly derivatives, while the performance measure used for test‐
ing should be as close as possible to the final objective. A good
example of this is a classifier trained using a cost function such as
the log loss (discussed in a moment) but evaluated using precision/
The hyperparameter α controls how much you want to regularize the model. If α = 0
then Ridge Regression is just Linear Regression. If α is very large, then all weights end
11 It is common to use the notation J(θ) for cost functions that don’t have a short name; we will often use this
notation throughout the rest of this book. The context will make it clear which cost function is being dis‐
12 Norms are discussed in Chapter 2.
13 A square matrix full of 0s except for 1s on the main diagonal (top-left to bottom-right).
tion 4-8 presents the Ridge Regression cost function.11
Equation 4-8. Ridge Regression cost function
J θ = MSE θ + α1
i = 1
Note that the bias term θ0 is not regularized (the sum starts at i = 1, not 0). If we
define w as the vector of feature weights (θ1 to θn), then the regularization term is
simply equal to ½(∥ w ∥2)2, where ∥ · ∥2 represents the ℓ2 norm of the weight vector.12
For Gradient Descent, just add αw to the MSE gradient vector (Equation 4-6).
It is important to scale the data (e.g., using a StandardScaler)
before performing Ridge Regression, as it is sensitive to the scale of
the input features. This is true of most regularized models.
Figure 4-17 shows several Ridge models trained on some linear data using different α
value. On the left, plain Ridge models are used, leading to linear predictions. On the
right, the data is first expanded using PolynomialFeatures(degree=10), then it is
scaled using a StandardScaler, and finally the Ridge models are applied to the result‐
ing features: this is Polynomial Regression with Ridge regularization. Note how
increasing α leads to flatter (i.e., less extreme, more reasonable) predictions; this
reduces the model’s variance but increases its bias.
As with Linear Regression, we can perform Ridge Regression either by computing a
closed-form equation or by performing Gradient Descent. The pros and cons are the
same. Equation 4-9 shows the closed-form solution (where A is the n × n identity
matrix13 except with a 0 in the top-left cell, corresponding to the bias term).
14 Alternatively you can use the Ridge class with the "sag" solver. Stochastic Average GD is a variant of SGD.
For more details, see the presentation “Minimizing Finite Sums with the Stochastic Average Gradient Algo‐
rithm” by Mark Schmidt et al. from the University of British Columbia.
Figure 4-17. Ridge Regression
Equation 4-9. Ridge Regression closed-form solution
θ = �T · � + α�
−1 · �T · �
Here is how to perform Ridge Regression with Scikit-Learn using a closed-form solu‐
tion (a variant of Equation 4-9 using a matrix factorization technique by André-Louis
>>> from sklearn.linear_model import Ridge
>>> ridge_reg = Ridge(alpha=1, solver="cholesky")
>>> ridge_reg.fit(X, y)
And using Stochastic Gradient Descent:14
>>> sgd_reg = SGDRegressor(penalty="l2")
>>> sgd_reg.fit(X, y.ravel())
The penalty hyperparameter sets the type of regularization term to use. Specifying
"l2" indicates that you want SGD to add a regularization term to the cost function
equal to half the square of the ℓ2 norm of the weight vector: this is simply Ridge
Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso
Regression) is another regularized version of Linear Regression: just like Ridge
Regression, it adds a regularization term to the cost function, but it uses the ℓ1 norm
of the weight vector instead of half the square of the ℓ2 norm (see Equation 4-10).
Equation 4-10. Lasso Regression cost function
J θ = MSE θ + α ∑
i = 1
Figure 4-18 shows the same thing as Figure 4-17 but replaces Ridge models with
Lasso models and uses smaller α values.
Figure 4-18. Lasso Regression
An important characteristic of Lasso Regression is that it tends to completely elimi‐
nate the weights of the least important features (i.e., set them to zero). For example,
the dashed line in the right plot on Figure 4-18 (with α = 10-7) looks quadratic, almost
linear: all the weights for the high-degree polynomial features are equal to zero. In
other words, Lasso Regression automatically performs feature selection and outputs a
sparse model (i.e., with few nonzero feature weights).
You can get a sense of why this is the case by looking at Figure 4-19: on the top-left
plot, the background contours (ellipses) represent an unregularized MSE cost func‐
tion (α = 0), and the white circles show the Batch Gradient Descent path with that
cost function. The foreground contours (diamonds) represent the ℓ1 penalty, and the
triangles show the BGD path for this penalty only (α → ∞). Notice how the path first
15 You can think of a subgradient vector at a nondifferentiable point as an intermediate vector between the gra‐
dient vectors around that point.
the contours represent the same cost function plus an ℓ1 penalty with α = 0.5. The
global minimum is on the θ2 = 0 axis. BGD first reaches θ2 = 0, then rolls down the
gutter until it reaches the global minimum. The two bottom plots show the same
thing but uses an ℓ2 penalty instead. The regularized minimum is closer to θ = 0 than
the unregularized minimum, but the weights do not get fully eliminated.
Figure 4-19. Lasso versus Ridge regularization
On the Lasso cost function, the BGD path tends to bounce across
the gutter toward the end. This is because the slope changes
abruptly at θ2 = 0. You need to gradually reduce the learning rate in
order to actually converge to the global minimum.
The Lasso cost function is not differentiable at θi = 0 (for i = 1, 2, ⋯, n), but Gradient
Descent still works fine if you use a subgradient vector g15 instead when any θi = 0.
Equation 4-11 shows a subgradient vector equation you can use for Gradient Descent
with the Lasso cost function.
g θ, J = ∇θ MSE θ + α
where sign θi =
−1 if θi < 0
if θi = 0
+1 if θi > 0
Here is a small Scikit-Learn example using the Lasso class. Note that you could
instead use an SGDRegressor(penalty="l1").
>>> from sklearn.linear_model import Lasso
>>> lasso_reg = Lasso(alpha=0.1)
>>> lasso_reg.fit(X, y)
Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The
regularization term is a simple mix of both Ridge and Lasso’s regularization terms,
and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge
Regression, and when r = 1, it is equivalent to Lasso Regression (see Equation 4-12).
Equation 4-12. Elastic Net cost function
J θ = MSE θ + rα ∑
i = 1
θi + 1 − r
i = 1
So when should you use Linear Regression, Ridge, Lasso, or Elastic Net? It is almost
always preferable to have at least a little bit of regularization, so generally you should
avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a
few features are actually useful, you should prefer Lasso or Elastic Net since they tend
to reduce the useless features’ weights down to zero as we have discussed. In general,
Elastic Net is preferred over Lasso since Lasso may behave erratically when the num‐
ber of features is greater than the number of training instances or when several fea‐
tures are strongly correlated.
Here is a short example using Scikit-Learn’s ElasticNet (l1_ratio corresponds to
the mix ratio r):
>>> from sklearn.linear_model import ElasticNet
>>> elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
>>> elastic_net.fit(X, y)
A very different way to regularize iterative learning algorithms such as Gradient
Descent is to stop training as soon as the validation error reaches a minimum. This is
called early stopping. Figure 4-20 shows a complex model (in this case a high-degree
Polynomial Regression model) being trained using Batch Gradient Descent. As the
epochs go by, the algorithm learns and its prediction error (RMSE) on the training set
naturally goes down, and so does its prediction error on the validation set. However,
after a while the validation error stops decreasing and actually starts to go back up.
This indicates that the model has started to overfit the training data. With early stop‐
ping you just stop training as soon as the validation error reaches the minimum. It is
such a simple and efficient regularization technique that Geoffrey Hinton called it a
“beautiful free lunch.”
Figure 4-20. Early stopping regularization
With Stochastic and Mini-batch Gradient Descent, the curves are
not so smooth, and it may be hard to know whether you have
reached the minimum or not. One solution is to stop only after the
validation error has been above the minimum for some time (when
you are confident that the model will not do any better), then roll
back the model parameters to the point where the validation error
was at a minimum.
Here is a basic implementation of early stopping:
from sklearn.base import clone
sgd_reg = SGDRegressor(n_iter=1, warm_start=True, penalty=None,
minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
sgd_reg.fit(X_train_poly_scaled, y_train) # continues where it left off
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
val_error = mean_squared_error(y_val_predict, y_val)
if val_error < minimum_val_error:
minimum_val_error = val_error
best_epoch = epoch
best_model = clone(sgd_reg)
Note that with warm_start=True, when the fit() method is called, it just continues
training where it left off instead of restarting from scratch.
Introduction to deep learning In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton published a milestone paper titled ImageNet Classification with Deep Convolutional Neural Networks https://papers.nips. cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks. pdf. The paper describes their use of neural networks to win the ImageNet competition of the same year, which we mentioned in Chapter 2, Neural Networks. At the […]
[ 142 ] Variance, Covariance, and Covariance Matrices Recall that variance is a measure of how a set of values are spread out. Variance is calculated as the average of the squared differences of the values and mean of the values, as per the following equation: ( ) 2 2 1 1 n i i […]
Using Keras to classify handwritten digits In this section, we'll use Keras to classify the images of the MNIST dataset. It's comprised of 70,000 examples of handwritten digits by different people. The first 60,000 are typically used for training and the remaining 10,000 for testing: Sample of digits taken from the MNIST dataset One of […]
Another way to generalize from a set of examples is to build a model of these exam‐ ples, then use that model to make predictions. This is called model-based learning (Figure 1-16). Figure 1-16. Model-based learning For example, suppose you want to know if money makes people happy, so you down‐ load the Better Life […]
1 Fun fact: this odd-sounding name is a statistics term introduced by Francis Galton while he was studying the fact that the children of tall people tend to be shorter than their parents. Since children were shorter, he called this regression to the mean. This name was then applied to the methods he used to […]
Transfer learning example with PyTorch Now that we know what transfer learning is, let's see whether it works in practice. In this section, we'll apply an advanced ImageNet pre-trained network on the CIFAR-10 images. We'll use both types of transfer learning. It's preferable to run this example on GPU: Do the following imports: 1. import […]