If you perform high-degree Polynomial Regression, you will likely fit the training
data much better than with plain Linear Regression. For example, Figure 4-14 applies
a 300-degree polynomial model to the preceding training data, and compares the
result with a pure linear model and a quadratic model (2nd-degree polynomial).
Notice how the 300-degree polynomial model wiggles around to get as close as possi‐
ble to the training instances.
Figure 4-14. High-degree Polynomial Regression
Of course, this high-degree Polynomial Regression model is severely overfitting the
training data, while the linear model is underfitting it. The model that will generalize
best in this case is the quadratic model. It makes sense since the data was generated
using a quadratic model, but in general you won’t know what function generated the
data, so how can you decide how complex your model should be? How can you tell
that your model is overfitting or underfitting the data?
performance. If a model performs well on the training data but generalizes poorly
according to the cross-validation metrics, then your model is overfitting. If it per‐
forms poorly on both, then it is underfitting. This is one way to tell when a model is
too simple or too complex.
Another way is to look at the learning curves: these are plots of the model’s perfor‐
mance on the training set and the validation set as a function of the training set size.
To generate the plots, simply train the model several times on different sized subsets
of the training set. The following code defines a function that plots the learning
curves of a model given some training data:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
train_errors, val_errors = , 
for m in range(1, len(X_train)):
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
Let’s look at the learning curves of the plain Linear Regression model (a straight line;
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
Figure 4-15. Learning curves
data: when there are just one or two instances in the training set, the model can fit
them perfectly, which is why the curve starts at zero. But as new instances are added
to the training set, it becomes impossible for the model to fit the training data per‐
fectly, both because the data is noisy and because it is not linear at all. So the error on
the training data goes up until it reaches a plateau, at which point adding new instan‐
ces to the training set doesn’t make the average error much better or worse. Now let’s
look at the performance of the model on the validation data. When the model is
trained on very few training instances, it is incapable of generalizing properly, which
is why the validation error is initially quite big. Then as the model is shown more
training examples, it learns and thus the validation error slowly goes down. However,
once again a straight line cannot do a good job modeling the data, so the error ends
up at a plateau, very close to the other curve.
These learning curves are typical of an underfitting model. Both curves have reached
a plateau; they are close and fairly high.
If your model is underfitting the training data, adding more train‐
ing examples will not help. You need to use a more complex model
or come up with better features.
Now let’s look at the learning curves of a 10th-degree polynomial model on the same
data (Figure 4-16):
from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline((
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
plot_learning_curves(polynomial_regression, X, y)
These learning curves look a bit like the previous ones, but there are two very impor‐
• The error on the training data is much lower than with the Linear Regression
• There is a gap between the curves. This means that the model performs signifi‐
cantly better on the training data than on the validation data, which is the hall‐
mark of an overfitting model. However, if you used a much larger training set,
the two curves would continue to get closer.
10 This notion of bias is not to be confused with the bias term of linear models.
Figure 4-16. Learning curves for the polynomial model
One way to improve an overfitting model is to feed it more training
data until the validation error reaches the training error.
The Bias/Variance Tradeoff
An important theoretical result of statistics and Machine Learning is the fact that a
model’s generalization error can be expressed as the sum of three very different
This part of the generalization error is due to wrong assumptions, such as assum‐
ing that the data is linear when it is actually quadratic. A high-bias model is most
likely to underfit the training data.10
This part is due to the model’s excessive sensitivity to small variations in the
training data. A model with many degrees of freedom (such as a high-degree pol‐
ynomial model) is likely to have high variance, and thus to overfit the training
This part is due to the noisiness of the data itself. The only way to reduce this
part of the error is to clean up the data (e.g., fix the data sources, such as broken
sensors, or detect and remove outliers).
Increasing a model’s complexity will typically increase its variance and reduce its bias.
Conversely, reducing a model’s complexity increases its bias and reduces its variance.
This is why it is called a tradeoff.
Gated recurrent units A gated recurrent unit (GRU) is a type of recurrent block that was introduced in 2014 by Kyunghyun Cho et al. (https://arxiv.org/abs/1406.1078, https://arxiv.org/abs/ 1412.3555), as an improvement over LSTM (see the following diagram). A GRU unit usually has similar or better performance to a LSTM, but it does so with fewer parameters […]
Behavioral cloning with PyTorch In this section, we'll implement something fun, a behavioral cloning example with PyTorch. We'll do this with the help of the CarRacing-v0 OpenAI Gym environment, displayed as follows: In the CarRacing-v0 environment, the agent is a racing car. A birds-eye view is used the whole time This example contains multiple Python […]
In short, since your main task is to select a learning algorithm and train it on some data, the two things that can go wrong are “bad algorithm” and “bad data.” Let’s start with examples of bad data. Insufficient Quantity of Training Data For a toddler to learn what an apple is, all it takes […]
Using Keras to classify images of objects With Keras, it's easy to create neural nets, but it's also easy to download test datasets. Let's try to use the CIFAR-10 (Canadian Institute For Advanced Research, https://www.cs. toronto.edu/~kriz/cifar.html) dataset instead of MNIST. It consists of 60,000 32x32 RGB images, divided into 10 classes of objects, namely: airplanes, […]
Another kind of constraint that often leads to good feature extraction is sparsity: by adding an appropriate term to the cost function, the autoencoder is pushed to reduce the number of active neurons in the coding layer. For example, it may be pushed to have on average only 5% significantly active neurons in the coding […]
5 “Auto-Encoding Variational Bayes,” D. Kingma and M. Welling (2014). Another important category of autoencoders was introduced in 2014 by Diederik Kingma and Max Welling,5 and has quickly become one of the most popular types of autoencoders: variational autoencoders. They are quite different from all the autoencoders we have discussed so far, in partic‐ ular: […]