In Chapter 1, we looked at a simple regression model of life satisfaction: life_satisfac‐
tion = θ0 + θ1 × GDP_per_capita.
This model is just a linear function of the input feature GDP_per_capita. θ0 and θ1 are
the model’s parameters.
More generally, a linear model makes a prediction by simply computing a weighted
sum of the input features, plus a constant called the bias term (also called the intercept
term), as shown in Equation 4-1.
Equation 4-1. Linear Regression model prediction
y = θ0 + θ1x1 + θ2x2 + ⋯ + θnxn
• ŷ is the predicted value.
• n is the number of features.
• xi is the ith feature value.
• θj is the jth model parameter (including the bias term θ0 and the feature weights
θ1, θ2, ⋯, θn).
1 It is often the case that a learning algorithm will try to optimize a different function than the performance
measure used to evaluate the final model. This is generally because that function is easier to compute, because
it has useful differentiation properties that the performance measure lacks, or because we want to constrain
the model during training, as we will see when we discuss regularization.
Equation 4-2. Linear Regression model prediction (vectorized form)
y = hθ � = θT · �
• θ is the model’s parameter vector, containing the bias term θ0 and the feature
weights θ1 to θn.
• θT is the transpose of θ (a row vector instead of a column vector).
• x is the instance’s feature vector, containing x0 to xn, with x0 always equal to 1.
• θT · x is the dot product of θT and x.
• hθ is the hypothesis function, using the model parameters θ.
Okay, that’s the Linear Regression model, so now how do we train it? Well, recall that
training a model means setting its parameters so that the model best fits the training
set. For this purpose, we first need a measure of how well (or poorly) the model fits
the training data. In Chapter 2 we saw that the most common performance measure
of a regression model is the Root Mean Square Error (RMSE) (Equation 2-1). There‐
fore, to train a Linear Regression model, you need to find the value of θ that minimi‐
zes the RMSE. In practice, it is simpler to minimize the Mean Square Error (MSE)
than the RMSE, and it leads to the same result (because the value that minimizes a
function also minimizes its square root).1
The MSE of a Linear Regression hypothesis hθ on a training set X is calculated using
Equation 4-3. MSE cost function for a Linear Regression model
MSE �, hθ = 1
i = 1
θT · � i − y i 2
Most of these notations were presented in Chapter 2 (see “Notations” on page 38).
The only difference is that we write hθ instead of just h in order to make it clear that
the model is parametrized by the vector θ. To simplify notations, we will just write
MSE(θ) instead of MSE(X, hθ).
2 The demonstration that this returns the value of θ that minimizes the cost function is outside the scope of this
To find the value of θ that minimizes the cost function, there is a closed-form solution
—in other words, a mathematical equation that gives the result directly. This is called
the Normal Equation (Equation 4-4).2
Equation 4-4. Normal Equation
θ = �T · �
−1 · �T · �
• θ is the value of θ that minimizes the cost function.
• y is the vector of target values containing y(1) to y(m).
Let’s generate some linear-looking data to test this equation on (Figure 4-1):
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
Figure 4-1. Randomly generated linear dataset
Now let’s compute θ using the Normal Equation. We will use the inv() function from
NumPy’s Linear Algebra module (np.linalg) to compute the inverse of a matrix, and
the dot() method for matrix multiplication:
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
The actual function that we used to generate the data is y = 4 + 3x0 + Gaussian noise.
Let’s see what the equation found:
We would have hoped for θ0 = 4 and θ1 = 3 instead of θ0 = 3.865 and θ1 = 3.139. Close
enough, but the noise made it impossible to recover the exact parameters of the origi‐
Now you can make predictions using θ:
>>> X_new = np.array([, ])
>>> X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
>>> y_predict = X_new_b.dot(theta_best)
Let’s plot this model’s predictions (Figure 4-2):
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
Figure 4-2. Linear Regression model predictions
>>> from sklearn.linear_model import LinearRegression
>>> lin_reg = LinearRegression()
>>> lin_reg.fit(X, y)
>>> lin_reg.intercept_, lin_reg.coef_
(array([ 4.21509616]), array([[ 2.77011339]]))
The Normal Equation computes the inverse of XT · X, which is an n × n matrix
(where n is the number of features). The computational complexity of inverting such a
matrix is typically about O(n2.4) to O(n3) (depending on the implementation). In
other words, if you double the number of features, you multiply the computation
time by roughly 22.4 = 5.3 to 23 = 8.
The Normal Equation gets very slow when the number of features
grows large (e.g., 100,000).
On the positive side, this equation is linear with regards to the number of instances in
the training set (it is O(m)), so it handles large training sets efficiently, provided they
can fit in memory.
Also, once you have trained your Linear Regression model (using the Normal Equa‐
tion or any other algorithm), predictions are very fast: the computational complexity
is linear with regards to both the number of instances you want to make predictions
on and the number of features. In other words, making predictions on twice as many
instances (or twice as many features) will just take roughly twice as much time.
Now we will look at very different ways to train a Linear Regression model, better
suited for cases where there are a large number of features, or too many training
instances to fit in memory.
Value function approximations So far, we've worked under the assumption that the state- and action- value functions are tabular. However, in tasks with large value spaces, such as computer games, it's impossible to store all possible values in a table. Instead, we'll try to approximate the value functions. To formalize this, let's think of the […]
Artistic style transfer Artistic style transfer is the use of the style (or texture) of one image to reproduce the semantic content of another. It can be implemented with different algorithms, but the most popular way was introduced in 2015 in the paper A Neural Algorithm of Artistic Style (https://arxiv.org/abs/1508.06576) by Leon A. Gatys, Alexander […]
An overview of PCA Recall from Chapter 3, Feature Extraction and Preprocessing, that problems involving high-dimensional data can be affected by the curse of dimensionality. As the dimensions of a data set increases, the number of samples required for an estimator to generalize increases exponentially. Acquiring such large data may be infeasible in some applications, […]
Reinforcement learning The third class of machine learning techniques is called reinforcement learning (RL). We will illustrate this with one of the most popular applications of reinforcement learning: teaching machines how to play games. The machine (or agent) interacts with the game (or environment). The goal of the agent is to win the game. To […]
Generative Adversarial networks In this section, we'll talk about arguably the most popular generative model today: the GANs framework. It was first introduced in 2014 in the landmark paper Generative Adversarial Nets(http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf) by Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair Aaron Courville, and Yoshua Bengio. The GANs framework can […]
Generating new MNIST digits with VAE In this section, we'll see how a VAE can generate new digits for the MNIST dataset and we'll use Keras to do so. We chose MNIST because it will illustrate the generative capabilities of the VAE well. Let's start: Do the imports: 1. import matplotlib.pyplot as plt from matplotlib.markers […]