Understand Linear Regression in Machine Learning: A Completed Guide for Beginners
Machine Learning
Understand Linear Regression in Machine Learning: A Completed Guide for Beginners

In Chapter 1, we looked at a simple regression model of life satisfaction: life_satisfac‐

tion = θ0 + θ1 × GDP_per_capita.

This model is just a linear function of the input feature GDP_per_capita. θ0 and θ1 are

the model’s parameters.

More generally, a linear model makes a prediction by simply computing a weighted

sum of the input features, plus a constant called the bias term (also called the intercept

term), as shown in Equation 4-1.

Equation 4-1. Linear Regression model prediction

y = θ0 + θ1x1 + θ2x2 + + θnxn

ŷ is the predicted value.

n is the number of features.

xi is the ith feature value.

θj is the jth model parameter (including the bias term θ0 and the feature weights

θ1, θ2, , θn).

1 It is often the case that a learning algorithm will try to optimize a different function than the performance

measure used to evaluate the final model. This is generally because that function is easier to compute, because

it has useful differentiation properties that the performance measure lacks, or because we want to constrain

the model during training, as we will see when we discuss regularization.

tion 4-2.

Equation 4-2. Linear Regression model prediction (vectorized form)

y = hθ � = θT · �

θ is the model’s parameter vector, containing the bias term θ0 and the feature

weights θ1 to θn.

θT is the transpose of θ (a row vector instead of a column vector).

x is the instance’s feature vector, containing x0 to xn, with x0 always equal to 1.

θT · x is the dot product of θT and x.

hθ is the hypothesis function, using the model parameters θ.

Okay, that’s the Linear Regression model, so now how do we train it? Well, recall that

training a model means setting its parameters so that the model best fits the training

set. For this purpose, we first need a measure of how well (or poorly) the model fits

the training data. In Chapter 2 we saw that the most common performance measure

of a regression model is the Root Mean Square Error (RMSE) (Equation 2-1). There‐

fore, to train a Linear Regression model, you need to find the value of θ that minimi‐

zes the RMSE. In practice, it is simpler to minimize the Mean Square Error (MSE)

than the RMSE, and it leads to the same result (because the value that minimizes a

function also minimizes its square root).1

The MSE of a Linear Regression hypothesis hθ on a training set X is calculated using

Equation 4-3.

Equation 4-3. MSE cost function for a Linear Regression model

MSE �, hθ = 1


i = 1


θT · � i y i 2

Most of these notations were presented in Chapter 2 (see “Notations” on page 38).

The only difference is that we write hθ instead of just h in order to make it clear that

the model is parametrized by the vector θ. To simplify notations, we will just write

MSE(θ) instead of MSE(X, hθ).

2 The demonstration that this returns the value of θ that minimizes the cost function is outside the scope of this


To find the value of θ that minimizes the cost function, there is a closed-form solution

—in other words, a mathematical equation that gives the result directly. This is called

the Normal Equation (Equation 4-4).2

Equation 4-4. Normal Equation

θ = �T · �

−1 · �T · �

θ is the value of θ that minimizes the cost function.

y is the vector of target values containing y(1) to y(m).

Let’s generate some linear-looking data to test this equation on (Figure 4-1):

import numpy as np

X = 2 * np.random.rand(100, 1)

y = 4 + 3 * X + np.random.randn(100, 1)

Figure 4-1. Randomly generated linear dataset

Now let’s compute θ using the Normal Equation. We will use the inv() function from

NumPy’s Linear Algebra module (np.linalg) to compute the inverse of a matrix, and

the dot() method for matrix multiplication:

X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance

theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

The actual function that we used to generate the data is y = 4 + 3x0 + Gaussian noise.

Let’s see what the equation found:

>>> theta_best

array([[ 4.21509616],

[ 2.77011339]])

We would have hoped for θ0 = 4 and θ1 = 3 instead of θ0 = 3.865 and θ1 = 3.139. Close

enough, but the noise made it impossible to recover the exact parameters of the origi‐

nal function.

Now you can make predictions using θ:

>>> X_new = np.array([[0], [2]])

>>> X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance

>>> y_predict = X_new_b.dot(theta_best)

>>> y_predict

array([[ 4.21509616],

[ 9.75532293]])

Let’s plot this model’s predictions (Figure 4-2):

plt.plot(X_new, y_predict, "r-")

plt.plot(X, y, "b.")

plt.axis([0, 2, 0, 15])


Figure 4-2. Linear Regression model predictions

>>> from sklearn.linear_model import LinearRegression

>>> lin_reg = LinearRegression()

>>> lin_reg.fit(X, y)

>>> lin_reg.intercept_, lin_reg.coef_

(array([ 4.21509616]), array([[ 2.77011339]]))

>>> lin_reg.predict(X_new)

array([[ 4.21509616],

[ 9.75532293]])

Computational Complexity

The Normal Equation computes the inverse of XT · X, which is an n × n matrix

(where n is the number of features). The computational complexity of inverting such a

matrix is typically about O(n2.4) to O(n3) (depending on the implementation). In

other words, if you double the number of features, you multiply the computation

time by roughly 22.4 = 5.3 to 23 = 8.

The Normal Equation gets very slow when the number of features

grows large (e.g., 100,000).

On the positive side, this equation is linear with regards to the number of instances in

the training set (it is O(m)), so it handles large training sets efficiently, provided they

can fit in memory.

Also, once you have trained your Linear Regression model (using the Normal Equa‐

tion or any other algorithm), predictions are very fast: the computational complexity

is linear with regards to both the number of instances you want to make predictions

on and the number of features. In other words, making predictions on twice as many

instances (or twice as many features) will just take roughly twice as much time.

Now we will look at very different ways to train a Linear Regression model, better

suited for cases where there are a large number of features, or too many training

instances to fit in memory.