In Chapter 1, we looked at a simple regression model of life satisfaction: *life_satisfac‐*

*tion* = *θ*0 + *θ*1 × *GDP_per_capita*.

This model is just a linear function of the input feature `GDP_per_capita`. *θ*0 and *θ*1 are

the model’s parameters.

More generally, a linear model makes a prediction by simply computing a weighted

sum of the input features, plus a constant called the *bias term* (also called the *intercept*

*term*), as shown in Equation 4-1.

*Equation 4-1. Linear Regression model prediction*

*y* =* θ*0 +* θ*1*x*1 +* θ*2*x*2 + ⋯ +* θ**n**x**n*

•* ŷ* is the predicted value.

•* n* is the number of features.

•* x*i is the i^{th} feature value.

•* θ**j* is the j^{th} model parameter (including the bias term *θ*0 and the feature weights

*θ*1, *θ*2, ⋯, *θ**n*).

1 It is often the case that a learning algorithm will try to optimize a different function than the performance

measure used to evaluate the final model. This is generally because that function is easier to compute, because

it has useful differentiation properties that the performance measure lacks, or because we want to constrain

the model during training, as we will see when we discuss regularization.

tion 4-2.

*Equation 4-2. Linear Regression model prediction (vectorized form)*

*y* =* h**θ* � =* θ*^{T}^{ }· �

•* θ* is the model’s *parameter vector*, containing the bias term *θ*0 and the feature

weights *θ*1 to *θ*n.

•* θ*^{T} is the transpose of *θ* (a row vector instead of a column vector).

•** x** is the instance’s *feature vector*, containing *x*0 to *x**n*, with *x*0 always equal to 1.

•* θ*^{T} · **x** is the dot product of *θ*^{T} and **x**.

•* h**θ* is the hypothesis function, using the model parameters *θ*.

Okay, that’s the Linear Regression model, so now how do we train it? Well, recall that

training a model means setting its parameters so that the model best fits the training

set. For this purpose, we first need a measure of how well (or poorly) the model fits

the training data. In Chapter 2 we saw that the most common performance measure

of a regression model is the Root Mean Square Error (RMSE) (Equation 2-1). There‐

fore, to train a Linear Regression model, you need to find the value of *θ* that minimi‐

zes the RMSE. In practice, it is simpler to minimize the Mean Square Error (MSE)

than the RMSE, and it leads to the same result (because the value that minimizes a

function also minimizes its square root).^{1}

The MSE of a Linear Regression hypothesis *h**θ* on a training set **X** is calculated using

Equation 4-3.

*Equation 4-3. MSE cost function for a Linear Regression model*

MSE �,* h**θ* = ^{1}

*m* ^{∑}

*i* = 1

*m*

*θ*^{T}^{ }· �* *^{i}^{ }−* y** *^{i}^{ 2}

Most of these notations were presented in Chapter 2 (see “Notations” on page 38).

The only difference is that we write *h**θ* instead of just *h* in order to make it clear that

the model is parametrized by the vector *θ*. To simplify notations, we will just write

MSE(*θ*) instead of MSE(**X**, *h**θ*).

2 The demonstration that this returns the value of *θ* that minimizes the cost function is outside the scope of this

book.

To find the value of *θ* that minimizes the cost function, there is a *closed-form solution*

—in other words, a mathematical equation that gives the result directly. This is called

the *Normal Equation* (Equation 4-4).^{2}

*Equation 4-4. Normal Equation*

*θ* = �^{T}^{ }· �

−1 · �*T* · �

•* θ* is the value of *θ* that minimizes the cost function.

•** y** is the vector of target values containing *y*^{(1)} to *y*^{(}^{m}^{)}.

Let’s generate some linear-looking data to test this equation on (Figure 4-1):

**import**` `**numpy**` `**as**` `**np**

`X`` ``=`` ``2`` ``*`` ``np``.``random``.``rand``(``100``, ``1``)`

`y`` ``=`` ``4`` ``+`` ``3`` ``*`` ``X`` ``+`` ``np``.``random``.``randn``(``100``, ``1``)`

*Figure 4-1. Randomly generated linear dataset*

Now let’s compute *θ* using the Normal Equation. We will use the `inv()` function from

NumPy’s Linear Algebra module (`np.linalg`) to compute the inverse of a matrix, and

the `dot()` method for matrix multiplication:

`X_b`` ``=`` ``np``.``c_``[``np``.``ones``((``100``, ``1``)), ``X``] `*# add x0 = 1 to each instance*

`theta_best`` ``=`` ``np``.``linalg``.``inv``(``X_b``.``T``.``dot``(``X_b``))``.``dot``(``X_b``.``T``)``.``dot``(``y``)`

The actual function that we used to generate the data is *y* = 4 + 3*x*0 + Gaussian noise.

Let’s see what the equation found:

**>>> **`theta_best`

`array([[ 4.21509616],`

` [ 2.77011339]])`

We would have hoped for *θ*0 = 4 and *θ*1 = 3 instead of *θ*0 = 3.865 and *θ*1 = 3.139. Close

enough, but the noise made it impossible to recover the exact parameters of the origi‐

nal function.

Now you can make predictions using *θ*:

**>>> **`X_new`` ``=`` ``np``.``array``([[``0``], [``2``]])`

**>>> **`X_new_b`` ``=`` ``np``.``c_``[``np``.``ones``((``2``, ``1``)), ``X_new``] `*# add x0 = 1 to each instance*

**>>> **`y_predict`` ``=`` ``X_new_b``.``dot``(``theta_best``)`

**>>> **`y_predict`

`array([[ 4.21509616],`

` [ 9.75532293]])`

Let’s plot this model’s predictions (Figure 4-2):

`plt``.``plot``(``X_new``, ``y_predict``, ``"r-"``)`

`plt``.``plot``(``X``, ``y``, ``"b."``)`

`plt``.``axis``([``0``, ``2``, ``0``, ``15``])`

`plt``.``show``()`

*Figure 4-2. Linear Regression model predictions*

**>>> ****from**` `**sklearn.linear_model**` `**import**` ``LinearRegression`

**>>> **`lin_reg`` ``=`` ``LinearRegression``()`

**>>> **`lin_reg``.``fit``(``X``, ``y``)`

**>>> **`lin_reg``.``intercept_``, ``lin_reg``.``coef_`

`(array([ 4.21509616]), array([[ 2.77011339]]))`

**>>> **`lin_reg``.``predict``(``X_new``)`

`array([[ 4.21509616],`

` [ 9.75532293]])`

**Computational Complexity**

The Normal Equation computes the inverse of **X**^{T} · **X**, which is an *n* × *n* matrix

(where *n* is the number of features). The *computational complexity* of inverting such a

matrix is typically about *O*(*n*^{2.4}) to *O*(*n*^{3}) (depending on the implementation). In

other words, if you double the number of features, you multiply the computation

time by roughly 2^{2.4} = 5.3 to 2^{3} = 8.

The Normal Equation gets very slow when the number of features

grows large (e.g., 100,000).

On the positive side, this equation is linear with regards to the number of instances in

the training set (it is *O*(*m*)), so it handles large training sets efficiently, provided they

can fit in memory.

Also, once you have trained your Linear Regression model (using the Normal Equa‐

tion or any other algorithm), predictions are very fast: the computational complexity

is linear with regards to both the number of instances you want to make predictions

on and the number of features. In other words, making predictions on twice as many

instances (or twice as many features) will just take roughly twice as much time.

Now we will look at very different ways to train a Linear Regression model, better

suited for cases where there are a large number of features, or too many training

instances to fit in memory.

Understanding Value Function Approximations in Reinforcement LearningValue function approximations So far, we've worked under the assumption that the state- and action- value functions are tabular. However, in tasks with large value spaces, such as computer games, it's impossible to store all possible values in a table. Instead, we'll try to approximate the value functions. To formalize this, let's think of the […]