**Batch Gradient Descent**

To implement Gradient Descent, you need to compute the gradient of the cost func‐

tion with regards to each model parameter *θ**j*. In other words, you need to calculate

how much the cost function will change if you change *θ**j* just a little bit. This is called

a *partial derivative*. It is like asking “what is the slope of the mountain under my feet

if I face east?” and then asking the same question facing north (and so on for all other

dimensions, if you can imagine a universe with more than three dimensions). Equa‐

tion 4-5 computes the partial derivative of the cost function with regards to parame‐

ter *θ**j*, noted ^{∂}

∂*θj*^{MSE}^{ θ}^{ .}

*Equation 4-5. Partial derivatives of the cost function*

∂

∂*θ**j*

MSE* θ* = ^{2}

*m* ^{∑}

*i* = 1

*m*

*θ*^{T}^{ }· �* *^{i}^{ }−* y** *^{i}^{ }*x**j*

*i*

Instead of computing these gradients individually, you can use Equation 4-6 to com‐

pute them all in one go. The gradient vector, noted ∇*θ*MSE(*θ*), contains all the partial

derivatives of the cost function (one for each model parameter).

6 Eta (*η*) is the 7^{th} letter of the Greek alphabet.

∇*θ* MSE* θ* =

∂

∂*θ*0

MSE* θ*

∂

∂*θ*1

MSE* θ*

⋮

∂

∂*θ**n*

MSE* θ*

= ^{2}

*m*^{�}^{T}^{ ·}^{ �}^{ ·}^{ θ}^{ −}^{ �}

Notice that this formula involves calculations over the full training

set **X**, at each Gradient Descent step! This is why the algorithm is

called *Batch Gradient Descent*: it uses the whole batch of training

data at every step. As a result it is terribly slow on very large train‐

ing sets (but we will see much faster Gradient Descent algorithms

shortly). However, Gradient Descent scales well with the number of

features; training a Linear Regression model when there are hun‐

dreds of thousands of features is much faster using Gradient

Descent than using the Normal Equation.

Once you have the gradient vector, which points uphill, just go in the opposite direc‐

tion to go downhill. This means subtracting ∇*θ*MSE(*θ*) from *θ*. This is where the

learning rate *η* comes into play:^{6} multiply the gradient vector by *η* to determine the

size of the downhill step (Equation 4-7).

*Equation 4-7. Gradient Descent step*

*θ* ^{next step}^{ }=* θ* −* η*∇*θ* MSE* θ*

Let’s look at a quick implementation of this algorithm:

`eta`` ``=`` ``0.1`` `*# learning rate*

`n_iterations`` ``=`` ``1000`

`m`` ``=`` ``100`

`theta`` ``=`` ``np``.``random``.``randn``(``2``,``1``) `*# random initialization*

**for**` ``iteration`` `**in**` ``range``(``n_iterations``):`

` ``gradients`` ``=`` ``2``/``m`` ``*`` ``X_b``.``T``.``dot``(``X_b``.``dot``(``theta``) ``-`` ``y``)`

` ``theta`` ``=`` ``theta`` ``-`` ``eta`` ``*`` ``gradients`

**>>> **`theta`

`array([[ 4.21509616],`

` [ 2.77011339]])`

Hey, that’s exactly what the Normal Equation found! Gradient Descent worked per‐

fectly. But what if you had used a different learning rate `eta`? Figure 4-8 shows the

first 10 steps of Gradient Descent using three different learning rates (the dashed line

represents the starting point).

*Figure 4-8. Gradient Descent with various learning rates*

On the left, the learning rate is too low: the algorithm will eventually reach the solu‐

tion, but it will take a long time. In the middle, the learning rate looks pretty good: in

just a few iterations, it has already converged to the solution. On the right, the learn‐

ing rate is too high: the algorithm diverges, jumping all over the place and actually

getting further and further away from the solution at every step.

To find a good learning rate, you can use grid search (see Chapter 2). However, you

may want to limit the number of iterations so that grid search can eliminate models

that take too long to converge.

You may wonder how to set the number of iterations. If it is too low, you will still be

far away from the optimal solution when the algorithm stops, but if it is too high, you

will waste time while the model parameters do not change anymore. A simple solu‐

tion is to set a very large number of iterations but to interrupt the algorithm when the

gradient vector becomes tiny—that is, when its norm becomes smaller than a tiny

number ϵ (called the *tolerance*)—because this happens when Gradient Descent has

(almost) reached the minimum.

**Convergence Rate**

When the cost function is convex and its slope does not change abruptly (as is the

case for the MSE cost function), it can be shown that Batch Gradient Descent with a

fixed learning rate has a *convergence rate* of *O*

1

iterations ^{. In other words, if you divide}

the tolerance ϵ by 10 (to have a more precise solution), then the algorithm will have

to run about 10 times more iterations.

An Example for Playing Game with A2CPlaying cart pole with A2C In this section, we'll implement an agent that tries to play the cart pole game with the help of A2C. We'll do this with the familiar tools: the OpenAI Gym and TensorFlow. Recall that the state of the cart-pole environment is described by the position and angle of the cart […]