 Machine Learning
An Introduction to Batch Gradient Descent: A Beginner Guide

To implement Gradient Descent, you need to compute the gradient of the cost func‐

tion with regards to each model parameter θj. In other words, you need to calculate

how much the cost function will change if you change θj just a little bit. This is called

a partial derivative. It is like asking “what is the slope of the mountain under my feet

if I face east?” and then asking the same question facing north (and so on for all other

dimensions, if you can imagine a universe with more than three dimensions). Equa‐

tion 4-5 computes the partial derivative of the cost function with regards to parame‐

ter θj, noted

θjMSE θ .

Equation 4-5. Partial derivatives of the cost function

θj

MSE θ = 2

m

i = 1

m

θT · &#xd835; i y i xj

i

Instead of computing these gradients individually, you can use Equation 4-6 to com‐

pute them all in one go. The gradient vector, noted θMSE(θ), contains all the partial

derivatives of the cost function (one for each model parameter).

6 Eta (η) is the 7th letter of the Greek alphabet.

θ MSE θ =

θ0

MSE θ

θ1

MSE θ

θn

MSE θ

= 2

m&#xd835;T · &#xd835; · θ &#xd835;

Notice that this formula involves calculations over the full training

set X, at each Gradient Descent step! This is why the algorithm is

called Batch Gradient Descent: it uses the whole batch of training

data at every step. As a result it is terribly slow on very large train‐

ing sets (but we will see much faster Gradient Descent algorithms

shortly). However, Gradient Descent scales well with the number of

features; training a Linear Regression model when there are hun‐

dreds of thousands of features is much faster using Gradient

Descent than using the Normal Equation.

Once you have the gradient vector, which points uphill, just go in the opposite direc‐

tion to go downhill. This means subtracting θMSE(θ) from θ. This is where the

learning rate η comes into play:6 multiply the gradient vector by η to determine the

size of the downhill step (Equation 4-7).

θ next step = θ ηθ MSE θ

Let’s look at a quick implementation of this algorithm:

eta = 0.1 # learning rate

n_iterations = 1000

m = 100

theta = np.random.randn(2,1) # random initialization

for iteration in range(n_iterations):

gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)

theta = theta - eta * gradients

>>> theta

array([[ 4.21509616],

[ 2.77011339]])

Hey, that’s exactly what the Normal Equation found! Gradient Descent worked per‐

fectly. But what if you had used a different learning rate eta? Figure 4-8 shows the

first 10 steps of Gradient Descent using three different learning rates (the dashed line

represents the starting point).

Figure 4-8. Gradient Descent with various learning rates

On the left, the learning rate is too low: the algorithm will eventually reach the solu‐

tion, but it will take a long time. In the middle, the learning rate looks pretty good: in

just a few iterations, it has already converged to the solution. On the right, the learn‐

ing rate is too high: the algorithm diverges, jumping all over the place and actually

getting further and further away from the solution at every step.

To find a good learning rate, you can use grid search (see Chapter 2). However, you

may want to limit the number of iterations so that grid search can eliminate models

that take too long to converge.

You may wonder how to set the number of iterations. If it is too low, you will still be

far away from the optimal solution when the algorithm stops, but if it is too high, you

will waste time while the model parameters do not change anymore. A simple solu‐

tion is to set a very large number of iterations but to interrupt the algorithm when the

gradient vector becomes tiny—that is, when its norm becomes smaller than a tiny

number ϵ (called the tolerance)—because this happens when Gradient Descent has

(almost) reached the minimum.

Convergence Rate

When the cost function is convex and its slope does not change abruptly (as is the

case for the MSE cost function), it can be shown that Batch Gradient Descent with a

fixed learning rate has a convergence rate of O

1

iterations . In other words, if you divide

the tolerance ϵ by 10 (to have a more precise solution), then the algorithm will have

to run about 10 times more iterations.