Batch Gradient Descent
To implement Gradient Descent, you need to compute the gradient of the cost func‐
tion with regards to each model parameter θj. In other words, you need to calculate
how much the cost function will change if you change θj just a little bit. This is called
a partial derivative. It is like asking “what is the slope of the mountain under my feet
if I face east?” and then asking the same question facing north (and so on for all other
dimensions, if you can imagine a universe with more than three dimensions). Equa‐
tion 4-5 computes the partial derivative of the cost function with regards to parame‐
ter θj, noted ∂
∂θjMSE θ .
Equation 4-5. Partial derivatives of the cost function
MSE θ = 2
i = 1
θT · � i − y i xj
Instead of computing these gradients individually, you can use Equation 4-6 to com‐
pute them all in one go. The gradient vector, noted ∇θMSE(θ), contains all the partial
derivatives of the cost function (one for each model parameter).
6 Eta (η) is the 7th letter of the Greek alphabet.
∇θ MSE θ =
m�T · � · θ − �
Notice that this formula involves calculations over the full training
set X, at each Gradient Descent step! This is why the algorithm is
called Batch Gradient Descent: it uses the whole batch of training
data at every step. As a result it is terribly slow on very large train‐
ing sets (but we will see much faster Gradient Descent algorithms
shortly). However, Gradient Descent scales well with the number of
features; training a Linear Regression model when there are hun‐
dreds of thousands of features is much faster using Gradient
Descent than using the Normal Equation.
Once you have the gradient vector, which points uphill, just go in the opposite direc‐
tion to go downhill. This means subtracting ∇θMSE(θ) from θ. This is where the
learning rate η comes into play:6 multiply the gradient vector by η to determine the
size of the downhill step (Equation 4-7).
Equation 4-7. Gradient Descent step
θ next step = θ − η∇θ MSE θ
Let’s look at a quick implementation of this algorithm:
eta = 0.1 # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
Hey, that’s exactly what the Normal Equation found! Gradient Descent worked per‐
fectly. But what if you had used a different learning rate eta? Figure 4-8 shows the
first 10 steps of Gradient Descent using three different learning rates (the dashed line
represents the starting point).
Figure 4-8. Gradient Descent with various learning rates
On the left, the learning rate is too low: the algorithm will eventually reach the solu‐
tion, but it will take a long time. In the middle, the learning rate looks pretty good: in
just a few iterations, it has already converged to the solution. On the right, the learn‐
ing rate is too high: the algorithm diverges, jumping all over the place and actually
getting further and further away from the solution at every step.
To find a good learning rate, you can use grid search (see Chapter 2). However, you
may want to limit the number of iterations so that grid search can eliminate models
that take too long to converge.
You may wonder how to set the number of iterations. If it is too low, you will still be
far away from the optimal solution when the algorithm stops, but if it is too high, you
will waste time while the model parameters do not change anymore. A simple solu‐
tion is to set a very large number of iterations but to interrupt the algorithm when the
gradient vector becomes tiny—that is, when its norm becomes smaller than a tiny
number ϵ (called the tolerance)—because this happens when Gradient Descent has
(almost) reached the minimum.
When the cost function is convex and its slope does not change abruptly (as is the
case for the MSE cost function), it can be shown that Batch Gradient Descent with a
fixed learning rate has a convergence rate of O
iterations . In other words, if you divide
the tolerance ϵ by 10 (to have a more precise solution), then the algorithm will have
to run about 10 times more iterations.
Playing cart pole with A2C In this section, we'll implement an agent that tries to play the cart pole game with the help of A2C. We'll do this with the familiar tools: the OpenAI Gym and TensorFlow. Recall that the state of the cart-pole environment is described by the position and angle of the cart […]
Intuition and justification for CNN The information we extract from sensory inputs is often determined by their context. With images, we can assume that nearby pixels are closely related and their collective information is more relevant when taken as a unit. Conversely, we can assume that individual pixels don't convey information related to each other. […]
Semantic segmentation Semantic segmentation is the process of assigning a class label (such as person, car, or tree) to each pixel of the image. You can think of it as classification, but on a pixel level – instead of classifying the entire image under one label, we'll classify each pixel separately. Here is an example […]
Monte Carlo methods In this section, we'll describe our first algorithm, which does not require full knowledge of the environment (model-free): the Monte Carlo (MC) method (yay, I guess...). Here, the agent uses its own experience to find the optimal policy. Policy evaluation In the Dynamic programming section, we'll describe how to estimate the value […]
Transfer learning example with PyTorch Now that we know what transfer learning is, let's see whether it works in practice. In this section, we'll apply an advanced ImageNet pre-trained network on the CIFAR-10 images. We'll use both types of transfer learning. It's preferable to run this example on GPU: Do the following imports: 1. import […]
In Chapter 1, we looked at a simple regression model of life satisfaction: life_satisfac‐ tion = θ0 + θ1 × GDP_per_capita. This model is just a linear function of the input feature GDP_per_capita. θ0 and θ1 are the model’s parameters. More generally, a linear model makes a prediction by simply computing a weighted sum of […]