 Machine Learning
Understand Mini-batch Gradient Descent: A Beginner Tutorial

The last Gradient Descent algorithm we will look at is called Mini-batch Gradient

Descent. It is quite simple to understand once you know Batch and Stochastic Gradi‐

ent Descent: at each step, instead of computing the gradients based on the full train‐

ing set (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini-

8 While the Normal Equation can only perform Linear Regression, the Gradient Descent algorithms can be

used to train many other models, as we will see.

batches. The main advantage of Mini-batch GD over Stochastic GD is that you can

get a performance boost from hardware optimization of matrix operations, especially

when using GPUs.

The algorithm’s progress in parameter space is less erratic than with SGD, especially

with fairly large mini-batches. As a result, Mini-batch GD will end up walking

around a bit closer to the minimum than SGD. But, on the other hand, it may be

harder for it to escape from local minima (in the case of problems that suffer from

local minima, unlike Linear Regression as we saw earlier). Figure 4-11 shows the

paths taken by the three Gradient Descent algorithms in parameter space during

training. They all end up near the minimum, but Batch GD’s path actually stops at the

minimum, while both Stochastic GD and Mini-batch GD continue to walk around.

However, don’t forget that Batch GD takes a lot of time to take each step, and Stochas‐

tic GD and Mini-batch GD would also reach the minimum if you used a good learn‐

ing schedule.

Figure 4-11. Gradient Descent paths in parameter space

Let’s compare the algorithms we’ve discussed so far for Linear Regression8 (recall that

m is the number of training instances and n is the number of features); see Table 4-1.

Table 4-1. Comparison of algorithms for Linear Regression

Algorithm

Large m

Out-of-core support

Large n

Hyperparams

Scaling required

Scikit-Learn

Normal Equation

Fast

No

Slow

0

No

LinearRegression

Batch GD

Slow

No

Fast

2

Yes

n/a

Algorithm

Large m Out-of-core support Large n Hyperparams

Scaling required Scikit-Learn

Stochastic GD

Fast

Yes

Fast

≥2

Yes

SGDRegressor

Mini-batch GD

Fast

Yes

Fast

≥2

Yes

n/a

There is almost no difference after training: all these algorithms

end up with very similar models and make predictions in exactly

the same way.