The last Gradient Descent algorithm we will look at is called *Mini-batch Gradient*

*Descent*. It is quite simple to understand once you know Batch and Stochastic Gradi‐

ent Descent: at each step, instead of computing the gradients based on the full train‐

ing set (as in Batch GD) or based on just one instance (as in Stochastic GD), Mini-

8 While the Normal Equation can only perform Linear Regression, the Gradient Descent algorithms can be

used to train many other models, as we will see.

*batches*. The main advantage of Mini-batch GD over Stochastic GD is that you can

get a performance boost from hardware optimization of matrix operations, especially

when using GPUs.

The algorithm’s progress in parameter space is less erratic than with SGD, especially

with fairly large mini-batches. As a result, Mini-batch GD will end up walking

around a bit closer to the minimum than SGD. But, on the other hand, it may be

harder for it to escape from local minima (in the case of problems that suffer from

local minima, unlike Linear Regression as we saw earlier). Figure 4-11 shows the

paths taken by the three Gradient Descent algorithms in parameter space during

training. They all end up near the minimum, but Batch GD’s path actually stops at the

minimum, while both Stochastic GD and Mini-batch GD continue to walk around.

However, don’t forget that Batch GD takes a lot of time to take each step, and Stochas‐

tic GD and Mini-batch GD would also reach the minimum if you used a good learn‐

ing schedule.

*Figure 4-11. Gradient Descent paths in parameter space*

Let’s compare the algorithms we’ve discussed so far for Linear Regression^{8} (recall that

*m* is the number of training instances and *n* is the number of features); see Table 4-1.

*Table 4-1. Comparison of algorithms for Linear Regression*

**Algorithm**

**Large ***m*

**Out-of-core support**

**Large ***n*

**Hyperparams**

**Scaling required**

**Scikit-Learn**

Normal Equation

Fast

No

Slow

0

No

`LinearRegression`

Batch GD

Slow

No

Fast

2

Yes

n/a

**Algorithm**

**Large ***m*** Out-of-core support Large ***n*** Hyperparams**

**Scaling required Scikit-Learn**

Stochastic GD

Fast

Yes

Fast

≥2

Yes

`SGDRegressor`

Mini-batch GD

Fast

Yes

Fast

≥2

Yes

n/a

There is almost no difference after training: all these algorithms

end up with very similar models and make predictions in exactly

the same way.

VGG Networks: A Simple Introduction for BeginnersVGG The first architecture we're going to discuss is VGG (from Oxford's Visual Geometry Group, https://arxiv.org/abs/1409.1556). It was introduced in 2014, when it became a runner-up in the ImageNet challenge of that year. The VGG family of networks remains popular today and is often used as a benchmark against newer architectures. Prior to VGG (for […]