Understand Stochastic Gradient Descent: A Beginner Tutorial
Machine Learning
Understand Stochastic Gradient Descent: A Beginner Tutorial

7 Out-of-core algorithms are discussed in Chapter 1.

The main problem with Batch Gradient Descent is the fact that it uses the whole

training set to compute the gradients at every step, which makes it very slow when

the training set is large. At the opposite extreme, Stochastic Gradient Descent just

picks a random instance in the training set at every step and computes the gradients

based only on that single instance. Obviously this makes the algorithm much faster

since it has very little data to manipulate at every iteration. It also makes it possible to

train on huge training sets, since only one instance needs to be in memory at each

iteration (SGD can be implemented as an out-of-core algorithm.7)

On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much

less regular than Batch Gradient Descent: instead of gently decreasing until it reaches

the minimum, the cost function will bounce up and down, decreasing only on aver‐

age. Over time it will end up very close to the minimum, but once it gets there it will

continue to bounce around, never settling down (see Figure 4-9). So once the algo‐

rithm stops, the final parameter values are good, but not optimal.

Figure 4-9. Stochastic Gradient Descent

algorithm jump out of local minima, so Stochastic Gradient Descent has a better

chance of finding the global minimum than Batch Gradient Descent does.

Therefore randomness is good to escape from local optima, but bad because it means

that the algorithm can never settle at the minimum. One solution to this dilemma is

to gradually reduce the learning rate. The steps start out large (which helps make

quick progress and escape local minima), then get smaller and smaller, allowing the

algorithm to settle at the global minimum. This process is called simulated annealing,

because it resembles the process of annealing in metallurgy where molten metal is

slowly cooled down. The function that determines the learning rate at each iteration

is called the learning schedule. If the learning rate is reduced too quickly, you may get

stuck in a local minimum, or even end up frozen halfway to the minimum. If the

learning rate is reduced too slowly, you may jump around the minimum for a long

time and end up with a suboptimal solution if you halt training too early.

This code implements Stochastic Gradient Descent using a simple learning schedule:

n_epochs = 50

t0, t1 = 5, 50 # learning schedule hyperparameters

def learning_schedule(t):

return t0 / (t + t1)

theta = np.random.randn(2,1) # random initialization

for epoch in range(n_epochs):

for i in range(m):

random_index = np.random.randint(m)

xi = X_b[random_index:random_index+1]

yi = y[random_index:random_index+1]

gradients = 2 * xi.T.dot(xi.dot(theta) - yi)

eta = learning_schedule(epoch * m + i)

theta = theta - eta * gradients

By convention we iterate by rounds of m iterations; each round is called an epoch.

While the Batch Gradient Descent code iterated 1,000 times through the whole train‐

ing set, this code goes through the training set only 50 times and reaches a fairly good


>>> theta

array([[ 4.21076011],

[ 2.74856079]])

Figure 4-10 shows the first 10 steps of training (notice how irregular the steps are).

Figure 4-10. Stochastic Gradient Descent first 10 steps

Note that since instances are picked randomly, some instances may be picked several

times per epoch while others may not be picked at all. If you want to be sure that the

algorithm goes through every instance at each epoch, another approach is to shuffle

the training set, then go through it instance by instance, then shuffle it again, and so

on. However, this generally converges more slowly.

To perform Linear Regression using SGD with Scikit-Learn, you can use the SGDRe

gressor class, which defaults to optimizing the squared error cost function. The fol‐

lowing code runs 50 epochs, starting with a learning rate of 0.1 (eta0=0.1), using the

default learning schedule (different from the preceding one), and it does not use any

regularization (penalty=None; more details on this shortly):

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1)

sgd_reg.fit(X, y.ravel())

Once again, you find a solution very close to the one returned by the Normal Equa‐


>>> sgd_reg.intercept_, sgd_reg.coef_

(array([ 4.18380366]), array([ 2.74205299]))


Implement Handwritten digits Recognitions Using Keras

Keras In this example, we are going to see how to recognize handwritten digits using Keras. The mnist_keras_training.py script creates the model using a four-layer neural network, as demonstrated in the following code snippet: def create_model(): """Create the model using Sequencial model""" # Create a sequential model (a simple NN is created) adding a softmax activation at the end with 10 units: model = Sequential() model.add(Dense(units=128, activation="relu", input_shape=(784,))) model.add(Dense(units=128, activation="relu")) model.add(Dense(units=128, activation="relu")) model.add(Dense(units=10, activation="softmax")) # Compile the model using the loss function "categorical_crossentropy" and Stocastic Gradient Descent optimizer: model.compile(optimizer=SGD(0.001), loss="categorical_crossentropy", metrics=["accuracy" # Return the created model return model In this case, we have compiled the model using the categorical_crossentropy loss function, which is a loss function well-suited for comparing two probability distributions and using the stochastic gradient descent (GSD) optimizer. To load the MNIST data, we have to use the following code: (train_x, train_y), (test_x, test_y) = mnist.load_data() Additionally, we have to reshape the loaded data to have the proper shape as follows: train_x = train_x.reshape(60000, 784) test_x = test_x.reshape(10000, 784) train_y = keras.utils.to_categorical(train_y, 10) test_y = keras.utils.to_categorical(test_y, 10) At this point, we can create the model, train the model, save the created model, and also get the accuracy obtained when evaluating the testing data as follows: # Create the model: model = create_model() # Use the created model for training: model.fit(train_x, train_y, batch_size=32, epochs=10, verbose=1) # Save the created model: model.save("mnist-model.h5") # Get the accuracy when testing: accuracy = model.evaluate(x=test_x, y=test_y, batch_size=32) # Show the accuracy: print("Accuracy: ", accuracy[1]) At this point, we are ready to use the pre-trained model for predicting new handwritten digits in images. This is performed in the mnist_keras_predicting.py script as follows: # Note: Images should have black background: def load_digit(image_name): """Loads a digit and pre-process in order to have the proper format""" gray = cv2.imread(image_name, cv2.IMREAD_GRAYSCALE) gray = cv2.resize(gray, (28, 28)) gray = gray.reshape((1, 784)) return gray # Create the model: model = create_model() # Load parameters of the model from the saved mode file: model.load_weights("mnist-model.h5") # Load some test images: test_digit_0 = load_digit("digit_0.png") […]