**[**** 189 ****]**

**Feedforward and feedback artificial **

**neural networks**

Artificial neural networks are described by three components. The first is the model's

**architecture**, or topology, which describes the layers of neurons and structure of the

connections between them. The second component is the activation function used by

the artificial neurons. The third component is the learning algorithm that finds the

optimal values of the weights.

There are two main types of artificial neural networks. **Feedforward neural **

**networks** are the most common type of neural net, and are defined by their directed

acyclic graphs. Signals only travel in one direction—towards the output layer—in

feedforward neural networks. Conversely, **feedback neural networks**, or recurrent

neural networks, do contain cycles. The feedback cycles can represent an internal

state for the network that can cause the network's behavior to change over time

based on its input. Feedforward neural networks are commonly used to learn a

function to map an input to an output. The temporal behavior of feedback neural

networks makes them suitable for processing sequences of inputs. Because feedback

neural networks are not implemented in scikit-learn, we will limit our discussion to

only feedforward neural networks.

**Multilayer perceptrons**

The **multilayer perceptron** (**MLP**) is the one of the most commonly used artificial

neural networks. The name is a slight misnomer; a multilayer perceptron is not a

single perceptron with multiple layers, but rather multiple layers of artificial neurons

that can be perceptrons. The layers of the MLP form a directed, acyclic graph.

Generally, each layer is fully connected to the subsequent layer; the output of each

artificial neuron in a layer is an input to every artificial neuron in the next layer

towards the output. MLPs have three or more layers of artificial neurons.

**[**** 190 ****]**

The **input layer** consists of simple input neurons. The input neurons are connected

to at least one **hidden layer** of artificial neurons. The hidden layer represents latent

variables; the input and output of this layer cannot be observed in the training data.

Finally, the last hidden layer is connected to an **output layer**. The following diagram

depicts the architecture of a multilayer perceptron with three layers. The neurons

labeled **+1** are bias neurons and are not depicted in most architecture diagrams.

**[**** 191 ****]**

The artificial neurons, or **units**, in the hidden layer commonly use nonlinear

activation functions such as the hyperbolic tangent function and the logistic

function, which are given by the following equations:

( )

( )

tanh

*f x*

*x*

=

( )

1

1

*x*

*f x*

*e*^{−}

= +

As with other supervised models, our goal is to find the values of the weights that

minimize the value of a cost function. The mean squared error cost function is

commonly used with multilayer perceptrons. It is given by the following equation,

where *m* is the number of training instances:

( )

(

)

2

1

1

*i*

*i*

*i*

*MSE*

*m y*

*f x*

*m*

=

=

−

∑

**Minimizing the cost function**

The **backpropagation** algorithm is commonly used in conjunction with an

optimization algorithm such as gradient descent to minimize the value of the cost

function. The algorithm takes its name from a portmanteau of *backward propagation*,

and refers to the direction in which errors flow through the layers of the network.

Backpropagation can theoretically be used to train a feedforward network with any

number of hidden units arranged in any number of layers, though computational

power constrains this capability.

Backpropagation is similar to gradient descent in that it uses the gradient of the

cost function to update the values of the model parameters. Unlike the linear

models we have previously seen, neural nets contain hidden units that represent

latent variables; we can't tell what the hidden units should do from the training

data. If we do not know what the hidden units should do, we cannot calculate their

errors and we cannot calculate the gradient of cost function with respect to their

weights. A naive solution to overcome this is to randomly perturb the weights for

the hidden units. If a random change to one of the weights decreases the value of

the cost function, we save the change and randomly change the value of another

weight. An obvious problem with this solution is its prohibitive computational cost.

Backpropagation provides a more efficient solution.

**[**** 192 ****]**

We will step through training a feedforward neural network using backpropagation.

This network has two input units, two hidden layers that both have three hidden

units, and two output units. The input units are both fully connected to the first

hidden layer's units, called Hidden1, Hidden2, and Hidden3. The edges connecting

the units are initialized to small random weights.

**Forward propagation**

During the forward propagation stage, the features are input to the network and fed

through the subsequent layers to produce the output activations. First, we compute

the activation for the unit Hidden1. We find the weighted sum of input to Hidden1,

and then process the sum with the activation function. Note that Hidden1 receives a

constant input from a bias unit that is not depicted in the diagram in addition to the

inputs from the input units. In the following diagram, ( )

*g x* is the activation function:

**[**** 193 ****]**

Next, we compute the activation for the second hidden unit. Like the first hidden

unit, it receives weighted inputs from both of the input units and a constant input

from a bias unit. We then process the weighted sum of the inputs, or **preactivation**,

with the activation function as shown in the following figure:

**[**** 194 ****]**

We then compute the activation for Hidden3 in the same manner:

**[**** 195 ****]**

Having computed the activations of all of the hidden units in the first layer, we

proceed to the second hidden layer. In this network, the first hidden layer is fully

connected to the second hidden layer. Similar to the units in the first hidden layer,

the units in the second hidden layer receive a constant input from bias units that are

not depicted in the diagram. We proceed to compute the activation of Hidden4:

**[**** 196 ****]**

We next compute the activations of Hidden5 and Hidden6. Having computed the

activations of all of the hidden units in the second hidden layer, we proceed to the

output layer in the following figure. The activation of Output1 is the weighted sum

of the second hidden layer's activations processed through an activation function.

Similar to the hidden units, the output units both receive a constant input from a

bias unit:

**[**** 197 ****]**

We calculate the activation of Output2 in the same manner:

We computed the activations of all of the units in the network, and we have now

completed forward propagation. The network is not likely to approximate the true

function well using the initial random values of the weights. We must now update

the values of the weights so that the network can better approximate our function.

**[**** 198 ****]**

**Backpropagation**

We can calculate the error of the network only at the output units. The hidden units

represent latent variables; we cannot observe their true values in the training data

and thus, we have nothing to compute their error against. In order to update their

weights, we must propagate the network's errors backwards through its layers. We

will begin with Output1. Its error is equal to the difference between the true and

predicted outputs, multiplied by the partial derivative of the unit's activation:

**[**** 199 ****]**

We then calculate the error of the second output unit:

**[**** 200 ****]**

We computed the errors of the output layer. We can now propagate these errors

backwards to the second hidden layer. First, we will compute the error of hidden unit

Hidden4. We multiply the error of Output1 by the value of the weight connecting

Hidden4 and Output1. We similarly weigh the error of Output2. We then add these

errors and calculate the product of their sum and the partial derivative of Hidden4:

**[**** 201 ****]**

We similarly compute the errors of Hidden5:

**[**** 202 ****]**

We then compute the Hidden6 error in the following figure:

**[**** 203 ****]**

We calculated the error of the second hidden layer with respect to the output layer.

Next, we will continue to propagate the errors backwards towards the input layer.

The error of the hidden unit Hidden1 is the product of its partial derivative and the

weighted sums of the errors in the second hidden layer:

**[**** 204 ****]**

We similarly compute the error for hidden unit Hidden2:

**[**** 205 ****]**

We similarly compute the error for Hidden3:

We computed the errors of the first hidden layer. We can now use these errors to

update the values of the weights. We will first update the weights for the edges

connecting the input units to Hidden1 as well as the weight for the edge connecting

the bias unit to Hidden1. We will increment the value of the weight connecting

Input1 and Hidden1 by the product of the learning rate, error of Hidden1, and the

value of Input1.

**[**** 206 ****]**

We will similarly increment the value of Weight2 by the product of the learning rate,

error of Hidden1, and the value of Input2. Finally, we will increment the value of the

weight connecting the bias unit to Hidden1 by the product of the learning rate, error

of Hidden1, and one.

**[**** 207 ****]**

We will then update the values of the weights connecting hidden unit Hidden2 to the

input units and the bias unit using the same method:

**[**** 208 ****]**

Next, we will update the values of the weights connecting the input layer to Hidden3:

**[**** 209 ****]**

Since the values of the weights connecting the input layer to the first hidden layer

is updated, we can continue to the weights connecting the first hidden layer to the

second hidden layer. We will increment the value of Weight7 by the product of the

learning rate, error of Hidden4, and the output of Hidden1. We continue to similarly

update the values of weights Weight8 to Weight15:

**[**** 210 ****]**

The weights for Hidden5 and Hidden6 are updated in the same way. We updated

the values of the weights connecting the two hidden layers. We can now update the

values of the weights connecting the second hidden layer and the output layer. We

increment the values of weights W16 through W21 using the same method that we

used for the weights in the previous layers:

After incrementing the value of Weight21 by the product of the learning rate,

error of Output2, and the activation of Hidden6, we have finished updating the

values of the weights for the network. We can now perform another forward

pass using the new values of the weights; the value of the cost function produced

using the updated weights should be smaller. We will repeat this process until

the model converges or another stopping criterion is satisfied. Unlike the linear

models we have discussed, backpropagation does not optimize a convex function.

It is possible that backpropagation will converge on parameter values that specify

a local, rather than global, minimum. In practice, local optima are frequently

adequate for many applications.

Extracting Features From Category Labels for Machine Learning in Scikit-learnExtracting features from categorical variables Many machine learning problems have categorical, or nominal, rather than continuous features. For example, an application that predicts a job's salary based on its description might use categorical features such as the job's location. Categorical variables are commonly encoded using one-of-K or one-hot encoding, in which the explanatory variable is […]