Backpropagation in Neural Networks



Introduction

3d graph

We have already written Neural Networks in Python in the previous chapters of our tutorial. We could train these networks, but we didn't explain the mechanism used for training. We used backpropagation without saying so. Backpropagation is a commonly used method for training artificial neural networks, especially deep neural networks. Backpropagation is needed to calculate the gradient, which we need to adapt the weights of the weight matrices. The weight of the neuron (nodes) of our network are adjusted by calculating the gradient of the loss function. For this purpose a gradient descent optimization algorithm is used. It is also called backward propagation of errors.

Quite often people are frightened away by the mathematics used in it. We try to explain it in simple terms.

Explaining gradient descent starts in many articles or tutorials with mountains. Imagine you are put on a mountain, not necessarily the top, by a helicopter at night or heavy fog. Let's further imagine that this mountain is on an island and you want to reach sea level. You have to go down, but you hardly see anything, maybe just a few metres. Your task is to find your way down, but you cannot see the path. You can use the method of gradient descent. This means that you are examining the steepness at your current position. You will proceed in the direction with the steepest descent. You take only a few steps and then you stop again to reorientate yourself. This means you are applying again the previously described procedure, i.e. you are looking for the steepest descend.

This procedure is depicted in the following diagram in a two-dimensional space.

Gradual descent visualisation

Going on like this you will arrive at a position, where there is no further descend.

Each directions goes upwards. You may have reached the deepest level - the global minimum -, but you might as well be stuck in a basin. If you start at the position on the right side of our image, everything works out fine, but from the leftside, you will be stuck in a local minimum. If you imagine now, - not very realistic - you are dropped many time at random places on this island, you will find ways downwards to sea level. This is what we actually do, when we train a neural network.



Backpropagation in Detail

Now, we have to go into the details, i.e. the mathematics.

We will start with the simpler case. We look at a linear network. Linear neural networks are networks where the output signal is created by summing up all the weighted input signals. No activation function will be applied to this sum, which is the reason for the linearity.

The will use the following simple network.

example network for demonstrating neural networks

We have labels, i.e. target or desired values $t_i$ for each output value $o_i$. Principially, the error is the difference between the target and the actual output:

$$e_i = t_i - o_i$$

We will later use a squared error function, because it has better characteristics for the algorithm:

$$e_i = \frac{1}{2} ( t_i - o_i ) ^ 2 $$

explaining backpropagation on one node of a neural network

We will have a look at the output value $o_1$, which is depending on the values $w_{11}$, $w_{21}$, $w_{31}$ and $w_{41}$. Let's assume the calculated value ($o_1$) is 0.92 and the desired value ($t_1$) is 1. In this case the error is

$$e_1 = t_1 - o_1 = 1 - 0.92 = 0.08$$

Depending on this error, we have to change the weights from the incoming values accordingly. We have four weights, so we could spread the error evenly. Yet, it makes more sense to to do it proportionally, according to the weight values. This means that we can calculate the fraction of the error $e_1$ in $w_{11}$ as:

$$e_1 \cdot \frac{w_{11}}{\sum_{i=1}^{4} w_{i1}}$$

This means in our example:

$$0.08 \cdot \frac{0.6}{0.6 + 0.4 + 0.1 + 0.2} = 0.037$$

The total error in our weight matrix between the hidden and the output layer - we called it in our previous chapter 'who' - looks like this

$$ e_{who} = \begin{bmatrix} \frac{w_{11}}{\sum_{i=1}^{4} w_{i1}} & \frac{w_{12}}{\sum_{i=1}^{4} w_{i1}} \\ \frac{w_{21}}{\sum_{i=1}^{4} w_{i1}} & \frac{w_{22}}{\sum_{i=1}^{4} w_{i1}} \\ \frac{w_{31}}{\sum_{i=1}^{4} w_{i1}} & \frac{w_{32}}{\sum_{i=1}^{4} w_{i1}} \\ \frac{w_{41}}{\sum_{i=1}^{4} w_{i1}} & \frac{w_{22}}{\sum_{i=1}^{4} w_{i1}} \\ \end{bmatrix} \cdot \begin{bmatrix} e_1 \\ e_2 \end{bmatrix} $$

You can see that the denominator in the left matrix is always the same. It is a scaling factor. We can drop it so that the calculation gets a lot simpler:

$$ e_{who} = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ w_{31} & w_{32} \\ w_{41} & w_{22} \\ \end{bmatrix} \cdot \begin{bmatrix} e_1 \\ e_2 \end{bmatrix} $$

If you compare the matrix on the right side with the 'who' matrix of our chapter Neuronal Network Using Python and Numpy, you will notice that it is the transpose of 'who'.

$$e_{who} = who.T \cdot e$$

So, this has been the easy part for linear neural networks.

We want to calculate the error in a network with an activation function, i.e. a non-linear network. The derivation of the error function describes the slope. As we mentioned in the beginning of the this chapter, we want to descend. The derivation describes how the error $E$ changes as the weight $w_{ij}$ changes:

$$\frac{\partial E}{\partial w_{ij}}$$

The error function E over all the output nodes $o_j$ ($j = 1, ... n$) where $n$ is the number of output nodes is:

$$E = \sum_{j=1}^{n} \frac{1}{2} (t_j - o_j)^2$$

Now, we can insert this in our derivation:

$$\frac{\partial E}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \frac{1}{2} \sum_{j=1}^{n} (t_j - o_j)^2$$

If you have a look at our example network, you will see that an output node $o_k$ only depends on the input signals created with the weights $w_{ij}$ with $i = 1, \ldots m$ and $m$ the number of hidden nodes.

This means that we can calculate the error for every output node independently of each other and we get rid of the sum. This is the error for a node j for example:

$$\frac{\partial E}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \frac{1}{2} (t_j - o_j)^2$$

The value $t_j$ is a constant, because it is not depending on any input signals or weights. We can apply the chain rule for the differentiation of the previous term to simplify things:

$$\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial o_{j}} \cdot \frac{\partial o_j}{\partial w_{ij}}$$

In the previous chapter of our tutorial, we used the sigmoid function as the activation function:

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

The output node $o_j$ is calculated by applying the sigmoid function to the sum of the weighted input signals. This means that we can further simplify our differentiation term by replacing $o_j$ by this function:

$$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \frac{\partial }{\partial w_{ij}} \sigma(\sum_{i=1}^{m} w_{ij}h_i)$$

where $m$ is the number of hidden nodes.

The sigmoid function is easy to differentiate:

$$\frac{\partial \sigma(x)}{\partial x} = \sigma(x) \cdot (1 - \sigma(x))$$

The complete differentiation looks like this now:

$$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \sigma(\sum_{i=1}^{m} w_{ij}h_i) \cdot (1 - \sigma(\sum_{i=1}^{m} w_{ij}h_i)) \frac{\partial }{\partial w_{ij}} \sum_{i=1}^{m} w_{ij}h_i $$

The last part has to be differentiated with respect to $w_{ij}$. This means that all the summands disappear and only $o_j$ is left:

$$\frac{\partial E}{\partial w_{ij}} = (t_j - o_j) \cdot \sigma(\sum_{i=1}^{m} w_{ij}h_i) \cdot (1 - \sigma(\sum_{i=1}^{m} w_{ij}h_i)) \cdot o_j $$

This is what we had used in our method 'train' of our NeuralNetwork class in the previous chapter.