The chain rule

One of the fundamental principles to compute backpropagation is the chain rule, which is a more generic form of the delta rule that we saw for the perceptron.

The chain rule uses the property of derivatives to calculate the result of the composition of more functions. By putting neurons in series, we are effectively creating a composition of functions; therefore, we can apply the chain rule formula:

In this particular case, we want to find the weight that minimizes our error function. To do that, we derive our error function in respect to the weights, and we follow the direction of the descending gradient. So, if we consider the neuron j, we will see that its input comes from the previous part of the network, which we can denote with network_j. The output of the neuron will be denoted with o_j; therefore, applying the chain rule, we will obtain the following formula:

Let's focus on every single element of this equation. The first factor is exactly what we had before with the perceptron; therefore, we get the following formula:

This is because in this case, o_j is also the output of the neurons in the next layer that we can denote with L. If we denote the number of neurons in a given layer with l, we will have the following formula:

That's where the delta rule that we used previously comes from.

When it's not the output neuron that we are deriving, the formula is more complex, as we need to consider each single neuron as it might be connected with a different part of the network. In that case, we have the following formula:

Then, we need to derive the output representation we found in respect to the rest of the network. In this case, the activation function is a sigmoid; therefore, the derivative is pretty easy to calculate:

The derivative of the input of neuron $o j$ (network_j) with respect to the weight that connects the neuron with our neuron j is simply the partial derivative of the activation function. In the last element, only one term depends on wi_j; therefore, everything else becomes 0:

Now, we can see the general case of the delta rule:

Here, we denote the following formula:

Now, the gradient descent technique wants to move our weights one step toward the direction of the gradient. This one step is something it's up to us to define, depending on how fast we want the algorithm to converge and how close we want to go to the local minima. If we take too large of a step, it's unlikely that we will find the minima, and if we take too small of a step, it will take too much time to find it:

We mentioned that with gradient descent, we are not guaranteed to find a local minima, and this is because of the non-convexity of error functions in neural networks. How well we explore the error space will depend on parameters such as the step size and the learning rate, but also on how well we created the dataset.

Unfortunately, at the moment, there is no formula that guarantees a good way to explore the error function. It's a process that still requires a bit of craftsmanship, and because of that, some theoretical purists look at deep learning as an inferior technique, preferring the more complete statistical formulations. But if we choose to look at the other side of the matter, this can be seen as a great opportunity for researchers to advance the field. The growth of deep learning in practical applications is what has driven the success of the field, demonstrating that the current limitations are not major drawbacks.