Backpropagation
Last updated
Last updated
Process:
We start at some random set of weights
Do forward propagation to calculate prediction and errors before we do backpropagation.
We do Backpropagation to estimate the slope of the loss function w.r.t each weight in the network.
Multiply that slope by the learning rate and subtract from the current weights.
Keep going with that cycle until we get to a flat part.
The slope of node values are the sum of the slopes for all the weights that come out of them.
Gadients for weight is product of:
Node value feeding into that weight
Slope of loss function w.r.t node it feeds into
Slope of activation function at the node it feeds into
Now we've come to the problem of how to make a multilayer neural network learn. Before, we saw how to update weights with gradient descent. The backpropagation algorithm is just an extension of that, using the chain rule to find the error with the respect to the weights connecting the input layer to the hidden layer (for a two layer network).
To update the weights to hidden layers using gradient descent, you need to know how much error each of the hidden units contributed to the final output. Since the output of a layer is determined by the weights between layers, the error resulting from units is scaled by the weights going forward through the network. Since we know the error at the output, we can use the weights to work backwards to hidden layers.
From this example, you can see one of the effects of using the sigmoid function for the activations. The maximum derivative of the sigmoid function is 0.25, so the errors in the output layer get reduced by at least 75%, and errors in the hidden layer are scaled down by at least 93.75%! You can see that if you have a lot of layers, using a sigmoid activation function will quickly reduce the weight steps to tiny values in layers near the input. This is known as the vanishing gradient problem. Later in the course you'll learn about other activation functions that perform better in this regard and are more commonly used in modern network architectures.
The derivative is almost 0 when its not near the center.
During back propagation, we have to multiply several derivatives. And multiplying small numbers by small numbers gives us Tiny numbers. So the gradient descent step will be extremely tiny.
To stop this we can use other activation functions such as ReLU
You can mix up the activation functions :
Here the last activation function is a sigmoid since the output still needs to be a probability between 0 and 1. If we let it be a relu we can use regression.
For the most part you have everything you need to implement backpropagation with NumPy.
Firstly, there will likely be a different number of input and hidden units, so trying to multiply the errors and the inputs as row vectors will throw an error:
Also, wij is a matrix now, so the right side of the assignment must have the same shape as the left side. Luckily, NumPy takes care of this for us. If you multiply a row vector array with a column vector array, it will multiply the first element in the column by each element in the row vector and set that as the first row in a new 2D array. This continues for each element in the column vector, so you get a 2D array that has shape (len(column_vector), len(row_vector))
.
It turns out this is exactly how we want to calculate the weight update step. As before, if you have your inputs as a 2D array with one row, you can also do hidden_error*inputs.T
, but that won't work if inputs
is a 1D array.
Now you're going to implement the backprop algorithm for a network trained on the graduate school admission data. You should have everything you need from the previous exercises to complete this one.
Your goals here:
Implement the forward pass.
Implement the backpropagation algorithm.
Update the weights.