Lesson 5 - Backprop, Accelerated SGD
Last updated
Last updated
When you do SGD, for each training instance, instead of multiplying the new weight W_new by the gradient * lr of the last weight, you multiply W_new by 10% of the grad * lr, and by 90% of the last weight (direction of the last weight).
This makes the training go faster.
Before we saw that taking a small lr, it would take us a very long time to converge. But with momentum, you are adding in the step you took last time, so your steps are getting bigger and bigger. Until you overshoot the optimal solution, in which case your gradient is going in the other direction of where your momentum is pointing, so then we'll back the other way.
S(t) = alpha*grad + (1-alpha)S(t-1)
Very common - exponentially weighted moving average.
1-alpha is going to multiply. So S(t-2) is in the equation, with a (1-alpha)^2, S(t-3) is there with (1-alpha)^3.
So it's basically the thing I want (Alpha*g) + a weighted average of the last few time periods, where the most recent ones are exponentially higher weighted.
That's what momentum is. I want the current gradient, + exponentially weighted average of my last few steps.
A kind of regularization.
We throw out some activations (and so all the weights associated with that activation will alos be removed). We throw them away with a percentage change p. A common value of p is 0.5. 00
In fastai: ps = dropout for the layers. You can pass in a list and then each p will be the dropout for each list.
For CNN - it's a little different, the int will be on the last layer, and half that value in the other layers.
There a feature in dropout:
training time = weight updates.
At test time we remove dropout: but if we remove it now there's twice as many activations if p was = 0.5. So in the Dropout paper they suggest multiplying all of your weights by p at test time. However Pytorch does it in train time so you don't need to change anything at test time.
122 = the number of continous variables.
nn.BatchNorm1d - kind of a bit of regualization, kind of a training helper.
Comes from this paper: Batch Normalization: Accelerating Dep Network Training by Reducing Internal Covariate Shift.
Redline: What happens when you train without batch norm (very bumpy). Blue: with.
This means we can increase our leaning rate with BN. The spikes in the red lines show when we're at risk of jumping off into a weight space that we can't get out of.
The algo:
It's going to take a mini batch. Batch Norm is a layer so the thing coming into it is activations. (activations are being called x1, x2, etc..)
Find the mean of the activations
Find the variance
Normalize
Scale and shift (most important part):
We take those values and add a vector of Bias (Beta). So we have a bias layer
Then we use something that looks like a bias and we multiply x_i by it (gamma). It's like having a multiplicative bias layer.
They are learnable numbers. They are PARAMETERS.
This is what the layer does.
Batch norm helps to do this really important thing which is shifting the outputs up and down, in and out.
Explanation: Say we're approximating y with ŷ, and we do this with a NN which is represented by f(w1, w2,... w_10000, X_hat). We also have a loss function say MSE.
Let's say we're trying to predict movie review outcomes and they are between 1 and 5.
We've tried to train our model and our activations at the very end are -1, 1. So they are way off where they need to be. The mean and range isn't what we want, we wanted 1-5.
So with batch norm : we're multiplying the neural net function by g, and adding b. We added 2 more parameter vectors
Now we can play with the scale with g, and affect the mean with b. Batch norm helps to do this really important thing which is shifting the outputs up and down, in and out.
You definitely want to use it.
Implementation:
Apply a BatchNorm layer.
momentum = 0.1. This isn't momentum like in optimization. This is momentum as in exponentially weighted moving average. We're actually taking the EWMA of the mean in the algo above, not the actual average of every minibatch.
The higher the momentum in batch norm, the more we have a regularization effect.