Training Neural Networks
Last updated
Last updated
This determines the number of epochs we should do.
We do gradient descent until the testing error stops decreasing and starts to increase. at that moment we stop.
Exercise
The model on the right will generate large errors, so it's going to be difficult for the model to tune and correct them. The function is very steep so it's hard to do gradient descent. The derivatives will all be close to 0 except for the middle part where they'll be very large.
So how do we prevent this type of overfitting from happenning? We have to tweak the error function a little bit. We want to punish the high coefs. We take the old error function and add a term. 2 options:
Lambda param will tell us how much we want to penalize the large coefs.
Small weights will tend to go to 0. If we want to reduce our weights and end up with a small set - L1. It's also good for feature selection as L1 will help us determine which ones are important.
L2 tries to keep weights reasonably small. Normally better for training models.
Why? Taking the squared error of 0.5^2, 0.5^2 will give 0.5. Thus, L2 will prefer the vector (0.5, 0.5) over the vector (1,0) because it produces a smaller number
Sometimes part of the network doesn't train because its weights are less important. We can turn off part of the network to allow the neglected part to train. We do this by randomly turning off the nodes as we pass through the epochs.
We can give the algo a parameter, which is the prob that a node will get turned off during an epoch.
If we hit a local minimal what can we do? Gradient descent alone will not help us. One way to help is random restart, and do gradient descent on all of them.
Momemtum:
Momemtum is a constant Beta that varies between 0 and 1 and attaches to the previous steps of our gradient descent.
We take small subsets of data run them through the NN and calculate the gradient based on those points and take a step in that direction. We still want to use all of our data. We do this batch by batch.