Gradient Descent

In order for us to do gradient descent our error function can not be discrete is has to be continuous.

It also has to be differentiable.

Simple explanation:

To calculate the slope for a weight, we need to multiply:

  • The slope of the loss function w.r.t value at the node feed into.

  • The value of the node that feeds into our weight

  • the slope of the activation function w.r.t value we feed into.

ex this simple model:

The first item : Slope of the mean-squared loss function w.r.t prediction.

2*(predicted value - actual value) = 2 * error = 6 - 10 = -4. (it's the derivative of the MSE)

Second item: 3

Third item : we don't have an activation function...

Si we have 2 * -4 * 3 = -24. If the learning rate is 0.01, the new weight would be

2 - 0.01(-24) = 2.24.

Backpropagation is applying this to a more complex NN structure.

Gradient Descent with Squared Errors

We want to find the weights for our neural networks. Let's start by thinking about the goal. The network needs to make predictions as close as possible to the real values. To measure this, we use a metric of how wrong the predictions are, the error. A common metric is the sum of the squared errors (SSE):

The SSE depends on the weights and the inputs because they are in the formula!

Stochastic gradient descent

It is common to calculate slopes on only a subset of the data ('batch')

Use a diff batch of data to calculate the next update

start over from the beginning once all data is used

each time through the training data is called an epoch

When slopes are calculated on one batch at a time: stochastic gradient descent.

Gradient descent Code

Exercise

Exercise 2:

Mean Square Error

We're going to make a small change to how we calculate the error here. Instead of the SSE, we're going to use the mean of the square errors (MSE). Now that we're using a lot of data, summing up all the weight steps can lead to really large updates that make the gradient descent diverge. To compensate for this, you'd need to use a quite small learning rate. Instead, we can just divide by the number of records in our data, mm to take the average. This way, no matter how much data we use, our learning rates will typically be in the range of 0.01 to 0.001. Then, we can use the MSE (shown below) to calculate the gradient and the result is the same as before, just averaged instead of summed.

Implementing with NumPy

For the most part, this is pretty straightforward with NumPy.

First, you'll need to initialize the weights. We want these to be small such that the input to the sigmoid is in the linear region near 0 and not squashed at the high and low ends. It's also important to initialize them randomly so that they all have different starting values and diverge, breaking symmetry. So, we'll initialize the weights from a normal distribution centered at 0. A good value for the scale is 1 / sqrt(n) where n is the number of input units. This keeps the input to the sigmoid low for increasing numbers of input units.

NumPy provides a function np.dot() that calculates the dot product of two arrays, which conveniently calculates hh for us. The dot product multiplies two arrays element-wise, the first element in array 1 is multiplied by the first element in array 2, and so on. Then, each product is summed.

And finally, we can update \Delta w_i and w_i​ by incrementing them with weights += ... which is shorthand for weights = weights + ...

Last updated

Was this helpful?