In order for us to do gradient descent our error function can not be discrete is has to be continuous.
It also has to be differentiable.
Simple explanation:
To calculate the slope for a weight, we need to multiply:
The slope of the loss function w.r.t value at the node feed into.
The value of the node that feeds into our weight
the slope of the activation function w.r.t value we feed into.
ex this simple model:
The first item : Slope of the mean-squared loss function w.r.t prediction.
2*(predicted value - actual value) = 2 * error = 6 - 10 = -4. (it's the derivative of the MSE)
Second item: 3
Third item : we don't have an activation function...
Si we have 2 * -4 * 3 = -24. If the learning rate is 0.01, the new weight would be
2 - 0.01(-24) = 2.24.
Backpropagation is applying this to a more complex NN structure.
Gradient Descent with Squared Errors
We want to find the weights for our neural networks. Let's start by thinking about the goal. The network needs to make predictions as close as possible to the real values. To measure this, we use a metric of how wrong the predictions are, the error. A common metric is the sum of the squared errors (SSE):
The SSE depends on the weights and the inputs because they are in the formula!
Stochastic gradient descent
It is common to calculate slopes on only a subset of the data ('batch')
Use a diff batch of data to calculate the next update
start over from the beginning once all data is used
each time through the training data is called an epoch
When slopes are calculated on one batch at a time: stochastic gradient descent.
Gradient descent Code
# Defining the sigmoid function for activationsdefsigmoid(x):return1/(1+np.exp(-x))# Derivative of the sigmoid functiondefsigmoid_prime(x):returnsigmoid(x)* (1-sigmoid(x))# Input datax = np.array([0.1, 0.3])# Targety =0.2# Input to output weightsweights = np.array([-0.8, 0.5])# The learning rate, eta in the weight step equationlearnrate =0.5# the linear combination performed by the node (h in f(h) and f'(h))h = x[0]*weights[0]+ x[1]*weights[1]# or h = np.dot(x, weights)# The neural network output (y-hat)nn_output =sigmoid(h)# output error (y - y-hat)error = y - nn_output# output gradient (f'(h))output_grad =sigmoid_prime(h)# error term (lowercase delta)error_term = error * output_grad# Gradient descent step del_w = [ learnrate * error_term * x[0], learnrate * error_term * x[1]]# or del_w = learnrate * error_term * x
Exercise
import numpy as npdefsigmoid(x):""" Calculate sigmoid """return1/(1+np.exp(-x))defsigmoid_prime(x):""" # Derivative of the sigmoid function """returnsigmoid(x)* (1-sigmoid(x))learnrate =0.5x = np.array([1, 2. 3, 4])y = np.array(0.5)# Initial weightsw = np.array([0.5, -0.5, 0.3, 0.1])### Calculate one gradient descent step for each weight### Note: Some steps have been consolidated, so there are### fewer variable names than in the above sample code# TODO: Calculate the node's linear combination of inputs and weightsh = np.dot(x, w)# TODO: Calculate output of neural networknn_output =sigmoid(h)# TODO: Calculate error of neural networkerror = y - nn_output# TODO: Calculate the error term# Remember, this requires the output gradient, which we haven't# specifically added a variable for.error_term = error *sigmoid_prime(h)# Note: The sigmoid_prime function calculates sigmoid(h) twice,# but you've already calculated it once. You can make this# code more efficient by calculating the derivative directly# rather than calling sigmoid_prime, like this:# error_term = error * nn_output * (1 - nn_output)# TODO: Calculate change in weightsdel_w = learnrate * error_term * xprint('Neural Network output:')print(nn_output)print('Amount of Error:')print(error)print('Change in Weights:')print(del_w)
Exercise 2:
Mean Square Error
We're going to make a small change to how we calculate the error here. Instead of the SSE, we're going to use the mean of the square errors (MSE). Now that we're using a lot of data, summing up all the weight steps can lead to really large updates that make the gradient descent diverge. To compensate for this, you'd need to use a quite small learning rate. Instead, we can just divide by the number of records in our data, mm to take the average. This way, no matter how much data we use, our learning rates will typically be in the range of 0.01 to 0.001. Then, we can use the MSE (shown below) to calculate the gradient and the result is the same as before, just averaged instead of summed.
Implementing with NumPy
For the most part, this is pretty straightforward with NumPy.
First, you'll need to initialize the weights. We want these to be small such that the input to the sigmoid is in the linear region near 0 and not squashed at the high and low ends. It's also important to initialize them randomly so that they all have different starting values and diverge, breaking symmetry. So, we'll initialize the weights from a normal distribution centered at 0. A good value for the scale is 1 / sqrt(n) where n is the number of input units. This keeps the input to the sigmoid low for increasing numbers of input units.
NumPy provides a function np.dot() that calculates the dot product of two arrays, which conveniently calculates hh for us. The dot product multiplies two arrays element-wise, the first element in array 1 is multiplied by the first element in array 2, and so on. Then, each product is summed.
# input to the output layer
output_in = np.dot(weights, inputs)
And finally, we can update \Delta w_i and w_i by incrementing them with weights += ... which is shorthand for weights = weights + ...
import numpy as npfrom data_prep import features, targets, features_test, targets_testdefsigmoid(x):""" Calculate sigmoid """return1/ (1+ np.exp(-x))# TODO: We haven't provided the sigmoid_prime function like we did in# the previous lesson to encourage you to come up with a more# efficient solution. If you need a hint, check out the comments# in solution.py from the previous lecture.# Use to same seed to make debugging easiernp.random.seed(42)n_records, n_features = features.shapelast_loss =None# Initialize weightsweights = np.random.normal(scale=1/ n_features**.5, size=n_features)# Neural Network hyperparametersepochs =1000learnrate =0.5for e inrange(epochs): del_w = np.zeros(weights.shape)for x, y inzip(features.values, targets):# Loop through all records, x is the input, y is the target# Note: We haven't included the h variable from the previous# lesson. You can add it if you want, or you can calculate# the h together with the output# TODO: Calculate the output output =sigmoid(np.dot(x, weights))# TODO: Calculate the error error = y - output# TODO: Calculate the error term error_term = error * output * (1- output)# TODO: Calculate the change in weights for this sample# and add it to the total weight change del_w += error_term * x# TODO: Update weights using the learning rate and the average change in weights weights += (del_w * learnrate) / n_records# Printing out the mean square error on the training setif e % (epochs /10) ==0: out =sigmoid(np.dot(features, weights)) loss = np.mean((out - targets) **2)if last_loss and last_loss < loss:print("Train loss: ", loss, " WARNING - Loss Increasing")else:print("Train loss: ", loss) last_loss = loss# Calculate accuracy on test datates_out =sigmoid(np.dot(features_test, weights))predictions = tes_out >0.5accuracy = np.mean(predictions == targets_test)print("Prediction accuracy: {:.3f}".format(accuracy))