Error functions
Last updated
Last updated
To find the best line we want to take small steps. This is the learning rate Alpha. We add Alpha to w1 and w2 to get our line closer and closer to the best fit.
We're using the average of the error, 1/m.
We want the error to always be positive so we have an absolute value in the error.
We take the average of all the squared differences, and we have this extra factor of 1/2 for convenience we'll talk about it later (we'll be taking the derivative of this error, it'll cancel out). We can multiply the error by any constant and the process of minimizing will be the exact same thing.
The difference is always positive because we are taking the square.
Graph:
We plot all the errors in the graph on the right.
As we descend from the function we use gradient descent, we're minimizing the error, we get a better and better line until we find the best possible fit with the smallest possible Mean Squared Error.
Minimizing the error is minimizing the average of the area's squares.
Minimizing Error Functions
When we are minimizing the absolute error, we are using a gradient descent step.
To minimize the error with the line, we use gradient descent.
The way to descend is to take the gradient of the error function with respect to the weights. This gradient is going to point to a direction where the gradient increases the most. Therefore the negative of this gradient is going to point down in the direction where the function decreases the most. So we take a step down in that direction.
We'll be multiplying this derivative by the learning rate because we want to make small steps. So the error function is decreasing and we're going towards a minimum. If we do this several times, we'll get to a minimum, which is the solution.
The gradient descent step uses these two derivatives: the derivative with respect to the slope w1, and the derivative to the y intercept w2. They are as follows:
This means that we have to increase the slope w1 by - (y - y_hat) * x * Alpha (the learning rate) and upgrade the y incercept by adding - (y - y_hat) * learning rate Alpha. This is exactly what the gradient step is doing.
Same thing can be applied with the absolute error:
A potential confusion is the following: How do we know if we should use the mean or the total squared (or absolute) error?
The total squared error is the sum of errors at each point, given by the following equation:
whereas the mean squared error is the average of these errors, given by the equation, where m m is the number of points:
The good news is, it doesn't really matter. As we can see, the total squared error is just a multiple of the mean squared error, since
Therefore, since derivatives are linear functions, the gradient of T is also mm times the gradient of M.
However, the gradient descent step consists of subtracting the gradient of the error times the learning rate alpha-α. Therefore, choosing between the mean squared error and the total squared error really just amounts to picking a different learning rate.
In real life, we'll have algorithms that will help us determine a good learning rate to work with. Therefore, if we use the mean error or the total error, the algorithm will just end up picking a different learning rate.
At this point, it seems that we've seen two ways of doing linear regression.
By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.
By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.
More specifically, the squared (or absolute) trick, when applied to a point, gives us some values to add to the weights of the model. We can add these values, update our weights, and then apply the squared (or absolute) trick on the next point. Or we can calculate these values for all the points, add them, and then update the weights with the sum of these values.
The latter is called batch gradient descent. The former is called stochastic gradient descent.
The question is, which one is used in practice?
Actually, in most cases, neither. Think about this: If your data is huge, both are a bit slow, computationally. The best way to do linear regression, is to split your data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update your weights. This is still called mini-batch gradient descent.
Another way to calculate the solution is with system of equations. But with big matrices, n*n (unknowns) it takes a lot of computing power. This is why we use Gradient descent.
Normal equation doesn't work for classification algos, so it's important to know gradient descent.