Least Squares

What does it mean for a vector b to be perpendicular to the column space? What vectors are perp to the column space? The vectors in the nullspace of A Transpose by definition.

This is true because, if we multiply this by b, it becomes the projection. We get:

We get 0 because A transpose T * b = 0 because they are perp.

For the other equation, if b is in the col space, then Ax=b. So we add that to the equation:

A Transpose A, get's canceled by it's inverse, this leaves us with just Ax, which turns out to be b.

So graphically we have the col space, and the nullspace of the transpose. We have vector b = p+e, it's projection onto the col space which is Pb (p=Pb), and its projection on the the nullspace of the transpose which is e = (I-P)b.

If P is a projection, then I-P is a projection. If P is symetric, then I-P is symetric. It's algrebra.

That was recap for the formula for p.

Now let's see the application.

Say we have 3 points and we want it fit a line

If we could solve this it would mean that we could put a line through all 3 points. But we can't . What is our Matrix, unknows and b?

3 equations, 2 unknows, no solution. But what's the best solution? It means we're going to have an error on the right hand side of the equation at the 1, the 2 and the 2. We're going to square and addup those errors. We want to minimize these errors. These errors are the difference between Ax - b.

So we can't solve the one above, we'll need to solve:

We want to know what is the minimum total error. This will give us our winning line. We want to minimize the sum of the errors. We want to minimize the length of Ax-b = e. e is the error vector, it is e1, e2, and e3. When we say we want to minimize it, we want Ax-b to be small, and that means we want it's length to be small. Therefore we use the notation below. It's convenient to square, so we square on both sides. It's a never neg value since it's the length and that we square.

This is doing linear regression. But statisticians don't ALWAYS use least squares because if there is an outlier, its error squared is going to mess up the model. So maybe we're better off using some other method like absolute error or something.

Now let's go back to the graph. What are the points on the line we're creating? What is the error of my vector on this graph? There are two pictures that are going on.

1 picture are the 3 points and the lines and the errors. In that picture the errors is the difference between the point and the line. e1, e2, and e3. The overall error is e1^2, e2^2, e3^2.

2nd picture - what are the points on the line? The 3 heights are: p1, p2, p3. Suppose those are the 3 values instead of b1, b2, b3. The p's are the points which instead of [1,2,2] which are the b's. Suppose we put the p's instead - if we do that it means that we could solve them. It is in the column space. It's the closest combination.

p1,2,3 are the points on the line, b1,2,3 are the data points. e1,2,3 are the distances between them. So what we could do is instead of using b123 points which doesn't have a solution, we could use instead the p1,2,3 because they are in the col space, and the system would have a solution.

Now let's compute the answer. Let's have our system : hats means they are estimated.

(for any estimation, or error, this is one of the most important equations).

these are called the normal equations.

Before we solve this, let's go back to our minimize equation. Our squared error we're trying to minimize is:

We could solve this with calculus with partial derivatives. Partical derivative with respect to C and set to 0, then D. This would be linear because if I take the square of something and I find the derivative, it's going to be linear.

But let's go back to our system before and use elimination and back substitution once we find 1 variable:

The best line is 2/3 + 1/2t. We can now find all the e's and p's. We can find the p's with line. That means we can find e.

Now let's go back to our b = p-e equation.

What can we say about those 2 vectors p and e? They are perpendicular. Their dot product = 0! What else do we know? It's perp to another vector. it's not just perp to this specific projection. It's perp to cols in the matrix for example [1,1,1].

The second picture is for vectors:

And the picture for the best line. It's the same picture, in one we can see the line.

That was Least Squares.

Let's talk about the matrix A Transpose A which is super important.

We want to make sure its invertable. let's come back to that.

Let's repeat an important fact:

We know that a matrix is invertable when its nullspace is the 0 vector.

If:

How come, x must be 0? Idea, let's multiply X transpose on both sides:

What does (Ax)T(Ax) tell us? That the vector has to be 0. This is the length of the vector Ax, squared. Ax times Ax, therefore Ax has to be 0. Now let's us our hypothesis that if A has independant cols that means that x is in its nullspace, and that Ax = 0, therefore the only thing that is in the nullspace is 0, so x = 0.

This means that the matrix highlighted HAD to be invertable because the cols on the left were independant:

Intro to the next subject:

Matrices whos columns are orthonormal, which means their columns are perpendicular to each other and their unit vectors, dont have to be the ones on the image above but they are unit vectors and their angle is 90 degrees.

PreviousProjections NextGaussian Elimination

Last updated 6 years ago

Was this helpful?