JulienBeaulieu
  • Introduction
  • Sciences
    • Math
      • Probability
        • Bayes Rule
        • Binomial distribution
        • Conditional Probability
      • Statistics
        • Descriptive Statistics
        • Inferential Statistics
          • Normal Distributions
          • Sampling Distributions
          • Confidence Intervals
          • Hypothesis Testing
          • AB Testing
        • Simple Linear Regression
        • Multiple Linear Regression
          • Statistical learning course
          • Model Assumptions And How To Address Each
        • Logistic Regression
      • Calculus
        • The big picture of Calculus
          • Derivatives
          • 2nd derivatives
          • The exponential e^x
        • Calculus
        • Gradient
      • Linear Algebra
        • Matrices
          • Matrix Multiplication
          • Inverses and Transpose and permutations
        • Vector Space and subspaces
        • Orthogonality
          • Orthogonal Sets
          • Projections
          • Least Squares
        • Gaussian Elimination
    • Programming
      • Command Line
      • Git & GitHub
      • Latex
      • Linear Algebra
        • Element-wise operations, Multiplication Transpose
      • Encodings and Character Sets
      • Uncategorized
      • Navigating Your Working Directory and File I/O
      • Python
        • Problem Solving
        • Strings
        • Lists & Dictionaries
        • Storing Data
        • HTTP Requests
      • SQL
        • Basic Statements
        • Entity Relationship Diagram
      • Jupyter Notebooks
      • Data Analysis
        • Data Visualization
          • Data Viz Cheat Sheet
          • Explanatory Analysis
          • Univariate Exploration of Data
            • Bar Chart
            • Pie Charts
            • Histograms
            • Kernel Density Estimation
            • Figures, Axes, and Subplots
            • Choosing a Plot for Discrete Data
            • Scales and Transformations (Log)
          • Bivariate Exploration of Data
            • Scatterplots
            • Overplotting, Transparency, and Jitter
            • Heatmaps
            • Violin & Box Plots
            • Categorical Variable Analysis
            • Faceting
            • Line Plots
            • Adapted Bar Charts
            • Q-Q, Swarm, Rug, Strip, Stacked, and Rigeline Plots
          • Multivariate Exploration of Data
            • Non-Positional Encodings for Third Variables
            • Color Palettes
            • Faceting for Multivariate Data
            • Plot and Correlation Matrices
            • Other Adaptations of Bivariate PLots
            • Feature Engineering for Data Viz
        • Python - Cheat Sheet
    • Machine Learning
      • Courses
        • Practical Deep learning for coders
          • Convolutional Neural Networks
            • Image Restauration
            • U-net
          • Lesson 1
          • Lesson 2
          • Lesson 3
          • Lesson 4 NLP, Collaborative filtering, Embeddings
          • Lesson 5 - Backprop, Accelerated SGD
          • Tabular data
        • Fast.ai - Intro to ML
          • Neural Nets
          • Business Applications
          • Class 1 & 2 - Random Forests
          • Lessons 3 & 4
      • Unsupervised Learning
        • Dimensionality Reduction
          • Independant Component Analysis
          • Random Projection
          • Principal Component Analysis
        • K-Means
        • Hierarchical Clustering
        • DBSCAN
        • Gaussian Mixture Model Clustering
        • Cluster Validation
      • Preprocessing
      • Machine Learning Overview
        • Confusion Matrix
      • Linear Regression
        • Feature Scaling and Normalization
        • Regularization
        • Polynomial Regression
        • Error functions
      • Decision Trees
      • Support Vector Machines
      • Training and Tuning
      • Model Evaluation Metrics
      • NLP
      • Neural Networks
        • Perceptron Algorithm
        • Multilayer Perceptron
        • Neural Network Architecture
        • Gradient Descent
        • Backpropagation
        • Training Neural Networks
  • Business
    • Analytics
      • KPIs for a Website
  • Books
    • Statistics
      • Practice Statistics for Data Science
        • Exploring Binary and Categorical Data
        • Data and Sampling Distributions
        • Statistical Experiments and Significance Testing
        • Regression and Prediction
        • Classification
        • Correlation
    • Pragmatic Thinking and Learning
      • Untitled
    • A Mind For Numbers: How to Excel at Math and Science
      • Focused and diffuse mode
      • Procrastination
      • Working memory and long term memory
        • Chunking
      • Importance of sleeping
      • Q&A with Terrence Sejnowski
      • Illusions of competence
      • Seeing the bigger picture
        • The value of a Library of Chunks
        • Overlearning
Powered by GitBook
On this page
  • Least Squares

Was this helpful?

  1. Sciences
  2. Math
  3. Linear Algebra
  4. Orthogonality

Least Squares

PreviousProjectionsNextGaussian Elimination

Last updated 5 years ago

Was this helpful?

Least Squares

What does it mean for a vector b to be perpendicular to the column space? What vectors are perp to the column space? The vectors in the nullspace of A Transpose by definition.

This is true because, if we multiply this by b, it becomes the projection. We get:

We get 0 because A transpose T * b = 0 because they are perp.

For the other equation, if b is in the col space, then Ax=b. So we add that to the equation:

A Transpose A, get's canceled by it's inverse, this leaves us with just Ax, which turns out to be b.

So graphically we have the col space, and the nullspace of the transpose. We have vector b = p+e, it's projection onto the col space which is Pb (p=Pb), and its projection on the the nullspace of the transpose which is e = (I-P)b.

If P is a projection, then I-P is a projection. If P is symetric, then I-P is symetric. It's algrebra.

That was recap for the formula for p.

Now let's see the application.

Say we have 3 points and we want it fit a line

If we could solve this it would mean that we could put a line through all 3 points. But we can't . What is our Matrix, unknows and b?

3 equations, 2 unknows, no solution. But what's the best solution? It means we're going to have an error on the right hand side of the equation at the 1, the 2 and the 2. We're going to square and addup those errors. We want to minimize these errors. These errors are the difference between Ax - b.

So we can't solve the one above, we'll need to solve:

We want to know what is the minimum total error. This will give us our winning line. We want to minimize the sum of the errors. We want to minimize the length of Ax-b = e. e is the error vector, it is e1, e2, and e3. When we say we want to minimize it, we want Ax-b to be small, and that means we want it's length to be small. Therefore we use the notation below. It's convenient to square, so we square on both sides. It's a never neg value since it's the length and that we square.

This is doing linear regression. But statisticians don't ALWAYS use least squares because if there is an outlier, its error squared is going to mess up the model. So maybe we're better off using some other method like absolute error or something.

Now let's go back to the graph. What are the points on the line we're creating? What is the error of my vector on this graph? There are two pictures that are going on.

1 picture are the 3 points and the lines and the errors. In that picture the errors is the difference between the point and the line. e1, e2, and e3. The overall error is e1^2, e2^2, e3^2.

2nd picture - what are the points on the line? The 3 heights are: p1, p2, p3. Suppose those are the 3 values instead of b1, b2, b3. The p's are the points which instead of [1,2,2] which are the b's. Suppose we put the p's instead - if we do that it means that we could solve them. It is in the column space. It's the closest combination.

p1,2,3 are the points on the line, b1,2,3 are the data points. e1,2,3 are the distances between them. So what we could do is instead of using b123 points which doesn't have a solution, we could use instead the p1,2,3 because they are in the col space, and the system would have a solution.

Now let's compute the answer. Let's have our system : hats means they are estimated.

(for any estimation, or error, this is one of the most important equations).

these are called the normal equations.

Before we solve this, let's go back to our minimize equation. Our squared error we're trying to minimize is:

We could solve this with calculus with partial derivatives. Partical derivative with respect to C and set to 0, then D. This would be linear because if I take the square of something and I find the derivative, it's going to be linear.

But let's go back to our system before and use elimination and back substitution once we find 1 variable:

The best line is 2/3 + 1/2t. We can now find all the e's and p's. We can find the p's with line. That means we can find e.

Now let's go back to our b = p-e equation.

What can we say about those 2 vectors p and e? They are perpendicular. Their dot product = 0! What else do we know? It's perp to another vector. it's not just perp to this specific projection. It's perp to cols in the matrix for example [1,1,1].

The second picture is for vectors:

And the picture for the best line. It's the same picture, in one we can see the line.

That was Least Squares.

Let's talk about the matrix A Transpose A which is super important.

We want to make sure its invertable. let's come back to that.

Let's repeat an important fact:

We know that a matrix is invertable when its nullspace is the 0 vector.

If:

How come, x must be 0? Idea, let's multiply X transpose on both sides:

What does (Ax)T(Ax) tell us? That the vector has to be 0. This is the length of the vector Ax, squared. Ax times Ax, therefore Ax has to be 0. Now let's us our hypothesis that if A has independant cols that means that x is in its nullspace, and that Ax = 0, therefore the only thing that is in the nullspace is 0, so x = 0.

This means that the matrix highlighted HAD to be invertable because the cols on the left were independant:

Intro to the next subject:

Matrices whos columns are orthonormal, which means their columns are perpendicular to each other and their unit vectors, dont have to be the ones on the image above but they are unit vectors and their angle is 90 degrees.