Multiple Linear Regression
Last updated
Last updated
Interpretation
The intercept coef is that if our home is a victorian home, we predict its price to be 1.046e+06 (1 million)
A lodge is predicted to be 7.411e+05 less than a victorian.
Each of the lodge and ranch coefs is a comparison to the baseline category, and the intercept it our prediction for the baseline.
See more details here:
Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr
function and in python this can by accomplished by using numpy's corrcoef
function.
Multicollinearity is when we have 3 or more predictor variables that are correlated with one another. One of the main concerns of multicolinearity is that it can lead to coefficients being flipped from the direction we expect from simple linear regression. Multicolinearity can emerge even when isolated pairs of variables are not colinear.
We would like x-variables to be related to the response, but not to be related to one another.
There are two consequences for multi-colinearity:
The expected relationships between your x-variables and the response may not hold when multicollinearity is present. That is, you may expect a positive relationship between the explanatory variables and the response (based on the bivariate relationships), but in the multiple linear regression case, it tuns out the relationship is negative.
Our hypothesis testing results may not be reliable. It turns out that having correlated explanatory variables means that our coefficient estimates are less stable. That is, standard deviations (often called standard errors) associated with your regression coefficients are quite large. Therefore, a particular variable might be useful for predicting the response, but because of the relationship it has with other x-variables, you will no longer see this association.
Two different ways of identifying multicollinearity:
We can look at the correlation of each explanatory variable with each other explanatory variable (with a plot or the correlation coefficient).
Let's use a pairplot to see the relationship between some variables. The 3 varaibles have a strong relationship
Also we can specifically see that price and bedrooms have a positive relationship from one another:
However, the bedroom coefficient is negative in our multiple linear regression:
x-variables are related to one another, we can have flipped relationships in our multiple linear regression models from what we would expect when looking at the bivariate linear regression relationships.
2. We can look at Variance Inflation Factors (VIFs) for each variable.
The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.
Run a multiple regression.
Calculate the VIF factors.
Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.
dmatrices is imported from patsy, and we input the dependant variable, as well as our x variables right after. Then we use the variance inflation factor method.
We would want to remove at least one of the last two variables from our model because both of their VIFs are larger than 10. It is commun to remove the one which is of least interest.
For more on VIFs and multicollinearity, here is the referenced post from the video on VIFs.
Higher order terms in linear models are created when multiplying two or more x-variables by one another. Common higher order terms include quadratics (x_1^2x12) and cubics (x_1^3x13) , where an x-variable is multiplied by itself, as well as interactions (x_1x_2x1x2) , where two or more x-variables are multiplied by one another.
In a model with no higher order terms, you might have an equation like:
Then we might decide the linear model can be improved with higher order terms. The equation might change to:
Here, we have introduced a quadratic (b2x_1^2) and an interaction (b4x1x2 ) term into the model.
In general, these terms can help you fit more complex relationships in your data. However, they also take away from the ease of interpreting coefficients, as we have seen so far. You might be wondering: "How do I identify if I need one of these higher order terms?"
When creating models with quadratic, cubic, or even higher orders of a variable, we are essentially looking at how many curves there are in the relationship between the explanatory and response variables.
If there is one curve, like in the plot below, then you will want to add a quadratic. Clearly, we can see a line isn't the best fit for this relationship.
Then, if we want to add a cubic relationship, it is because we see two curves in the relationship between the explanatory and response variable. An example of this is shown in the plot below.
In python:
How do you know if you should add an interaction term?
Interaction definition: The way that variable X1 is related to your response is dependent on the value of X2.
Mathematically, an interaction is created by multiplying two variables by one another and adding this term to our linear regression model.
Say you have 2 neighborhoods and their relationships to the area vs price of a house: area (x1) and the neighborhood (x2) of a home (either A or B) to predict the home price (y).
where b1 is the way we estimate the relationship between area and price, which in this model we believe to be the same regardless of the neighborhood.
Then b2 is the difference in price depending on which neighborhood you are in, which is the vertical distance between the two lines here:
Notice here that:
The way that area is related to price is the same regardless of neighborhood.
AND
The difference in price for the different neighborhoods is the same regardless of the area.
When these statements are true, we do not need an interaction term in our model. However, we need an interaction when the way that area is related to price is different depending on the neighborhood.
Mathematically, when the way area relates to price depends on the neighborhood, this suggests we should add an interaction. By adding the interaction, we allow the slopes of the line for each neighborhood to be different, as shown in the plot below. Here we have added the interaction, and you can see this allows for a difference in these two slopes.
The slopes are different. In order to account for this, we would want to add an interaction term for this between neighborhood and square footage.
Here we can see the way square footage is related to the homeprice, we dependent ont the neighborhood we are in. AKA the interaction definition: The way that variable X1 is related to your response is dependent on the value of X2.
Conclusion: if the slopes are close to equal, then we do NOT add an interaction. Else, we do.
Interpretation:
With the higher order term, the coefficients associated with area and area squared are not easily interpretable. However, coefficients that are not associated with the higher order terms are still interpretable in the way you did earlier.