Regularization
Last updated
Last updated
Regularization a technique that helps in avoiding overfitting and also increasing model interpretability.
Overfitting: If we have too many features, the learned hypothesis may fit the training set very well. This means that our squared error function will be close to 0.
But this will make a function or curve that tried too hard to pass through all the different points. This means that it will fail to generalize to new examples (predict prices on new examples.
So what can we do to adress this? One thing we can do it regularization. Here we look at L1 (Lasso) and L2 (Ridge) regularization.
Regularization: Keep all the features, but reduce the magnitude/values of parameters Theta_j. Works well when we have a lot of feature, each of wehich contributes a bit to predicting y.
For L1, or Lasso, we add a regularization term to our squred error function / or cost function (sum of all the errors) that affects every single parameter. (we start at j = 1 because we're not penalizing the intercept.
We take a parameter Landa, the absolute value of the coefficients, and take the sum of all that. This contributes to the overall sum.
This function has 2 objectives: on the left the cost function where we want to fit the data well. On this right, where we want to keep the parameters small to keep the function simpler. Lambda controls the trade off between overfitting and underfitting.
The more we add to this overall value, the worse we are actually performing. That is something we actually want, since we want to prevent overfitting. Lambda tells us how strong this effect should be.
L2 or ridge regularization - we take the square of the parameters. This means that we penalize more the large scale parameters.
If a model wants to learn large parameters it will be penalized.
L1 can be used to completely remove features, but it is computationaly more expensive.
L2 cannot be used for feature selection, but it is computationally more efficient. It penalizes large values,
High Bias vs High variance
This sheds light on the obvious disadvantage of ridge regression, which is model interpretability. It will shrink the coefficients for least important predictors, very close to zero. But it will never make them exactly zero. In other words, the final model will include all predictors. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Therefore, the lasso method also performs variable selection and is said to yield sparse models.
Perhaps it's not too surprising at this point, but there are classes in sklearn that will help you perform regularization with your linear regression. You'll get practice with implementing that in this exercise. In this assignment's data.csv, you'll find data for a bunch of points including six predictor variables and one outcome variable. Use sklearn's Lasso
class to fit a linear regression model to the data, while also using L1 regularization to control for model complexity.
Perform the following steps:
1. Load in the data
The data is in the file called 'data.csv'. Note that there's no header row on this file.
Split the data so that the six predictor features (first six columns) are stored in X
, and the outcome feature (last column) is stored in y
.
2. Fit data using linear regression with Lasso regularization
Create an instance of sklearn's Lasso
class and assign it to the variable lasso_reg
. You don't need to set any parameter values: use the default values for the quiz.
Use the Lasso
object's .fit()
method to fit the regression model onto the data.
3. Inspect the coefficients of the regression model
Obtain the coefficients of the fit regression model using the .coef_
attribute of the Lasso
object. Store this in the reg_coef
variable: the coefficients will be printed out, and you will use your observations to answer the question at the bottom of the page.