Regression and Prediction

Simple Linear Regression

Simple linear regression models the relationship between the magnitude of one variable and that of a second — for example, as X increases, Y also increases. Or as X increases, Y decreases.1 Correlation is another way to measure how two variables are related: see the section “Correlation”. The difference is that while correlation measures the strength of an association between two variables,regression quantifies the nature of the relationship

KEY TERMS FOR SIMPLE LINEAR REGRESSION

Response The variable we are trying to predict. Synonyms: dependent variable, Y-variable, target, outcome

Independent variable The variable used to predict the response. Synonyms: independent variable, X-variable, feature, attribute

Record The vector of predictor and outcome values for a specific individual or case. Synonyms: row, case, instance, example

Intercept The intercept of the regression line — that is, the predicted value when X=0. Synonyms: b0 and Beta0

Regression coefficient The slope of the regression line. Synonyms: slope, b1, beta1, parameter estimates, weights

Fitted values The estimates Y-hat,i obtained from the regression line. Synonyms: predicted values

Residuals The difference between the observed values and the fitted values. Synonyms: errors

Least squares The method of fitting a regression by minimizing the sum of squared residuals. Synonyms: ordinary least squares

The machine learning community tends to use other terms,calling Y the target and X a feature vector.

Fitted Values and Residuals

Important concepts in regression analysis are the fitted values and residuals. In general, the data doesn’t fall exactly on a line, so the regression equation should include an explicit error term :

The fitted values, also referred to as the predicted values, are typically denotedby (Y-hat). These are given by:

The notation b-hat 0 and b-hat1 indicates that the coefficients are estimated versus known.

We compute the residuals by subtracting the predicted values from the original data:

This figure illustrates the residuals from the regression line fit to the lung data. The residuals are the length of the vertical dashed lines from the data to the line.

Least squares

How is the model fit to the data? When there is a clear relationship, you could imagine fitting the line by hand. In practice, the regression line is the estimate that minimizes the sum of squared residual values, also called the residual sum of squares or RSS:

The estimates b-hat0 and b-hat1 are the values that minimize RSS. The method of minimizing the sum of the squared residuals is termed least squares regression, or ordinary least squares (OLS) regression. It is often attributed to Carl Friedrich Gauss, the German mathmetician, but was first published by the French mathmetician Adrien-Marie Legendre in 1805. Least squares regression leads to a simple formula to compute the coefficients:

Least squares are sensitive to outliers.

With the advent of big data, regression is widely used to form a model to predict individual outcomes for new data, rather than explain data in hand (i.e., apredictive model). In this instance, the main items of interest are the fitted values Y-hat. In marketing, regression can be used to predict the change in revenue inresponse to the size of an ad campaign. Universities use regression to predictstudents’ GPA based on their SAT scores.

Multiple Linear Regressions

Instead of a line, we now have a linear model — the relationship between eachcoefficient and its variable (feature) is linear.

KEY TERMS FOR MULTIPLE LINEAR REGRESSION

Root mean squared error The square root of the average squared error of the regression (this is the most widely used metric to compare regression models). Synonyms: RMSE

Residual standard error The same as the root mean squared error, but adjusted for degrees of freedom. Synonyms RSE

R-squared The proportion of variance explained by the model, from 0 to 1. Synonyms: coefficient of determination,

t-statistic The coefficient for a predictor, divided by the standard error of the coefficient, giving a metric to compare the importance of variables in the model.

Weighted regression Regression with the records having different weights

All of the other concepts in simple linear regression, such as fitting by leastsquares and the definition of fitted values and residuals, extend to the multiplelinear regression setting. For example, the fitted values are given by:

Assessing the model

The most important performance metric from a data science perspective is rootmean squared error, or RMSE. RMSE is the square root of the average squarederror in the predicted Y-hat-i values:

This measures the overall accuracy of the model, and is a basis for comparing it to other models (including models fit using machine learning techniques). Similar to RMSE is the residual standard error, or RSE. In this case we have p predictors, and the RSE is given by:

The only difference is that the denominator is the degrees of freedom, as opposed to number of records (see “Degrees of Freedom”). In practice, for linear regression, the difference between RMSE and RSE is very small, particularly forbig data applications.

R-squared ranges from 0 to 1 and measures the proportion of variation in the data that is accounted for in the model. It is useful mainly in explanatory uses of regression where you want to assess how well the model fits the data.

Standard error of the coefficients (SE) and a t-statistic:

The t-statistic — and its mirror image, the p-value — measures the extent to which a coefficient is “statistically significant” — that is, outside the range of what a random chance arrangement of predictor and target variable might produce. The higher the t-statistic (and the lower the p-value), the more significant the predictor.

Cross validation

Classic statistical regression metrics (R2, F-statistics, and p-values) are all “in-sample” metrics — they are applied to the same data that was used to fit themodel. Intuitively, you can see that it would make a lot of sense to set aside some of the original data, not use it to fit the model, and then apply the model to the set-aside (holdout) data to see how well it does. Normally, you would use a majority of the data to fit the model, and use a smaller portion to test the model.

Cross-validation extends the idea of a holdout sample to multiple sequential holdout samples. The algorithm for basic k-fold cross-validation is as follows:

  1. Set aside 1/k of the data as a holdout sample.

  2. Train the model on the remaining data.

  3. Apply (score) the model to the 1/k holdout, and record needed modelassessment metrics.

  4. Restore the first 1/k of the data, and set aside the next 1/k (excluding anyrecords that got picked the first time).

  5. Repeat steps 2 and 3.

  6. Repeat until each record has been used in the holdout portion.

  7. Average or otherwise combine the model assessment metrics.

The division of the data into the training sample and the holdout sample is alsocalled a fold.

Model Selection and Stepwise Regression

In some problems, many variables could be used as predictors in a regression. Adding more variables, however, does not necessarily mean we have a better model. Statisticians use the principle of Occam’s razor to guide the choice of a model: all things being equal, a simpler model should be used in preference to amore complicated model.

Including additional variables always reduces RMSE and increases R^2. Hence, these are not appropriate to help guide the model choice. In the 1970s, Hirotugu Akaike, the eminent Japanese statistician, deveoped a metric called AIC (Akaike’s Information Criteria) that penalizes adding terms to a model. In the case of regression, AIC has the form:

AIC=2P+nlog(RSS/n)AIC = 2P + n log(RSS/n)

Where p is the number of variables and n is the number of records. The goal is to find the model that minimizes AIC; models with k more extra variables are penalized by 2k.

How do we find the model that minimizes AIC? One approach is to search through all possible models, called all subset regression. This is computationally expensive and is not feasible for problems with large data and many variables. An attractive alternative is to use stepwise regression, which successively adds and drops predictors to find a model that lowers AIC. Ex: the MASS package by Veneblesand Ripley offers a stepwise regression function called stepAIC.

Simpler yet are forward selection and backward selection. In forward selection, you start with no predictors and add them one-by-one, at each step adding the predictor that has the largest contribution to R^2, stopping when the contribution is no longer statistically significant. In backward selection, or backward elimination, you start with the full model and take away predictors that are not statistically significant until you are left with a model in which all predictors are statistically significant.

Stepwise regression and all subset regression are in-sample methods to assess and tune models. This means the model selection is possibly subject to overfitting and may not perform as well when applied to new data. One common approach to avoid this is to use cross-validation to validate the models.

Weighted regression

Weighted regression is used by statisticians for a variety of purposes; in particular, it is important for analysis of complex surveys. Data scientists may find weighted regression useful in two cases:

  • Inverse-variance weighting when different observations have been measured with different precision.

  • Analysis of data in an aggregated form such that the weight variable encodes how many original observations each row in the aggregated data represents.

For example, with the housing data, older sales are less reliable than more recent sales. Using the DocumentDate to determine the year of the sale, we can compute a Weight as the number of years since 2005 (the beginning of the data).

Prediction Using Regression

The primary purpose of regression in data science is prediction. This is useful to keep in mind, since regression, being an old and established statistical method, comes with baggage that is more relevant to its traditional explanatory modeling role than to prediction.

KEY TERMS FOR PREDICTION USING REGRESSION

Prediction interval An uncertainty interval around an individual predicted value.

Extrapolation Extension of a model beyond the range of the data used to fit it.

The Dangers of Extrapolation

Regression models should not be used to extrapolate beyond the range of the data. The model is valid only for predictor values for which the data has sufficient values Ex: predicting an empty lot 5k sqft while your regression was trained on condos.

Confidence and Prediction Intervals

PREDICTION INTERVAL OR CONFIDENCE INTERVAL?

A prediction interval pertains to uncertainty around a single value, while a confidence interval pertains to a mean or other statistic calculated from multiple values. Thus, a prediction interval will typically be much wider than a confidence interval for the same value. We model this individual value error in the bootstrap model by selecting an individual residual to tack on to the predicted value. Which should you use? That depends on the context and the purpose of the analysis, but, in general, data scientists are interested in specific individual predictions, so a prediction interval would be more appropriate. Using a confidence interval when you should be using a prediction interval will greatly underestimate the uncertainty in a given predicted value.

Useful metrics are confidence intervals, which are uncertainty intervals placed around regression coefficients and predictions. An easy way to understand this is via the bootstrap

The most common regression confidence intervals encountered in software output are those for regression parameters (coefficients). Here is a bootstrap algorithm for generating confidence intervals for regression parameters (coefficients) for a data set with P predictors and n records (rows):

  1. Consider each row (including outcome variable) as a single “ticket” and place all the n tickets in a box.

  2. Draw a ticket at random, record the values, and replace it in the box.

  3. Repeat step 2 n times; you now have one bootstrap resample.

  4. Fit a regression to the bootstrap sample, and record the estimated coefficients.

  5. Repeat steps 2 through 4, say, 1,000 times.

  6. You now have 1,000 bootstrap values for each coefficient; find the appropriate percentiles for each one (e.g., 5th and 95th for a 90% confidence interval).

Of greater interest to data scientists are intervals around predicted y values ( ). The uncertainty around comes from two sources:

  • Uncertainty about what the relevant predictor variables and their coefficients

  • Additional error inherent in individual data points

The individual data point error can be thought of as follows: even if we knew for certain what the regression equation was (e.g., if we had a huge number of records to fit it), the actual outcome values for a given set of predictor values will vary.

We can model this individual error with the residuals from the fitted values. The bootstrap algorithm for modeling both the regression model error and the individual data point error would look as follows:

  1. Take a bootstrap sample from the data (spelled out in greater detail earlier).

  2. Fit the regression, and predict the new value.

  3. Take a single residual at random from the original regression fit, add it to the predicted value, and record the result.

  4. Repeat steps 1 through 3, say, 1,000 times.

  5. Find the 2.5th and the 97.5th percentiles of the results.

Factor Variables in Regression

Factor variables, also termed categorical variables, take on a limited number of discrete values. For example, a loan purpose can be “debt consolidation,” “wedding,” “car,” and so on. The binary (yes/no) variable, also called an indicator variable, is a special case of a factor variable. Regression requires numerical inputs, so factor variables need to be recoded to use in the model. The most common approach is to convert a variable into a set of binary dummy variables.

KEY TERMS FOR FACTOR VARIABLES

Dummy variables Binary 0–1 variables derived by recoding factor data for use in regression and other models.

Reference coding The most common type of coding used by statisticians, in which one level of a factor is used as a reference and other factors are compared to that level. Synonyms: treatment coding

One hot encoder A common type of coding used in the machine learning community in which all factors levels are retained. While useful for certain machine learning algorithms, this approach is not appropriate for multiple linear regression.

Deviation coding A type of coding that compares each level against the overall mean as opposed to the reference level. Synonyms: sum contrasts

There are three possible values: Multiplex, Single Family, and Townhouse. To use this factor variable, we need to convert it to a set of binary variables. We do this by creating a binary variable for each possible value of the factor variable. The function model.matrix converts a data frame into a matrix suitable to a linear model. The factor variable PropertyType, which has three distinct levels, is represented as a matrix with three columns. In the machine learning community, this representation is referred to as one hot encoding. In certain machine learning algorithms, such as nearest neighbors and tree models, one hot encoding is the standard way to represent factor variables

In the regression setting, a factor variable with P distinct levels is usually represented by a matrix with only P – 1 columns. This is because a regression model typically includes an intercept term. With an intercept, once you have defined the values for P – 1 binaries, the value for the Pth is known and could be considered redundant. Adding the Pth column will cause a multicollinearity error

DIFFERENT FACTOR CODINGS

There are several different ways to encode factor variables, known as contrast coding systems. For example, deviation coding, also know as sum contrasts, compares each level against the overall mean. Another contrast is polynomial coding, which is appropriate for ordered factors; see the section “Ordered Factor Variables”. With the exception of ordered factors, data scientists will generally not encounter any type of coding besides reference coding or one hot encoder.

Factor Variables with Many Levels

Some factor variables can produce a huge number of binary dummies — zip codes are a factor variable and there are 43,000 zip codes in the US. In such cases, it is useful to explore the data, and the relationships between predictor variables and the outcome, to determine whether useful information is contained in the categories. If so, you must further decide whether it is useful to retain all factors, or whether the levels should be consolidated.

An alternative approach is to group the zip codes according to another variable, such as sale price. Ex: include a variable ZipGroup that categorizes the zip code into one of five groups, from least expensive (1) to most expensive (5). Even better is to form zip code groups using the residuals from an initial model. The following dplyr code consolidates the 82 zip codes into five groups based on the median of the residual from the house_lm regression:

Ordered Factor Variables

Some factors have levels that are ordered and can be represented as a single numeric variable. Treating ordered factors as a numeric variable preserves the information contained in the ordering that would be lost if it were converted to a factor.

KEY TERMS FOR INTERPRETING THE REGRESSION EQUATION

Correlated variables When the predictor variables are highly correlated, it is difficult to interpret the individual coefficients.

Multicollinearity When the predictor variables have perfect, or near-perfect, correlation, the regression can be unstable or impossible to compute. Synonyms: collinearity

Confounding variables An important predictor that, when omitted, leads to spurious relationships in a regression equation.

Main effects The relationship between a predictor and the outcome variable, independent from other variables.

Interactions An interdependent relationship between two or more predictors and the response.

Correlated Predictors

In multiple regression, the predictor variables are often correlated with each other. The coefficient for Bedrooms is negative! This implies that adding a bedroom to a house will reduce its value. How can this be? This is because the predictor variables are correlated: larger houses tend to have more bedrooms, and it is the size that drives house value, not the number of bedrooms. Consider two homes of the exact same size: it is reasonable to expect that a home with more, but smaller, bedrooms would be considered less desirable.

Having correlated predictors can make it difficult to interpret the sign and value of regression coefficients

The update function can be used to add or remove variables from a model. Now the coefficient for bedrooms is positive in our ex — in line with what we would expect (though it is really acting as a proxy for house size, now that those variables have been removed).

Correlated variables are only one issue with interpreting regression coefficients.

Multicollinearity

An extreme case of correlated variables produces multicollinearity — a condition in which there is redundance among the predictor variables. Perfect multicollinearity occurs when one predictor variable can be expressed as a linear combination of others. Multicollinearity occurs when:

  • A variable is included multiple times by error.

  • P dummies, instead of P – 1 dummies, are created from a factor variable (see “Factor Variables in Regression”).

  • Two variables are nearly perfectly correlated with one another.

Multicollinearity in regression must be addressed — variables should be removed until the multicollinearity is gone. A regression does not have a well-defined solution in the presence of perfect multicollinearity.

NOTE

Multicollinearity is not such a problem for nonregression methods like trees, clustering, and nearest-neighbors, and in such methods it may be advisable to retain P dummies (instead of P – 1). That said, even in those methods, nonredundancy in predictor variables is still a virtue.

Confounding Variables

With correlated variables, the problem is one of commission: including different variables that have a similar predictive relationship with the response. With confounding variables, the problem is one of omission: an important variable is not included in the regression equation. Naive interpretation of the equation coefficients can lead to invalid conclusions.

The original regression model does not contain a variable to represent location — a very important predictor of house price. To model location, include a variable ZipGroup that categorizes the zip code into one of five groups, from least expensive (1) to most expensive (5).

Interactions and Main Effects

Statisticians like to distinguish between main effects, or independent variables, and the interactions between the main effects. Main effects are what are often referred to as the predictor variables in the regression equation.

Location in real estate is everything, and it is natural to presume that the relationship between, say, house size and the sale price depends on location. A big house built in a low-rent district is not going to retain the same value as a big house built in an expensive area.

Data, the following fits an interaction between SqFtTotLiving and ZipGroup:

The resulting model has four new terms: SqFtTotLiving:ZipGroup2, SqFtTotLiving:ZipGroup3, and so on.

Conclusion: An interaction term between two variables is needed if the relationship between the variables and the response is interdependent.

MODEL SELECTION WITH INTERACTION TERMS

In problems involving many variables, it can be challenging to decide which interaction terms should be included in the model. Several different approaches are commonly taken:

  • In some problems, prior knowledge and intuition can guide the choice of which interaction terms to include in the model.

  • Stepwise selection (see “Model Selection and Stepwise Regression”) can be used to sift through the various models.

  • Penalized regression can automatically fit to a large set of possible interaction terms.

  • Perhaps the most common approach is the use tree models, as well as their descendents, random forest and gradient boosted trees. This class of models automatically searches for optimal interaction terms; see “Tree Models”.

Testing the Assumptions: Regression Diagnostics

In explanatory modeling (i.e., in a research context), various steps, in addition to the metrics mentioned previously (see “Assessing the Model”), are taken to assess how well the model fits the data. Most are based on analysis of the residuals, which can test the assumptions underlying the model. These steps do not directly address predictive accuracy, but they can provide useful insight in a predictive setting.

KEY TERMS FOR REGRESSION DIAGNOSTICS

Standardized residuals Residuals divided by the standard error of the residuals.

Outliers Records (or outcome values) that are distant from the rest of the data (or the predicted outcome).

Influential value A value or record whose presence or absence makes a big difference in the regression equation.

Leverage The degree of influence that a single record has on a regression equation. Synonyms: hat-value

Non-normal residuals Non-normally distributed residuals can invalidate some technical requirements of regression, but are usually not a concern in data science.

Heteroskedasticity When some ranges of the outcome experience residuals with higher variance (may indicate a predictor missing from the equation).

Partial residual plots A diagnostic plot to illuminate the relationship between the outcome variable and a single predictor. Synonyms: added variables plot

Outliers

Generally speaking, an extreme value, also called an outlier, is one that is distant from most of the other observations. Just as outliers need to be handled for estimates of location and variability (see “Estimates of Location” and “Estimates of Variability”), outliers can cause problems with regression models. In regression, an outlier is a record whose actual y value is distant from the predicted value. You can detect outliers by examining the standardized residual, which is the residual divided by the standard error of the residuals.

How to find them

For example, with the boxplot, outliers are those data points that are too far above or below the box boundaries (see “Percentiles and Boxplots”), where “too far” = “more than 1.5 times the inter-quartile range.” In regression, the standardized residual is the metric that is typically used to determine whether a record is classified as an outlier. Standardized residuals can be interpreted as “the number of standard errors away from the regression line.”

Outliers could also be the result of other problems, such as a “fat-finger” data entry or a mismatch of units (e.g., reporting a sale in thousands of dollars versus simply dollars), and so should not be included in the regression.

Usecase

Outliers are central to anomaly detection, where finding outliers is the whole point. The outlier could also correspond to a case of fraud or an accidental action. In any case, detecting outliers can be a critical business need.

Influential Values

A value whose absence would significantly change the regression equation is termed an infuential observation. In regression, such a value need not be associated with a large residual.

Clearly, that data value has a huge influence on the regression even though it is not associated with a large outlier (from the full regression). This data value is considered to have high leverage on the regression.

In addition to standardized residuals (see “Outliers”), statisticians have developed several metrics to determine the influence of a single record on a regression. A common measure of leverage is the hat-value; values above 2(P + 1) /n indicate a high-leverage data value.

Another metric is Cook’s distance, which defines influence as a combination of leverage and residual size. A rule of thumb is that an observation has high influence if Cook’s distance exceeds 4/(n-P-1).

An influence plot or bubble plot combines standardized residuals, the hat-value, and Cook’s distance in a single plot. Figure 4-6 shows the influence plot for the King County house data

Usecase

For purposes of fitting a regression that reliably predicts future data, identifying influential observations is only useful in smaller data sets. For regressions involving many records, it is unlikely that any one observation will carry sufficient weight to cause extreme influence on the fitted equation (although the regression may still have big outliers). For purposes of anomaly detection, though, identifying influential observations can be very useful.

Heteroskedasticity, Non-Normality and Correlated Errors

For formal inference (hypothesis tests and p-values) to be fully valid, the residuals are assumed to be normally distributed, have the same variance, and be independent. One area where this may be of concern to data scientists is the standard calculation of confidence intervals for predicted values, which are based upon the assumptions about the residuals (see “Confidence and Prediction Intervals”).

Heteroskedasticity is the lack of constant residual variance across the range of the predicted values. In other words, errors are greater for some portions of the range than for others. The ggplot2 package has some convenient tools to analyze residuals.

The following code plots the absolute residuals versus the predicted values for the lm_98105 regression fit in “Outliers”.

Evidently, the variance of the residuals tends to increase for higher-valued homes, but is also large for lower-valued homes. This plot indicates that lm_98105 has heteroskedastic errors.

WHY WOULD A DATA SCIENTIST CARE ABOUT HETEROSKEDASTICITY? Heteroskedasticity indicates that prediction errors differ for different ranges of the predicted value, and may suggest an incomplete model. For example, the heteroskedasticity in lm_98105 may indicate that the regression has left something unaccounted for in high- and low-range homes.

Partial Residual Plots and Nonlinearity

Partial residual plots are a way to visualize how well the estimated fit explains the relationship between a predictor and the outcome. Along with detection of outliers, this is probably the most important diagnostic for data scientists. The basic idea of a partial residual plot is to isolate the relationship between a predictor variable and the response, taking into account all of the other predictor variables. A partial residual might be thought of as a “synthetic outcome” value, combining the prediction based on a single predictor with the actual residual from the full regression equation. A partial residual for predictor is the ordinary residual plus the regression term associated with :

where b-hat-i is the estimated regression coefficient.

The partial residual is an estimate of the contribution that ex: SqFtTotLiving adds to the sales price. The relationship between SqFtTotLiving and the sales price is evidently nonlinear. The regression line underestimates the sales price for homes less than 1,000 square feet and overestimates the price for homes between 2,000 and 3,000 square feet. There are too few data points above 4,000 square feet to draw conclusions for those homes.

This nonlinearity makes sense in this case: adding 500 feet in a small home makes a much bigger difference than adding 500 feet in a large home. This suggests that, instead of a simple linear term for SqFtTotLiving, a nonlinear term should be considered (see “Polynomial and Spline Regression”).

The partial residuals plot can be used to qualitatively assess the fit for each regression term, possibly leading to alternative modelspecification.

Polynomial and Spline Regression

The relationship between the response and a predictor variable is not necessarily linear. The response to the dose of a drug is often nonlinear: doubling the dosage generally doesn’t lead to a doubled response. The demand for a product is not a linear function of marketing dollars spent since, at some point, demand is likely to be saturated. There are several ways that regression can be extended to capture these nonlinear effects.

KEY TERMS FOR NONLINEAR REGRESSION

Polynomial regression Adds polynomial terms (squares, cubes, etc.) to a regression.

Spline regression Fitting a smooth curve with a series of polynomial segments.

Knots Values that separate spline segments.

Generalized additive models Spline models with automated selection of knots. Synonyms: GAM

Polynomial

Polynomial regression involves including polynomial terms to a regression equation. The use of polynomial regression dates back almost to the development of regression itself with a paper by Gergonne in 1815. For example, a quadratic regression between the response Y and the predictor X would take the form:

There are now two coefficients associated with SqFtTotLiving: one for the linear term and one for the quadratic term. The partial residual plot (see “Partial Residual Plots and Nonlinearity”) indicates some curvature in the regression equation associated with SqFtTotLiving. The fitted line more closely matches the smooth (see “Splines”) of the partial residuals as compared to a linear fit (see Figure 4-10).

(Page 300).

Splines

Polynomial regression only captures a certain amount of curvature in a nonlinear relationship. Adding in higher-order terms, such as a cubic quartic polynomial, often leads to undesirable “wiggliness” in the regression equation. An alternative, and often superior, approach to modeling nonlinear relationships is to use splines. Splines provide a way to smoothly interpolate between fixed points.

The polynomial pieces are smoothly connected at a series of fixed points in a predictor variable, referred to as knots. Formulation of splines is much more complicated than polynomial regression; statistical software usually handles the details of fitting a spline. The R package splines includes the function bs to create a b-spline term in a regression model. For example, the following adds a b-spline term to the house regression model:

In contrast to a linear term, for which the coefficient has a direct meaning, the coefficients for a spline term are not interpretable. Instead, it is more useful to use the visual display to reveal the nature of the spline fit. Figure 4-12 displays the partial residual plot from the regression. In contrast to the polynomial model, the spline model more closely matches the smooth, demonstrating the greater flexibility of splines. In this case, the line more closely fits the data. Does this mean the spline regression is a better model? Not necessarily: it doesn’t make economic sense that very small homes (less than 1,000 square feet) would have higher value than slightly larger homes. This is possibly an artifact of a confounding variable; see “Confounding Variables”.

Generalized Additive Models

Suppose you suspect a nonlinear relationship between the response and a predictor variable, either by a priori knowledge or by examining the regression diagnostics. Polynomial terms may not flexible enough to capture the relationship, and spline terms require specifying the knots. Generalized additive models, or GAM, are a technique to automatically fit a spline regression.

Summary

Perhaps no other statistical method has seen greater use over the years than regression — the process of establishing a relationship between multiple predictor variables and an outcome variable. The fundamental form is linear: each predictor variable has a coefficient that describes a linear relationship between the predictor and the outcome. More advanced forms of regression, such as polynomial and spline regression, permit the relationship to be nonlinear. In classical statistics, the emphasis is on finding a good fit to the observed data to explain or describe some phenomenon, and the strength of this fit is how traditional (“in-sample”) metrics are used to assess the model. In data science, by contrast, the goal is typically to predict values for new data, so metrics based on predictive accuracy for out-of-sample data are used. Variable selection methods are used to reduce dimensionality and create more compact models.

Last updated