What if the Regression Equation Contains Wrong Predictors?

Before we can go off and learn about the two variable selection methods, we first need to understand the consequences of a regression equation containing the “wrong” or “inappropriate” variables. Let’s do that now!

There are four possible outcomes when formulating a regression model for a set of data:

The regression model is correctly specified.
The regression model is underspecified.
The regression model contains one or more extraneous variables.
The regression model is overspecified.

Let’s consider the consequence of each of these outcomes on the regression model. Before we do, we need to take a brief aside to learn what it means for an estimate to have the good characteristic of being unbiased.

Recall that the mean squared error, MSE, is our estimate of the constant error variance, \(\sigma^2\), in the linear regression model.

Unbiased Estimates

An estimate is unbiased if the average of the values of the statistics determined from all possible random samples equals the parameter you’re trying to estimate. That is, if you take a random sample from a population and calculate the mean of the sample, then take another random sample and calculate its mean, and take another random sample and calculate its mean, and so on — the average of the means from all of the samples that you have taken should equal the true population mean. If that happens, the sample mean is considered an unbiased estimate of the population mean \(\mu\).

An estimated regression coefficient \(\hat{\beta}_i\) is an unbiased estimate of the population slope \(\beta_i\) if the mean of all the possible estimates \(\hat{\beta}_j\) (that is if you took repeated samples from this population, calculated the regression equations, and averaged the \(\hat{\beta}_j\) from each equation), is equal to \(\beta_j\).

So far, this has probably sounded pretty technical. Here’s an easy way to think about it. If you hop on a scale every morning, you can’t expect that the scale will be perfectly accurate every day —some days it might run a little high, and some days a little low. That you can probably live with. You certainly don’t want the scale, however, to consistently report that you weigh five pounds more than you actually do — your scale would be biased upward. Nor do you want it to consistently report that you weigh five pounds less than you actually do — errr…, scratch that, maybe you do — in this case, your scale would be biased downward. What you do want is for the scale to be correct on average — in this case, your scale would be unbiased. And, that’s what we want!

The Four Possible Outcomes

The regression model is correctly specified

A regression model is correctly specified if the regression equation contains all of the relevant predictors, including any necessary transformations and interaction terms.

There are no missing, redundant or extraneous predictors in the model.
A perfectly specified regression model yields unbiased regression coeffiencts and unbiased predictions of the response.
The MSE is also an unbiased estimate of \(\sigma^2\), leading to valid standard error estimates used in confidence intervals and hypothesis tests.

A regression model is underspecified

A regression model is underspecified if the regression equation is missing one or more important predictor variables. This is perhaps the worst case scenario.

This situation yields biased regression coefficients and predictions. This means that we consistently underestimate or overestimate the population slopes and population means.
The MSE also tends to overstimate \(\sigma^2\), leading to wider confidence intervals than it should have.

The regression model contains one or more extraneous variables

An extraneous variable is one that is not related to the repsonse nor to any of the other predictors. This situation yeilds

Unbaised estimates of the regression coefficients
Unbiased predictions of the response
And an unbiased estimate of \(\sigma^2\), using MSE

The issue here is that that becuase we have too many predictors the degress of freedom of the MSE is too small (recall the degression of freedom for MSE is \(n-p-1\), where \(n\) is the sample size and \(p\) is the number of predictors).

This leads to wider confidence intervals and lower power for hyopthesis test (recall power is the probability of rejecting \(H_0\) when \(H_0\) is false).

These extra variables also make the model more complicated and harder to interpret.

The regression mode is overspecified

The regression model is overspecified if the regression model contains one or more redundant predictors, that is, two or more predictors that carry the same information about the response or each other (i.e. they are correlated/multicollinearity). This lead to

larger standard errors for the regression coefficients than we should have.

Overspecified models do yield unbiased estimates of the regression coefficients, unbiased predictions, and an unbiased MSE for \(\sigma^2\). We can use this model for predictions with caution, but we should not use an overspecified model to attribute effects to a certain predictor (if we have redundant predictors then the actual effect may be split between the redundant predictors).

How do we handle incorrect models?

Unfortunately, we never know the true model, so we can never really say that we have the “correct” model. All cases except the underspecified model can be used (at least with caution). The best we can do is use statistcal techniques appropriately with good model building strategies.

Here are some reccomendations to build good and useful models:

Know your goal and research question. Knowing how you plan to use your regression model can assist greatly in the model building stage. Do you have a few particular predictors of interest? If so, you should make sure your final model includes them. Are you just interested in predicting the response? If so, then multicollinearity should worry you less. Are you interested in the effects that specific predictors have on the response? If so, multicollinearity should be a serious concern. Are you just interested in summary description? What is it that you are trying to accomplish?
Identify all of the possible candidate predictors. This may sound easier than it actually is to accomplish. Don’t worry about interactions or the appropriate functional form — such as x2 and log x — just yet. Just make sure you identify all the possible important predictors. If you don’t consider them, there is no chance for them to appear in your final model.
Use variable selection procedures to find the middle ground between an underspecified model and a model with extraneous or redundant variables. Two possible variable selection procedures are stepwise regression and best subsets regression. We’ll learn about both methods here in this lesson.
Fine-tune the model to get a correctly specified model. If necessary, change the functional form of the predictors and/or add interactions. Check the behavior of the residuals. If the residuals suggest problems with the model, try a different functional form of the predictors or remove some of the interaction terms. Iterate back and forth between formulating different regression models and checking the behavior of the residuals until you are satisfied with the model