Before we can go off and learn about the two variable selection methods, we first need to understand the consequences of a regression equation containing the “wrong” or “inappropriate” variables. Let’s do that now!
There are four possible outcomes when formulating a regression model for a set of data:
Let’s consider the consequence of each of these outcomes on the regression model. Before we do, we need to take a brief aside to learn what it means for an estimate to have the good characteristic of being unbiased.
Recall that the mean squared error, MSE, is our estimate of the constant error variance, \(\sigma^2\), in the linear regression model.
An estimate is unbiased if the average of the values of the statistics determined from all possible random samples equals the parameter you’re trying to estimate. That is, if you take a random sample from a population and calculate the mean of the sample, then take another random sample and calculate its mean, and take another random sample and calculate its mean, and so on — the average of the means from all of the samples that you have taken should equal the true population mean. If that happens, the sample mean is considered an unbiased estimate of the population mean \(\mu\).
An estimated regression coefficient \(\hat{\beta}_i\) is an unbiased estimate of the population slope \(\beta_i\) if the mean of all the possible estimates \(\hat{\beta}_j\) (that is if you took repeated samples from this population, calculated the regression equations, and averaged the \(\hat{\beta}_j\) from each equation), is equal to \(\beta_j\).
So far, this has probably sounded pretty technical. Here’s an easy way to think about it. If you hop on a scale every morning, you can’t expect that the scale will be perfectly accurate every day —some days it might run a little high, and some days a little low. That you can probably live with. You certainly don’t want the scale, however, to consistently report that you weigh five pounds more than you actually do — your scale would be biased upward. Nor do you want it to consistently report that you weigh five pounds less than you actually do — errr…, scratch that, maybe you do — in this case, your scale would be biased downward. What you do want is for the scale to be correct on average — in this case, your scale would be unbiased. And, that’s what we want!
A regression model is correctly specified if the regression equation contains all of the relevant predictors, including any necessary transformations and interaction terms.
A regression model is underspecified if the regression equation is missing one or more important predictor variables. This is perhaps the worst case scenario.
An extraneous variable is one that is not related to the repsonse nor to any of the other predictors. This situation yeilds
The issue here is that that becuase we have too many predictors the degress of freedom of the MSE is too small (recall the degression of freedom for MSE is \(n-p-1\), where \(n\) is the sample size and \(p\) is the number of predictors).
These extra variables also make the model more complicated and harder to interpret.
The regression model is overspecified if the regression model contains one or more redundant predictors, that is, two or more predictors that carry the same information about the response or each other (i.e. they are correlated/multicollinearity). This lead to
Overspecified models do yield unbiased estimates of the regression coefficients, unbiased predictions, and an unbiased MSE for \(\sigma^2\). We can use this model for predictions with caution, but we should not use an overspecified model to attribute effects to a certain predictor (if we have redundant predictors then the actual effect may be split between the redundant predictors).
Unfortunately, we never know the true model, so we can never really say that we have the “correct” model. All cases except the underspecified model can be used (at least with caution). The best we can do is use statistcal techniques appropriately with good model building strategies.
Here are some reccomendations to build good and useful models: