Best subset selection requires the ability to fit all possible \(2^p\) models, but even for relatively small \(p\) such as 40, this can become prohibitively expensive. If you have 40 variables then you need to fit \(2^{40}=\). This why stepwise selection procedures were developed. We will discuss two stepwise procedures.
In each of these methods, we will either enter in and/or remove one variable from the model until we cannot justify adding or removing any more variables.
In forward stepwise selection, we start with a model that has no predictors. We pick what is called a significance level to enter (alpha-to-enter) denoted \(\alpha_E\). At each step, we add a single variable which has the smallest t-test p-value lower than \(\alpha_E\). We stop once there are no more variables that satify this condition.
Suppose we have \(p\) predictors, \(x_1,x_2,\ldots,x_p\).
Step 1: Choose \(\alpha_E\). By default, this is usually 0.1 or 0.15 in most software packages. Step 2: Fit each of the \(p\) simple linear regression models. The first variable added to the model is the predictor with the smallest t-test p-value smaller than \(\alpha_E\). If none exists, then stop. Step 3: Say that \(x_1\) was chosen in step 2. Fit each of the two predictor regression models that include \(x_1\) (i.e. regress \(y\) onto \(x_1\) and \(x_2\), regression \(y\) onto \(x_1\) and \(x_3\) and so forth.) The next variable to be added to the model from \(x_2,\ldots,x_p\) is the one with the smallest t-test p-value less than \(\alpa_E\). If none extists then stop. … Continue the above steps, adding a new variable at each step, until you have used all the variables or you reach a step where none of the new variables attain a t-test p-value smaller than \(\alpha_E\).
The p-value approach is the defualt in SAS. Another way to perform the stepwise procedure is to use the model selection criterion discussed in the previous section, such as adjusted-\(R^2\), \(C_p\), or BIC. The procedure is similar. For example, with BIC smaller is better, so at each step we would choose the variable that caused the BIC to decrease the most. We would stop when no variables caused the BIC to decrease.
We will consider an example worked out by “hand” here first, just to see the steps of the p-value approach.
Are a person’s brain size and body size predictive of his or her intelligence? Interested in this question, some researchers collected data on 38 college students:
We will perform the p-value forward stepwise procedure on this dataset first. Let’s use \(\alpha_E=0.15\).
Step 1: Fit all three simple linear regression models.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6603589412 4.371288e+01 0.1066129 0.91568795
## MRI_Count 0.0001176523 4.805849e-05 2.4481074 0.01936606
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 147.4066895 64.3498192 2.2907087 0.02794175
## Height -0.5270978 0.9389412 -0.5613746 0.57802032
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.109769e+02 24.5144141 4.52700483 0.0000631177
## Weight 2.417927e-03 0.1604148 0.01507297 0.9880571896
The only predictor with a t-test p-value below 0.15 is MRI\(\_\)Count with a p-value of 0.019, so our first predictor added to the model is MRI\(\_\)Count (brain size).
Step 2: Fit the regression models with two predictors that include brain size
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.112785e+02 5.586881e+01 1.991782 0.0542430272
## MRI_Count 2.060561e-04 5.466688e-05 3.769304 0.0006050915
## Height -2.729841e+00 9.932231e-01 -2.748467 0.0094034791
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7679585966 4.302711e+01 0.1108129 0.912397736
## MRI_Count 0.0001592123 5.512299e-05 2.8883097 0.006602115
## Weight -0.2501925819 1.703609e-01 -1.4686031 0.150871496
The p-value for Weight is 0.151 and the p-value for Height is 0.009. Since height has the smaller p-value and it is less than 0.15, we add height to the model next.
Step 3: Fit the three variable model to see if weight should be entered.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.113782e+02 6.297148e+01 1.768708305 0.0859141163
## MRI_Count 2.060200e-04 5.634551e-05 3.656369700 0.0008564671
## Height -2.732402e+00 1.229522e+00 -2.222328611 0.0330178964
## Weight 7.164159e-04 1.970643e-01 0.003635442 0.9971205909
Since Weight has a p-value of 0.997 which is greater than 0.15, we do not add weight and we stop. This means that our final model selected by this procedure is
\[y_i=\beta_1+\beta_1*MRI\_ Count+\beta_2*Height+\varepsilon_i.\] Nest, let’s look at an alternative approach, using the other criterion.
## MRI_Count Height Weight R2 Adjusted R2 Cp BIC
## 1 ( 1 ) "*" " " " " "0.1427" "0.1189" "7.3383" "1.4236"
## 1 ( 2 ) " " "*" " " "0.0087" "-0.0189" "13.8017" "6.944"
## 1 ( 3 ) " " " " "*" "0" "-0.0278" "14.2199" "7.2749"
## 2 ( 1 ) "*" "*" " " "0.2949" "0.2546" "2" "-2.3651"
## 2 ( 2 ) "*" " " "*" "0.1925" "0.1463" "6.9387" "2.7888"
## 3 ( 1 ) "*" "*" "*" "0.2949" "0.2327" "4" "1.2725"
Using any of these criterion, we would make the same conclusion as in the p-value approach. From the single variable models, all criterion choose MRI\(\_\)Count, then Height is added and Weight is not added to the model. For example, looking at BIC. We want BIC to be as small as possible. The model with the smallest BIC in these steps is the model with MRI\(\_\)Count and Height.
In backward stepwise selection, we start with a model that has all predictors. We pick what is called a significance level to stay (alpha-to-stay) denoted \(\alpha_S\). At each step, we remove a single variable which has the largest t-test p-value larger than \(\alpha_S\). We stop once there are no more variables that satify this condition.
Suppose we have \(p\) predictors, \(x_1,x_2,\ldots,x_p\).
Step 1: Fit the full regression model with all \(p\) predictors. Choose the predictor with the largerst t-test p-value. If this p-value is larger than \(\alpha_S\) remove it from the model, otherwise, stop. Step 2: Choose the predictor with the largest p-value from the model with \(p-1\) predictors. If this p-value is larger than \(\alpha_S\) remove it from the model, otherwise, stop. … Continue until no predictors have a t-test p-value larger than \(\alpha_S\) or all predictors have been removed.
Again, this procedure can also be implemented using the other criteria such as BIC, adjusted-\(R^2\) and Mallow’s \(C_p\). In this case, you remove the variable that improves the chosen criterion the most. Stop when removing a variable does not improve the chosen criterion.
Again, we will do this manually at first to see the procedure.
Are a person’s brain size and body size predictive of his or her intelligence? Interested in this question, some researchers collected data on 38 college students:
We will perform the p-value forward stepwise procedure on this dataset first. Let’s use \(\alpha_S=0.15\).
Step 1: Fit the full regression model with all three predictors
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.113782e+02 6.297148e+01 1.768708305 0.0859141163
## MRI_Count 2.060200e-04 5.634551e-05 3.656369700 0.0008564671
## Height -2.732402e+00 1.229522e+00 -2.222328611 0.0330178964
## Weight 7.164159e-04 1.970643e-01 0.003635442 0.9971205909
Weight has the highest t-test p-value, 0.997, and since this is larger than 0.15, we will remove Weight from the model.
Step 2: Fit the model with the remaining two predictors
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.112785e+02 5.586881e+01 1.991782 0.0542430272
## MRI_Count 2.060561e-04 5.466688e-05 3.769304 0.0006050915
## Height -2.729841e+00 9.932231e-01 -2.748467 0.0094034791
Now, the largest t-test p-value is for Height with a p-value of 0.009. Since this p-value is smaller than 0.15, we do not remove Height and we stop. We have arrived at the final model \[y_i=\beta_1+\beta_1*MRI\_ Count+\beta_2*Height+\varepsilon_i.\]
Note that both forward and backward selection resulted in the same model. This need not always happen.
Again, let’s look at using the other criterion as well.
## MRI_Count Height Weight R2 Adjusted R2 Cp BIC
## 1 ( 1 ) "*" " " " " "0.1427" "0.1189" "7.3383" "1.4236"
## 2 ( 1 ) "*" "*" " " "0.2949" "0.2546" "2" "-2.3651"
## 3 ( 1 ) "*" "*" "*" "0.2949" "0.2327" "4" "1.2725"
Again, we would make the same decision and choose the two variable model with brain size and height.
This dataset was collected on children ages 3 to 19 to study lung function and contains the following variables:
Use the output below, to answer the following questions.
## age smoke sex R2 Adjusted R2 Cp BIC
## 1 ( 1 ) "*" " " " " "0.5722" "0.5716" "61.7387" "-542.391"
## 1 ( 2 ) " " "*" " " "0.0602" "0.0588" "913.6175" "-27.6626"
## 1 ( 3 ) " " " " "*" "0.0434" "0.042" "941.564" "-16.0769"
## 2 ( 1 ) "*" " " "*" "0.607" "0.6058" "5.8991" "-591.3395"
## 2 ( 2 ) "*" "*" " " "0.5766" "0.5753" "56.4888" "-542.6038"
## 2 ( 3 ) " " "*" "*" "0.112" "0.1093" "829.4101" "-58.2688"
## 3 ( 1 ) "*" "*" "*" "0.6093" "0.6075" "4" "-588.7677"
Recall that a higher adjusted-\(R^2\) and a lower BIC indicate better models.