Stepwise Selection

Best subset selection requires the ability to fit all possible \(2^p\) models, but even for relatively small \(p\) such as 40, this can become prohibitively expensive. If you have 40 variables then you need to fit \(2^{40}=\). This why stepwise selection procedures were developed. We will discuss two stepwise procedures.

Forward stepwise selection
Backward stepwise selection

In each of these methods, we will either enter in and/or remove one variable from the model until we cannot justify adding or removing any more variables.

Forward Stepwise Selection

In forward stepwise selection, we start with a model that has no predictors. We pick what is called a significance level to enter (alpha-to-enter) denoted \(\alpha_E\). At each step, we add a single variable which has the smallest t-test p-value lower than \(\alpha_E\). We stop once there are no more variables that satify this condition.

Suppose we have \(p\) predictors, \(x_1,x_2,\ldots,x_p\).

Step 1: Choose \(\alpha_E\). By default, this is usually 0.1 or 0.15 in most software packages. Step 2: Fit each of the \(p\) simple linear regression models. The first variable added to the model is the predictor with the smallest t-test p-value smaller than \(\alpha_E\). If none exists, then stop. Step 3: Say that \(x_1\) was chosen in step 2. Fit each of the two predictor regression models that include \(x_1\) (i.e. regress \(y\) onto \(x_1\) and \(x_2\), regression \(y\) onto \(x_1\) and \(x_3\) and so forth.) The next variable to be added to the model from \(x_2,\ldots,x_p\) is the one with the smallest t-test p-value less than \(\alpa_E\). If none extists then stop. … Continue the above steps, adding a new variable at each step, until you have used all the variables or you reach a step where none of the new variables attain a t-test p-value smaller than \(\alpha_E\).

The p-value approach is the defualt in SAS. Another way to perform the stepwise procedure is to use the model selection criterion discussed in the previous section, such as adjusted-\(R^2\), \(C_p\), or BIC. The procedure is similar. For example, with BIC smaller is better, so at each step we would choose the variable that caused the BIC to decrease the most. We would stop when no variables caused the BIC to decrease.

We will consider an example worked out by “hand” here first, just to see the steps of the p-value approach.

Example: Intelligence and Physical Characteristics

Are a person’s brain size and body size predictive of his or her intelligence? Interested in this question, some researchers collected data on 38 college students:

PIQ: Performance intelligence quotient
MRI_count: Brain size measured by pixel count in 3D MRI brain scan images
Height: Height in inches
Weight: Weight in pounds

We will perform the p-value forward stepwise procedure on this dataset first. Let’s use \(\alpha_E=0.15\).

Step 1: Fit all three simple linear regression models.

##                 Estimate   Std. Error   t value   Pr(>|t|)
## (Intercept) 4.6603589412 4.371288e+01 0.1066129 0.91568795
## MRI_Count   0.0001176523 4.805849e-05 2.4481074 0.01936606

##                Estimate Std. Error    t value   Pr(>|t|)
## (Intercept) 147.4066895 64.3498192  2.2907087 0.02794175
## Height       -0.5270978  0.9389412 -0.5613746 0.57802032

##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 1.109769e+02 24.5144141 4.52700483 0.0000631177
## Weight      2.417927e-03  0.1604148 0.01507297 0.9880571896

The only predictor with a t-test p-value below 0.15 is MRI\(\_\)Count with a p-value of 0.019, so our first predictor added to the model is MRI\(\_\)Count (brain size).

Step 2: Fit the regression models with two predictors that include brain size

##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  1.112785e+02 5.586881e+01  1.991782 0.0542430272
## MRI_Count    2.060561e-04 5.466688e-05  3.769304 0.0006050915
## Height      -2.729841e+00 9.932231e-01 -2.748467 0.0094034791

##                  Estimate   Std. Error    t value    Pr(>|t|)
## (Intercept)  4.7679585966 4.302711e+01  0.1108129 0.912397736
## MRI_Count    0.0001592123 5.512299e-05  2.8883097 0.006602115
## Weight      -0.2501925819 1.703609e-01 -1.4686031 0.150871496

The p-value for Weight is 0.151 and the p-value for Height is 0.009. Since height has the smaller p-value and it is less than 0.15, we add height to the model next.

Step 3: Fit the three variable model to see if weight should be entered.

##                  Estimate   Std. Error      t value     Pr(>|t|)
## (Intercept)  1.113782e+02 6.297148e+01  1.768708305 0.0859141163
## MRI_Count    2.060200e-04 5.634551e-05  3.656369700 0.0008564671
## Height      -2.732402e+00 1.229522e+00 -2.222328611 0.0330178964
## Weight       7.164159e-04 1.970643e-01  0.003635442 0.9971205909

Since Weight has a p-value of 0.997 which is greater than 0.15, we do not add weight and we stop. This means that our final model selected by this procedure is

\[y_i=\beta_1+\beta_1*MRI\_ Count+\beta_2*Height+\varepsilon_i.\] Nest, let’s look at an alternative approach, using the other criterion.

##          MRI_Count Height Weight R2       Adjusted R2 Cp        BIC      
## 1  ( 1 ) "*"       " "    " "    "0.1427" "0.1189"    "7.3383"  "1.4236" 
## 1  ( 2 ) " "       "*"    " "    "0.0087" "-0.0189"   "13.8017" "6.944"  
## 1  ( 3 ) " "       " "    "*"    "0"      "-0.0278"   "14.2199" "7.2749" 
## 2  ( 1 ) "*"       "*"    " "    "0.2949" "0.2546"    "2"       "-2.3651"
## 2  ( 2 ) "*"       " "    "*"    "0.1925" "0.1463"    "6.9387"  "2.7888" 
## 3  ( 1 ) "*"       "*"    "*"    "0.2949" "0.2327"    "4"       "1.2725"

Using any of these criterion, we would make the same conclusion as in the p-value approach. From the single variable models, all criterion choose MRI\(\_\)Count, then Height is added and Weight is not added to the model. For example, looking at BIC. We want BIC to be as small as possible. The model with the smallest BIC in these steps is the model with MRI\(\_\)Count and Height.

Backward Stepwise Selection

In backward stepwise selection, we start with a model that has all predictors. We pick what is called a significance level to stay (alpha-to-stay) denoted \(\alpha_S\). At each step, we remove a single variable which has the largest t-test p-value larger than \(\alpha_S\). We stop once there are no more variables that satify this condition.

Suppose we have \(p\) predictors, \(x_1,x_2,\ldots,x_p\).

Step 1: Fit the full regression model with all \(p\) predictors. Choose the predictor with the largerst t-test p-value. If this p-value is larger than \(\alpha_S\) remove it from the model, otherwise, stop. Step 2: Choose the predictor with the largest p-value from the model with \(p-1\) predictors. If this p-value is larger than \(\alpha_S\) remove it from the model, otherwise, stop. … Continue until no predictors have a t-test p-value larger than \(\alpha_S\) or all predictors have been removed.

Again, this procedure can also be implemented using the other criteria such as BIC, adjusted-\(R^2\) and Mallow’s \(C_p\). In this case, you remove the variable that improves the chosen criterion the most. Stop when removing a variable does not improve the chosen criterion.

Again, we will do this manually at first to see the procedure.

Example: Intelligence and Physical Characteristics

Are a person’s brain size and body size predictive of his or her intelligence? Interested in this question, some researchers collected data on 38 college students:

PIQ: Performance intelligence quotient
MRI_count: Brain size measured by pixel count in 3D MRI brain scan images
Height: Height in inches
Weight: Weight in pounds

We will perform the p-value forward stepwise procedure on this dataset first. Let’s use \(\alpha_S=0.15\).

Step 1: Fit the full regression model with all three predictors

##                  Estimate   Std. Error      t value     Pr(>|t|)
## (Intercept)  1.113782e+02 6.297148e+01  1.768708305 0.0859141163
## MRI_Count    2.060200e-04 5.634551e-05  3.656369700 0.0008564671
## Height      -2.732402e+00 1.229522e+00 -2.222328611 0.0330178964
## Weight       7.164159e-04 1.970643e-01  0.003635442 0.9971205909

Weight has the highest t-test p-value, 0.997, and since this is larger than 0.15, we will remove Weight from the model.

Step 2: Fit the model with the remaining two predictors

##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  1.112785e+02 5.586881e+01  1.991782 0.0542430272
## MRI_Count    2.060561e-04 5.466688e-05  3.769304 0.0006050915
## Height      -2.729841e+00 9.932231e-01 -2.748467 0.0094034791

Now, the largest t-test p-value is for Height with a p-value of 0.009. Since this p-value is smaller than 0.15, we do not remove Height and we stop. We have arrived at the final model \[y_i=\beta_1+\beta_1*MRI\_ Count+\beta_2*Height+\varepsilon_i.\]

Note that both forward and backward selection resulted in the same model. This need not always happen.

Again, let’s look at using the other criterion as well.

##          MRI_Count Height Weight R2       Adjusted R2 Cp       BIC      
## 1  ( 1 ) "*"       " "    " "    "0.1427" "0.1189"    "7.3383" "1.4236" 
## 2  ( 1 ) "*"       "*"    " "    "0.2949" "0.2546"    "2"      "-2.3651"
## 3  ( 1 ) "*"       "*"    "*"    "0.2949" "0.2327"    "4"      "1.2725"

Again, we would make the same decision and choose the two variable model with brain size and height.

Exercise: Lung function

This dataset was collected on children ages 3 to 19 to study lung function and contains the following variables:

FEV: forced expiratory volume
age: age in years
smoke: current smoker (1) non-smoker (0)
sex: male (1) female (0)

Use the output below, to answer the following questions.

##          age smoke sex R2       Adjusted R2 Cp         BIC        
## 1  ( 1 ) "*" " "   " " "0.5722" "0.5716"    "61.7387"  "-542.391" 
## 1  ( 2 ) " " "*"   " " "0.0602" "0.0588"    "913.6175" "-27.6626" 
## 1  ( 3 ) " " " "   "*" "0.0434" "0.042"     "941.564"  "-16.0769" 
## 2  ( 1 ) "*" " "   "*" "0.607"  "0.6058"    "5.8991"   "-591.3395"
## 2  ( 2 ) "*" "*"   " " "0.5766" "0.5753"    "56.4888"  "-542.6038"
## 2  ( 3 ) " " "*"   "*" "0.112"  "0.1093"    "829.4101" "-58.2688" 
## 3  ( 1 ) "*" "*"   "*" "0.6093" "0.6075"    "4"        "-588.7677"

Recall that a higher adjusted-\(R^2\) and a lower BIC indicate better models.

Using adjusted-\(R^2\), what variable would be added to the model first?
Using adjusted-\(R^2\), what variable would be added to the model second?
Using adjusted-\(R^2\), would you include all three variables or stop at the two variable model?
Based on BIC, what variable would you drop from the three variable model first?
Based on BIC, would you drop a second variable for a single variable model or stop at the two variable model? What is you final model based on BIC?

Solutions