Best Subset Selection

The first variable selection methods we will discuss is call best subsets regression or all possible subsets regression procedure. The idea is to fit all possible regression models with the chosen set of predictors and pick the subset of predictors that yields the best model based on some criterion. Before we cover the actual procedure, let’s discuss some criterion for choosing a best model.

Model Selection Criteria

There are many criteria that can be used to define a best model, and they each quantify different aspects of the regression model. This often leads to different best models between different criteria. This should not be viewed as an issue as long as you don’t claim that the result of best subset regression is the best out of all models. This procedure should be thought of as a screening procedure, that is, as a way to reduce the number large number of possible regression models to just a handful of models that we can then evaluate further before arriving at a final model.

The criteria that we will discuss here are

\(R^2\)
Adjusted-\(R^2\)
MSE
Mallow’s \(C_p\)
AIC
BIC
PRESS statistic
Predicted \(R^2\)

\(R^2\)

Recall that \(R^2\) provides a measure of how well the model fits the data by measuring the predictive power of the model. Higher \(R^2\) means that the model is better. The drawback is that \(R^2\) always increases when adding new variables, so we can’t just say that a higher \(R^2\) means a better model. We will also need to examine the magnitude by which \(R^2\) increases to decide if adding more variables is worth it.

Higher \(R^2\) is better until the increase is small

Adjusted \(R^2\) and MSE

The adjusted \(R^2\) is similar to \(R^2\) but with an added penalty for adding more predictors. This means that the adjusted-\(R^2\) will only increase as long as the benifit of adding the variable to the model outweighs the penalty.

Use the model with the highest adjusted \(R^2\)

The MSE measures how far our data are from the predicted values of the model, so better models yield smaller MSE. It also turns out the the adjusted \(R^2\) increases if and only if the MSE descreases, so using either MSE or adjusted \(R^2\) will always yield the same model.

Use the model with the smallest MSE

Mallow’s \(C_p\)

Mallow’s \(C_p\) statistic is a measure of the bias and variation in the predicted responses from the regression model. Ideally, we want this minimized. Stratedy for using Mallow’s \(C_p\).

Identify subsets of predictors for which the \(C_p\) value is near p (if possible).
The full model always yields \(C_p= p\), so don’t select the full model based on \(C_p\).
If all models, except the full model, yield a large \(C_p\) not near p, it suggests some important predictor(s) are missing from the analysis. In this case, we are well-advised to identify the predictors that are missing!
If a number of models have \(C_p\) near p, choose the model with the smallest \(C_p\) value, thereby insuring that the combination of the bias and the variance is at a minimum.
When more than one model has a small value of \(C_p\) value near p, in general, choose the simpler model or the model that meets your research needs.

Note here that \(p\) stands for the number of \(\beta\)’s in the model, that is, a model with \(p-1\) predictors. So for a model with 3 predictors (that is 4 \(\beta\)’s, 3 slopes and one y-intercept) we want \(C_4\) to be 3 or smaller.

AIC and BIC

AIC and BIC are information criterion that stand for Akaike’s and Bayesian Information Criterion, respectively. Both of these criterion place penalties on the number of predictors, where BIC places a slighly heavier penalty. This usually means that BIC will lead to models with fewer predictors than AIC. The modeling strategy for using AIC and BIX is

choose the model the the smalles AIC or BIC.

PRESS and Predicted \(R^2\)

The **PRESS* or predicted sum or squares is a model validation method used to assess a model’s predictive ability. For a data set of size n, PRESS is calculated by omitting each observation individually and then the remaining \(n - 1\) observations are used to calculate a regression equation which is used to predict the value of the omitted response value (which is denoted by \(\hat{y}_{i(i)}\)). This leads the the \(ith\) PRESS residual \(y_i-\hat{y}_{i(i)}\). The formula for PRESS is then given by

\[PRESS=\sum_{i=1}^n(y_i-\hat{y}_{i(i)})^2.\] In general the smaller PRESS is the better.

The predicted \(R^2\) is defined as \[R^2_{pred}=1-\dfrac{PRESS}{SSTot}.\] This provides another measure of the predictive ability of the model and is similar to \(R^2\) in that we want higher values for \(R^2_{pred}\).

BSS Procedure

Best subsets regression uses the following steps.

First, identify all of the possible regression models derived from all possible combinations of the candidate predictors. This can be quite large.

For example, suppose we only have three predictors, \(x_1,x_2,\) and \(x_3\). Then there are 8 possible regression models

The one model with no predictors
The three models with a single predictor
The three models with two predictors: \(x_1\) and \(x_2\), \(x_1\) and \(x_3\), and \(x_2\) and \(x_3\).
The one model with all three predictors.

In general if we have \(p\) predictors, then there are \(2^p\) possible models. So for 10 predictors, there are \(2^{10}=1024\) possible models. This quickly becomes prohibitively expensive in computational time. This was the impetus for stepwise selection introducted in the next section.

Even so, we will not need to actually write out all these models ourselves, since the statistical software will do that for us.

From all possible models (one predictor models, two predictor models,\(\cdots\),\(p\) predictor models) choose the models that do the best based on one of the criteria discussed above.
Further evaluate the models selected from step 2. Are the model assumptions met? Do we need to transform variables? Add interaction terms? Explore the models until you are satisfied that you have found a model that meets the model conditions, does a good job of summarizing the trend in the data, and most importantly allows you to answer your research question.

Note: The examples in this section will intentionally only contain a small number of variables to explain the methods. In practice, these methods are used on datasets with many more potential predictors.

Example: Intelligence and Physical Characteristics

Are a person’s brain size and body size predictive of his or her intelligence? Interested in this question, some researchers collected data on 38 college students:

PIQ: Performance intelligence quotient
MRI_count: Brain size measured by pixel count in 3D MRI brain scan images
Height: Height in inches
Weight: Weight in pounds

Step 1: Idenitfy all possible regression models

Normally, the software will do this for us, but for the sake of example let’s write it out manually this time. In this case there are \(2^3=8\) possible regression models:

A single model with no variables

\(y_i=\beta_0+\varepsilon_i\) (Null model with no variables)

Three models with a single variable

\(y_i=\beta_0+\beta_1*MRI_Count+\varepsilon_i\)
\(y_i=\beta_0+\beta_1*Height+\varepsilon_i\)
\(y_i=\beta_0+\beta_1*Weight+\varepsilon_i\)

Three models with two variables

\(y_i=\beta_0+\beta_1*MRI_Count + \beta_2*Height+\varepsilon_i\)
\(y_i=\beta_0+\beta_1*MRI_Count + \beta_2*Weight+\varepsilon_i\)
\(y_i=\beta_0+\beta_1*Weight + \beta_2*Height+\varepsilon_i\)

A single model with all three variables

\(y_i=\beta_0+\beta_1*MRI_Count + \beta_2*Height+\beta_3*Weight+\varepsilon_i\)

Step 2: Identify the “best” models

We will examine a few critrion here including \(C_p\), adjusted-\(R^2\), and BIC.

## Warning: package 'leaps' was built under R version 3.4.3

##          MRI_Count Height Weight R2       Adjusted R2 Cp        BIC      
## 1  ( 1 ) "*"       " "    " "    "0.1427" "0.1189"    "7.3383"  "1.4236" 
## 1  ( 2 ) " "       "*"    " "    "0.0087" "-0.0189"   "13.8017" "6.944"  
## 1  ( 3 ) " "       " "    "*"    "0"      "-0.0278"   "14.2199" "7.2749" 
## 2  ( 1 ) "*"       "*"    " "    "0.2949" "0.2546"    "2"       "-2.3651"
## 2  ( 2 ) "*"       " "    "*"    "0.1925" "0.1463"    "6.9387"  "2.7888" 
## 2  ( 3 ) " "       "*"    "*"    "0.0177" "-0.0385"   "15.369"  "10.236" 
## 3  ( 1 ) "*"       "*"    "*"    "0.2949" "0.2327"    "4"       "1.2725"

In the output above, a star in the column under the variable name means that that variable was included in the model in that row. The other columns provide the \(R^2\), adjusted-\(R^2\), \(C_p\), and \(BIC\) for each model.

Using the \(R^2\) criterion, we want to models with high values. The best one variable model uses MRI\(\_\)Count with an \(R^2=14.27\%\). The best two variable model includes MRI\(\_\)Count and Height with \(R^2=29.49\%\). The one three variable model has \(R^2=29.49\%\), so adding weight to the model added no benefit over the best two variable model. The jump from the best one variable model to the best two variable model (from \(R^2=14.27\%\) to \(R^2=29.49\%\)) is significant enough to justify the two variable model over the one variable model.

Using adjusted-\(R^2\), again, we want higher values. Adjusted-\(R^2\) picks the same best one and two variable models with values 0.1189 and 0.2546, respectively. Note that, the adjusted-\(R^2\), does clearly distinguish between the best two variable model and the three variabl model, since the adjusted-\(R^2\) drops to 0.2327 for the three variable model. Again, the overall “best” model by this criterion is the two variable model with MRI\(\_\)Count and Height.

For Mallow’s \(C_p\), lower value are in general better. In particular, we want the value to be around (or even better lower than) the number of predictors plus one. Based on this criteria, the best single variable model is the model with MRI\(\_\)Count and \(C_p=7.3383\), but the \(C_p\) value is well above 2 indicating serious bias in this model. The best two variable model has MRI\(\_\)Count and Height with \(C_p=2\). This value is below 3, indicating a good model. Recall, that we can’t use \(C_p\) for the full model since it is always equal to the number of predictros plus one (in this case 4).

Finally, with BIC, we want small values of BIC. BIC chooses the same best one and two variables models with BIC of 1.4236 and -2.3651, respectively. The smallest BIC overall, including the three variable model, is for the two variable model containing MRI\(\_\)Count and Height.

In this example, all four criterion choose the same best models, but this does not happen in all cases.

tep 3: Explore the best models. We will not do this here as it was covered in the previous lesson.

Exercise: Lung Function

This dataset was collected on children ages 3 to 19 to study lung function and contains the following variables:

FEV: forced expiratory volume
age: age in years
smoke: current smoker (1) non-smoker (0)
sex: male (1) female (0)

Use the output below, to answer the following questions.

##          age smoke sex R2       Adjusted R2 Cp         BIC        
## 1  ( 1 ) "*" " "   " " "0.5722" "0.5716"    "61.7387"  "-542.391" 
## 1  ( 2 ) " " "*"   " " "0.0602" "0.0588"    "913.6175" "-27.6626" 
## 1  ( 3 ) " " " "   "*" "0.0434" "0.042"     "941.564"  "-16.0769" 
## 2  ( 1 ) "*" " "   "*" "0.607"  "0.6058"    "5.8991"   "-591.3395"
## 2  ( 2 ) "*" "*"   " " "0.5766" "0.5753"    "56.4888"  "-542.6038"
## 2  ( 3 ) " " "*"   "*" "0.112"  "0.1093"    "829.4101" "-58.2688" 
## 3  ( 1 ) "*" "*"   "*" "0.6093" "0.6075"    "4"        "-588.7677"

What are the best one and two variable models chosen by using

\(R^2\)
Adjusted-\(R^2\)
Mallow’s \(C_p\)
BIC

What is the best overall model from the best one, two and three variable models as chosen by

\(R^2\)
Adjusted-\(R^2\)
Mallow’s \(C_p\)
BIC

Solution