The previous methods fit many individual models containing subsets of the \(p\) predictors and certain models were chosen based on some criteria. As an alternative, we can fit a model containing all \(p\) predictors using a technique that constrains or regularizes the coefficients, or equivalently, that shrinks the coefficients towards zero. the idea here is that we end up with estimates that are slighlty biased but much less variable, so that the estimates are closer on average to the true value than in standard regression which is unbiased but can have much more variability (large standard errors).

There are many models that fall into this shrinkage category. We will breifly discuss the LASSO. Recall that for linear regression, we choose the slope parameters \(\beta_0,\beta_1,\ldots,\beta_p\) that minimize the residual sum of squares: \[RSS=\sum_{i=1}^n(y_i-\beta_0-\beta_1x_2-\cdots-\beta_px_p)^2.\]

In the LASSO, we choose the slope parameters \(\beta_0,\beta_1,\ldots,\beta_p\) that minimize the residual sum of squares plus a penalty term: \[\sum_{i=1}^n(y_i-\beta_0-\beta_1x_2-\cdots-\beta_px_p)^2+\lambda\sum_{i=1}^p|\beta_i|.\] The term, \(\lambda\)(\(\geq 0\)), is called a tuning parameter. It determines the strength of the penalty term, i.e. the contraint on the slopes.

The LASSO can be used regardless of the number of predictors. Even if the number of predictors is larger than the total number of observations. It tends to force many slopes to be exactly 0, and can give stable (but slightly biased) estimates even when linear regression cannot.

Example: Blood Pressure in Peru

The following dataset contains measurements possible related to blood pressure from 39 Peruvians who moved from rural high altitude areas to urban lower altitude areas. The dataset contains the following variables:

## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
## 10 x 7 sparse Matrix of class "dgCMatrix"
##                         0          0.2          0.5            1
## (Intercept)  146.00735983 119.11706327  81.98313907  70.60625229
## Age           -1.10606838  -0.66000714  -0.18892200  -0.05685838
## Years          2.41089801   1.17836974   .            .         
## Weight         1.41784339   1.47757617   1.40339139   1.06871586
## Height        -0.03460929  -0.02982719  -0.01470139   .         
## Chin          -0.94440892  -0.92197225  -0.73764292  -0.14439515
## Forearm       -1.15747599  -0.70539020   .            .         
## Calf          -0.15346662  -0.00977771   .            .         
## Pulse          0.11270247   0.05537508   .            .         
## UrbanFrac   -113.63725803 -67.39818300 -22.72670687 -20.11977682
##                      2           4        9
## (Intercept)  81.578634 102.2613593 127.4103
## Age           .          .           .     
## Years         .          .           .     
## Weight        0.813106   0.4093552   .     
## Height        .          .           .     
## Chin          .          .           .     
## Forearm       .          .           .     
## Calf          .          .           .     
## Pulse         .          .           .     
## UrbanFrac   -14.296756  -1.8262950   .

Note that for \(\lambda=0\), we get the least squares estimates, but as lambda get’s larger more and more coefficients become zero (shown as missing values .). What \(\lambda\) to use is typically decided by cross validation. In this case, the \(\lambda\) chosen by cross-validation and resulting LASSO fit are:

## [1] 0.08513799
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 134.95616093
## Age          -0.92298227
## Years         1.90537615
## Weight        1.44200824
## Height       -0.03262516
## Chin         -0.93538503
## Forearm      -0.97124314
## Calf         -0.09448374
## Pulse         0.08916647
## UrbanFrac   -94.67071129