Linear Regression in SAS

In this section, we will learn how to obtain the output discussed in the notes for linear regression. At the end of this lab, you should be able to do the following in SAS:

  1. Fit a regression model with PROC REG and PROC GLM and obtain the standard regression output.
  2. Compare nested models with the TEST statement.
  3. Calculate confidence intervals for the regression slopes
  4. Obtain the residual plots needed for assessing the model assumptions
  5. Make new predictions and caluclate prediction intervals

For this lab, we will use the hospital infection risk dataset. Recall that this dataset contains data on hospitals to assess the infection risk with the following data on 113 hosptials:

  • Response (y): The infection risk at the hospital (Percentage of patients who contract an infection while hospitalized).
  • Predictor 1 ($x_1$): Average length of patient stay (in days)
  • Predictor 2 ($x_2$): Average patient age (in years)
  • Predictor 3 ($x_3$): Measure of how many x-rays are given at the hospital.

Download the dataset hospital_infct.csv by right clicking the fil and choosing Save Ling As to save the file to your computer.

First, we need to read in the data.

In [1]:
LIBNAME Survey "\\file.phhp.ufl.edu\home\rlp176\BiostatCourses\PHC6937SurveryBiostat\Lectures\MLR\Data";

PROC IMPORT datafile="\\file.phhp.ufl.edu\home\rlp176\BiostatCourses\PHC6937SurveryBiostat\Lectures\MLR\Data\hospital_infct.csv"
out=Survey.hospital dbms=csv replace;
getnames=Yes;
RUN;

PROC PRINT DATA=Survey.hospital(OBS=5);
VAR InfctRsk Stay Age Xray;
RUN;
SAS Connection established. Subprocess id is 8796

Out[1]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set SURVEY.HOSPITAL

Obs InfctRsk Stay Age Xray
1 4.1 7.13 55.7 39.6
2 1.6 8.82 58.2 51.7
3 2.7 8.34 56.9 74
4 5.6 8.95 53.7 122.8
5 5.7 11.2 56.5 88.9

Recall the regression model for this data:

$$y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\varepsilon_i$$

where

  • $y_i$ is the infection risk for hospital i
  • $x_{i1}$ is the average length of stay at hospital i
  • $x_{i2}$ is the average age of patients at hospital i
  • $x_{i3}$ is the measure of the number of x-rays given at hospital i

and the $\varepsilon_i$ are independent and have a normal distribution with mean 0 and equal variance $\sigma^2$. To fit this model in SAS, there are two options:

  • PROC REG
  • PROC GLM

PROC REG provides a lot more automatic output for regression, but it does not handle categorical variables and interaction terms as easilty as PROC REG. For PROC REG, you have to manually create the coded variables and the product terms for the predictors to include categorical variables and interaction terms. Both will be able to provide the same output in the end, so which one you choose to use is a personal choice for multiple linear regression. Let's fit this model in both.

In [4]:
PROC REG DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray;
RUN;
Out[4]:
SAS Output

SAS Output

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: InfctRsk

The REG Procedure

MODEL1

Fit

InfctRsk

Number of Observations

Number of Observations Read 113
Number of Observations Used 113

Analysis of Variance

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 3 73.09897 24.36632 20.70 <.0001
Error 109 128.28085 1.17689    
Corrected Total 112 201.37982      

Fit Statistics

Root MSE 1.08484 R-Square 0.3630
Dependent Mean 4.35487 Adj R-Sq 0.3455
Coeff Var 24.91109    

Parameter Estimates

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 1.00116 1.31472 0.76 0.4480
Stay 1 0.30818 0.05940 5.19 <.0001
Age 1 -0.02301 0.02352 -0.98 0.3301
Xray 1 0.01966 0.00576 3.41 0.0009

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: InfctRsk

Observation-wise Statistics

InfctRsk

Diagnostic Plots

Fit Diagnostics

Panel of fit diagnostics for InfctRsk.

Residual Plots

Panel 1

Panel of scatterplots of residuals by regressors for InfctRsk.
In [6]:
PROC GLM DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray / solution;
RUN;
Out[6]:
SAS Output

SAS Output

The SAS System

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 113
Number of Observations Used 113

The SAS System

The GLM Procedure

Dependent Variable: InfctRsk

Analysis of Variance

InfctRsk

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 3 73.0989689 24.3663230 20.70 <.0001
Error 109 128.2808541 1.1768886    
Corrected Total 112 201.3798230      

Fit Statistics

R-Square Coeff Var Root MSE InfctRsk Mean
0.362991 24.91109 1.084845 4.354867

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
Stay 1 57.30510979 57.30510979 48.69 <.0001
Age 1 2.07505963 2.07505963 1.76 0.1870
Xray 1 13.71879952 13.71879952 11.66 0.0009

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
Stay 1 31.68381849 31.68381849 26.92 <.0001
Age 1 1.12633960 1.12633960 0.96 0.3301
Xray 1 13.71879952 13.71879952 11.66 0.0009

Solution

Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 1.001161692 1.31472381 0.76 0.4480
Stay 0.308180904 0.05939565 5.19 <.0001
Age -0.023005220 0.02351578 -0.98 0.3301
Xray 0.019660929 0.00575856 3.41 0.0009

Note that PROC REG automatically provides diagnostic plots for regression analysis. These cannot be directly obtained from PROC GLM, but you can extract the residuals and fitted values using an OUTPUT statement and create the plot yourself with PROC SGPLOT. The solution option in PROC GLM adds the table of regression estimates for the slopes and intercept along with t-tests. Both outputs automatically come with the ANOVA table and F-test.

The MODEL statement specifies the formula for the mean response of the regression model in terms of the predictors. It is important to note that this model formula needs to be written as Y = X1 X2 ... XP; (the response on the left and the predictors on the right). Note that there are spaces between the predictors.

Adding an interaction term is different between PROC REG and PROC GLM. In PROC GLM it is straigtforward.

In [7]:
PROC GLM DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray Stay*Age;
RUN;

/* 
Equivalently you can write

PROC GLM DATA=Survey.hospital;
MODEL InfctRsk = Xray Stay|Age;
RUN;
*/
Out[7]:
SAS Output

SAS Output

The SAS System

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 113
Number of Observations Used 113

The SAS System

The GLM Procedure

Dependent Variable: InfctRsk

Analysis of Variance

InfctRsk

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 4 73.5200594 18.3800149 15.53 <.0001
Error 108 127.8597636 1.1838867    
Corrected Total 112 201.3798230      

Fit Statistics

R-Square Coeff Var Root MSE InfctRsk Mean
0.365082 24.98505 1.088066 4.354867

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
Stay 1 57.30510979 57.30510979 48.40 <.0001
Age 1 2.07505963 2.07505963 1.75 0.1883
Xray 1 13.71879952 13.71879952 11.59 0.0009
Stay*Age 1 0.42109047 0.42109047 0.36 0.5522

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
Stay 1 1.39608076 1.39608076 1.18 0.2799
Age 1 0.16916918 0.16916918 0.14 0.7062
Xray 1 14.09431888 14.09431888 11.91 0.0008
Stay*Age 1 0.42109047 0.42109047 0.36 0.5522

Solution

Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept -2.650215630 6.26282439 -0.42 0.6730
Stay 0.679876723 0.62608020 1.09 0.2799
Age 0.042409677 0.11219136 0.38 0.7062
Xray 0.020064525 0.00581516 3.45 0.0008
Stay*Age -0.006696426 0.01122821 -0.60 0.5522

For PROC REG, have to manually calculate this product term with a DATA step to add it to the model.

In [2]:
DATA hospital_temp;
SET Survey.hospital;
StayAge = Stay*Age;
RUN;

PROC REG DATA=hospital_temp;
MODEL InfctRsk = Stay Age Xray StayAge;
RUN;
Out[2]:
SAS Output

SAS Output

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: InfctRsk

The REG Procedure

MODEL1

Fit

InfctRsk

Number of Observations

Number of Observations Read 113
Number of Observations Used 113

Analysis of Variance

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 4 73.52006 18.38001 15.53 <.0001
Error 108 127.85976 1.18389    
Corrected Total 112 201.37982      

Fit Statistics

Root MSE 1.08807 R-Square 0.3651
Dependent Mean 4.35487 Adj R-Sq 0.3416
Coeff Var 24.98505    

Parameter Estimates

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -2.65022 6.26282 -0.42 0.6730
Stay 1 0.67988 0.62608 1.09 0.2799
Age 1 0.04241 0.11219 0.38 0.7062
Xray 1 0.02006 0.00582 3.45 0.0008
StayAge 1 -0.00670 0.01123 -0.60 0.5522

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: InfctRsk

Observation-wise Statistics

InfctRsk

Diagnostic Plots

Fit Diagnostics

Panel of fit diagnostics for InfctRsk.

Residual Plots

Panel 1

Panel of scatterplots of residuals by regressors for InfctRsk.

To obtain confidence intervals for the regression coefficients, add the CLB obtion to the MODEL statement in PROC REG.

In [9]:
PROC REG DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray / CLB;
RUN;
Out[9]:
SAS Output

SAS Output

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: InfctRsk

The REG Procedure

MODEL1

Fit

InfctRsk

Number of Observations

Number of Observations Read 113
Number of Observations Used 113

Analysis of Variance

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 3 73.09897 24.36632 20.70 <.0001
Error 109 128.28085 1.17689    
Corrected Total 112 201.37982      

Fit Statistics

Root MSE 1.08484 R-Square 0.3630
Dependent Mean 4.35487 Adj R-Sq 0.3455
Coeff Var 24.91109    

Parameter Estimates

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| 95% Confidence Limits
Intercept 1 1.00116 1.31472 0.76 0.4480 -1.60458 3.60690
Stay 1 0.30818 0.05940 5.19 <.0001 0.19046 0.42590
Age 1 -0.02301 0.02352 -0.98 0.3301 -0.06961 0.02360
Xray 1 0.01966 0.00576 3.41 0.0009 0.00825 0.03107

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: InfctRsk

Observation-wise Statistics

InfctRsk

Diagnostic Plots

Fit Diagnostics

Panel of fit diagnostics for InfctRsk.

Residual Plots

Panel 1

Panel of scatterplots of residuals by regressors for InfctRsk.

If we want to perform the general linear test, then we can use the TEST statment in PROC REG. PROC GLM also has a TEST statement, but it is stlightly more complicated to use. For example, if we want to test if the $\beta_{age}=\beta_{stay}=0$, then we could do the following.

In [11]:
PROC REG DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray;
TEST AGE=0, STAY=0;
RUN;
Out[11]:
SAS Output

SAS Output

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: InfctRsk

The REG Procedure

MODEL1

Fit

InfctRsk

Number of Observations

Number of Observations Read 113
Number of Observations Used 113

Analysis of Variance

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 3 73.09897 24.36632 20.70 <.0001
Error 109 128.28085 1.17689    
Corrected Total 112 201.37982      

Fit Statistics

Root MSE 1.08484 R-Square 0.3630
Dependent Mean 4.35487 Adj R-Sq 0.3455
Coeff Var 24.91109    

Parameter Estimates

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 1.00116 1.31472 0.76 0.4480
Stay 1 0.30818 0.05940 5.19 <.0001
Age 1 -0.02301 0.02352 -0.98 0.3301
Xray 1 0.01966 0.00576 3.41 0.0009

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: InfctRsk

Observation-wise Statistics

InfctRsk

Diagnostic Plots

Fit Diagnostics

Panel of fit diagnostics for InfctRsk.

Residual Plots

Panel 1

Panel of scatterplots of residuals by regressors for InfctRsk.

The SAS System

The REG Procedure

Model: MODEL1

Test 1

Results

Test 1 Results for Dependent Variable InfctRsk
Source DF Mean
Square
F Value Pr > F
Numerator 2 15.85127 13.47 <.0001
Denominator 109 1.17689    

The table with the F-test appears at the bottom, with a p-value of $<0.0001$.

To make new predictions and obtain prediction intervals or confidence intervals for the mean, we will need to use the OUTPUT statement in PROC REG. First, we need to add a row to out dataset, that has the values of the predictors we want to use for the predictions while leaving the response value blank in this row.

  • The P output provides the predicted value.
  • The LCL and UCL outputs provide the prediction intervals (lower and upper bounds, respectively).
  • The LCLM and UCLM outputs provide the confidence intervals for the mean (lower and upper bounds, respectively).

Let's obtain the predicted value, confidence interval for the mean, and prediction interval for a hospital with Age = 55, Stay = 10, Xray = 89.

In [3]:
DATA temp;
INPUT InfctRsk Age Stay Xray;
DATALINES;
. 55 10 89
;
RUN;

DATA hospital_pred;
SET Survey.hospital temp;
RUN;

PROC PRINT DATA=hospital_pred(FIRSTOBS=113 OBS=114);
VAR InfctRsk Age Stay Xray;
RUN;
Out[3]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.HOSPITAL_PRED

Obs InfctRsk Age Stay Xray
113 3.1 59.5 9.41 91.7
114 . 55 10 89
In [4]:
ODS SELECT NONE; /* To suppress the PROC REG output */
PROC REG DATA=hospital_pred;
MODEL InfctRsk = Stay Age Xray;
OUTPUT OUT=pred(where=(InfctRsk=.)) p=predicted lcl=UCL_Pred ucl=LCL_Pred 
LCLM=LCLM_Pred UCLM=UCLM_Pred;
RUN;
ODS SELECT ALL;

PROC PRINT DATA=pred;
VAR Age Stay Xray predicted LCLM_Pred UCLM_Pred UCL_Pred LCL_Pred;
RUN;
Out[4]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.PRED

Obs Age Stay Xray predicted LCLM_Pred UCLM_Pred UCL_Pred LCL_Pred
1 55 10 89 4.56751 4.33577 4.79924 2.40493 6.73009

Exercises

Now it's your turn. Use the the skin cancer dataset (csv) to obtain the following output.

  1. Fit the linear regression model
$$MORT=\beta_0+\beta_1*Lat+\beta_2*Long+\beta_{12}Lat*Long+\varepsilon_i$$
  1. Obtain the confidence intervals for the regression parameters (the $\beta$'s).
  2. Create a plot of the residuals vs the fitted values.
  3. Create a histogram and QQ plot of the residuals.
  4. Calculate the confidence interval for the mean mortality rate and the prediction interval when Lat = 33 and Long = 86.
  5. Fit the reduced regression model without the interaction term.

Solutions

In [ ]: