Data Set SURVEY.HOSPITAL
Obs | InfctRsk | Stay | Age | Xray |
---|---|---|---|---|
1 | 4.1 | 7.13 | 55.7 | 39.6 |
2 | 1.6 | 8.82 | 58.2 | 51.7 |
3 | 2.7 | 8.34 | 56.9 | 74 |
4 | 5.6 | 8.95 | 53.7 | 122.8 |
5 | 5.7 | 11.2 | 56.5 | 88.9 |
In this section, we will learn how to obtain the output discussed in the notes for linear regression. At the end of this lab, you should be able to do the following in SAS:
For this lab, we will use the hospital infection risk dataset. Recall that this dataset contains data on hospitals to assess the infection risk with the following data on 113 hosptials:
Download the dataset hospital_infct.csv by right clicking the fil and choosing Save Ling As to save the file to your computer.
First, we need to read in the data.
LIBNAME Survey "\\file.phhp.ufl.edu\home\rlp176\BiostatCourses\PHC6937SurveryBiostat\Lectures\MLR\Data";
PROC IMPORT datafile="\\file.phhp.ufl.edu\home\rlp176\BiostatCourses\PHC6937SurveryBiostat\Lectures\MLR\Data\hospital_infct.csv"
out=Survey.hospital dbms=csv replace;
getnames=Yes;
RUN;
PROC PRINT DATA=Survey.hospital(OBS=5);
VAR InfctRsk Stay Age Xray;
RUN;
Recall the regression model for this data:
$$y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\varepsilon_i$$where
and the $\varepsilon_i$ are independent and have a normal distribution with mean 0 and equal variance $\sigma^2$. To fit this model in SAS, there are two options:
PROC REG provides a lot more automatic output for regression, but it does not handle categorical variables and interaction terms as easilty as PROC REG. For PROC REG, you have to manually create the coded variables and the product terms for the predictors to include categorical variables and interaction terms. Both will be able to provide the same output in the end, so which one you choose to use is a personal choice for multiple linear regression. Let's fit this model in both.
PROC REG DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray;
RUN;
PROC GLM DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray / solution;
RUN;
Note that PROC REG automatically provides diagnostic plots for regression analysis. These cannot be directly obtained from PROC GLM, but you can extract the residuals and fitted values using an OUTPUT statement and create the plot yourself with PROC SGPLOT. The solution option in PROC GLM adds the table of regression estimates for the slopes and intercept along with t-tests. Both outputs automatically come with the ANOVA table and F-test.
The MODEL statement specifies the formula for the mean response of the regression model in terms of the predictors. It is important to note that this model formula needs to be written as Y = X1 X2 ... XP; (the response on the left and the predictors on the right). Note that there are spaces between the predictors.
Adding an interaction term is different between PROC REG and PROC GLM. In PROC GLM it is straigtforward.
PROC GLM DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray Stay*Age;
RUN;
/*
Equivalently you can write
PROC GLM DATA=Survey.hospital;
MODEL InfctRsk = Xray Stay|Age;
RUN;
*/
For PROC REG, have to manually calculate this product term with a DATA step to add it to the model.
DATA hospital_temp;
SET Survey.hospital;
StayAge = Stay*Age;
RUN;
PROC REG DATA=hospital_temp;
MODEL InfctRsk = Stay Age Xray StayAge;
RUN;
To obtain confidence intervals for the regression coefficients, add the CLB obtion to the MODEL statement in PROC REG.
PROC REG DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray / CLB;
RUN;
If we want to perform the general linear test, then we can use the TEST statment in PROC REG. PROC GLM also has a TEST statement, but it is stlightly more complicated to use. For example, if we want to test if the $\beta_{age}=\beta_{stay}=0$, then we could do the following.
PROC REG DATA=Survey.hospital;
MODEL InfctRsk = Stay Age Xray;
TEST AGE=0, STAY=0;
RUN;
The table with the F-test appears at the bottom, with a p-value of $<0.0001$.
To make new predictions and obtain prediction intervals or confidence intervals for the mean, we will need to use the OUTPUT statement in PROC REG. First, we need to add a row to out dataset, that has the values of the predictors we want to use for the predictions while leaving the response value blank in this row.
Let's obtain the predicted value, confidence interval for the mean, and prediction interval for a hospital with Age = 55, Stay = 10, Xray = 89.
DATA temp;
INPUT InfctRsk Age Stay Xray;
DATALINES;
. 55 10 89
;
RUN;
DATA hospital_pred;
SET Survey.hospital temp;
RUN;
PROC PRINT DATA=hospital_pred(FIRSTOBS=113 OBS=114);
VAR InfctRsk Age Stay Xray;
RUN;
ODS SELECT NONE; /* To suppress the PROC REG output */
PROC REG DATA=hospital_pred;
MODEL InfctRsk = Stay Age Xray;
OUTPUT OUT=pred(where=(InfctRsk=.)) p=predicted lcl=UCL_Pred ucl=LCL_Pred
LCLM=LCLM_Pred UCLM=UCLM_Pred;
RUN;
ODS SELECT ALL;
PROC PRINT DATA=pred;
VAR Age Stay Xray predicted LCLM_Pred UCLM_Pred UCL_Pred LCL_Pred;
RUN;
Now it's your turn. Use the the skin cancer dataset (csv) to obtain the following output.