Number of Observations
Number of Observations Read | 32 |
---|---|
Number of Observations Used | 32 |
The linear model has the following general form
[Y_i = \beta_0 + \beta_1 X_1 + \cdots +\beta_p X_p +\varepsilon_i]
The linear model has the following assumptions:
Lets examine whether the relationship between systolic blood pressure and the Quetelet index (BMI) by fitting the simple linear regression model [y_i=\beat_0 + \beta_1*Quet +\varepsilon_i.]
LIBNAME mreg "H:\BiostatCourses\PublicHealthComputing\Lectures\Week9MultipleReg\SAS";
PROC REG DATA=mreg.sbp_quet;
MODEL SBP = quet / CLM CLI CLB;
OUTPUT OUT = diag r = residuals;
RUN;
QUIT;
/* Test for normality of the residuals */
PROC UNIVARIATE DATA=diag normal;
var residuals;
RUN;
Let's examine the regression assumptions for this example.
The ANOVA F-test and t-test (equivalently in the simple linear regression case) show that the relationship is significant with an estimated $\hat{\beta}_{quet}=2.15$. This means that for each 1 unit increase in BMI (Quetelet index), the mean systolic blood pressure will increase by 2.15 mmHg. The CLB option in the model statement provides 95% confidence intervals for the regression coefficients and intercept. The 95% confidence interval for $\beta_{quet}$ is (1.43,2.87). The CLM option provides confidence intervals for the mean response $E(Y|X)$ and the CLI option provides prediction intervals.
Note that in this case, $R^2=0.5506$, so 55.06% of the variation in systolic blood pressure is explained by the linear regression on the Quetelet index (BMI). Maybe if we control for some other covariates, we can develop a better model. Let's add in age and see if the model fits better.
PROC SGSCATTER DATA=mreg.sbp_quet;
matrix sbp age quet/ diagonal = (histogram kernel);
RUN;
PROC REG DATA=mreg.sbp_quet;
MODEL SBP = quet age;
RUN;
QUIT;
The new regression equation is [\widehat{SBP}=62.15 + 0.98Quet + 1.05Age]
For the next model, let's consider a categorical predictor, smoking (0 = no, 1 = yes).
If we want to evaluate the regression equation at a particular point, we could use the ESTIMATE statment in PROC GLM.
PROC GLM DATA = mreg.sbp_quet;
MODEL SBP = quet age / solution;
ESTIMATE 'Quet = 20 Age = 25' intercept 1 quet 20 age 25;
RUN;
QUIT;
PROC REG DATA=mreg.sbp_quet;
MODEL SBP = quet smk;
RUN;
QUIT;
PROC GLM DATA=mreg.sbp_quet;
CLASS smk (ref = "0");
MODEL SBP = quet smk / solution;
RUN;
QUIT;
With this categorical predictor, we get two regression equations: one for smokers and one for non-smokers.
Note that in this model, the slopes for the two euqations are forced to be the same. If we want to allow the effect of Quetelet score on SBP to differ between smokers and non-smokers, then we will need to include an interaction term.
PROC GLM DATA=mreg.sbp_quet;
CLASS smk (ref = "0");
MODEL SBP = quet|smk / solution;
RUN;
QUIT;
Now we have the following two regression equations for smokers and non-smokers
/* Forward Selection */
PROC REG DATA=mreg.sbp_quet;
model sbp = age smk quet / selection = f sle = 0.05;
RUN;
QUIT;
/* Backward Selection */
PROC REG DATA=mreg.sbp_quet;
model sbp = age smk quet / selection = b sls = 0.05;
RUN;
QUIT;
/* Stepwise Selection */
PROC REG DATA=mreg.sbp_quet;
model sbp = age smk quet / selection = stepwise sls =0.05 sle = 0.05;
RUN;
QUIT;
/* All subset selection */
/* This uses Mallow's C_p: lower C_p is better */
PROC REG DATA=mreg.sbp_quet;
model sbp = age smk quet / selection = cp BEST = 8;
RUN;
QUIT;