Multiple Regression¶

The linear model has the following general form

[Y_i = \beta_0 + \beta_1 X_1 + \cdots +\beta_p X_p +\varepsilon_i]

$Y_i$: the response for the ith subect
$X_j$: The jth covariate (predictor)
$\beta_0$: intercept
$\beta_l, j=1,\ldots,p$: The jth regression coefficient
$\varepsilon_i$: Error term for subject i

The linear model has the following assumptions:

Independence: Observations are not related and do not influence each other
Linearity: There is a true underlying linear relationship between the mean of the response and the predictors.
Normality: At any given value of the predictors, the response variable is normally distribute. (Equivalently, the error term is normally distributed)
Homoscedasticity: The variance of y is constant for all values of the predictors

Lets examine whether the relationship between systolic blood pressure and the Quetelet index (BMI) by fitting the simple linear regression model [y_i=\beat_0 + \beta_1*Quet +\varepsilon_i.]

LIBNAME mreg "H:\BiostatCourses\PublicHealthComputing\Lectures\Week9MultipleReg\SAS";

PROC REG DATA=mreg.sbp_quet;
MODEL SBP = quet / CLM CLI CLB;
OUTPUT OUT = diag r = residuals;
RUN;
QUIT;

/* Test for normality of the residuals */
PROC UNIVARIATE DATA=diag normal;
var residuals;
RUN;

Let's examine the regression assumptions for this example.

Independence: In this case, this assumption is unknown since we do not know how to subjects were sampled.
Linearity: Looking at the scatterplot given in the fit plot, we can see that the relationship between systolic blood pressure and BMI does look linear.
Normality: The QQ-plot and histogram of residuals show no major deviations from normality. Furthermore, the Shapiro-Wilks and Kolmogorov-Smirnov tests do not show any evidence of deviation from normality with p-values 0.5011 and >0.15 respectively.
Homoscedasticity: Th eplots of residuals vs predicted values and the original scatter plot show no evidence of non-constant variance (usually seen as "fanning out").

The ANOVA F-test and t-test (equivalently in the simple linear regression case) show that the relationship is significant with an estimated $\hat{\beta}_{quet}=2.15$. This means that for each 1 unit increase in BMI (Quetelet index), the mean systolic blood pressure will increase by 2.15 mmHg. The CLB option in the model statement provides 95% confidence intervals for the regression coefficients and intercept. The 95% confidence interval for $\beta_{quet}$ is (1.43,2.87). The CLM option provides confidence intervals for the mean response $E(Y|X)$ and the CLI option provides prediction intervals.

Note that in this case, $R^2=0.5506$, so 55.06% of the variation in systolic blood pressure is explained by the linear regression on the Quetelet index (BMI). Maybe if we control for some other covariates, we can develop a better model. Let's add in age and see if the model fits better.

PROC SGSCATTER DATA=mreg.sbp_quet;
matrix sbp age quet/ diagonal = (histogram kernel);
RUN;

PROC REG DATA=mreg.sbp_quet;
MODEL SBP = quet age;
RUN;
QUIT;

The new regression equation is [\widehat{SBP}=62.15 + 0.98Quet + 1.05Age]

Note that the Quet regression coefficient deacreased with the inclusion of age
As always, $R^2$ increased when adding a new cariable, but the adjusted $R^2$ also increased indicating that age has at least slighlty improved the model. (The adjusted $R^2$ has a penalty term for the number of predictors.)

For the next model, let's consider a categorical predictor, smoking (0 = no, 1 = yes).

If we want to evaluate the regression equation at a particular point, we could use the ESTIMATE statment in PROC GLM.

PROC GLM DATA = mreg.sbp_quet;
MODEL SBP = quet age / solution;
ESTIMATE 'Quet = 20 Age = 25' intercept 1 quet 20 age 25;
RUN;
QUIT;

PROC REG DATA=mreg.sbp_quet;
MODEL SBP = quet smk;
RUN;
QUIT;

PROC GLM DATA=mreg.sbp_quet;
CLASS smk (ref = "0");
MODEL SBP = quet smk / solution;
RUN;
QUIT;

With this categorical predictor, we get two regression equations: one for smokers and one for non-smokers.

$\widehat{SBP} = 79.36 + 2.21*Quet$ (non-smokers)
$\widehat{SBP} = 87.93 + 2.21*Quet$ (smokers)

Note that in this model, the slopes for the two euqations are forced to be the same. If we want to allow the effect of Quetelet score on SBP to differ between smokers and non-smokers, then we will need to include an interaction term.

PROC GLM DATA=mreg.sbp_quet;
CLASS smk (ref = "0");
MODEL SBP = quet|smk / solution;
RUN;
QUIT;

Now we have the following two regression equations for smokers and non-smokers

$\widehat{SBP} = 67.72 + 2.63*Quet$ (non-smokers)
$\widehat{SBP} = 93.33 + 2.01*Quet$ (smokers)

Model Selections¶

Ideally, there is a physiologic model for how the process works, so you can just specify those terms in the model and be done.
Other times, usually in the exploratory phase, we have many possible models
- We may want a parsimonious (few covariates) with good predictions
- Determine which predictors explain indepndent proportions of the variability in the outcome.
- It depends on your goal: Are you interested in prediction? Are you interested in indentifying a group of important predctors?
We will look at a few common variable selection techniques available in SAS
- Forward selection
- Backward selection
- Stepwise regression
- All subsets

/* Forward Selection */
PROC REG DATA=mreg.sbp_quet;
model sbp = age smk quet / selection = f sle = 0.05;
RUN;
QUIT;

/* Backward Selection */
PROC REG DATA=mreg.sbp_quet;
model sbp = age smk quet / selection = b sls = 0.05;
RUN;
QUIT;

/* Stepwise Selection */
PROC REG DATA=mreg.sbp_quet;
model sbp = age smk quet / selection = stepwise sls =0.05 sle = 0.05;
RUN;
QUIT;

/* All subset selection */
/* This uses Mallow's C_p: lower C_p is better */
PROC REG DATA=mreg.sbp_quet;
model sbp = age smk quet / selection = cp BEST = 8;
RUN;
QUIT;

Number of Observations Read	32
Number of Observations Used	32

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	3537.94574	3537.94574	36.75	<.0001
Error	30	2888.02301	96.26743
Corrected Total	31	6425.96875

Root MSE	9.81160	R-Square	0.5506
Dependent Mean	144.53125	Adj R-Sq	0.5356
Coeff Var	6.78856

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|	95% Confidence Limits
Intercept	Intercept	1	85.62057	9.87116	8.67	<.0001	65.46098	105.78016
QUET	QUET	1	2.14917	0.35451	6.06	<.0001	1.42515	2.87318

Output Statistics
Obs	Dependent Variable	Predicted Value	Std Error Mean Predict	95% CL Mean		95% CL Predict		Residual
1	135	132.3864	2.6499	126.9747	137.7982	111.6306	153.1423	2.6136
2	122	140.4458	1.8608	136.6456	144.2460	120.0507	160.8409	-18.4458
3	130	137.2006	2.1144	132.8824	141.5187	116.7026	157.6985	-7.2006
4	148	151.5570	2.0860	147.2968	155.8172	131.0712	172.0428	-3.5570
5	146	134.6001	2.3858	129.7276	139.4725	113.9782	155.2219	11.3999
6	129	130.5382	2.8873	124.6416	136.4347	109.6506	151.4257	-1.5382
7	162	149.4078	1.9119	145.5032	153.3125	128.9930	169.8227	12.5922
8	160	148.2043	1.8372	144.4522	151.9565	127.8181	168.5905	11.7957
9	144	121.4687	4.1810	112.9299	130.0074	99.6873	143.2501	22.5313
10	180	170.2333	4.5807	160.8782	179.5884	148.1191	192.3475	9.7667
11	166	153.8996	2.3230	149.1553	158.6439	133.3077	174.4915	12.1004
12	138	157.2308	2.7197	151.6764	162.7852	136.4373	178.0243	-19.2308
13	152	159.0361	2.9552	153.0008	165.0714	138.1090	179.9632	-7.0361
14	138	149.5153	1.9194	145.5953	153.4353	129.0975	169.9331	-11.5153
15	140	147.1297	1.7866	143.4809	150.7785	126.7623	167.4972	-7.1297
16	134	135.0084	2.3401	130.2294	139.7875	114.4085	155.6084	-1.0084
17	145	142.7884	1.7581	139.1978	146.3790	122.4313	163.1455	2.2116
18	142	135.5672	2.2792	130.9124	140.2220	114.9957	156.1387	6.4328
19	135	138.7265	1.9812	134.6803	142.7727	118.2841	159.1689	-3.7265
20	142	143.6696	1.7403	140.1155	147.2237	123.3189	164.0203	-1.6696
21	150	148.5482	1.8567	144.7562	152.3401	128.1546	168.9418	1.4518
22	144	151.1917	2.0531	146.9986	155.3847	130.7197	171.6636	-7.1917
23	137	141.4129	1.8091	137.7182	145.1077	121.0372	161.7887	-4.4129
24	132	139.5647	1.9182	135.6471	143.4822	119.1473	159.9820	-7.5647
25	149	141.5204	1.8042	137.8358	145.2050	121.1465	161.8943	7.4796
26	132	135.4168	2.2954	130.7290	140.1046	114.8378	155.9958	-3.4168
27	120	130.5167	2.8901	124.6143	136.4190	109.6275	151.4058	-10.5167
28	126	134.1058	2.4425	129.1175	139.0940	113.4563	154.7553	-8.1058
29	161	152.2447	2.1511	147.8516	156.6379	131.7309	172.7586	8.7553
30	170	159.3800	3.0013	153.2505	165.5094	138.4255	180.3344	10.6200
31	152	155.7264	2.5335	150.5523	160.9005	135.0312	176.4216	-3.7264
32	164	156.7580	2.6601	151.3254	162.1906	135.9967	177.5193	7.2420

Sum of Residuals	0
Sum of Squared Residuals	2888.02301
Predicted Residual SS (PRESS)	3476.56899

Moments
N	32	Sum Weights	32
Mean	0	Sum Observations	0
Std Deviation	9.6520481	Variance	93.1620326
Skewness	0.13173666	Kurtosis	-0.2581232
Uncorrected SS	2888.02301	Corrected SS	2888.02301
Coeff Variation	.	Std Error Mean	1.70625717

Basic Statistical Measures
Location		Variability
Mean	0.00000	Std Deviation	9.65205
Median	-1.60386	Variance	93.16203
Mode	.	Range	41.76214
		Interquartile Range	15.27812

Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	0	Pr > \|t\|	1.0000
Sign	M	-2	Pr >= \|M\|	0.5966
Signed Rank	S	4	Pr >= \|S\|	0.9418

Quantiles (Definition 5)
Level	Quantile
100% Max	22.53133
99%	22.53133
95%	12.59216
90%	11.79569
75% Q3	8.11743
50% Median	-1.60386
25% Q1	-7.16069
10%	-10.51667
5%	-18.44582
1%	-19.23081
0% Min	-19.23081

Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
-19.23081	12	11.3999	5
-18.44582	2	11.7957	8
-11.51530	14	12.1004	11
-10.51667	27	12.5922	7
-8.10578	28	22.5313	9

Root MSE	8.91604	R-Square	0.6412
Dependent Mean	144.53125	Adj R-Sq	0.6165
Coeff Var	6.16893

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	4120.592245	2060.296123	25.92	<.0001
Error	29	2305.376505	79.495742
Corrected Total	31	6425.968750

Source	DF	Type I SS	Mean Square	F Value	Pr > F
QUET	1	3537.945739	3537.945739	44.50	<.0001
AGE	1	582.646506	582.646506	7.33	0.0113

Source	DF	Type III SS	Mean Square	F Value	Pr > F
QUET	1	258.9618700	258.9618700	3.26	0.0815
AGE	1	582.6465058	582.6465058	7.33	0.0113

Tests for Normality
Test	Statistic		p Value
Shapiro-Wilk	W	0.97006	Pr < W	0.5011
Kolmogorov-Smirnov	D	0.107078	Pr > D	>0.1500
Cramer-von Mises	W-Sq	0.072792	Pr > W-Sq	>0.2500
Anderson-Darling	A-Sq	0.432666	Pr > A-Sq	>0.2500

Class Level Information
Class	Levels	Values
SMK	2	1 0

Parameter	Estimate		Standard Error	t Value	Pr > \|t\|
Intercept	79.35695590	B	9.26429554	8.57	<.0001
QUET	2.21156035		0.32299564	6.85	<.0001
SMK 1	8.57101456	B	3.16670062	2.71	0.0113
SMK 0	0.00000000	B	.	.	.

Class Level Information
Class	Levels	Values
SMK	2	1 0

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	62.14894871	12.47519150	4.98	<.0001
QUET	0.97507319	0.54024560	1.80	0.0815
AGE	1.04515739	0.38605667	2.71	0.0113

Root MSE	8.91647	R-Square	0.6412
Dependent Mean	144.53125	Adj R-Sq	0.6165
Coeff Var	6.16924

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	4120.366493	2060.183247	25.91	<.0001
Error	29	2305.602257	79.503526
Corrected Total	31	6425.968750

Source	DF	Type III SS	Mean Square	F Value	Pr > F
QUET	1	3727.268332	3727.268332	46.88	<.0001
SMK	1	582.420754	582.420754	7.33	0.0113

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	4184.107589	1394.702530	17.42	<.0001
Error	28	2241.861161	80.066470
Corrected Total	31	6425.968750

Source	DF	Type III SS	Mean Square	F Value	Pr > F
QUET	1	3590.846203	3590.846203	44.85	<.0001
SMK	1	140.094758	140.094758	1.75	0.1966
QUET*SMK	1	63.741095	63.741095	0.80	0.3799

Parameter	Estimate		Standard Error	t Value	Pr > \|t\|
Intercept	67.72373692	B	16.01336515	4.23	0.0002
QUET	2.63028254	B	0.57034924	4.61	<.0001
SMK 1	25.61422160	B	19.36402171	1.32	0.1966
SMK 0	0.00000000	B	.	.	.
QUET*SMK 1	-0.61847849	B	0.69317067	-0.89	0.3799
QUET*SMK 0	0.00000000	B	.	.	.

Variable	Parameter Estimate	Standard Error	Type II SS	F Value	Pr > F
Intercept	59.09163	12.81626	1817.11840	21.26	<.0001
AGE	1.60450	0.23872	3861.63038	45.18	<.0001

Variable	Parameter Estimate	Standard Error	Type II SS	F Value	Pr > F
Intercept	48.04960	11.12956	1115.95464	18.64	0.0002
AGE	1.70916	0.20176	4296.58607	71.76	<.0001
SMK	10.29439	2.76811	828.05385	13.83	0.0009

Summary of Forward Selection
Step	Variable Entered	Label	Number Vars In	Partial R-Square	Model R-Square	C(p)	F Value	Pr > F
1	AGE	AGE	1	0.6009	0.6009	18.7414	45.18	<.0001
2	SMK	SMK	2	0.1289	0.7298	5.6481	13.83	0.0009

Multiple Regression¶

SAS Output

The REG Procedure

MODEL1

Fit

SBP

Number of Observations

Analysis of Variance

Fit Statistics

Parameter Estimates

Observation-wise Statistics

SBP

Output Statistics

Residual Statistics

Diagnostic Plots

Fit Diagnostics

Residual Plots

QUET

Fit Plot

The UNIVARIATE Procedure

residuals

Moments

Basic Measures of Location and Variability

Tests For Location

Tests For Normality

Quantiles

Extreme Observations

SAS Output

The SGSCATTER Procedure

The SGScatter Procedure

The REG Procedure

MODEL1

Fit

SBP

Number of Observations

Analysis of Variance

Fit Statistics

Parameter Estimates

Observation-wise Statistics

SBP

Diagnostic Plots

Fit Diagnostics

Residual Plots

Panel 1

SAS Output

The GLM Procedure

Data

Number of Observations

Analysis of Variance

SBP

Overall ANOVA

Fit Statistics

Type I Model ANOVA

Type III Model ANOVA

Estimates

Solution

Contour Fit Plot

SAS Output

The REG Procedure

MODEL1

Fit

SBP

Number of Observations

Analysis of Variance

Fit Statistics

Parameter Estimates

Observation-wise Statistics

SBP

Diagnostic Plots

Fit Diagnostics

Residual Plots

Panel 1

SAS Output

The GLM Procedure

Data

Class Levels

Number of Observations

Analysis of Variance

SBP

Overall ANOVA

Variable	Parameter Estimate	Standard Error	Type II SS	F Value	Pr > F
Intercept	51.11791	10.77421	1234.94960	22.51	<.0001
AGE	1.21271	0.32382	769.45920	14.03	0.0008
SMK	9.94557	2.65606	769.23345	14.02	0.0008
QUET	0.85924	0.44987	200.14147	3.65	0.0664

Summary of Backward Elimination
Step	Variable Removed	Label	Number Vars In	Partial R-Square	Model R-Square	C(p)	F Value	Pr > F
1	QUET	QUET	2	0.0311	0.7298	5.6481	3.65	0.0664

Number in Model	C(p)	R-Square	Variables in Model
3	4.0000	0.7609	AGE SMK QUET
2	5.6481	0.7298	AGE SMK
2	16.0212	0.6412	AGE QUET
2	16.0253	0.6412	SMK QUET
1	18.7414	0.6009	AGE
1	24.6414	0.5506	QUET
1	81.9640	0.0612	SMK