In this lecture, we will discuss ANOVA and simple linear regression. ANOVA (Analysis of Variance) uses sums of squares to test for differences in location between multiple groups (case CQ). This generalizes the two sample t-test to multiple groups. Then, we will examine the basics of simple linear regression including scatterplots, correlation, and fitting a regression line. Next week, we will continue regression by considering models with more that one predictor.
The two-sample t-test can be used to compare means between two groups. If we have multiple groups, then we could use the two sample t-test to make all pairwise comparisons to see if any of the means differ, but this would greatly inflate the type I error rate.
The ANOVA F test for equality of means controls for this type I error rate. ANOVA has the following assumptions:
ANOVA is robust to minor deviations from these assumptions.
The treatment effects model, models the jth observation in the ith group by $$y_{ij}=\mu+\alpha_i+\varepsilon_{ij}$$
The ANOVA F test is for the hypotheses
$$H_0:\alpha_1=\alpha_2=\cdots=\alpha_p=0\;vs\;H_1:\;\alpha_i's\text{ not all equal}$$Note that rejecting $H_0$ only tells you that at least one group mean is differnt but not which groups are different. The test compares sum of squares treatment (or model as listed in SAS) (variability between groups) with the sum of squares error (variability within groups.
$$ SST=\sum_{i=1}^p\sum_{j=1}^{n_i}(\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot})^2=\sum_{i=1}^p n_i(\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot})^2, \;\;df=p-1$$$$ SSE=\sum_{i=1}^p\sum_{j=1}^{n_i}(\bar{y}_{ij}-\bar{y}_{i\cdot})^2,\;\;df=N-p $$$$ F=\dfrac{SST/p-1}{SSE/N-p} $$Example: Is there an association between depression and worktype?
data dep;
*Create ID variable, used here to input ordered data;
do id=1 to 39;
if id<=8 then type="Admin";
else if id<=21 then type="Labor";
else type="Tech";
input dep @;
output;
end;
datalines;
75 73 68 109 92 82 33 135
69 161 91 80 198 194 94
126 184 141 108 175 126
35 86 202 213 82 156 170 188 37
294 92 232 238 112 87 73 168 218
;
run;
proc sgplot data=dep;
vbox dep /category=type ;
run;
In order to fit the model and get the output for the F test, we can use either PROC ANOVA (balanced data) or PROC GLM.
proc ANOVA data=dep;
class type;
model dep=type;
run
proc GLM data=dep;
class type;
model dep=type;
lsmeans type;
run;
From the output, we can see that SST = 24158.4113, SSE = 128505.8964 and the F statistic is 3.38 with a p-value of 0.045.
Note that ANOVA is just a special case of regression with categorical predictors. This model for depressions by work type is the same as the the regression model with two group indicators.
data dep2; set dep;
VL=(type="Labor");
VT=(type="Tech");
run;
proc GLM data=dep2;
model dep=VL VT;
run;
Notice that we get the same vale for the F test. In this case, the regression model is
$$y=\beta_0 + \beta_LV_L + \beta_TV_T$$and the F test is of $H_0: \beta_L=\beta_T=0$. When running the model this way, we also get estimates of the group means in the parameter estimates table.
Now that we have performed the F test and seen that the results are significant, how do we determine which groups are different. We can visually inspect the boxplot as a start. To perform tests to compare the groups, we can use contrasts. In the output above, the type III sum of squares table give us tests comparing Labor to Admin ($H_0:\beta_L=0$) and Tech to Admin ($H_0:\beta_T=0$).
To use contrast to get these results and to compare Labor to Tech, we can do the following.
proc GLM data=dep2;
model DEP=VL VT;
contrast 'ANOVA equivalent test' VL 1, VT 1;
contrast 'L vs A equivalent' VL 1 ;
contrast 'T vs A equivalent' VT 1;
contrast 'Labor vs Tech' VL 1 VT -1;
run;
Other post-hoc pairwise comparisons that control the type I error rate are Tukey's honestly significant differece, Fisher's least significant differece, Scheffe's methods, Bonferroni, and Sidak.
proc GLM data=dep2;
class type;
model dep=type;
means type/scheffe bon sidak tukey lsd;
run;
quit;
Now, let's add in a second factor, gender.
data depgend (drop=i);
do i=1 to 39;
if i<=8 then type="Admin";
else if i<=21 then type="Labor";
else type="Tech";
input dep gend @;
output;
end;
datalines;
75 1 73 0 68 0 109 1 92 1 82 0 33 0 135 0
69 0 161 0 91 1 80 1 198 0 194 1 94 0
126 1 184 0 141 1 108 0 175 0 126 0
35 0 86 1 202 1 213 0 82 0 156 1 170 0 188 1 37 0
294 1 92 0 232 0 238 0 112 1 87 1 73 0 168 0 218 1
;
run;
proc freq data=depgend;
table gend*type /norow nocol nopct;
run;
*Box plot;
proc sgplot data=depgend;
vbox dep /category=type group=gend ;
label gend="1=Male, 0=Female";
run;
By choosing Admin and Female as our reference categories, we have the following model
$$y=\beta_{AF}+\beta_{L}V_L+\beta_TV_T+\beta_MV_M+\beta_{MT}V_{MT}+\beta_{ML}V_{ML}$$PROC GLM reports three different types of sums of square that you can use to test for coefficients.
data depgend2; set depgend;
VL=(type="Labor");
VT=(type="Tech");
VA=(type="Admin");
run;
title 'Unbalanced Two-Way Analysis of Variance';
proc glm data=depgend2;
model dep=VT VL gend gend*VT gend*VL;
run;
proc glm data=depgend2;
class type (ref="Admin") gend (ref="0");
model dep=type gend type*gend /ss1 ss2 ss3;
run;
proc glm data=depgend2; *Saturation symbol;
class type (ref="Admin") gend (ref="0");
model dep=type|gend /ss1 ss2 ss3;
run;
In this case, the interaction effect between gender and type is not significant. Since the interaction effect is not significant, we can test for main effects (note that it is not appropriate to test for main effects if the interaction is significant). Gender also appears to not be significant (see TYPE I or TYPE II SS).
If the interaction had been significant, then we would want to test simple effects instead of main effects. In this example we would have the following simple effects
*Simple Effects Contrast example;
proc glm data=depgend2; *Manual method;
model dep= VL VT GEND GEND*VL GEND*VT;
contrast "Gender comp. for Admin" gend 1;
contrast "Gender comp. for labor" gend 1 GEND*VL 1;
contrast "Gender comp. for tech" gend 1 GEND*VT 1;
contrast "1-way ANOVA for Females" VT 1, VL 1;
contrast "1-way ANOVA for Males" VT 1 Gend*VT 1, VL 1 Gend*VL 1;
run;
For a significan interaction, sample write up depends on the findings, but could look like the following (with made up p-values):
A two-way ANOVA was conducted that examined the effect of gender and work type on depression. There was a significant interaction between the effects of gender and work type on depression (F = 4.5, p = .02).
Simple main effects analysis showed that Tech workers had an estimated 55.8 points higher depression levels compared to Admin workers among Males(p = .002), but there were no differences found in females (p = .793).
The nonparametric t-test is the Wilcox Mann Whitney rank sum test (or U test). For more than two groups (the one-way ANOVA case), it is the Kruskal-Wallis Test.
Title 'Nonparametric Analysis';
Title2 'Two group comparison: Rank-Sum Test';
proc npar1way data=depgend2 wilcoxon;
class gend;
var dep ;
run;
Title2 'Three group comparison: Kruskal Wallis';
proc npar1way data=depgend2 wilcoxon;
class type;
var dep ;
run;
For describing the association between two quantiative variables, scatterplots and correlation are the most common methods used to explore the relationship. Recall that the correlation coefficient describes the strength and direction of the LINEAR relationship between two quantitative variables.
Data vote;
input VOTE TVEXP;
datalines;
35.4 38.5
58.2 48.3
46.1 47.2
45.5 34.8
64.8 50.1
52.0 44.0
37.9 27.2
48.2 37.8
41.8 27.2
54.0 39.1
40.8 31.3
61.9 45.1
36.5 31.3
32.7 34.8
53.8 42.2
24.6 29.0
31.2 36.1
42.6 36.5
49.6 33.2
56.6 46.1
. 40
;
run;
TITLE;
TITLE2;
PROC SGPLOT DATA=vote;
SCATTER X=TVEXP Y=VOTE;
RUN;
PROC CORR DATA=vote;
VAR vote tvexp;
RUN;
From the scatterplot, we can see that the relationship looks linear and with $r=0.75819$, the correlation coefficient indicates a strong positive linear relationship between money spent on tv adds and voer turnout. To quantify the relationship between these two variables, we can fit a simple linear regression model. Recall, that linear regression has the following assumptions
proc reg data=vote;
model vote=tvexp ;
run;
proc glm data=vote;
model vote=tvexp ;
run;
For this model, we get the following fitted equation
$$\hat{y}=1.92 + 1.15x$$In order to check the assumptions, we can look at the scatterplot and residual diagnostics plots output by PROC REG.