ANOVA and Simple Linear Regression in SAS¶

In this lecture, we will discuss ANOVA and simple linear regression. ANOVA (Analysis of Variance) uses sums of squares to test for differences in location between multiple groups (case CQ). This generalizes the two sample t-test to multiple groups. Then, we will examine the basics of simple linear regression including scatterplots, correlation, and fitting a regression line. Next week, we will continue regression by considering models with more that one predictor.

ANOVA¶

The two-sample t-test can be used to compare means between two groups. If we have multiple groups, then we could use the two sample t-test to make all pairwise comparisons to see if any of the means differ, but this would greatly inflate the type I error rate.

The probability of making at least one false positive conclusion for p independent test at $\alpha=0.05$ (Type I error rate for ANOVA) is $1-(0.95)^p$
For $p=3$ groups, this would be 0.14
Fro $p=10$ groups, this would be 0.40.

The ANOVA F test for equality of means controls for this type I error rate. ANOVA has the following assumptions:

Indpendence observations
Normal populations
Contant variance

ANOVA is robust to minor deviations from these assumptions.

One-way ANOVA¶

The treatment effects model, models the jth observation in the ith group by $$y_{ij}=\mu+\alpha_i+\varepsilon_{ij}$$

$\mu$ is the overall mean
$\alpha_i$ is the group i treatment effect
Mean response for group i is $\mu+\alpha_i$
$\varepsilon_{ij}$ is a random error term (generally assumed to be iid N(0,$\sigma^2$))

The ANOVA F test is for the hypotheses

$$H_0:\alpha_1=\alpha_2=\cdots=\alpha_p=0\;vs\;H_1:\;\alpha_i's\text{ not all equal}$$

Note that rejecting $H_0$ only tells you that at least one group mean is differnt but not which groups are different. The test compares sum of squares treatment (or model as listed in SAS) (variability between groups) with the sum of squares error (variability within groups.

$$ SST=\sum_{i=1}^p\sum_{j=1}^{n_i}(\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot})^2=\sum_{i=1}^p n_i(\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot})^2, \;\;df=p-1$$$$ SSE=\sum_{i=1}^p\sum_{j=1}^{n_i}(\bar{y}_{ij}-\bar{y}_{i\cdot})^2,\;\;df=N-p $$$$ F=\dfrac{SST/p-1}{SSE/N-p} $$

Example: Is there an association between depression and worktype?

Worktype: Administration, Labor, or Technology
Depression: measures as a depression score
Another way of looking at this question is: Do the observed depression scores come from the same population distribution? Do depression levels differ by work type?

data dep;
*Create ID variable, used here to input ordered data;
  do id=1 to 39;   
  if id<=8 then type="Admin";
   else if id<=21 then type="Labor";
   else type="Tech";
  input dep @;
  output;
  end;
datalines;
75 73 68 109 92 82 33 135

69 161 91 80 198 194 94 
126 184 141 108 175 126

35 86 202 213 82 156 170 188 37 
294 92 232 238 112 87 73 168 218
;
run; 

proc sgplot data=dep;    
  vbox dep /category=type ;
run;

SAS Connection established. Subprocess id is 9156

In order to fit the model and get the output for the F test, we can use either PROC ANOVA (balanced data) or PROC GLM.

proc ANOVA data=dep;
  class type;
  model dep=type;
run

proc GLM data=dep;
 class type;
 model dep=type;
 lsmeans type;
run;

From the output, we can see that SST = 24158.4113, SSE = 128505.8964 and the F statistic is 3.38 with a p-value of 0.045.

Note that ANOVA is just a special case of regression with categorical predictors. This model for depressions by work type is the same as the the regression model with two group indicators.

data dep2; set dep;
  VL=(type="Labor");
  VT=(type="Tech");
run;

proc GLM data=dep2;
 model dep=VL VT;
run;

Notice that we get the same vale for the F test. In this case, the regression model is

$$y=\beta_0 + \beta_LV_L + \beta_TV_T$$

and the F test is of $H_0: \beta_L=\beta_T=0$. When running the model this way, we also get estimates of the group means in the parameter estimates table.

Admin: 83.375
Labor: 87.375+51.01 = 138.385
Tech: 87.375+65.68 = 153.055

Now that we have performed the F test and seen that the results are significant, how do we determine which groups are different. We can visually inspect the boxplot as a start. To perform tests to compare the groups, we can use contrasts. In the output above, the type III sum of squares table give us tests comparing Labor to Admin ($H_0:\beta_L=0$) and Tech to Admin ($H_0:\beta_T=0$).

To use contrast to get these results and to compare Labor to Tech, we can do the following.

proc GLM data=dep2;
 model DEP=VL VT;
 contrast 'ANOVA equivalent test' VL 1, VT 1;
 contrast 'L vs A equivalent' VL 1 ;
 contrast 'T vs A equivalent' VT 1;
 contrast 'Labor vs Tech' VL 1 VT -1; 
run;

Other post-hoc pairwise comparisons that control the type I error rate are Tukey's honestly significant differece, Fisher's least significant differece, Scheffe's methods, Bonferroni, and Sidak.

proc GLM data=dep2;
 class type;
 model dep=type;
 means type/scheffe bon sidak tukey lsd;
run;

quit;

Two-way ANOVA¶

Now, let's add in a second factor, gender.

data depgend (drop=i);
  do i=1 to 39;
  if i<=8 then type="Admin";
   else if i<=21 then type="Labor";
   else type="Tech";
  input dep gend @;
  output;
  end;
datalines;
75 1 73 0 68 0 109 1 92 1 82 0 33 0 135 0

69 0 161 0 91 1 80 1 198 0 194 1 94 0 
126 1 184 0 141 1 108 0 175 0 126 0

35 0 86 1 202 1 213 0 82 0 156 1 170 0 188 1 37 0 
294 1 92 0 232 0 238 0 112 1 87 1 73 0 168 0 218 1
;
run; 

proc freq data=depgend;
 table gend*type /norow nocol nopct;
 run;

 *Box plot;
proc sgplot data=depgend;  
  vbox dep /category=type group=gend ;
  label gend="1=Male, 0=Female";
run;

By choosing Admin and Female as our reference categories, we have the following model

$$y=\beta_{AF}+\beta_{L}V_L+\beta_TV_T+\beta_MV_M+\beta_{MT}V_{MT}+\beta_{ML}V_{ML}$$

PROC GLM reports three different types of sums of square that you can use to test for coefficients.

Type I SS: Added in order SS
Type II SS: Compare to everything of the same or smaller order
Type III SS: Compares to model with all other parameters

data depgend2; set depgend;
  VL=(type="Labor");
  VT=(type="Tech");
  VA=(type="Admin");
run;

title 'Unbalanced Two-Way Analysis of Variance';
proc glm data=depgend2;
  model dep=VT VL gend gend*VT gend*VL;
run;

proc glm data=depgend2;
  class type (ref="Admin") gend (ref="0");
  model dep=type gend type*gend /ss1 ss2 ss3;
run;

proc glm data=depgend2;  *Saturation symbol;
 class type (ref="Admin") gend (ref="0");
 model dep=type|gend /ss1 ss2 ss3;
run;

In this case, the interaction effect between gender and type is not significant. Since the interaction effect is not significant, we can test for main effects (note that it is not appropriate to test for main effects if the interaction is significant). Gender also appears to not be significant (see TYPE I or TYPE II SS).

If the interaction had been significant, then we would want to test simple effects instead of main effects. In this example we would have the following simple effects

Gender comparison for admins: $H_0: \beta_m=0$
Gender comparison for Labor: $H_0: \beta_M +\beta_{ML}=0$
Gender comparison for Tech: $H_0: \beta_M +\beta_{MT}=0$
Work type comparison for Male: $H_0: \beta_L +\beta_{ML} = \beta_T +\beta_{MT}=0$
Work type comparison for female: $H_0: \beta_L=\beta_T$

*Simple Effects Contrast example;
proc glm data=depgend2;  *Manual method;
 model dep= VL VT GEND GEND*VL GEND*VT;
 contrast "Gender comp. for Admin" gend 1;
 contrast "Gender comp. for labor" gend 1 GEND*VL 1;
 contrast "Gender comp. for tech" gend 1 GEND*VT 1;
 contrast "1-way ANOVA for Females" VT 1, VL 1;
 contrast "1-way ANOVA for Males" VT 1 Gend*VT 1, VL 1 Gend*VL 1;
run;

For a significan interaction, sample write up depends on the findings, but could look like the following (with made up p-values):

A two-way ANOVA was conducted that examined the effect of gender and work type on depression. There was a significant interaction between the effects of gender and work type on depression (F = 4.5, p = .02).

Simple main effects analysis showed that Tech workers had an estimated 55.8 points higher depression levels compared to Admin workers among Males(p = .002), but there were no differences found in females (p = .793).

Nonparametric ANOVA¶

The nonparametric t-test is the Wilcox Mann Whitney rank sum test (or U test). For more than two groups (the one-way ANOVA case), it is the Kruskal-Wallis Test.

Title 'Nonparametric Analysis';
Title2 'Two group comparison: Rank-Sum Test';
proc npar1way data=depgend2 wilcoxon; 
 class gend;
 var dep ;
run;

Title2 'Three group comparison: Kruskal Wallis';
proc npar1way data=depgend2 wilcoxon; 
 class type;
 var dep ;
run;

Simple Linear Regression¶

For describing the association between two quantiative variables, scatterplots and correlation are the most common methods used to explore the relationship. Recall that the correlation coefficient describes the strength and direction of the LINEAR relationship between two quantitative variables.

Data vote;
input VOTE TVEXP;
datalines;
    35.4            38.5
    58.2            48.3
    46.1            47.2
    45.5            34.8
    64.8            50.1
    52.0            44.0
    37.9            27.2
    48.2            37.8
    41.8            27.2
    54.0            39.1
    40.8            31.3
    61.9            45.1
    36.5            31.3
    32.7            34.8
    53.8            42.2
    24.6            29.0
    31.2            36.1
    42.6            36.5
    49.6            33.2
    56.6            46.1
    .               40
;
run;

TITLE;
TITLE2;
PROC SGPLOT DATA=vote;
SCATTER X=TVEXP Y=VOTE;
RUN;

PROC CORR DATA=vote;
VAR vote tvexp;
RUN;

From the scatterplot, we can see that the relationship looks linear and with $r=0.75819$, the correlation coefficient indicates a strong positive linear relationship between money spent on tv adds and voer turnout. To quantify the relationship between these two variables, we can fit a simple linear regression model. Recall, that linear regression has the following assumptions

homescedasticity - constant variance of Y for each X
Independent observations
Linear (mean) relationship between X and Y
It is also usually assumed that the error distribution is Normal

proc reg data=vote; 
  model vote=tvexp ;
run;

proc glm data=vote; 
  model vote=tvexp ;
run;

For this model, we get the following fitted equation

$$\hat{y}=1.92 + 1.15x$$

In this case, we should not interpret the y-intercept but if we did we would say: The expected voter turnout is 1.91% when no money is spent on tv adds.
Slope: The mean voter turnout increases by 1.15% for each additional \$1000 spent on tv adds.

In order to check the assumptions, we can look at the scatterplot and residual diagnostics plots output by PROC REG.

Class Level Information
Class	Levels	Values
type	3	Admin Labor Tech

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	24158.4113	12079.2057	3.38	0.0450
Error	36	128505.8964	3569.6082
Corrected Total	38	152664.3077

Class Level Information
Class	Levels	Values
type	3	Admin Labor Tech

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	24158.4113	12079.2057	3.38	0.0450
Error	36	128505.8964	3569.6082
Corrected Total	38	152664.3077

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	24158.4113	12079.2057	3.38	0.0450
Error	36	128505.8964	3569.6082
Corrected Total	38	152664.3077

Source	DF	Type I SS	Mean Square	F Value	Pr > F
VL	1	265.84615	265.84615	0.07	0.7865
VT	1	23892.56517	23892.56517	6.69	0.0139

Source	DF	Type III SS	Mean Square	F Value	Pr > F
VL	1	12886.00046	12886.00046	3.61	0.0655
VT	1	23892.56517	23892.56517	6.69	0.0139

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	83.37500000	21.12347105	3.95	0.0004
VL	51.00961538	26.84746315	1.90	0.0655
VT	65.68055556	25.38725266	2.59	0.0139

Contrast	DF	Contrast SS	Mean Square	F Value	Pr > F
ANOVA equivalent test	2	24158.41132	12079.20566	3.38	0.0450
L vs A equivalent	1	12886.00046	12886.00046	3.61	0.0655
T vs A equivalent	1	23892.56517	23892.56517	6.69	0.0139
Labor vs Tech	1	1624.68831	1624.68831	0.46	0.5042

Alpha	0.05
Error Degrees of Freedom	36
Error Mean Square	3569.608
Critical Value of t	2.02809

Comparisons significant at the 0.05 level are indicated by ***.
type Comparison	Difference Between Means	95% Confidence Limits
Tech - Labor	14.67	-29.43	58.77
Tech - Admin	65.68	14.19	117.17	***
Labor - Tech	-14.67	-58.77	29.43
Labor - Admin	51.01	-3.44	105.46
Admin - Tech	-65.68	-117.17	-14.19	***
Admin - Labor	-51.01	-105.46	3.44

Comparisons significant at the 0.05 level are indicated by ***.
type Comparison	Difference Between Means	Simultaneous 95% Confidence Limits
Tech - Labor	14.67	-38.48	67.82
Tech - Admin	65.68	3.63	127.73	***
Labor - Tech	-14.67	-67.82	38.48
Labor - Admin	51.01	-14.61	116.63
Admin - Tech	-65.68	-127.73	-3.63	***
Admin - Labor	-51.01	-116.63	14.61

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	5	30133.5577	6026.7115	1.62	0.1813
Error	33	122530.7500	3713.0530
Corrected Total	38	152664.3077

Source	DF	Type I SS	Mean Square	F Value	Pr > F
VT	1	11272.41087	11272.41087	3.04	0.0908
VL	1	12886.00046	12886.00046	3.47	0.0714
gend	1	1983.13781	1983.13781	0.53	0.4700
VT*gend	1	3156.78453	3156.78453	0.85	0.3632
VL*gend	1	835.22403	835.22403	0.22	0.6384

Source	DF	Type III SS	Mean Square	F Value	Pr > F
VT	1	10378.80000	10378.80000	2.80	0.1040
VL	1	11515.01731	11515.01731	3.10	0.0875
gend	1	357.07500	357.07500	0.10	0.7584
VT*gend	1	531.43599	531.43599	0.14	0.7076
VL*gend	1	835.22403	835.22403	0.22	0.6384

Number of Observations Read	39
Number of Observations Used	39

Number of Observations Read	39
Number of Observations Used	39

Number of Observations Read	39
Number of Observations Used	39

Number of Observations Read	39
Number of Observations Used	39

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	78.20000000	27.25088267	2.87	0.0071
VT	55.80000000	33.37537879	1.67	0.1040
VL	61.17500000	34.73819562	1.76	0.0875
gend	13.80000000	44.50050505	0.31	0.7584
VT*gend	20.07500000	53.06347031	0.38	0.7076
VL*gend	-26.77500000	56.45385004	-0.47	0.6384

Source	DF	Type II SS	Mean Square	F Value	Pr > F
type	2	23431.11373	11715.55686	3.16	0.0557
gend	1	1983.13781	1983.13781	0.53	0.4700
type*gend	2	3992.00856	1996.00428	0.54	0.5892

Source	DF	Type III SS	Mean Square	F Value	Pr > F
type	2	22881.93374	11440.96687	3.08	0.0593
gend	1	1111.46769	1111.46769	0.30	0.5880
type*gend	2	3992.00856	1996.00428	0.54	0.5892

Contrast	DF	Contrast SS	Mean Square	F Value	Pr > F
Gender comp. for Admin	1	357.07500	357.07500	0.10	0.7584
Gender comp. for labor	1	518.00192	518.00192	0.14	0.7112
Gender comp. for tech	1	5100.06944	5100.06944	1.37	0.2496
1-way ANOVA for Females	2	13377.75978	6688.87989	1.80	0.1809
1-way ANOVA for Males	2	14045.36250	7022.68125	1.89	0.1669

Wilcoxon Scores (Rank Sums) for Variable dep Classified by Variable gend
gend	N	Sum of Scores	Expected Under H0	Std Dev Under H0	Mean Score
Average scores were used for ties.
1	16	353.0	320.0	35.016711	22.062500
0	23	427.0	460.0	35.016711	18.565217

Wilcoxon Two-Sample Test
Z includes a continuity correction of 0.5.
Statistic	353.0000

Normal Approximation
Z	0.9281
One-Sided Pr > Z	0.1767
Two-Sided Pr > \|Z\|	0.3533

t Approximation
One-Sided Pr > Z	0.1796
Two-Sided Pr > \|Z\|	0.3592

Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
VOTE	20	45.71000	10.81670	914.20000	24.60000	64.80000
TVEXP	21	38.08571	6.94833	799.80000	27.20000	50.10000

Number of Observations Read	21
Number of Observations Used	20
Number of Observations with Missing Values	1

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	1277.89314	1277.89314	24.34	0.0001
Error	18	945.12486	52.50694
Corrected Total	19	2223.01800

Root MSE	7.24617	R-Square	0.5748
Dependent Mean	45.71000	Adj R-Sq	0.5512
Coeff Var	15.85248

ANOVA and Simple Linear Regression in SAS¶

ANOVA¶

One-way ANOVA¶

SAS Output

The SGPLOT Procedure

The SGPlot Procedure

SAS Output

The ANOVA Procedure

Data

Class Levels

Number of Observations

Analysis of Variance

dep

Overall ANOVA

Fit Statistics

Anova Model ANOVA

Box Plot

SAS Output

The GLM Procedure

Data

Class Levels

Number of Observations

Analysis of Variance

dep

Overall ANOVA

Fit Statistics

Type I Model ANOVA

Type III Model ANOVA

Box Plot

Least Squares Means

type

dep

LSMeans

type Mean Plot

SAS Output

The GLM Procedure

Data

Number of Observations

Analysis of Variance

dep

Overall ANOVA

Fit Statistics

Type I Model ANOVA

Type III Model ANOVA

Solution

Contour Fit Plot

SAS Output

The GLM Procedure

Data

Number of Observations

Analysis of Variance

dep

Overall ANOVA

Fit Statistics

Type I Model ANOVA

Type III Model ANOVA

Contrasts

Solution

Contour Fit Plot

SAS Output

The GLM Procedure

Data

Class Levels

Number of Observations

Analysis of Variance

dep

Overall ANOVA

Fit Statistics

Type I Model ANOVA

Type III Model ANOVA

Box Plot

Means

type

dep

Distribution of dep by type

Pairwise Multiple Comparisons

t

Information

Pairs

Tukey

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	1.91867	9.02332	0.21	0.8340
TVEXP	1	1.15271	0.23366	4.93	0.0001

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	1277.893142	1277.893142	24.34	0.0001
Error	18	945.124858	52.506937
Corrected Total	19	2223.018000

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1.918665998	9.02332069	0.21	0.8340
TVEXP	1.152706870	0.23365762	4.93	0.0001