ANOVA and Simple Linear Regression in SAS

In this lecture, we will discuss ANOVA and simple linear regression. ANOVA (Analysis of Variance) uses sums of squares to test for differences in location between multiple groups (case CQ). This generalizes the two sample t-test to multiple groups. Then, we will examine the basics of simple linear regression including scatterplots, correlation, and fitting a regression line. Next week, we will continue regression by considering models with more that one predictor.

ANOVA

The two-sample t-test can be used to compare means between two groups. If we have multiple groups, then we could use the two sample t-test to make all pairwise comparisons to see if any of the means differ, but this would greatly inflate the type I error rate.

  • The probability of making at least one false positive conclusion for p independent test at $\alpha=0.05$ (Type I error rate for ANOVA) is $1-(0.95)^p$
  • For $p=3$ groups, this would be 0.14
  • Fro $p=10$ groups, this would be 0.40.

The ANOVA F test for equality of means controls for this type I error rate. ANOVA has the following assumptions:

  • Indpendence observations
  • Normal populations
  • Contant variance

ANOVA is robust to minor deviations from these assumptions.

One-way ANOVA

The treatment effects model, models the jth observation in the ith group by $$y_{ij}=\mu+\alpha_i+\varepsilon_{ij}$$

  • $\mu$ is the overall mean
  • $\alpha_i$ is the group i treatment effect
  • Mean response for group i is $\mu+\alpha_i$
  • $\varepsilon_{ij}$ is a random error term (generally assumed to be iid N(0,$\sigma^2$))

The ANOVA F test is for the hypotheses

$$H_0:\alpha_1=\alpha_2=\cdots=\alpha_p=0\;vs\;H_1:\;\alpha_i's\text{ not all equal}$$

Note that rejecting $H_0$ only tells you that at least one group mean is differnt but not which groups are different. The test compares sum of squares treatment (or model as listed in SAS) (variability between groups) with the sum of squares error (variability within groups.

$$ SST=\sum_{i=1}^p\sum_{j=1}^{n_i}(\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot})^2=\sum_{i=1}^p n_i(\bar{y}_{i\cdot}-\bar{y}_{\cdot\cdot})^2, \;\;df=p-1$$$$ SSE=\sum_{i=1}^p\sum_{j=1}^{n_i}(\bar{y}_{ij}-\bar{y}_{i\cdot})^2,\;\;df=N-p $$$$ F=\dfrac{SST/p-1}{SSE/N-p} $$

Example: Is there an association between depression and worktype?

  • Worktype: Administration, Labor, or Technology
  • Depression: measures as a depression score
  • Another way of looking at this question is: Do the observed depression scores come from the same population distribution? Do depression levels differ by work type?
In [1]:
data dep;
*Create ID variable, used here to input ordered data;
  do id=1 to 39;   
  if id<=8 then type="Admin";
   else if id<=21 then type="Labor";
   else type="Tech";
  input dep @;
  output;
  end;
datalines;
75 73 68 109 92 82 33 135

69 161 91 80 198 194 94 
126 184 141 108 175 126

35 86 202 213 82 156 170 188 37 
294 92 232 238 112 87 73 168 218
;
run; 

proc sgplot data=dep;    
  vbox dep /category=type ;
run;
SAS Connection established. Subprocess id is 9156

Out[1]:
SAS Output

SAS Output

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure

In order to fit the model and get the output for the F test, we can use either PROC ANOVA (balanced data) or PROC GLM.

In [2]:
proc ANOVA data=dep;
  class type;
  model dep=type;
run
Out[2]:
SAS Output

SAS Output

The SAS System

The ANOVA Procedure

The ANOVA Procedure

Data

Class Levels

Class Level Information
Class Levels Values
type 3 Admin Labor Tech

Number of Observations

Number of Observations Read 39
Number of Observations Used 39

The SAS System

The ANOVA Procedure

Dependent Variable: dep

Analysis of Variance

dep

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 24158.4113 12079.2057 3.38 0.0450
Error 36 128505.8964 3569.6082    
Corrected Total 38 152664.3077      

Fit Statistics

R-Square Coeff Var Root MSE dep Mean
0.158245 45.71516 59.74620 130.6923

Anova Model ANOVA

Source DF Anova SS Mean Square F Value Pr > F
type 2 24158.41132 12079.20566 3.38 0.0450

Box Plot

Distribution of dep by type
In [16]:
proc GLM data=dep;
 class type;
 model dep=type;
 lsmeans type;
run;
Out[16]:
SAS Output

SAS Output

Nonparametric Analysis

Three group comparison: Kruskal Wallis

The GLM Procedure

The GLM Procedure

Data

Class Levels

Class Level Information
Class Levels Values
type 3 Admin Labor Tech

Number of Observations

Number of Observations Read 39
Number of Observations Used 39

Nonparametric Analysis

Three group comparison: Kruskal Wallis

The GLM Procedure

Dependent Variable: dep

Analysis of Variance

dep

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 24158.4113 12079.2057 3.38 0.0450
Error 36 128505.8964 3569.6082    
Corrected Total 38 152664.3077      

Fit Statistics

R-Square Coeff Var Root MSE dep Mean
0.158245 45.71516 59.74620 130.6923

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
type 2 24158.41132 12079.20566 3.38 0.0450

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
type 2 24158.41132 12079.20566 3.38 0.0450

Box Plot

Fit Plot for dep by type

Nonparametric Analysis

Three group comparison: Kruskal Wallis

The GLM Procedure

Least Squares Means

Least Squares Means

type

dep

LSMeans

type dep LSMEAN
Admin 83.375000
Labor 134.384615
Tech 149.055556

type Mean Plot

Plot of dep least-squares means for type.

From the output, we can see that SST = 24158.4113, SSE = 128505.8964 and the F statistic is 3.38 with a p-value of 0.045.

Note that ANOVA is just a special case of regression with categorical predictors. This model for depressions by work type is the same as the the regression model with two group indicators.

In [4]:
data dep2; set dep;
  VL=(type="Labor");
  VT=(type="Tech");
run;

proc GLM data=dep2;
 model dep=VL VT;
run;
Out[4]:
SAS Output

SAS Output

The SAS System

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 39
Number of Observations Used 39

The SAS System

The GLM Procedure

Dependent Variable: dep

Analysis of Variance

dep

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 24158.4113 12079.2057 3.38 0.0450
Error 36 128505.8964 3569.6082    
Corrected Total 38 152664.3077      

Fit Statistics

R-Square Coeff Var Root MSE dep Mean
0.158245 45.71516 59.74620 130.6923

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
VL 1 265.84615 265.84615 0.07 0.7865
VT 1 23892.56517 23892.56517 6.69 0.0139

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
VL 1 12886.00046 12886.00046 3.61 0.0655
VT 1 23892.56517 23892.56517 6.69 0.0139

Solution

Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 83.37500000 21.12347105 3.95 0.0004
VL 51.00961538 26.84746315 1.90 0.0655
VT 65.68055556 25.38725266 2.59 0.0139

Contour Fit Plot

Contour Fit Plot for dep

Notice that we get the same vale for the F test. In this case, the regression model is

$$y=\beta_0 + \beta_LV_L + \beta_TV_T$$

and the F test is of $H_0: \beta_L=\beta_T=0$. When running the model this way, we also get estimates of the group means in the parameter estimates table.

  • Admin: 83.375
  • Labor: 87.375+51.01 = 138.385
  • Tech: 87.375+65.68 = 153.055

Now that we have performed the F test and seen that the results are significant, how do we determine which groups are different. We can visually inspect the boxplot as a start. To perform tests to compare the groups, we can use contrasts. In the output above, the type III sum of squares table give us tests comparing Labor to Admin ($H_0:\beta_L=0$) and Tech to Admin ($H_0:\beta_T=0$).

To use contrast to get these results and to compare Labor to Tech, we can do the following.

In [5]:
proc GLM data=dep2;
 model DEP=VL VT;
 contrast 'ANOVA equivalent test' VL 1, VT 1;
 contrast 'L vs A equivalent' VL 1 ;
 contrast 'T vs A equivalent' VT 1;
 contrast 'Labor vs Tech' VL 1 VT -1; 
run;
Out[5]:
SAS Output

SAS Output

The SAS System

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 39
Number of Observations Used 39

The SAS System

The GLM Procedure

Dependent Variable: dep

Analysis of Variance

dep

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 24158.4113 12079.2057 3.38 0.0450
Error 36 128505.8964 3569.6082    
Corrected Total 38 152664.3077      

Fit Statistics

R-Square Coeff Var Root MSE dep Mean
0.158245 45.71516 59.74620 130.6923

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
VL 1 265.84615 265.84615 0.07 0.7865
VT 1 23892.56517 23892.56517 6.69 0.0139

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
VL 1 12886.00046 12886.00046 3.61 0.0655
VT 1 23892.56517 23892.56517 6.69 0.0139

Contrasts

Contrast DF Contrast SS Mean Square F Value Pr > F
ANOVA equivalent test 2 24158.41132 12079.20566 3.38 0.0450
L vs A equivalent 1 12886.00046 12886.00046 3.61 0.0655
T vs A equivalent 1 23892.56517 23892.56517 6.69 0.0139
Labor vs Tech 1 1624.68831 1624.68831 0.46 0.5042

Solution

Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 83.37500000 21.12347105 3.95 0.0004
VL 51.00961538 26.84746315 1.90 0.0655
VT 65.68055556 25.38725266 2.59 0.0139

Contour Fit Plot

Contour Fit Plot for dep

Other post-hoc pairwise comparisons that control the type I error rate are Tukey's honestly significant differece, Fisher's least significant differece, Scheffe's methods, Bonferroni, and Sidak.

In [6]:
proc GLM data=dep2;
 class type;
 model dep=type;
 means type/scheffe bon sidak tukey lsd;
run;

quit;
Out[6]:
SAS Output

SAS Output

The SAS System

The GLM Procedure

The GLM Procedure

Data

Class Levels

Class Level Information
Class Levels Values
type 3 Admin Labor Tech

Number of Observations

Number of Observations Read 39
Number of Observations Used 39

The SAS System

The GLM Procedure

Dependent Variable: dep

Analysis of Variance

dep

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 24158.4113 12079.2057 3.38 0.0450
Error 36 128505.8964 3569.6082    
Corrected Total 38 152664.3077      

Fit Statistics

R-Square Coeff Var Root MSE dep Mean
0.158245 45.71516 59.74620 130.6923

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
type 2 24158.41132 12079.20566 3.38 0.0450

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
type 2 24158.41132 12079.20566 3.38 0.0450

Box Plot

Fit Plot for dep by type

The SAS System

The GLM Procedure

Means

type

dep

Distribution of dep by type

Distribution of dep by type

Pairwise Multiple Comparisons

t


The SAS System

The GLM Procedure

t Tests (LSD) for dep

Note:This test controls the Type I comparisonwise error rate, not the experimentwise error rate.

Information

Alpha 0.05
Error Degrees of Freedom 36
Error Mean Square 3569.608
Critical Value of t 2.02809

Pairs

Comparisons significant at the 0.05 level are indicated by ***.
type
Comparison
Difference
Between
Means
95% Confidence Limits  
Tech - Labor 14.67 -29.43 58.77  
Tech - Admin 65.68 14.19 117.17 ***
Labor - Tech -14.67 -58.77 29.43  
Labor - Admin 51.01 -3.44 105.46  
Admin - Tech -65.68 -117.17 -14.19 ***
Admin - Labor -51.01 -105.46 3.44  

Tukey


The SAS System

The GLM Procedure

Tukey's Studentized Range (HSD) Test for dep

Note:This test controls the Type I experimentwise error rate.

Information

Alpha 0.05
Error Degrees of Freedom 36
Error Mean Square 3569.608
Critical Value of Studentized Range 3.45675

Pairs

Comparisons significant at the 0.05 level are indicated by ***.
type
Comparison
Difference
Between
Means
Simultaneous 95% Confidence Limits  
Tech - Labor 14.67 -38.48 67.82  
Tech - Admin 65.68 3.63 127.73 ***
Labor - Tech -14.67 -67.82 38.48  
Labor - Admin 51.01 -14.61 116.63  
Admin - Tech -65.68 -127.73 -3.63 ***
Admin - Labor -51.01 -116.63 14.61  

Sidak


The SAS System

The GLM Procedure

Sidak t Tests for dep

Note:This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than Tukey's for all pairwise comparisons.

Information

Alpha 0.05
Error Degrees of Freedom 36
Error Mean Square 3569.608
Critical Value of t 2.50395

Pairs

Comparisons significant at the 0.05 level are indicated by ***.
type
Comparison
Difference
Between
Means
Simultaneous 95% Confidence Limits  
Tech - Labor 14.67 -39.78 69.12  
Tech - Admin 65.68 2.11 129.25 ***
Labor - Tech -14.67 -69.12 39.78  
Labor - Admin 51.01 -16.22 118.23  
Admin - Tech -65.68 -129.25 -2.11 ***
Admin - Labor -51.01 -118.23 16.22  

Bonferroni


The SAS System

The GLM Procedure

Bonferroni (Dunn) t Tests for dep

Note:This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than Tukey's for all pairwise comparisons.

Information

Alpha 0.05
Error Degrees of Freedom 36
Error Mean Square 3569.608
Critical Value of t 2.51104

Pairs

Comparisons significant at the 0.05 level are indicated by ***.
type
Comparison
Difference
Between
Means
Simultaneous 95% Confidence Limits  
Tech - Labor 14.67 -39.93 69.28  
Tech - Admin 65.68 1.93 129.43 ***
Labor - Tech -14.67 -69.28 39.93  
Labor - Admin 51.01 -16.41 118.42  
Admin - Tech -65.68 -129.43 -1.93 ***
Admin - Labor -51.01 -118.42 16.41  

Scheffe


The SAS System

The GLM Procedure

Scheffe's Test for dep

Note:This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than Tukey's for all pairwise comparisons.

Information

Alpha 0.05
Error Degrees of Freedom 36
Error Mean Square 3569.608
Critical Value of F 3.25945

Pairs

Comparisons significant at the 0.05 level are indicated by ***.
type
Comparison
Difference
Between
Means
Simultaneous 95% Confidence Limits  
Tech - Labor 14.67 -40.85 70.19  
Tech - Admin 65.68 0.86 130.50 ***
Labor - Tech -14.67 -70.19 40.85  
Labor - Admin 51.01 -17.54 119.56  
Admin - Tech -65.68 -130.50 -0.86 ***
Admin - Labor -51.01 -119.56 17.54  

Two-way ANOVA

Now, let's add in a second factor, gender.

In [7]:
data depgend (drop=i);
  do i=1 to 39;
  if i<=8 then type="Admin";
   else if i<=21 then type="Labor";
   else type="Tech";
  input dep gend @;
  output;
  end;
datalines;
75 1 73 0 68 0 109 1 92 1 82 0 33 0 135 0

69 0 161 0 91 1 80 1 198 0 194 1 94 0 
126 1 184 0 141 1 108 0 175 0 126 0

35 0 86 1 202 1 213 0 82 0 156 1 170 0 188 1 37 0 
294 1 92 0 232 0 238 0 112 1 87 1 73 0 168 0 218 1
;
run; 

proc freq data=depgend;
 table gend*type /norow nocol nopct;
 run;

 *Box plot;
proc sgplot data=depgend;  
  vbox dep /category=type group=gend ;
  label gend="1=Male, 0=Female";
run;
Out[7]:
SAS Output

SAS Output

The SAS System

The FREQ Procedure

The FREQ Procedure

Table gend * type

Cross-Tabular Freq Table

Frequency
Table of gend by type
gend type
Admin Labor Tech Total
0
5
8
10
23
1
3
5
8
16
Total
8
13
18
39

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure

By choosing Admin and Female as our reference categories, we have the following model

$$y=\beta_{AF}+\beta_{L}V_L+\beta_TV_T+\beta_MV_M+\beta_{MT}V_{MT}+\beta_{ML}V_{ML}$$

PROC GLM reports three different types of sums of square that you can use to test for coefficients.

  • Type I SS: Added in order SS
  • Type II SS: Compare to everything of the same or smaller order
  • Type III SS: Compares to model with all other parameters
In [10]:
data depgend2; set depgend;
  VL=(type="Labor");
  VT=(type="Tech");
  VA=(type="Admin");
run;

title 'Unbalanced Two-Way Analysis of Variance';
proc glm data=depgend2;
  model dep=VT VL gend gend*VT gend*VL;
run;
Out[10]:
SAS Output

SAS Output

Unbalanced Two-Way Analysis of Variance

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 39
Number of Observations Used 39

Unbalanced Two-Way Analysis of Variance

The GLM Procedure

Dependent Variable: dep

Analysis of Variance

dep

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 5 30133.5577 6026.7115 1.62 0.1813
Error 33 122530.7500 3713.0530    
Corrected Total 38 152664.3077      

Fit Statistics

R-Square Coeff Var Root MSE dep Mean
0.197384 46.62465 60.93483 130.6923

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
VT 1 11272.41087 11272.41087 3.04 0.0908
VL 1 12886.00046 12886.00046 3.47 0.0714
gend 1 1983.13781 1983.13781 0.53 0.4700
VT*gend 1 3156.78453 3156.78453 0.85 0.3632
VL*gend 1 835.22403 835.22403 0.22 0.6384

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
VT 1 10378.80000 10378.80000 2.80 0.1040
VL 1 11515.01731 11515.01731 3.10 0.0875
gend 1 357.07500 357.07500 0.10 0.7584
VT*gend 1 531.43599 531.43599 0.14 0.7076
VL*gend 1 835.22403 835.22403 0.22 0.6384

Solution

Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 78.20000000 27.25088267 2.87 0.0071
VT 55.80000000 33.37537879 1.67 0.1040
VL 61.17500000 34.73819562 1.76 0.0875
gend 13.80000000 44.50050505 0.31 0.7584
VT*gend 20.07500000 53.06347031 0.38 0.7076
VL*gend -26.77500000 56.45385004 -0.47 0.6384
In [12]:
proc glm data=depgend2;
  class type (ref="Admin") gend (ref="0");
  model dep=type gend type*gend /ss1 ss2 ss3;
run;
Out[12]:
SAS Output

SAS Output

Unbalanced Two-Way Analysis of Variance

The GLM Procedure

The GLM Procedure

Data

Class Levels

Class Level Information
Class Levels Values
type 3 Labor Tech Admin
gend 2 1 0

Number of Observations

Number of Observations Read 39
Number of Observations Used 39

Unbalanced Two-Way Analysis of Variance

The GLM Procedure

Dependent Variable: dep

Analysis of Variance

dep

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 5 30133.5577 6026.7115 1.62 0.1813
Error 33 122530.7500 3713.0530    
Corrected Total 38 152664.3077      

Fit Statistics

R-Square Coeff Var Root MSE dep Mean
0.197384 46.62465 60.93483 130.6923

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
type 2 24158.41132 12079.20566 3.25 0.0513
gend 1 1983.13781 1983.13781 0.53 0.4700
type*gend 2 3992.00856 1996.00428 0.54 0.5892

Type II Model ANOVA

Source DF Type II SS Mean Square F Value Pr > F
type 2 23431.11373 11715.55686 3.16 0.0557
gend 1 1983.13781 1983.13781 0.53 0.4700
type*gend 2 3992.00856 1996.00428 0.54 0.5892

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
type 2 22881.93374 11440.96687 3.08 0.0593
gend 1 1111.46769 1111.46769 0.30 0.5880
type*gend 2 3992.00856 1996.00428 0.54 0.5892

Interaction Plot

Interaction Plot for dep by type
In [11]:
proc glm data=depgend2;  *Saturation symbol;
 class type (ref="Admin") gend (ref="0");
 model dep=type|gend /ss1 ss2 ss3;
run;
Out[11]:
SAS Output

SAS Output

Unbalanced Two-Way Analysis of Variance

The GLM Procedure

The GLM Procedure

Data

Class Levels

Class Level Information
Class Levels Values
type 3 Labor Tech Admin
gend 2 1 0

Number of Observations

Number of Observations Read 39
Number of Observations Used 39

Unbalanced Two-Way Analysis of Variance

The GLM Procedure

Dependent Variable: dep

Analysis of Variance

dep

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 5 30133.5577 6026.7115 1.62 0.1813
Error 33 122530.7500 3713.0530    
Corrected Total 38 152664.3077      

Fit Statistics

R-Square Coeff Var Root MSE dep Mean
0.197384 46.62465 60.93483 130.6923

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
type 2 24158.41132 12079.20566 3.25 0.0513
gend 1 1983.13781 1983.13781 0.53 0.4700
type*gend 2 3992.00856 1996.00428 0.54 0.5892

Type II Model ANOVA

Source DF Type II SS Mean Square F Value Pr > F
type 2 23431.11373 11715.55686 3.16 0.0557
gend 1 1983.13781 1983.13781 0.53 0.4700
type*gend 2 3992.00856 1996.00428 0.54 0.5892

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
type 2 22881.93374 11440.96687 3.08 0.0593
gend 1 1111.46769 1111.46769 0.30 0.5880
type*gend 2 3992.00856 1996.00428 0.54 0.5892

Interaction Plot

Interaction Plot for dep by type

In this case, the interaction effect between gender and type is not significant. Since the interaction effect is not significant, we can test for main effects (note that it is not appropriate to test for main effects if the interaction is significant). Gender also appears to not be significant (see TYPE I or TYPE II SS).

If the interaction had been significant, then we would want to test simple effects instead of main effects. In this example we would have the following simple effects

  • Gender comparison for admins: $H_0: \beta_m=0$
  • Gender comparison for Labor: $H_0: \beta_M +\beta_{ML}=0$
  • Gender comparison for Tech: $H_0: \beta_M +\beta_{MT}=0$
  • Work type comparison for Male: $H_0: \beta_L +\beta_{ML} = \beta_T +\beta_{MT}=0$
  • Work type comparison for female: $H_0: \beta_L=\beta_T$
In [13]:
*Simple Effects Contrast example;
proc glm data=depgend2;  *Manual method;
 model dep= VL VT GEND GEND*VL GEND*VT;
 contrast "Gender comp. for Admin" gend 1;
 contrast "Gender comp. for labor" gend 1 GEND*VL 1;
 contrast "Gender comp. for tech" gend 1 GEND*VT 1;
 contrast "1-way ANOVA for Females" VT 1, VL 1;
 contrast "1-way ANOVA for Males" VT 1 Gend*VT 1, VL 1 Gend*VL 1;
run;
Out[13]:
SAS Output

SAS Output

Unbalanced Two-Way Analysis of Variance

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 39
Number of Observations Used 39

Unbalanced Two-Way Analysis of Variance

The GLM Procedure

Dependent Variable: dep

Analysis of Variance

dep

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 5 30133.5577 6026.7115 1.62 0.1813
Error 33 122530.7500 3713.0530    
Corrected Total 38 152664.3077      

Fit Statistics

R-Square Coeff Var Root MSE dep Mean
0.197384 46.62465 60.93483 130.6923

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
VL 1 265.84615 265.84615 0.07 0.7907
VT 1 23892.56517 23892.56517 6.43 0.0161
gend 1 1983.13781 1983.13781 0.53 0.4700
VL*gend 1 3460.57257 3460.57257 0.93 0.3414
VT*gend 1 531.43599 531.43599 0.14 0.7076

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
VL 1 11515.01731 11515.01731 3.10 0.0875
VT 1 10378.80000 10378.80000 2.80 0.1040
gend 1 357.07500 357.07500 0.10 0.7584
VL*gend 1 835.22403 835.22403 0.22 0.6384
VT*gend 1 531.43599 531.43599 0.14 0.7076

Contrasts

Contrast DF Contrast SS Mean Square F Value Pr > F
Gender comp. for Admin 1 357.07500 357.07500 0.10 0.7584
Gender comp. for labor 1 518.00192 518.00192 0.14 0.7112
Gender comp. for tech 1 5100.06944 5100.06944 1.37 0.2496
1-way ANOVA for Females 2 13377.75978 6688.87989 1.80 0.1809
1-way ANOVA for Males 2 14045.36250 7022.68125 1.89 0.1669

Solution

Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 78.20000000 27.25088267 2.87 0.0071
VL 61.17500000 34.73819562 1.76 0.0875
VT 55.80000000 33.37537879 1.67 0.1040
gend 13.80000000 44.50050505 0.31 0.7584
VL*gend -26.77500000 56.45385004 -0.47 0.6384
VT*gend 20.07500000 53.06347031 0.38 0.7076

For a significan interaction, sample write up depends on the findings, but could look like the following (with made up p-values):

A two-way ANOVA was conducted that examined the effect of gender and work type on depression. There was a significant interaction between the effects of gender and work type on depression (F = 4.5, p = .02).

Simple main effects analysis showed that Tech workers had an estimated 55.8 points higher depression levels compared to Admin workers among Males(p = .002), but there were no differences found in females (p = .793).

Nonparametric ANOVA

The nonparametric t-test is the Wilcox Mann Whitney rank sum test (or U test). For more than two groups (the one-way ANOVA case), it is the Kruskal-Wallis Test.

In [14]:
Title 'Nonparametric Analysis';
Title2 'Two group comparison: Rank-Sum Test';
proc npar1way data=depgend2 wilcoxon; 
 class gend;
 var dep ;
run;
Out[14]:
SAS Output

SAS Output

Nonparametric Analysis

Two group comparison: Rank-Sum Test

The NPAR1WAY Procedure

The NPAR1WAY Procedure

Variable dep

Wilcoxon Analysis

Scores

Wilcoxon Scores (Rank Sums) for Variable dep
Classified by Variable gend
gend N Sum of
Scores
Expected
Under H0
Std Dev
Under H0
Mean
Score
Average scores were used for ties.
1 16 353.0 320.0 35.016711 22.062500
0 23 427.0 460.0 35.016711 18.565217

Two-Sample Test

Wilcoxon Two-Sample Test
Z includes a continuity correction of 0.5.
Statistic 353.0000
   
Normal Approximation  
Z 0.9281
One-Sided Pr > Z 0.1767
Two-Sided Pr > |Z| 0.3533
   
t Approximation  
One-Sided Pr > Z 0.1796
Two-Sided Pr > |Z| 0.3592

Kruskal-Wallis Test

Kruskal-Wallis Test
Chi-Square 0.8881
DF 1
Pr > Chi-Square 0.3460

Box Plot

Box Plot of Wilcoxon Scores for dep Classified by gend
In [15]:
Title2 'Three group comparison: Kruskal Wallis';
proc npar1way data=depgend2 wilcoxon; 
 class type;
 var dep ;
run;
Out[15]:
SAS Output

SAS Output

Nonparametric Analysis

Three group comparison: Kruskal Wallis

The NPAR1WAY Procedure

The NPAR1WAY Procedure

Variable dep

Wilcoxon Analysis

Scores

Wilcoxon Scores (Rank Sums) for Variable dep
Classified by Variable type
type N Sum of
Scores
Expected
Under H0
Std Dev
Under H0
Mean
Score
Average scores were used for ties.
Admin 8 87.50 160.0 28.745991 10.937500
Labor 13 280.00 260.0 33.559060 21.538462
Tech 18 412.50 360.0 35.489292 22.916667

Kruskal-Wallis Test

Kruskal-Wallis Test
Chi-Square 6.4713
DF 2
Pr > Chi-Square 0.0393

Box Plot

Box Plot of Wilcoxon Scores for dep Classified by type

Simple Linear Regression

For describing the association between two quantiative variables, scatterplots and correlation are the most common methods used to explore the relationship. Recall that the correlation coefficient describes the strength and direction of the LINEAR relationship between two quantitative variables.

In [19]:
Data vote;
input VOTE TVEXP;
datalines;
    35.4            38.5
    58.2            48.3
    46.1            47.2
    45.5            34.8
    64.8            50.1
    52.0            44.0
    37.9            27.2
    48.2            37.8
    41.8            27.2
    54.0            39.1
    40.8            31.3
    61.9            45.1
    36.5            31.3
    32.7            34.8
    53.8            42.2
    24.6            29.0
    31.2            36.1
    42.6            36.5
    49.6            33.2
    56.6            46.1
    .               40
;
run;

TITLE;
TITLE2;
PROC SGPLOT DATA=vote;
SCATTER X=TVEXP Y=VOTE;
RUN;

PROC CORR DATA=vote;
VAR vote tvexp;
RUN;
Out[19]:
SAS Output

SAS Output

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure

The CORR Procedure

The CORR Procedure

Variables Information

2 Variables: VOTE TVEXP

Simple Statistics

Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
VOTE 20 45.71000 10.81670 914.20000 24.60000 64.80000
TVEXP 21 38.08571 6.94833 799.80000 27.20000 50.10000

Pearson Correlations

Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
  VOTE TVEXP
VOTE
1.00000
 
20
0.75819
0.0001
20
TVEXP
0.75819
0.0001
20
1.00000
 
21

From the scatterplot, we can see that the relationship looks linear and with $r=0.75819$, the correlation coefficient indicates a strong positive linear relationship between money spent on tv adds and voer turnout. To quantify the relationship between these two variables, we can fit a simple linear regression model. Recall, that linear regression has the following assumptions

  • homescedasticity - constant variance of Y for each X
  • Independent observations
  • Linear (mean) relationship between X and Y
  • It is also usually assumed that the error distribution is Normal
In [20]:
proc reg data=vote; 
  model vote=tvexp ;
run;

proc glm data=vote; 
  model vote=tvexp ;
run;
Out[20]:
SAS Output

SAS Output

The REG Procedure

Model: MODEL1

Dependent Variable: VOTE

The REG Procedure

MODEL1

Fit

VOTE

Number of Observations

Number of Observations Read 21
Number of Observations Used 20
Number of Observations with Missing Values 1

Analysis of Variance

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 1277.89314 1277.89314 24.34 0.0001
Error 18 945.12486 52.50694    
Corrected Total 19 2223.01800      

Fit Statistics

Root MSE 7.24617 R-Square 0.5748
Dependent Mean 45.71000 Adj R-Sq 0.5512
Coeff Var 15.85248    

Parameter Estimates

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 1.91867 9.02332 0.21 0.8340
TVEXP 1 1.15271 0.23366 4.93 0.0001

The REG Procedure

Model: MODEL1

Dependent Variable: VOTE

Observation-wise Statistics

VOTE

Diagnostic Plots

Fit Diagnostics

Panel of fit diagnostics for VOTE.

Residual Plots

TVEXP

Scatter plot of residuals by TVEXP for VOTE.

Fit Plot

Scatterplot of VOTE by TVEXP overlaid with the fit line, a 95% confidence band and lower and upper 95% prediction limits.

The GLM Procedure

The GLM Procedure

Data

Number of Observations

Number of Observations Read 21
Number of Observations Used 20

The GLM Procedure

Dependent Variable: VOTE

Analysis of Variance

VOTE

Overall ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 1277.893142 1277.893142 24.34 0.0001
Error 18 945.124858 52.506937    
Corrected Total 19 2223.018000      

Fit Statistics

R-Square Coeff Var Root MSE VOTE Mean
0.574846 15.85248 7.246167 45.71000

Type I Model ANOVA

Source DF Type I SS Mean Square F Value Pr > F
TVEXP 1 1277.893142 1277.893142 24.34 0.0001

Type III Model ANOVA

Source DF Type III SS Mean Square F Value Pr > F
TVEXP 1 1277.893142 1277.893142 24.34 0.0001

Solution

Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 1.918665998 9.02332069 0.21 0.8340
TVEXP 1.152706870 0.23365762 4.93 0.0001

Fit Plot

Fit Plot for VOTE by TVEXP

For this model, we get the following fitted equation

$$\hat{y}=1.92 + 1.15x$$
  • In this case, we should not interpret the y-intercept but if we did we would say: The expected voter turnout is 1.91% when no money is spent on tv adds.
  • Slope: The mean voter turnout increases by 1.15% for each additional \$1000 spent on tv adds.

In order to check the assumptions, we can look at the scatterplot and residual diagnostics plots output by PROC REG.

In [ ]: