13. Statistical Analysis in SAS¶

Now we are going to cover how to perform a variety of basic statistical tests in SAS.

Proportion tests
Chi-squared
Fisher’s Exact Test
Correlation
T-tests/Rank-sum tests
One-way ANOVA/Kruskal-Wallis
Linear Regression
Logistic Regression
Poisson Regression

Note: We will be glossing over the statistical theory and “formulas” for these tests. There are plenty of resources online for learning more about these tests if you have not had a course covering this material. You will only be required to write code to fit or perform these test but will not be expected to interpret the results for this course.

13.1. Proportion Tests¶

To conduct a test for one proportion, we can use PROC FREQ. To get this test, we use the BINOMIAL option in the TABLES statement. As options to BINOMIAL, we can specify

p= - the null value for the hypothesis test
level= - which group to use as a “success”
CORRECT - uses a continuity correction for calculating the p-value (can be useful for small sample sizes)
CL= - can select different types of CI such as WALD, EXACT, and LOGIT.

Example

In the following example, we use a summarized dataset, where we have the counts of the "successes" and "failures". In this case, we are interested in the proportion of smokers, so we have a count of smokers and a count of non-smokers.

DATA smoke;
   INPUT smkstatus $ count;
   DATALINES;
Y 15
N 17
;
RUN;

PROC FREQ data = smoke;
   TABLES smkstatus / binomial(p = 0.5 level = "Y" CORRECT) alpha = 0.05;
   WEIGHT count;
RUN;

SAS Output

The SAS System

The FREQ Procedure


smkstatus	Frequency	Percent	Cumulative Frequency	Cumulative Percent
N	17	53.13	17	53.13
Y	15	46.88	32	100.00


Binomial Proportion
smkstatus = Y
Proportion	0.4688
ASE	0.0882
95% Lower Conf Limit	0.2802
95% Upper Conf Limit	0.6573

Exact Conf Limits
95% Lower Conf Limit	0.2909
95% Upper Conf Limit	0.6526


Test of H0: Proportion = 0.5
The asymptotic confidence limits and test include a continuity correction.
ASE under H0	0.0884
Z	-0.1768
One-sided Pr < Z	0.4298
Two-sided Pr > \|Z\|	0.8597

Sample Size = 32

Note the use of the WEIGHT statement to specify the counts for Y and N. Without this statement SAS would read our data as having 1 Y and 1 N.

The estimated proportion is 0.4688. The (asymptotic) 95% CI is (0.2802, 0.6573) and the two sided (continuity corrected) p-value for testing $H_0: p=0.5$ vs $H_a: p\neq 0.5$ is 0.8597.

Alternatively, we could have had the data listed out for each individual as follows.

DATA smoke2;
   DO i = 1 to 15;
      smkstatus = "Y";
      OUTPUT;
   END;
   DO i = 1 to 17;
      smkstatus = "N";
      OUTPUT;
   END;
   DROP i;
RUN;

PROC FREQ data = smoke2;
   TABLES smkstatus / binomial(p = 0.5 level = "Y" CORRECT) alpha = 0.05;
RUN;

SAS Output

The SAS System

The FREQ Procedure


smkstatus	Frequency	Percent	Cumulative Frequency	Cumulative Percent
N	17	53.13	17	53.13
Y	15	46.88	32	100.00


Binomial Proportion
smkstatus = Y
Proportion	0.4688
ASE	0.0882
95% Lower Conf Limit	0.2802
95% Upper Conf Limit	0.6573

Exact Conf Limits
95% Lower Conf Limit	0.2909
95% Upper Conf Limit	0.6526


Test of H0: Proportion = 0.5
The asymptotic confidence limits and test include a continuity correction.
ASE under H0	0.0884
Z	-0.1768
One-sided Pr < Z	0.4298
Two-sided Pr > \|Z\|	0.8597

Sample Size = 32

13.2. Chi-squared Test¶

To test for an association between two categorical variables, we could perform a chi-square test of independence. Again, we will use PROC FREQ with a tables statement. For 2x2 tables, a chi-square test is automatically performed, but for larger tables, we can request is by providing the CHISQ option to the tables statement. Another useful option to also specify is the EXPECTED option which provided the expected cell counts under the null hypothesis of independence. These expected cell counts are needed to assess whether or not the chi-square test is appropriate.

Example

The following example uses the Kaggle car auction dataset to test for an association between online sales and a car being a bad buy.

FILENAME cardata '/folders/myfolders/SAS_Notes/data/kaggleCarAuction.csv';

PROC IMPORT datafile = cardata out = cars dbms = CSV replace;
   getnames = yes;
   guessingrows = 1000;
RUN;

PROC FREQ data = cars;
   TABLES isbadbuy*isonlinesale / chisq expected;
RUN;

SAS Output

The SAS System

The FREQ Procedure

Frequency Expected Percent Row Pct Col Pct


Table of IsBadBuy by IsOnlineSale
IsBadBuy	IsOnlineSale
IsBadBuy	0	1	Total
0	62375 62389 85.47 97.45 87.68	1632 1618.1 2.24 2.55 88.46	64007 87.70
1	8763 8749.1 12.01 97.63 12.32	213 226.91 0.29 2.37 11.54	8976 12.30
Total	71138 97.47	1845 2.53	72983 100.00

Statistics for Table of IsBadBuy by IsOnlineSale


Statistic	DF	Value	Prob
Chi-Square	1	0.9978	0.3178
Likelihood Ratio Chi-Square	1	1.0154	0.3136
Continuity Adj. Chi-Square	1	0.9274	0.3356
Mantel-Haenszel Chi-Square	1	0.9978	0.3179
Phi Coefficient		-0.0037
Contingency Coefficient		0.0037
Cramer's V		-0.0037


Fisher's Exact Test
Cell (1,1) Frequency (F)	62375
Left-sided Pr <= F	0.1679
Right-sided Pr >= F	0.8498

Table Probability (P)	0.0177
Two-sided Pr <= P	0.3324

Sample Size = 72983

The chi-square test results in a p-value of 0.3178, or if we use the chi-square test with continuity correction, then we get a p-value of 0.3356.

In the 2x2 case, as in this example, we may also want measures of effert such as the risk difference, relative risk and odds ratio. We can obtain these using the RISKDIFF, RELRISK, and OR options which will request all three measures with confidence intervals.

PROC FREQ data = cars;
   TABLES isbadbuy*isonlinesale / RISKDIFF RELRISK OR;
RUN;

SAS Output

The SAS System

The FREQ Procedure

Frequency Percent Row Pct Col Pct


Table of IsBadBuy by IsOnlineSale
IsBadBuy	IsOnlineSale
IsBadBuy	0	1	Total
0	62375 85.47 97.45 87.68	1632 2.24 2.55 88.46	64007 87.70
1	8763 12.01 97.63 12.32	213 0.29 2.37 11.54	8976 12.30
Total	71138 97.47	1845 2.53	72983 100.00

Statistics for Table of IsBadBuy by IsOnlineSale


Column 1 Risk Estimates
	Risk	ASE	95% Confidence Limits		Exact 95% Confidence Limits
Difference is (Row 1 - Row 2)
Row 1	0.9745	0.0006	0.9733	0.9757	0.9733	0.9757
Row 2	0.9763	0.0016	0.9731	0.9794	0.9729	0.9793
Total	0.9747	0.0006	0.9736	0.9759	0.9736	0.9758
Difference	-0.0018	0.0017	-0.0051	0.0016


Column 2 Risk Estimates
	Risk	ASE	95% Confidence Limits		Exact 95% Confidence Limits
Difference is (Row 1 - Row 2)
Row 1	0.0255	0.0006	0.0243	0.0267	0.0243	0.0267
Row 2	0.0237	0.0016	0.0206	0.0269	0.0207	0.0271
Total	0.0253	0.0006	0.0241	0.0264	0.0242	0.0264
Difference	0.0018	0.0017	-0.0016	0.0051


Odds Ratio and Relative Risks
Statistic	Value	95% Confidence Limits
Odds Ratio	0.9290	0.8040	1.0735
Relative Risk (Column 1)	0.9982	0.9947	1.0016
Relative Risk (Column 2)	1.0745	0.9331	1.2373

Sample Size = 72983

For the risk difference, SAS provides two tables that compare the conditional row proportions in the first column and the conditional row proportions in the second column. Similarly, for the relative risk, we get a relative risk for the first and the second column. This allows us to pick the one that matters to us depending on which column corresponds to the outcome of interest.

13.3. Fisher’s Exact Test¶

An alternative way to test for an association between two categorical variables is Fisher’s exact test. This test is a nonparametric test that makes no assumption other than that we have a random sample. Note, however, that this comes with a price. The more levels our variables have and the more observations we have will increase the computing time needed to perform this test. For 2x2 tables, this test is usally very quick, but for 5x5 tables, depending on how much data and what computer you are using, this test may take hours to complete.

For 2x2 tables, this test is automatically output. For larger tables, if you want this test, then you will need to specify the FISHER option in the TABLES statement.

Example

The following SAS program uses Fisher's exact test to test for an association between a car being a bad buy and buying the car online.

PROC FREQ data = cars;
   TABLES isbadbuy*isonlinesale / FISHER;
RUN;

SAS Output

The SAS System

The FREQ Procedure

Frequency Percent Row Pct Col Pct


Table of IsBadBuy by IsOnlineSale
IsBadBuy	IsOnlineSale
IsBadBuy	0	1	Total
0	62375 85.47 97.45 87.68	1632 2.24 2.55 88.46	64007 87.70
1	8763 12.01 97.63 12.32	213 0.29 2.37 11.54	8976 12.30
Total	71138 97.47	1845 2.53	72983 100.00

Statistics for Table of IsBadBuy by IsOnlineSale


Statistic	DF	Value	Prob
Chi-Square	1	0.9978	0.3178
Likelihood Ratio Chi-Square	1	1.0154	0.3136
Continuity Adj. Chi-Square	1	0.9274	0.3356
Mantel-Haenszel Chi-Square	1	0.9978	0.3179
Phi Coefficient		-0.0037
Contingency Coefficient		0.0037
Cramer's V		-0.0037


Fisher's Exact Test
Cell (1,1) Frequency (F)	62375
Left-sided Pr <= F	0.1679
Right-sided Pr >= F	0.8498

Table Probability (P)	0.0177
Two-sided Pr <= P	0.3324

Sample Size = 72983

The p-value for Fisher's exact test is 0.3324.

13.4. Correlation¶

SAS’s CORR procedure can perform correlation analysis by providing both the parametric Pearson’s correlation and the nonparametric Spearman’s rank correlation coefficients and hypothesis tests. The default correlation output is Pearson’s. To request the Spearman’s rank correlation, add the SPREAMAN option to the PROC CORR statement.

Example

Let's look at some examples using PROC CORR using the Charm City Circulator bus ridership dataset. The following SAS program will find the Pearson correlation and hypothesis test results for the correlation between the average daily ridership between the orange and purple bus lines.

FILENAME busdata '/folders/myfolders/SAS_Notes/data/Charm_City_Circulator_Ridership.csv';

PROC IMPORT datafile = busdata out = circ dbms = CSV replace;
   getnames = yes;
   guessingrows = 1000;
RUN;

PROC CORR data = circ;
  VAR orangeAverage purpleAverage;
RUN;

SAS Output

The SAS System

The CORR Procedure


2 Variables:	orangeAverage purpleAverage


Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
orangeAverage	1136	3033	1228	3445671	0	6927
purpleAverage	993	4017	1407	3988816	0	8090


Pearson Correlation Coefficients Prob > \|r\| under H0: Rho=0 Number of Observations
	orangeAverage	purpleAverage
orangeAverage	1.00000 1136	0.91954 <.0001 993
purpleAverage	0.91954 <.0001 993	1.00000 993

Example

We can also get a correlation matrix for multiple variables at the same time. The following example also uses the NOMISS option to only use complete observations instead of pairwise complete observations when calculating the correlations. Here we get the correlation matrix between average ridership counts between all four of the orange, purple, banner, and green bus lines.

PROC CORR data = circ NOMISS;
   VAR orangeAverage purpleAverage greenAverage bannerAverage;
RUN;

SAS Output

The SAS System

The CORR Procedure


4 Variables:	orangeAverage purpleAverage greenAverage bannerAverage


Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
orangeAverage	270	3859	1095	1041890	0	6927
purpleAverage	270	4552	1297	1228935	0	8090
greenAverage	270	2090	556.00353	564213	0	3879
bannerAverage	270	827.26852	436.04872	223363	0	4617


Pearson Correlation Coefficients, N = 270 Prob > \|r\| under H0: Rho=0
	orangeAverage	purpleAverage	greenAverage	bannerAverage
orangeAverage	1.00000	0.90788 <.0001	0.83958 <.0001	0.54470 <.0001
purpleAverage	0.90788 <.0001	1.00000	0.86656 <.0001	0.52135 <.0001
greenAverage	0.83958 <.0001	0.86656 <.0001	1.00000	0.45334 <.0001
bannerAverage	0.54470 <.0001	0.52135 <.0001	0.45334 <.0001	1.00000

If we don't want all pairwise correlations, but instead only specific pairs, then we can use the WITH statement as in the following example.

PROC CORR data = circ NOMISS;
   VAR orangeAverage purpleAverage;
   WITH greenAverage bannerAverage;
RUN;

SAS Output

The SAS System

The CORR Procedure


2 With Variables:	greenAverage bannerAverage
2 Variables:	orangeAverage purpleAverage


Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
greenAverage	270	2090	556.00353	564213	0	3879
bannerAverage	270	827.26852	436.04872	223363	0	4617
orangeAverage	270	3859	1095	1041890	0	6927
purpleAverage	270	4552	1297	1228935	0	8090


Pearson Correlation Coefficients, N = 270 Prob > \|r\| under H0: Rho=0
	orangeAverage	purpleAverage
greenAverage	0.83958 <.0001	0.86656 <.0001
bannerAverage	0.54470 <.0001	0.52135 <.0001

To get Spearman’s rank correlation instead of Pearson’s correlation, add the SPEARMAN option to the PROC CORR statement.

Example

The following SAS program produces Spearman's rank correlation coefficient and associated p-value for the hypothesis test of the correlation is 0 between the average daily ridership counts betwen the orange and purple bus lines.

PROC CORR data = circ SPEARMAN;
  VAR orangeAverage purpleAverage;
RUN;

SAS Output

The SAS System

The CORR Procedure


2 Variables:	orangeAverage purpleAverage


Simple Statistics
Variable	N	Mean	Std Dev	Median	Minimum	Maximum
orangeAverage	1136	3033	1228	2968	0	6927
purpleAverage	993	4017	1407	4223	0	8090


Spearman Correlation Coefficients Prob > \|r\| under H0: Rho=0 Number of Observations
	orangeAverage	purpleAverage
orangeAverage	1.00000 1136	0.91455 <.0001 993
purpleAverage	0.91455 <.0001 993	1.00000 993

13.5. T-Tests¶

T-tests can be performed in SAS with the TTEST procedure including

one sample t-test
paired t-test
Two sample t-test

Example

In this example, we will test if the average daily ridership on the orange bus line is greater than 3000 using a one sample t-test.

PROC TTEST data = circ H0 = 3000 SIDE = U;
   VAR orangeAverage;
RUN;

SAS Output

The SAS System

The TTEST Procedure

Variable: orangeAverage


N	Mean	Std Dev	Std Err	Minimum	Maximum
1136	3033.2	1227.6	36.4217	0	6926.5


Mean	95% CL Mean		Std Dev	95% CL Std Dev
3033.2	2973.2	Infty	1227.6	1179.1	1280.3


DF	t Value	Pr > t
1135	0.91	0.1814

The H0= option specifies the null value in the t-test and the SIDE= option specifies whether you want a less than (L), greater than (U), or not equal to (2) test. The default values are 0 for the null hypothesis value and two sided (2) for the alternative hypothesis. The output provides some summary statistics, the p-value for the test, confidence interval and a histogram and QQ plot to assess the normality assumption.

From the output, we find the p-value to be 0.1814. Since we requested a one-side test, we get a one-sided confidence interval. To get our usual (two-sided) confidence interval, we need to request a two-sided test.

For a two sample t-test, we need to have the data formatted in two columns:

A data column that contains the quantitative data for both groups
A grouping variable column that indicates the group for the data value in that row.

In PROC TTEST, we put the data variable in the VAR statement and the grouping variable in the CLASS statement to get a two sample t-test.

Example

In the following SAS program, we perform a two-sample t-test between the orange and purple bus lines' average ridership counts. We will first have to transform the data to meet the required data format for PROC TTEST.

DATA circ_sub;
  SET circ;
  count = orangeAverage;
  group = "orange";
  OUTPUT;
  count = purpleAverage;
  group = "purple";
  OUTPUT;
  KEEP count group;
RUN;

PROC TTEST data = circ_sub;
  VAR count;
  CLASS group;
RUN;

SAS Output

The SAS System

The TTEST Procedure

Variable: count


group	Method	N	Mean	Std Dev	Std Err	Minimum	Maximum
orange		1136	3033.2	1227.6	36.4217	0	6926.5
purple		993	4016.9	1406.7	44.6388	0	8089.5
Diff (1-2)	Pooled		-983.8	1314.1	57.0906
Diff (1-2)	Satterthwaite		-983.8		57.6122


group	Method	Mean	95% CL Mean		Std Dev	95% CL Std Dev
orange		3033.2	2961.7	3104.6	1227.6	1179.1	1280.3
purple		4016.9	3929.3	4104.5	1406.7	1347.4	1471.4
Diff (1-2)	Pooled	-983.8	-1095.7	-871.8	1314.1	1275.8	1354.9
Diff (1-2)	Satterthwaite	-983.8	-1096.8	-870.8


Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	2127	-17.23	<.0001
Satterthwaite	Unequal	1984	-17.08	<.0001


Equality of Variances
Method	Num DF	Den DF	F Value	Pr > F
Folded F	992	1135	1.31	<.0001

The SAS output contains summary statistics for each group, confidence intervals for each group mean, confidence intervals for the difference of the two means, hypothesis tests for the difference of the two means, and the F test for equality of variances. The Pooled row corresponds to the two sample t-test which assumes the population variances are equal between the two groups while the Satterthwaite assumes that the population variances are unequal.

Note that the data here are really matched pairs data, since we have average ridership counts matched by date between the two bus lines. We will explore the paired t-test next.

To perform a paired t-test, we need to use the PAIRED statement. In this case, SAS assumes the data from each group are in two separate columns where observations in the same row correspond to the matched pairs.

Example

The following SAS program performs a paired t-test betwen the average ridership counts between the orange and purple bus lines.

PROC TTEST data = circ;
   PAIRED orangeAverage*purpleAverage;
RUN;

SAS Output

The SAS System

The TTEST Procedure

Difference: orangeAverage - purpleAverage


N	Mean	Std Dev	Std Err	Minimum	Maximum
993	-764.1	572.3	18.1613	-2998.0	2504.5


Mean	95% CL Mean		Std Dev	95% CL Std Dev
-764.1	-799.8	-728.5	572.3	548.2	598.6


DF	t Value	Pr > \|t\|
992	-42.08	<.0001

Summary Panel for Difference of orangeAverage and purpleAverage

Profiles Plot for orangeAverage and purpleAverage

Agreement Plot for orangeAverage and purpleAverage

Q-Q Plot for Difference of orangeAverage and purpleAverage

13.6. Nonparametric Alternatives to the T-Tests¶

In the case that we have a small sample size and the data cannot be assumed to be from populations that are Normally distributed, we need to use a nonparametric test. For the t-tests we have the following possible alternative tests:

The sign test or the Wilcoxon signed rank test as alternative to the one sample t-test or the paired t-test.
The Wilcoxon rank sum test as an alternative to the two sample t-test.

To perform a Wilcoxon rank sum test, we use PROC NPAR1WAY.

Example

In the following example, we use PROC NPAR1WAY to perform Wilcoxon rank sum test to compare median daily ridership counts between the orange and purple bus lines.

PROC NPAR1WAY data = circ_sub WILCOXON;
  VAR count;
  CLASS group;
RUN;

SAS Output

The SAS System

The NPAR1WAY Procedure


Wilcoxon Scores (Rank Sums) for Variable count Classified by Variable group
group	N	Sum of Scores	Expected Under H0	Std Dev Under H0	Mean Score
Average scores were used for ties.
orange	1136	982529.50	1209840.0	14150.2115	864.90273
purple	993	1284855.50	1057545.0	14150.2115	1293.91289


Wilcoxon Two-Sample Test
Statistic	Z	Pr > Z	Pr > \|Z\|	t Approximation
Statistic	Z	Pr > Z	Pr > \|Z\|	Pr > Z	Pr > \|Z\|
Z includes a continuity correction of 0.5.
1284856	16.0641	<.0001	<.0001	<.0001	<.0001


Kruskal-Wallis Test
Chi-Square	DF	Pr > ChiSq
258.0555	1	<.0001

Box Plot of Wilcoxon Scores for count Classified by group

In order to perform a sign test or Wilcoxon signed rank test, we must first calculate the paired differences between the matched pairs, and then pass the differences to PROC UNIVARIATE.

Example

The following SAS program performs uses PROC UNIVARIATE to obtain the sign test and Wilcoxon signed rank test p-values for testing for a median difference in ridership counts between the orange and purple bus lines.

DATA circ_diff;
  SET circ;
  diff = orangeAverage - purpleAverage;
RUN;

PROC UNIVARIATE data = circ_diff;
  VAR diff;
RUN;

SAS Output

The SAS System

The UNIVARIATE Procedure

Variable: diff


Moments
N	993	Sum Weights	993
Mean	-764.14401	Sum Observations	-758795
Std Deviation	572.297901	Variance	327524.888
Skewness	0.73397459	Kurtosis	2.88082288
Uncorrected SS	904733341	Corrected SS	324904688
Coeff Variation	-74.893985	Std Error Mean	18.1613249


Basic Statistical Measures
Location		Variability
Mean	-764.144	Std Deviation	572.29790
Median	-788.000	Variance	327525
Mode	0.000	Range	5503
		Interquartile Range	732.50000


Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	-42.0753	Pr > \|t\|	<.0001
Sign	M	-424.5	Pr >= \|M\|	<.0001
Signed Rank	S	-227467	Pr >= \|S\|	<.0001


Quantiles (Definition 5)
Level	Quantile
100% Max	2504.5
99%	912.5
95%	131.5
90%	-72.0
75% Q3	-426.5
50% Median	-788.0
25% Q1	-1159.0
10%	-1433.0
5%	-1596.0
1%	-1980.0
0% Min	-2998.0


Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
-2998.0	552	1360.5	171
-2649.5	999	1418.5	672
-2365.0	635	1933.5	743
-2300.5	553	2484.0	674
-2049.5	188	2504.5	709


Missing Values
Missing Value	Count	Percent Of
Missing Value	Count	All Obs	Missing Obs
.	153	13.35	100.00

PROC UNIVARIATE provides lots of default output. The p-values for the sign test and signed rank test can be found in the test for location table.

13.7. One-way ANOVA and the Kruskal-Wallis Test¶

When we wish to compare means between more than two independent groups, we can perform a one-way ANOVA or in the small sample case a Kruskal-Wallis test. A on-way ANOVA can be performed in SAS by using PROC GLM.

Example

The following SAS program performs a one-way ANOVA to test for equality of mean ridership counts between the orange, purple, and green bus lines.

DATA circ_aov;
   SET circ;
   count = orangeAverage;
   group = "orange";
   OUTPUT;
   count = purpleAverage;
   group = "purple";
   OUTPUT;
   count = greenAverage;
   group = "green";
   OUTPUT;
   KEEP count group;
RUN;

PROC GLM data = circ_aov;
  CLASS group;
  MODEL count = group;
RUN;

SAS Output

The SAS System

The GLM Procedure


Class Level Information
Class	Levels	Values
group	3	green orange purple


Number of Observations Read	3438
Number of Observations Used	2614

The SAS System

The GLM Procedure

Dependent Variable: count


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	1442596863	721298432	490.02	<.0001
Error	2611	3843371488	1471992
Corrected Total	2613	5285968351


R-Square	Coeff Var	Root MSE	count Mean
0.272911	37.82740	1213.257	3207.349


Source	DF	Type I SS	Mean Square	F Value	Pr > F
group	2	1442596863	721298432	490.02	<.0001


Source	DF	Type III SS	Mean Square	F Value	Pr > F
group	2	1442596863	721298432	490.02	<.0001

If instead we wanted to perform a Kruskal-Wallis test, we would use PROC NPAR1WAY.

PROC NPAR1WAY data = circ_aov WILCOXON;
   CLASS group;
   VAR count;
RUN;

SAS Output

The SAS System

The NPAR1WAY Procedure


Wilcoxon Scores (Rank Sums) for Variable count Classified by Variable group
group	N	Sum of Scores	Expected Under H0	Std Dev Under H0	Mean Score
Average scores were used for ties.
orange	1136	1395578.00	1485320.00	19128.0882	1228.50176
purple	993	1720911.50	1298347.50	18728.8588	1733.04280
green	485	301315.50	634137.50	15000.4361	621.26907


Kruskal-Wallis Test
Chi-Square	DF	Pr > ChiSq
729.0668	2	<.0001

13.8. Linear Regression¶

In SAS, there are two procedures that can be used to fit a linear regression model:

PROC REG
PROc GLM

PROC REG will give you most of the standard output, but the model statement requires all variables to be calculated in a prior DATA step, such as interaction terms. PROC GLM, however, allows you to calculate interaction terms on the fly in the MODEL statement. Generally, I prefer PROC GLM for this reason, but either PROC will work and can be used to get all the standard regression output.

Another reason I prefer PROC GLM over PROC REG is that PROC REG does not have a CLASS statement, so you must do all the dummy coding for categorical variables manually in a DATA step when using PROC REG.

Let’s look at a few examples using both PROCs.

Example

The first example fits a simple linear regression model with a single binary predictor. Note that in this case, the t-test for the slope is equivalent to a two sample t-test.

DATA circ_sub;
   SET circ_sub;
   IF group = "orange" THEN grp_bin = 1;
   ELSE grp_bin = 0;
RUN;

PROC REG data = circ_sub;
   MODEL count = grp_bin;
RUN;

SAS Output

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: count


Number of Observations Read	2292
Number of Observations Used	2129
Number of Observations with Missing Values	163


Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	512793031	512793031	296.93	<.0001
Error	2127	3673232559	1726955
Corrected Total	2128	4186025589


Root MSE	1314.13647	R-Square	0.1225
Dependent Mean	3492.00892	Adj R-Sq	0.1221
Coeff Var	37.63268


Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	4016.93454	41.70286	96.32	<.0001
grp_bin	1	-983.77345	57.09059	-17.23	<.0001

The SAS System

The REG Procedure

Model: MODEL1

Dependent Variable: count

Scatter plot of residuals by grp_bin for count.

Scatterplot of count by grp_bin overlaid with the fit line, a 95% confidence band and lower and upper 95% prediction limits.

PROC GLM data = circ_sub PLOTS=DIAGNOSTICS;
  CLASS group(ref = 'purple');
  MODEL count = group / solution;
RUN;

SAS Output

The SAS System

The GLM Procedure


Class Level Information
Class	Levels	Values
group	2	orange purple


Number of Observations Read	2292
Number of Observations Used	2129

The SAS System

The GLM Procedure

Dependent Variable: count


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	512793031	512793031	296.93	<.0001
Error	2127	3673232559	1726955
Corrected Total	2128	4186025589


R-Square	Coeff Var	Root MSE	count Mean
0.122501	37.63268	1314.136	3492.009


Source	DF	Type I SS	Mean Square	F Value	Pr > F
group	1	512793030.6	512793030.6	296.93	<.0001


Source	DF	Type III SS	Mean Square	F Value	Pr > F
group	1	512793030.6	512793030.6	296.93	<.0001


Parameter	Estimate		Standard Error	t Value	Pr > \|t\|
Intercept	4016.934542	B	41.70286032	96.32	<.0001
group orange	-983.773450	B	57.09058700	-17.23	<.0001
group purple	0.000000	B	.	.	.

Note:The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

Note that we get the same output from both procedures, but with PROC REG we had to manually code the group variable as a 0/1 dummy variable, whereas in PROC GLM we could use the CLASS statement with a ref= statement to select the reference category. We also need to request the residual diagnostic plots in PROC GLM as this is not default output.

Now that we have seen how to get the same output from both PROC GLM and PROC REG, we will use PROC GLM for all the remaining examples to avoid needing to use a DATA step to calculate dummy variables and interaction terms.

Example

In the following SAS program, we will fit linear regression models with more than one predictor using the Kaggle car auction dataset. First, let's fit a simple linear regression model and build on to it by adding more variables.

PROC GLM data = cars;
   MODEL VehOdo = VehicleAge;
RUN;

SAS Output

The SAS System

The GLM Procedure


Number of Observations Read	72983
Number of Observations Used	72983

The SAS System

The GLM Procedure

Dependent Variable: VehOdo


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	1.5863773E12	1.5863773E12	8313.88	<.0001
Error	72981	1.3925561E13	190810767.21
Corrected Total	72982	1.5511938E13


R-Square	Coeff Var	Root MSE	VehOdo Mean
0.102268	19.31948	13813.43	71500.00


Source	DF	Type I SS	Mean Square	F Value	Pr > F
VehicleAge	1	1.5863773E12	1.5863773E12	8313.88	<.0001


Source	DF	Type III SS	Mean Square	F Value	Pr > F
VehicleAge	1	1.5863773E12	1.5863773E12	8313.88	<.0001


Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	60127.24037	134.8017988	446.04	<.0001
VehicleAge	2722.94117	29.8632078	91.18	<.0001

Now let's add another varialbe, in this case, the binary variable IsBadBuy. This variable is alread a 0/1 dummy variable, so we don't need to put it in a class statement (but we could if we wanted to and still get the same output by choosing the matching reference category).

PROC GLM data = cars;
   MODEL VehOdo = VehicleAge IsBadBuy;
RUN;

SAS Output

The SAS System

The GLM Procedure


Number of Observations Read	72983
Number of Observations Used	72983

The SAS System

The GLM Procedure

Dependent Variable: VehOdo


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	1.5998928E12	799946379347	4196.37	<.0001
Error	72980	1.3912045E13	190628187.46
Corrected Total	72982	1.5511938E13


R-Square	Coeff Var	Root MSE	VehOdo Mean
0.103139	19.31023	13806.82	71500.00


Source	DF	Type I SS	Mean Square	F Value	Pr > F
VehicleAge	1	1.5863773E12	1.5863773E12	8321.84	<.0001
IsBadBuy	1	13515481118	13515481118	70.90	<.0001


Source	DF	Type III SS	Mean Square	F Value	Pr > F
VehicleAge	1	1.4941599E12	1.4941599E12	7838.08	<.0001
IsBadBuy	1	13515481118	13515481118	70.90	<.0001


Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	60141.77139	134.7483412	446.33	<.0001
VehicleAge	2680.32758	30.2749125	88.53	<.0001
IsBadBuy	1329.00242	157.8350951	8.42	<.0001

Note that when adding multiple predictors in the MODEL statement, they are separated by a space instead of a + symbol. To add an interaction, we can create the individual interaction term using * for multiplication while still including the main effect terms or we can use the shorthand | to create all three at the same time.

PROC GLM data = cars;
   MODEL VehOdo = VehicleAge IsBadBuy VehicleAge*IsBadBuy;
   *MODEL VehOdo = VehicleAge|IsBadBuy;
RUN;

SAS Output

The SAS System

The GLM Procedure


Number of Observations Read	72983
Number of Observations Used	72983

The SAS System

The GLM Procedure

Dependent Variable: VehOdo


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	1.5998931E12	533297702109	2797.54	<.0001
Error	72979	1.3912045E13	190630794.79
Corrected Total	72982	1.5511938E13


R-Square	Coeff Var	Root MSE	VehOdo Mean
0.103139	19.31037	13806.91	71500.00


Source	DF	Type I SS	Mean Square	F Value	Pr > F
VehicleAge	1	1.5863773E12	1.5863773E12	8321.73	<.0001
IsBadBuy	1	13515481118	13515481118	70.90	<.0001
VehicleAge*IsBadBuy	1	347632.54102	347632.54102	0.00	0.9659


Source	DF	Type III SS	Mean Square	F Value	Pr > F
VehicleAge	1	1.2937202E12	1.2937202E12	6786.52	<.0001
IsBadBuy	1	1662398367	1662398367	8.72	0.0031
VehicleAge*IsBadBuy	1	347632.54068	347632.54068	0.00	0.9659


Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	60139.69756	143.2332682	419.87	<.0001
VehicleAge	2680.83719	32.5421914	82.38	<.0001
IsBadBuy	1347.28218	456.2338897	2.95	0.0031
VehicleAge*IsBadBuy	-3.78953	88.7403782	-0.04	0.9659

To get the residuals and predicted values, use the OUTPUT statement. We can even get predicted values for new data by adding rows to the dataset and setting the response variable to missing.

Example

In the following example, we will extract the residuals and the predicted values using the OUTPUT statement. We will also add an additional new point to the dataset to get a new predicted value.

DATA new;
   INPUT VehOdo VehicleAge IsBadBuy;
   DATALINES;
. 6 1
. 5 0
;
RUN;

DATA cars_new;
   SET cars(keep = VehOdo VehicleAge IsBadBuy) 
       new;
RUN;

PROC GLM data = cars_new noprint;
   MODEL VehOdo = VehicleAge IsBadBuy VehicleAge*IsBadBuy;
   OUTPUT out=res_pred residuals = resid predicted = fitted;
RUN;

PROC SGPLOT data = res_pred;
   HISTOGRAM resid;
RUN;

PROC PRINT data = res_pred (FIRSTOBS=72984);
RUN;

SAS Output

The SAS System


Obs	IsBadBuy	VehicleAge	VehOdo	resid	fitted
72984	1	6	.	.	77549.27
72985	0	5	.	.	73543.88

The missing values for the response, VehOdo, in the new observations keep these rows from being used to fit the model, but since we have values for all the predictors in the model a predicted value is still calculated in the OUTPUT dataset. Recall, the predicted values are found by plugging in the predictor values into the fitted regression equation. For example, for the first new data value:

$$\widehat{y}=60139.7 + 1347.28 + 2680.84*6 -3.79*6=77549.28$$

13.9. Logistic Regression¶

Generalized Linear Models (GLMs) allow for fitting regressions for non-continuous/normal outcomes. The glm has similar syntax to the lm command. Logistic regression is one example.

In a (simple) logistic regression model, we have a binary response Y and a predictor x. It is assumed that given the predictor, $Y\sim\text{Bernoulli(p(x))}$ where $p(x)=P(Y=1|x)$ and

\[\log\left(\dfrac{P(Y=1|x)}{1-P(Y=1|x)}\right)=\beta_0+\beta_1x\]

That is the log-odds of success changes linearly with x. It then follows that $e^{\beta_1}$ is the odds ratio of success for a one unit increase in x.

In SAS, there are two procedures that can be used to fit a logistic regression model

PROC LOGISTIC
PROC GENMOD

Generally, I use PROC LOGISTIC as it is made specifically for logistic regression and provides many extras that PROC GENMOD does not.

Example

The following example uses PROC LOGISTIC to fit a logistic regression model with IsBadBuy as the binary response and VehOdo and VehicleAge as predictors.

PROC LOGISTIC data = cars;
   MODEL isbadbuy(event='1') = vehodo vehicleage / CLPARM=WALD CLODDS=WALD;
RUN;

SAS Output

The SAS System

The LOGISTIC Procedure


Model Information
Data Set	WORK.CARS
Response Variable	IsBadBuy
Number of Response Levels	2
Model	binary logit
Optimization Technique	Fisher's scoring


Number of Observations Read	72983
Number of Observations Used	72983


Response Profile
Ordered Value	IsBadBuy	Total Frequency
1	0	64007
2	1	8976

Probability modeled is IsBadBuy='1'.


Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.


Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	54423.307	52352.263
SC	54432.505	52379.857
-2 Log L	54421.307	52346.263


Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	2075.0443	2	<.0001
Score	2108.2791	2	<.0001
Wald	2025.3779	2	<.0001


Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept	1	-3.7782	0.0638	3505.9533	<.0001
VehOdo	1	8.341E-6	8.526E-7	95.6991	<.0001
VehicleAge	1	0.2681	0.00677	1567.3289	<.0001


Association of Predicted Probabilities and Observed Responses
Percent Concordant	64.4	Somers' D	0.288
Percent Discordant	35.6	Gamma	0.288
Percent Tied	0.0	Tau-a	0.062
Pairs	574526832	c	0.644


Parameter Estimates and Wald Confidence Intervals
Parameter	Estimate	95% Confidence Limits
Intercept	-3.7782	-3.9033	-3.6531
VehOdo	8.341E-6	6.67E-6	0.000010
VehicleAge	0.2681	0.2548	0.2814


Odds Ratio Estimates and Wald Confidence Intervals
Effect	Unit	Estimate	95% Confidence Limits
VehOdo	1.0000	1.000	1.000	1.000
VehicleAge	1.0000	1.307	1.290	1.325

Plot of Odds Ratios with 95% Wald Confidence Limits

The CLPARM= and CLODDS= options request confidence intervals for the parameter estimates and corresponding odds ratios.

13.10. Poisson Regression¶

Poisson regression is used for count responses. This model assumes that (in the case of a single predictor) that $Y|x\sim\text{Poisson}(\lambda(x))$, where $\lambda(x)=E[Y|x]$, and for the case of a single predictor

\[\log(E[Y|x])=\beta_0+\beta_1x.\]

Then $e^{\beta_1}$ represents the rate ratio for a one unit increase in x. To fit such a model, we will use PROC GENMOD.

Example

The following SAS program fits a Poisson regression model to the count response of the daily ridership count on the orange bus line with day of the week as the predictor.

PROC GENMOD data = circ;
   CLASS day(ref='Friday');
   MODEL orangeBoardings = day / dist = Poisson link = log;
RUN;

SAS Output

The SAS System

The GENMOD Procedure


Model Information
Data Set	WORK.CIRC
Distribution	Poisson
Link Function	Log
Dependent Variable	orangeBoardings


Number of Observations Read	1146
Number of Observations Used	1079
Missing Values	67


Class Level Information
Class	Levels	Values
day	7	Monday Saturday Sunday Thursday Tuesday Wednesday Friday


Criteria For Assessing Goodness Of Fit
Criterion	DF	Value	Value/DF
Deviance	1072	465776.5310	434.4930
Scaled Deviance	1072	465776.5310	434.4930
Pearson Chi-Square	1072	425148.2927	396.5936
Scaled Pearson X2	1072	425148.2927	396.5936
Log Likelihood		23002386.843
Full Log Likelihood		-238139.4730
AIC (smaller is better)		476292.9459
AICC (smaller is better)		476293.0505
BIC (smaller is better)		476327.8324


Algorithm converged.


Analysis Of Maximum Likelihood Parameter Estimates
Parameter		DF	Estimate	Standard Error	Wald 95% Confidence Limits		Wald Chi-Square	Pr > ChiSq
Intercept		1	8.2279	0.0013	8.2254	8.2305	3.954E7	<.0001
day	Monday	1	-0.1964	0.0019	-0.2002	-0.1926	10163.2	<.0001
day	Saturday	1	-0.2691	0.0020	-0.2730	-0.2652	18119.2	<.0001
day	Sunday	1	-0.6897	0.0023	-0.6942	-0.6852	90824.2	<.0001
day	Thursday	1	-0.1523	0.0019	-0.1561	-0.1485	6213.44	<.0001
day	Tuesday	1	-0.1719	0.0019	-0.1757	-0.1681	7860.04	<.0001
day	Wednesday	1	-0.1396	0.0019	-0.1434	-0.1359	5260.55	<.0001
day	Friday	0	0.0000	0.0000	0.0000	0.0000	.	.
Scale		0	1.0000	0.0000	1.0000	1.0000

Note:The scale parameter was held fixed.

In the model statement, we need to specify the dist=Poisson option to specify that the response is assumed to be Poisson and that we are using the log link via the link=log option.

In the case that an offset is desired in a Poisson regression model, we can use the OFFSET= option in the model statement. Note that when using this option, we must take the log() of the offset value ourselves.

13.11. Exercises¶

These exercises will use the child mortality dataset, indicatordeadkids35.csv, and the Kaggle car auction dataset, kaggleCarAuction.csv. Modify the following code to read in this dataset.

FILENAME cardata '/folders/myfolders/SAS_Notes/data/kaggleCarAuction.csv';

PROC IMPORT datafile = cardata out = cars dbms = CSV replace;
   getnames = yes;
   guessingrows = 1000;
RUN;

FILENAME mortdat '/folders/myfolders/SAS_Notes/data/indicatordeadkids35.csv';

PROC IMPORT datafile = mortdat out = mort dbms = CSV replace;
   getnames = yes;
   guessingrows = 500;
RUN;

Compute the correlation between the 1980, 1990, 2000, and 2010 mortality data. Just display the result to the screen. Then compute using the NOMMISS option. (Note: The column names are numbers, which are invalid standard SAS names, so to refer to the variable 1980 in your code use ‘1980’n.)
a. Compute the correlation between the Myanmar, China, and United States mortality data. Store this correlation matrix in an object called country_cor using ODS OUTPUT. b. Extract the Myanmar-US correlation from the correlation matrix.
Is there a difference between mortality information from 1990 and 2000? Run a paired t-test and a Wilcoxon signed rank test to assess this. Hint: to extract the column of information for 1990, use ‘1990’n.
Using the cars dataset, fit a linear regression model with vehicle cost (VehBCost) as the outcome and vehicle age (VehicleAge) and whether it’s an online sale (IsOnlineSale) as predictors as well as their interaction.
Create a variable called expensive in the cars data that indicates if the vehicle cost is over $10,000. Use a chi-squared test to assess if there is a relationship between a car being expensive and it being labeled as a “bad buy” (IsBadBuy).
Fit a logistic regression model where the outcome is “bad buy” status and predictors are the expensive status and vehicle age (VehicleAge). Request confidence intervals for the odds ratios.

12. The Output Delivery System and Graphics 14. SAS MACRO Programming

Intro to SAS Notes

13. Statistical Analysis in SAS¶

13.1. Proportion Tests¶

Example

13.2. Chi-squared Test¶

Example

13.3. Fisher’s Exact Test¶

Example

13.4. Correlation¶

Example

Example

Example

13.5. T-Tests¶

Example

Example

Example

13.6. Nonparametric Alternatives to the T-Tests¶

Example

Example

13.7. One-way ANOVA and the Kruskal-Wallis Test¶

Example

13.8. Linear Regression¶

Example

Example

Example

13.9. Logistic Regression¶

Example

13.10. Poisson Regression¶

Example

13.11. Exercises¶