SAS: Categorical Data Analysis¶

In this lecture, we will focus on analyzing contingency tables for nominal categorical variables (we will discuss regression models for categorical responses such as logistic regression in a later lecture). The methods covered here are

Chi-square test of independence
Fisher's Exact Test
Risk Difference
McNemar's Test
Relative Risk
Odds Ratio
Cochran-Mantel-Hanzel Test
Breslow-Day Test

Tests of Association¶

Depending on the question, we may want to test for an association and further quantify the direction and strength of the association. We will first look at two general tests of association.

Chi-square Test of Independence¶

When testing $$H_0: \text{There is no association between treatment and outcome}\;vs\;H_1:\text{ There is an association}$$ we can use the chi-square test of independence. These hypotheses can be restated as $$H_0: \text{The two variables are independent}\;vs\;H_1:\text{ The two variables are dependent}$$ In the 2x2 case, this is equivalent to testing the difference of proportions. To request the chi-square test in SAS, you use the CHISQ option in the tables statement of PROC FREQ. In the following example, we will use a dental dataset. Does exposure to chlorinated swimming pool water affect dental erosion? The response is dental erosion (Yes/No) and the predictor is frequent (>=6 hours) or occasional (<6hours) swimmer. The dataset is summarized in the following table:

	Erosion (Yes)	Erosion (No)	Total
>=6hrs	32	118	150
<6hrs	17	127	144
Total	49	245	294

data dental;  
 input Erosion $ More6 $ cnt;
 datalines;
Y Y 32
N N 127
Y N 17
N Y 118
;
run;
proc freq data=dental order=data; 
 *Want Yes first so use order=data;
  table more6*erosion /chisq;
 weight cnt;  *How many in each cell;
Run;

SAS Connection established. Subprocess id is 8628

With a p-value of 0.0284, we have strong evidence of an association between dental erosion and frequency of swimming.

The chi-square test of independece uses a large sample approximation to calculate the p-value. the approximation is only valid if the expected cell counts are more than 5. When this condition fails, the chi-square approximation of the sampling distribution of the test statistic may not hold and we need to use another test. Fisher's exact test can be used in this situation.

Fisher's Exact Test¶

Fisher's exact test is a small sample test for $$H_0: \text{The two variables are independent}\;vs\;H_1:\text{ The two variables are dependent}$$ Let's look at an example where the chi-square test would not be appropriate. The following dataset was collected to test to see if there is an association between juvenile delinquent status and awareness of vision health.

	Wears Glasses	No Glasses	Total
Juvenile Delinquent	1	8	9
Non-delinquent	5	2	7
Total	6	10	16

data glasses;  
 input glasses $ del $ cnt;
 datalines;
N N 2
Y N 8
N Y 5
Y Y 1
;
run;

proc freq data=glasses;
 table del*glasses /chisq expected; *Fisher's;
 weight cnt;
run;

With a p-value of 0.035, we have strong evidence that there is an association between juvenile deliquency and eyesight awareness problems at the 0.05 significance level. From the row percentages, we can see that juvenile delinquents are less aware of eyesight problems than non-delinquents.

Fisher's exact test can be used for general mxn table but for large tables this can be computationally intensive. For 2x2 tables, Fisher's exact test is reported by defualt when requesting chi-square test output, but for larger tables you need to specify the EXACT option to get Fisher's exact test.

McNemar's Test for Paired Samples¶

For 2x2 tables that contain matched pairs where the outcome is measure twice (repeated measures) or pre- and post-intervention evaluation on the same (or matched) subject, we can use McNemar's Test to test for an association. Examples include,

Left and right eye measurements
Husband and wife voting preference
Cases matched to controls on the basis of demographic characteristics in case-control studies

Consider a matched pairs case-control study in which we want to see if marijuana use affects sleeping difficulty. In this study, 32 subjects are chosen based on sleep difficulty (25 with no sleep difficulty and 7 with sleep difficulty) and then marijuana use is recorded. We then choose 32 subjects matched with the original 32 subjects and record marijuana use, giving us a total of 62 subjects. For this sample we get the following 2x2 table

	Marijuana - Yes	Marijuana - No	Total
Control - Yes	a=4	b=9	13
Control - No	c=3	d=16	19
Total	7	25	32

McNemar's test is a test for the difference of proportions. In this case, we are testing to see if there is a difference in the proportion of marijuana users with sleep difficulty and the proportion of non marijuana users with sleep difficulty. The test is based off of discordant pairs. If there is a strong relationship between marijuana use and sleep difficulty, then their should be an imbalanae between the discoredant pairs b and c.

data mcnemar;  
 input User $ sleep $ cnt;
 datalines;
0_YES 0_YES 4
0_YES 1_NO 3
1_NO 0_YES 9
1_NO 1_NO 16
;
run;

proc freq data=mcnemar;
 table user*sleep /agree; *Mcnemar's test;
 weight cnt;
run;

With a p-value of 0.0833, there is no significan differene in the sleeping difficulty between marijuana and non-marijuana users.

Quantifying Associations¶

In order to quantify the association between the two quantitative variables in the 2x2 case, we can look at estimates and confidence intervales for

risk difference
relative risk (prospective study)
odds ratio (retrospective study)

Risk Difference¶

Consider a predictor with two groups and response measured as a success or failure. The risk difference is given by $p_1-p_2$ where

$p_1 = $ the proportion of successes in group 1
$p_2 = $ the proportion of successes in group 2

To get an estimate and confidence interval for the risk difference, specify the option RISKDIFF in the tables statement. Let's go back to the denatl example and get the confidence interval for the risk difference.

proc freq data=dental order=data;;
  table more6*erosion /chisq riskdiff;
 weight cnt;
Run;

We are 95% confident that the risk of dental erosion for frequent swimmers is between 0.0112 abd 0.1794 more than the risk of dental erosion for non-frequent swimmers.

Relative Risk¶

For prospectice medical studies, it is more common to use the relative risk rather than the risk difference. The relative risk of group 1 to group 2 is given by $\dfrac{p_1}{p_2}$. The relative risk has the following interpretation

RR > 1: Risk in group 1 is greater than the risk in group 2
RR = 1: No association between the variables, that is risk the the same in each group
RR < 1: Risk in group 1 is less than the risk in group 2

To get the confidence interval for the relative risk, use the measures option in the table statement.

proc freq data=dental order=data;
 table more6*erosion /chisq measures;
 weight cnt;
run;

The risk of dental erosion for frequent swimmers is between 1.05 to 3.11 times the risk of dental erosion for non-frequent swimmers. The estimated relative risk is 1.8071.

Odds Ratio¶

In retrospective studies, such as case-control studies, we cannot estimate the risk of the response in each group from the data becuase we samples based on the response, but we can estimate the odds ratio from retrospective studies. To estimate the odds ratio of success in group 1 vs group 2, we calculate the odds of outcome + for group 1 / odds of outcome + for group 2 = ad/bc where a,b,c,d represent the following cells in the contigency table

	case	control	Total
exposure +	a	b	a+b
exposure -	c	d	c+d
Total	a+c	b+d	a+b+c+d

The odds ratio has the following general interpretation:

OR = 1: no association between exposure and outcome
OR > 1: cases are more likely than controls to be exposure +
OR < 1: controls are more likely than cases to be exposure +

Note that

RR = [a/(a+b)]/[c/(c+d)]=a(c+d)/c(a+b)
OR = ad/bc

so that for rare diseases (i.e. a and c are small) the OR is approximately the same as the RR. To get the confidence interval for the odds ratio use the measures option in the table statement of PROC FREQ. From the output above, we see that in the dental example the estimate odds ratio is 2.0259 and the odds of no dental erosion for those swimming less than 6 hours a week was 2.026 times the odds for those swimming more than 6 hours a week.

Stratified 2x2 Tables¶

In this case, we want to test for an overall association between two binary variables after adjusting for a stratification variable. This could be the result of study design, e.g. stratifying by hospital sites in a multicenter study, or to control for a confounding variable. For each level of the stratification tale, we will have a 2x2 table.

Let's use the following example. How does stress (Low/High) affect one's opinion (favorable/unfavorable) on a new policy? In this study, subjects were interviewed in both rural and urban environments, so we will need to account for this location in our analysis. To test for an association between stress and opion on health policy when accounting for location, we will use the Cochran-Mantel-Hanzel test. For this test to be appropriate, we need at least a sample size of 30 for each row category in each table, otherwise the chi-square approximation may not work.

DATA healthpolicy;
INPUT loc $ stress $ opinion $ cnt @@;
DATALINES;
Urban Low FA 48 Urban Low UFA 12
Urban High FA 96 Urban High UFA 94
Rural Low FA 55 Rural Low UFA 135
Rural High FA 7 Rural High UFA 53
;
RUN;

PROC FREQ DATA=healthpolicy;
weight cnt;
tables loc*stress*opinion / cmh;
RUN;

The test statistic is 23.0502 with a p-value of <0.0001, so there is very strong evidence of an overall assoication between stress and opinion on a new health policy. People with low stress are significantly more likely to support a new health policy than people with high stress.

The Mantel-Hanzel test should be used with caution. If the direction of association between the tables is not the same, then the power is greatly reduced to detect an association. We can see this in the following example in testing the association between country (US/UK) and switching to a new soft drink (Y/N) when controlling for gender.

DATA soda;
INPUT gender $ country $ soda $ cnt @@;
DATALINES;
Male USA Yes 29 Male USA No 6
Male UK Yes 19 Male UK No 15
Female USA Yes 7 Female USA No 23
Female UK Yes 24 Female UK No 29
;
RUN;

PROC FREQ DATA=soda order=data;
weight cnt;
tables gender*country*soda /cmh;
RUN;

In this case, the p-value for the Cochran-Matnel-Hanzel test is 0.876, so there is no evidence of an association, but if you examine the association within tables there does appear to be an association within the strata but in different directions.

PROC FREQ DATA=soda order=data;
where gender="Male";
weight cnt;
table country*soda / chisq;
run;

PROC FREQ DATA=soda order=data;
where gender="Female";
weight cnt;
table country*soda / chisq;
run;

For the female table, the p-value is 0.047 and for the male table the p-value is 0.015. So within each strata there is an association, but overall the opposite direction of these associations cancel out giving a large p-value for the overall association. The CMH test does not work well in these situations, so it is a good idea to inspect tables for inconsistent patterns of associations across the strata.

The Breslow-Day Test¶

The Breslow-Day test is a test for homogeneous odds ratios across the strata for stratified 2x2 tables. If we fail to reject the null of homogeneous odds ratios, then we can use the Mantel-Hanzel estimate of the odds ratio for this common odds ratio. Let's go back to the health policy opinion and stress level example.

PROC FREQ DATA=healthpolicy;
weight cnt;
tables loc*stress*opinion / cmh Expected;
RUN;

The p-value for the Breslow-Day test is 0.6688, so we have no evidence that the odds ratios are different across the strata, so the estimated common odds raio is 0.2823 with 95% confidence interval (0.1648,0.4839). Note that the Breslow-Day test is only valid if the expected cell counts in all tables are at least 5. This is need to ensure the chi-squared approximation is good.

General rxc Tables¶

Which tests are appropriate for more general tables, depends on whether we have ordinal or nominal variables. In the case that both are nominal, the chi-square test of independence is a general test of association that can be used. The next example examines whether or nor there is an association between eye strain (Y/N) and the type of office (G1 = Data entry in visual display unit, G2 = Conversational use of VDUs, G3 = Full-time typing, and G4 = Traditional offie work).

data strain;  
 input Job $ StrainGrp $ cnt @@;
 datalines;
G1 Y 11 G1 N 42
G2 Y 30 G2 N 79
G3 Y 14 G3 N 63
G4 Y 3  G4 N 52
;
run;
ods rtf style=htmlblue;
proc freq data=strain order=data;
 table StrainGrp*Job / cmh;
 weight cnt;
run;

In this case, we use the General Association in the CMH statistics table. The p-value is 0.0099 for this dataset.

If one variable is nominal and the other is ordinal, then we can use the mean score statistic for trend. We could still use the general chi-square test, but taking the ordering into account increases the power of the test. Consider the following example of treatment (test drug/ placebo) vs improvement in condition (none/some/marked).

data drug;  
 input Drug $ Improve $ cnt;
 datalines;
Test None 13
Test Some 7
Test Marked 21
Placebo None 29
Placebo Some 7
Placebo Marked 7
;
run;

proc freq data=drug order=data;
 table Drug*Improve /cmh;
 weight cnt;
run;

The row mean scores test has a p-value of 0.0003 in this test.

If both are ordinal, then we want to use the nonzero correlation row in the CMH table.

Statistic	DF	Value	Prob
Chi-Square	1	4.8020	0.0284
Likelihood Ratio Chi-Square	1	4.8746	0.0273
Continuity Adj. Chi-Square	1	4.1405	0.0419
Mantel-Haenszel Chi-Square	1	4.7857	0.0287
Phi Coefficient		0.1278
Contingency Coefficient		0.1268
Cramer's V		0.1278

Fisher's Exact Test
Cell (1,1) Frequency (F)	32
Left-sided Pr <= F	0.9910
Right-sided Pr >= F	0.0204

Table Probability (P)	0.0114
Two-sided Pr <= P	0.0296

Statistic	DF	Value	Prob
WARNING: 75% of the cells have expected counts less than 5. Chi-Square may not be a valid test.
Chi-Square	1	6.1122	0.0134
Likelihood Ratio Chi-Square	1	6.5153	0.0107
Continuity Adj. Chi-Square	1	3.8095	0.0510
Mantel-Haenszel Chi-Square	1	5.7302	0.0167
Phi Coefficient		-0.6181
Contingency Coefficient		0.5258
Cramer's V		-0.6181

Fisher's Exact Test
Cell (1,1) Frequency (F)	2
Left-sided Pr <= F	0.0245
Right-sided Pr >= F	0.9991

Table Probability (P)	0.0236
Two-sided Pr <= P	0.0350

McNemar's Test
Statistic (S)	3.0000
DF	1
Pr > S	0.0833

Simple Kappa Coefficient
Kappa	0.1616
ASE	0.1641
95% Lower Conf Limit	-0.1600
95% Upper Conf Limit	0.4832

Column 1 Risk Estimates
	Risk	ASE	(Asymptotic) 95% Confidence Limits		(Exact) 95% Confidence Limits
Difference is (Row 1 - Row 2)
Row 1	0.2133	0.0334	0.1478	0.2789	0.1507	0.2876
Row 2	0.1181	0.0269	0.0654	0.1708	0.0703	0.1823
Total	0.1667	0.0217	0.1241	0.2093	0.1259	0.2143
Difference	0.0953	0.0429	0.0112	0.1794

Statistic	Value	ASE
Gamma	0.3390	0.1444
Kendall's Tau-b	0.1278	0.0564
Stuart's Tau-c	0.0952	0.0429
Somers' D C\|R	0.0953	0.0429
Somers' D R\|C	0.1714	0.0751
Pearson Correlation	0.1278	0.0564
Spearman Correlation	0.1278	0.0564
Lambda Asymmetric C\|R	0.0000	0.0000
Lambda Asymmetric R\|C	0.0625	0.1052
Lambda Symmetric	0.0466	0.0792
Uncertainty Coefficient C\|R	0.0184	0.0164
Uncertainty Coefficient R\|C	0.0120	0.0107
Uncertainty Coefficient Symmetric	0.0145	0.0129

Odds Ratio and Relative Risks
Statistic	Value	95% Confidence Limits
Odds Ratio	2.0259	1.0689	3.8398
Relative Risk (Column 1)	1.8071	1.0510	3.1070
Relative Risk (Column 2)	0.8920	0.8050	0.9883

Cochran-Mantel-Haenszel Statistics (Based on Table Scores)
Statistic	Alternative Hypothesis	DF	Value	Prob
1	Nonzero Correlation	1	23.0502	<.0001
2	Row Mean Scores Differ	1	23.0502	<.0001
3	General Association	1	23.0502	<.0001

Common Odds Ratio and Relative Risks
Statistic	Method	Value	95% Confidence Limits
Odds Ratio	Mantel-Haenszel	0.2823	0.1648	0.4839
	Logit	0.2810	0.1642	0.4806
Relative Risk (Column 1)	Mantel-Haenszel	0.5709	0.4585	0.7107
	Logit	0.6140	0.5112	0.7374
Relative Risk (Column 2)	Mantel-Haenszel	1.5135	1.2722	1.8006
	Logit	1.2928	1.1404	1.4657

Statistic	DF	Value	Prob
Chi-Square	1	5.9272	0.0149
Likelihood Ratio Chi-Square	1	6.0690	0.0138
Continuity Adj. Chi-Square	1	4.7216	0.0298
Mantel-Haenszel Chi-Square	1	5.8413	0.0157
Phi Coefficient		0.2931
Contingency Coefficient		0.2813
Cramer's V		0.2931

Fisher's Exact Test
Cell (1,1) Frequency (F)	29
Left-sided Pr <= F	0.9968
Right-sided Pr >= F	0.0143

Table Probability (P)	0.0112
Two-sided Pr <= P	0.0194

Statistic	DF	Value	Prob
Chi-Square	1	3.9443	0.0470
Likelihood Ratio Chi-Square	1	4.0934	0.0431
Continuity Adj. Chi-Square	1	3.0620	0.0801
Mantel-Haenszel Chi-Square	1	3.8968	0.0484
Phi Coefficient		-0.2180
Contingency Coefficient		0.2130
Cramer's V		-0.2180

Fisher's Exact Test
Cell (1,1) Frequency (F)	7
Left-sided Pr <= F	0.0385
Right-sided Pr >= F	0.9881

Table Probability (P)	0.0267
Two-sided Pr <= P	0.0602

SAS: Categorical Data Analysis¶

Tests of Association¶

Chi-square Test of Independence¶

SAS Output

The FREQ Procedure

Table More6 * Erosion

Cross-Tabular Freq Table

Chi-Square Tests

Fisher's Exact Test

Fisher's Exact Test¶

SAS Output

The FREQ Procedure

Table del * glasses

Cross-Tabular Freq Table

Chi-Square Tests

Fisher's Exact Test

McNemar's Test for Paired Samples¶

SAS Output

The FREQ Procedure

Table User * sleep

Cross-Tabular Freq Table

McNemar's Test

Simple Kappa Coefficient

Agreement Plot

Quantifying Associations¶

Risk Difference¶

SAS Output

The FREQ Procedure

Table More6 * Erosion

Cross-Tabular Freq Table

Chi-Square Tests

Fisher's Exact Test

Column 1 Risk Estimates

Column 2 Risk Estimates

Relative Risk¶

SAS Output

The FREQ Procedure

Table More6 * Erosion

Cross-Tabular Freq Table

Chi-Square Tests

Fisher's Exact Test

Measures of Association

Relative Risk Estimates

Odds Ratio¶

Stratified 2x2 Tables¶

SAS Output

The FREQ Procedure

Table 1 of stress * opinion

Cross-Tabular Freq Table

Table 2 of stress * opinion

Cross-Tabular Freq Table

Summary for stress * opinion

Cochran-Mantel-Haenszel Statistics

Common Relative Risk Estimates

Breslow-Day Test

SAS Output

The FREQ Procedure

Table 1 of country * soda

Cross-Tabular Freq Table

Table 2 of country * soda

Cross-Tabular Freq Table

Summary for country * soda

Cochran-Mantel-Haenszel Statistics

Common Relative Risk Estimates

Breslow-Day Test

SAS Output

The FREQ Procedure

Table country * soda

Cross-Tabular Freq Table

Chi-Square Tests

Fisher's Exact Test

The FREQ Procedure

Table country * soda

Cross-Tabular Freq Table

Chi-Square Tests

Fisher's Exact Test

The Breslow-Day Test¶

SAS Output

The FREQ Procedure

Table 1 of stress * opinion