Cross-Tabular Freq Table
|
|
In this lecture, we will focus on analyzing contingency tables for nominal categorical variables (we will discuss regression models for categorical responses such as logistic regression in a later lecture). The methods covered here are
Depending on the question, we may want to test for an association and further quantify the direction and strength of the association. We will first look at two general tests of association.
When testing $$H_0: \text{There is no association between treatment and outcome}\;vs\;H_1:\text{ There is an association}$$ we can use the chi-square test of independence. These hypotheses can be restated as $$H_0: \text{The two variables are independent}\;vs\;H_1:\text{ The two variables are dependent}$$ In the 2x2 case, this is equivalent to testing the difference of proportions. To request the chi-square test in SAS, you use the CHISQ option in the tables statement of PROC FREQ. In the following example, we will use a dental dataset. Does exposure to chlorinated swimming pool water affect dental erosion? The response is dental erosion (Yes/No) and the predictor is frequent (>=6 hours) or occasional (<6hours) swimmer. The dataset is summarized in the following table:
Erosion (Yes) | Erosion (No) | Total | |
>=6hrs | 32 | 118 | 150 |
<6hrs | 17 | 127 | 144 |
Total | 49 | 245 | 294 |
data dental;
input Erosion $ More6 $ cnt;
datalines;
Y Y 32
N N 127
Y N 17
N Y 118
;
run;
proc freq data=dental order=data;
*Want Yes first so use order=data;
table more6*erosion /chisq;
weight cnt; *How many in each cell;
Run;
With a p-value of 0.0284, we have strong evidence of an association between dental erosion and frequency of swimming.
The chi-square test of independece uses a large sample approximation to calculate the p-value. the approximation is only valid if the expected cell counts are more than 5. When this condition fails, the chi-square approximation of the sampling distribution of the test statistic may not hold and we need to use another test. Fisher's exact test can be used in this situation.
Fisher's exact test is a small sample test for $$H_0: \text{The two variables are independent}\;vs\;H_1:\text{ The two variables are dependent}$$ Let's look at an example where the chi-square test would not be appropriate. The following dataset was collected to test to see if there is an association between juvenile delinquent status and awareness of vision health.
Wears Glasses | No Glasses | Total | |
Juvenile Delinquent | 1 | 8 | 9 |
Non-delinquent | 5 | 2 | 7 |
Total | 6 | 10 | 16 |
data glasses;
input glasses $ del $ cnt;
datalines;
N N 2
Y N 8
N Y 5
Y Y 1
;
run;
proc freq data=glasses;
table del*glasses /chisq expected; *Fisher's;
weight cnt;
run;
With a p-value of 0.035, we have strong evidence that there is an association between juvenile deliquency and eyesight awareness problems at the 0.05 significance level. From the row percentages, we can see that juvenile delinquents are less aware of eyesight problems than non-delinquents.
Fisher's exact test can be used for general mxn table but for large tables this can be computationally intensive. For 2x2 tables, Fisher's exact test is reported by defualt when requesting chi-square test output, but for larger tables you need to specify the EXACT option to get Fisher's exact test.
For 2x2 tables that contain matched pairs where the outcome is measure twice (repeated measures) or pre- and post-intervention evaluation on the same (or matched) subject, we can use McNemar's Test to test for an association. Examples include,
Consider a matched pairs case-control study in which we want to see if marijuana use affects sleeping difficulty. In this study, 32 subjects are chosen based on sleep difficulty (25 with no sleep difficulty and 7 with sleep difficulty) and then marijuana use is recorded. We then choose 32 subjects matched with the original 32 subjects and record marijuana use, giving us a total of 62 subjects. For this sample we get the following 2x2 table
Marijuana - Yes | Marijuana - No | Total | |
Control - Yes | a=4 | b=9 | 13 |
Control - No | c=3 | d=16 | 19 |
Total | 7 | 25 | 32 |
McNemar's test is a test for the difference of proportions. In this case, we are testing to see if there is a difference in the proportion of marijuana users with sleep difficulty and the proportion of non marijuana users with sleep difficulty. The test is based off of discordant pairs. If there is a strong relationship between marijuana use and sleep difficulty, then their should be an imbalanae between the discoredant pairs b and c.
data mcnemar;
input User $ sleep $ cnt;
datalines;
0_YES 0_YES 4
0_YES 1_NO 3
1_NO 0_YES 9
1_NO 1_NO 16
;
run;
proc freq data=mcnemar;
table user*sleep /agree; *Mcnemar's test;
weight cnt;
run;
With a p-value of 0.0833, there is no significan differene in the sleeping difficulty between marijuana and non-marijuana users.
In order to quantify the association between the two quantitative variables in the 2x2 case, we can look at estimates and confidence intervales for
Consider a predictor with two groups and response measured as a success or failure. The risk difference is given by $p_1-p_2$ where
To get an estimate and confidence interval for the risk difference, specify the option RISKDIFF in the tables statement. Let's go back to the denatl example and get the confidence interval for the risk difference.
proc freq data=dental order=data;;
table more6*erosion /chisq riskdiff;
weight cnt;
Run;
We are 95% confident that the risk of dental erosion for frequent swimmers is between 0.0112 abd 0.1794 more than the risk of dental erosion for non-frequent swimmers.
For prospectice medical studies, it is more common to use the relative risk rather than the risk difference. The relative risk of group 1 to group 2 is given by $\dfrac{p_1}{p_2}$. The relative risk has the following interpretation
To get the confidence interval for the relative risk, use the measures option in the table statement.
proc freq data=dental order=data;
table more6*erosion /chisq measures;
weight cnt;
run;
The risk of dental erosion for frequent swimmers is between 1.05 to 3.11 times the risk of dental erosion for non-frequent swimmers. The estimated relative risk is 1.8071.
In retrospective studies, such as case-control studies, we cannot estimate the risk of the response in each group from the data becuase we samples based on the response, but we can estimate the odds ratio from retrospective studies. To estimate the odds ratio of success in group 1 vs group 2, we calculate the odds of outcome + for group 1 / odds of outcome + for group 2 = ad/bc where a,b,c,d represent the following cells in the contigency table
case | control | Total | |
exposure + | a | b | a+b |
exposure - | c | d | c+d |
Total | a+c | b+d | a+b+c+d |
The odds ratio has the following general interpretation:
Note that
so that for rare diseases (i.e. a and c are small) the OR is approximately the same as the RR. To get the confidence interval for the odds ratio use the measures option in the table statement of PROC FREQ. From the output above, we see that in the dental example the estimate odds ratio is 2.0259 and the odds of no dental erosion for those swimming less than 6 hours a week was 2.026 times the odds for those swimming more than 6 hours a week.
In this case, we want to test for an overall association between two binary variables after adjusting for a stratification variable. This could be the result of study design, e.g. stratifying by hospital sites in a multicenter study, or to control for a confounding variable. For each level of the stratification tale, we will have a 2x2 table.
Let's use the following example. How does stress (Low/High) affect one's opinion (favorable/unfavorable) on a new policy? In this study, subjects were interviewed in both rural and urban environments, so we will need to account for this location in our analysis. To test for an association between stress and opion on health policy when accounting for location, we will use the Cochran-Mantel-Hanzel test. For this test to be appropriate, we need at least a sample size of 30 for each row category in each table, otherwise the chi-square approximation may not work.
DATA healthpolicy;
INPUT loc $ stress $ opinion $ cnt @@;
DATALINES;
Urban Low FA 48 Urban Low UFA 12
Urban High FA 96 Urban High UFA 94
Rural Low FA 55 Rural Low UFA 135
Rural High FA 7 Rural High UFA 53
;
RUN;
PROC FREQ DATA=healthpolicy;
weight cnt;
tables loc*stress*opinion / cmh;
RUN;
The test statistic is 23.0502 with a p-value of <0.0001, so there is very strong evidence of an overall assoication between stress and opinion on a new health policy. People with low stress are significantly more likely to support a new health policy than people with high stress.
The Mantel-Hanzel test should be used with caution. If the direction of association between the tables is not the same, then the power is greatly reduced to detect an association. We can see this in the following example in testing the association between country (US/UK) and switching to a new soft drink (Y/N) when controlling for gender.
DATA soda;
INPUT gender $ country $ soda $ cnt @@;
DATALINES;
Male USA Yes 29 Male USA No 6
Male UK Yes 19 Male UK No 15
Female USA Yes 7 Female USA No 23
Female UK Yes 24 Female UK No 29
;
RUN;
PROC FREQ DATA=soda order=data;
weight cnt;
tables gender*country*soda /cmh;
RUN;
In this case, the p-value for the Cochran-Matnel-Hanzel test is 0.876, so there is no evidence of an association, but if you examine the association within tables there does appear to be an association within the strata but in different directions.
PROC FREQ DATA=soda order=data;
where gender="Male";
weight cnt;
table country*soda / chisq;
run;
PROC FREQ DATA=soda order=data;
where gender="Female";
weight cnt;
table country*soda / chisq;
run;
For the female table, the p-value is 0.047 and for the male table the p-value is 0.015. So within each strata there is an association, but overall the opposite direction of these associations cancel out giving a large p-value for the overall association. The CMH test does not work well in these situations, so it is a good idea to inspect tables for inconsistent patterns of associations across the strata.
The Breslow-Day test is a test for homogeneous odds ratios across the strata for stratified 2x2 tables. If we fail to reject the null of homogeneous odds ratios, then we can use the Mantel-Hanzel estimate of the odds ratio for this common odds ratio. Let's go back to the health policy opinion and stress level example.
PROC FREQ DATA=healthpolicy;
weight cnt;
tables loc*stress*opinion / cmh Expected;
RUN;
The p-value for the Breslow-Day test is 0.6688, so we have no evidence that the odds ratios are different across the strata, so the estimated common odds raio is 0.2823 with 95% confidence interval (0.1648,0.4839). Note that the Breslow-Day test is only valid if the expected cell counts in all tables are at least 5. This is need to ensure the chi-squared approximation is good.
Which tests are appropriate for more general tables, depends on whether we have ordinal or nominal variables. In the case that both are nominal, the chi-square test of independence is a general test of association that can be used. The next example examines whether or nor there is an association between eye strain (Y/N) and the type of office (G1 = Data entry in visual display unit, G2 = Conversational use of VDUs, G3 = Full-time typing, and G4 = Traditional offie work).
data strain;
input Job $ StrainGrp $ cnt @@;
datalines;
G1 Y 11 G1 N 42
G2 Y 30 G2 N 79
G3 Y 14 G3 N 63
G4 Y 3 G4 N 52
;
run;
ods rtf style=htmlblue;
proc freq data=strain order=data;
table StrainGrp*Job / cmh;
weight cnt;
run;
In this case, we use the General Association in the CMH statistics table. The p-value is 0.0099 for this dataset.
If one variable is nominal and the other is ordinal, then we can use the mean score statistic for trend. We could still use the general chi-square test, but taking the ordering into account increases the power of the test. Consider the following example of treatment (test drug/ placebo) vs improvement in condition (none/some/marked).
data drug;
input Drug $ Improve $ cnt;
datalines;
Test None 13
Test Some 7
Test Marked 21
Placebo None 29
Placebo Some 7
Placebo Marked 7
;
run;
proc freq data=drug order=data;
table Drug*Improve /cmh;
weight cnt;
run;
The row mean scores test has a p-value of 0.0003 in this test.
If both are ordinal, then we want to use the nonzero correlation row in the CMH table.