Correlated Binary and Count Data¶

When dealing with correlated data, it is important to account for this dependence structure. Most of the standard methods we have discussed assume a random sample (independent units). If this assumption is violated (as it is for correlated data), these methods will results in incorrect standard errors and type-1 error rates. We looked at linear mixed models last week as a way to handle correlated data for quantitative responses. This week we will discuss how to handle correlated binary and count data. With independent binaray and count data, we have discussed logistic and poisson regression models (generalized linear models). For correlated binary and count data, we have two options

GEE - generalized estimating equations (marginal model)
GLMM - generlaized linear mixed models

Here we will very briefly discuss these models and learn how to fit them with SAS and R. Let's begin with an example of correlated binary data. This data is from a clinical trial comparing two treatments for a respiratory illness.

N=111: 57 placebo (trt=1), 54 active treatment (trt=2)
Respiratory status was measured at 4 time points (0=poor, 1=good)
Also measured: center, gender, age, baseline status

For this data, the subjects are independent of each other, but the repeated measures on each subject over the 4 times points are correlated.

Summary Measures Analysis¶

Before we go into GEE, note that we can still do a simple analysis using summary measures analysis. Our goal here is to see if the treatment has an effect on respiratory status. Since the respiratory status is binary, we can compare proportions between the two groups by calculating the proportion of positive (response = 1) status visits for each subject as our summary measure. We can then perform a t-test on the two groups with the response as the proportion of good respiratory status visits.

data resptrial;
 input id center treat sex age bl v1-v4; 
cards;
1   1   1   1   46  0   0   0   0   0
2   1   1   1   28  0   0   0   0   0
3   1   2   1   23  1   1   1   1   1
4   1   1   1   44  1   1   1   1   0
5   1   1   2   13  1   1   1   1   1
6   1   2   1   34  0   0   0   0   0
7   1   1   1   43  0   1   0   1   1
8   1   2   1   28  0   0   0   0   0
9   1   2   1   31  1   1   1   1   1
10  1   1   1   37  1   0   1   1   0
11  1   2   1   30  1   1   1   1   1
12  1   2   1   14  0   1   1   1   0
13  1   1   1   23  1   1   0   0   0
14  1   1   1   30  0   0   0   0   0
15  1   1   1   20  1   1   1   1   1
16  1   2   1   22  0   0   0   0   1
17  1   1   1   25  0   0   0   0   0
18  1   2   2   47  0   0   1   1   1
19  1   1   2   31  0   0   0   0   0
20  1   2   1   20  1   1   0   1   0
21  1   2   1   26  0   1   0   1   0
22  1   2   1   46  1   1   1   1   1
23  1   2   1   32  1   1   1   1   1
24  1   2   1   48  0   1   0   0   0
25  1   1   2   35  0   0   0   0   0
26  1   2   1   26  0   0   0   0   0
27  1   1   1   23  1   1   0   1   1
28  1   1   2   36  0   1   1   0   0
29  1   1   1   19  0   1   1   0   0
30  1   2   1   28  0   0   0   0   0
31  1   1   1   37  0   0   0   0   0
32  1   2   1   23  0   1   1   1   1
33  1   2   1   30  1   1   1   1   0
34  1   1   1   15  0   0   1   1   0
35  1   2   1   26  0   0   0   1   0
36  1   1   2   45  0   0   0   0   0
37  1   2   1   31  0   0   1   0   0
38  1   2   1   50  0   0   0   0   0
39  1   1   1   28  0   0   0   0   0
40  1   1   1   26  0   0   0   0   0
41  1   1   1   14  0   0   0   0   1
42  1   2   1   31  0   0   1   0   0
43  1   1   1   13  1   1   1   1   1
44  1   1   1   27  0   0   0   0   0
45  1   1   1   26  0   1   0   1   1
46  1   1   1   49  0   0   0   0   0
47  1   1   1   63  0   0   0   0   0
48  1   2   1   57  1   1   1   1   1
49  1   1   1   27  1   1   1   1   1
50  1   2   1   22  0   0   1   1   1
51  1   2   1   15  0   0   1   1   1
52  1   1   1   43  0   0   0   1   0
53  1   2   2   32  0   0   0   1   0
54  1   2   1   11  1   1   1   1   0
55  1   1   1   24  1   1   1   1   1
56  1   2   1   25  0   1   1   0   1
57  2   1   2   39  0   0   0   0   0
58  2   2   1   25  0   0   1   1   1
59  2   2   1   58  1   1   1   1   1
60  2   1   2   51  1   1   0   1   1
61  2   1   2   32  1   0   0   1   1
62  2   1   1   45  1   1   0   0   0
63  2   1   2   44  1   1   1   1   1
64  2   1   2   48  0   0   0   0   0
65  2   2   1   26  0   1   1   1   1
66  2   2   1   14  0   1   1   1   1
67  2   1   2   48  0   0   0   0   0
68  2   2   1   13  1   1   1   1   1
69  2   1   1   20  0   1   1   1   1
70  2   2   1   37  1   1   0   0   1
71  2   2   1   25  1   1   1   1   1
72  2   2   1   20  0   0   0   0   0
73  2   1   2   58  0   1   0   0   0
74  2   1   1   38  1   1   0   0   0
75  2   2   1   55  1   1   1   1   1
76  2   2   1   24  1   1   1   1   1
77  2   1   2   36  1   1   0   0   1
78  2   1   1   36  0   1   1   1   1
79  2   2   2   60  1   1   1   1   1
80  2   1   1   15  1   0   0   1   1
81  2   2   1   25  1   1   1   1   0
82  2   2   1   35  1   1   1   1   1
83  2   2   1   19  1   1   0   1   1
84  2   1   2   31  1   1   1   1   1
85  2   2   1   21  1   1   1   1   1
86  2   2   2   37  0   1   1   1   1
87  2   1   1   52  0   1   1   1   1
88  2   2   1   55  0   0   1   1   0
89  2   1   1   19  1   0   0   1   1
90  2   1   1   20  1   0   1   1   1
91  2   1   1   42  1   0   0   0   0
92  2   2   1   41  1   1   1   1   1
93  2   2   1   52  0   0   0   0   0
94  2   1   2   47  0   1   1   0   1
95  2   1   1   11  1   1   1   1   1
96  2   1   1   14  0   0   0   1   0
97  2   1   1   15  1   1   1   1   1
98  2   1   1   66  1   1   1   1   1
99  2   2   1   34  0   1   1   0   1
100 2   1   1   43  0   0   0   0   0
101 2   1   1   33  1   1   1   0   1
102 2   1   1   48  1   1   0   0   0
103 2   2   1   20  0   1   1   1   1
104 2   1   2   39  1   0   1   0   0
105 2   2   1   28  0   1   0   0   0
106 2   1   2   38  0   0   0   0   0
107 2   2   1   43  1   1   1   1   1
108 2   2   2   39  0   1   1   1   1
109 2   2   1   68  0   1   1   1   1
110 2   2   2   63  1   1   1   1   1
111 2   2   1   31  1   1   1   1   1
;
run;

data resp2; set resptrial;
 ngood=sum(of v1-v4);
 visits=4;
 mnstatus=mean(of v1-v4);
 arcsin=arsin(mnstatus);
run;

proc ttest data=resp2;
  class treat;
  var mnstatus arcsin;
run;

SAS Connection established. Subprocess id is 9956

The arscin transformation is used when comparing proportions to improve the normal approximation. In either case, with or without the arscin transform we conclude that the treatment did improve respiratory status.

GEE¶

GEE is an example of a marginal (or population averaged) model. Marginal models refer to the fact that the mean response depends only on the covariates (rather than within subject correlations/random effects) and can be viewed as repeated cross-sectional GLM analysis at each repeated measure.

The within-subject correlation is still accounted for and affects statistical inference (but is considered a nuisance parameter as it is not of primary interest)
Parameter interpretations are for population effects

To specify the marginal model, we need three pieces

The mean response: Just as in usual GLM, we need to specify a link function and the linear form of the parameters (which covariates)

$$g(\mu)=\beta_0+\beta_1x_1+\cdots +\beta_px_p$$

The form of the variance of the response: for example for logistic regression, the variance of a binary response is Var($\mu)=\phi\mu(1-\mu)$. An additional $\phi$ parameter is usuall included to allow for overdispersion.
A correlation structure for the within subject correlations: we discussed some such as independence, compound symmetry (exchangeable), and AR(1) last time.

This model formulation does not fit into a suitable likelihood, so an interative methods known as GEE is used. GEE is implemented in PROC GENMOD. Before we look at code, here are a few points about GEE

The GEE estimator of $\hat{\beta}$ is consistent (as long as the mean specification is correct) regardless of whether or not the correlation structure is correct.
Correct inferences require the correlation structure to be correct for the model to give the correct standard errors. Since specifying the correlation may be difficult, we can instead use robust "sandwich" estimators for the covariance. The sandwich estimator is (asymptotically) correct regardless of whether or not we correctly specify the correlation structure correctly. However, these sandwich estimates require a "large" sample size, and are better if we specify the correlation correctly.
Interpretation of parameter for binary data: average change in log odds between treatment and control holding other covariates constant.

Let's fit a model using GEE with different correlation structures.

*Make data into long form;
data respl; set resptrial;
 array vis[4] v1-v4;
 do time=1 to 4;
   status=vis{time};
   output;
 end;
 drop v1 v2 v3 v4;
run;

proc genmod data=respl desc;
 class id;
 model status=center treat sex age time bl /d=b;
repeated subject=id / type=ind; *modelse; *gives nonsandwich ests;
estimate 'treatment' treat 1 /exp;
run;

proc genmod data=respl desc;
 class id;
 model status=center treat sex age time bl /d=b;
repeated subject=id / type=un; *modelse; *option would give nonsandwich ests;
estimate 'treatment' treat 1 /exp;
run;

proc genmod data=respl desc;
 class id;
 model status=center treat sex age time bl /d=b;
repeated subject=id / type=cs;* modelse; *gives nonsandwich ests;
estimate 'treatment' treat 1 /exp;
run;

proc genmod data=respl descending;
 class id;
 model status=center treat sex age time bl /d=b; * dist=binomial;
repeated subject=id / type=ar(1); *modelse; *option would give nonsandwich ests;
estimate 'treatment' treat 1 /exp;
run;

Generalized Linear Mixed Models¶

With th GLMM approach, we introduce random effects which are allowed to vary from one subject to another. As with linear mixed models, adding random effects to the mean response model to induces correlations, but not in as simple a way since the model is non-linear through the link function. These models are more complicated and sometimes cannot even be fit. We will look at just a simple random intercept model. This model assumes that conditional on the random intercepts, that the data follows a usual logistic regression model. The random intercept is assumed to be normal just as in linear mixed models.

Interpretation of parameters:

GEE was population averaged (similar to what we are used to)
GLMM: Subject specific estimates
- What effect can we predict will happen within an individual across time?
- For non time covaraites (like trt grp): What treatment effect would we assume for an individual or (more awkward) between 2 people with the same covariates and same (unknown) baseline propensity (random effect)?

proc glimmix data=respl noclprint;
 class id;
 model status(desc)=center treat sex age time bl  
                   / d=binary solution ddfm=kr;
 random int / subject=id;
 estimate 'treatment' treat 1 /exp;
run;

treat	N	Mean	Std Dev	Std Err	Minimum	Maximum
1	57	0.4430	0.3981	0.0527	0	1.0000
2	54	0.6852	0.3704	0.0504	0	1.0000
Diff (1-2)		-0.2422	0.3849	0.0731

treat	Method	Mean	95% CL Mean		Std Dev	95% CL Std Dev
1		0.4430	0.3373	0.5486	0.3981	0.3361	0.4884
2		0.6852	0.5841	0.7863	0.3704	0.3114	0.4573
Diff (1-2)	Pooled	-0.2422	-0.3871	-0.0973	0.3849	0.3399	0.4438
Diff (1-2)	Satterthwaite	-0.2422	-0.3868	-0.0976

Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	109	-3.31	0.0013
Satterthwaite	Unequal	108.97	-3.32	0.0012

Equality of Variances
Method	Num DF	Den DF	F Value	Pr > F
Folded F	56	53	1.16	0.5984

treat	N	Mean	Std Dev	Std Err	Minimum	Maximum
1	57	0.5907	0.6082	0.0806	0	1.5708
2	54	0.9715	0.6169	0.0840	0	1.5708
Diff (1-2)		-0.3809	0.6124	0.1163

treat	Method	Mean	95% CL Mean		Std Dev	95% CL Std Dev
1		0.5907	0.4293	0.7520	0.6082	0.5134	0.7460
2		0.9715	0.8031	1.1399	0.6169	0.5186	0.7616
Diff (1-2)	Pooled	-0.3809	-0.6114	-0.1503	0.6124	0.5408	0.7061
Diff (1-2)	Satterthwaite	-0.3809	-0.6115	-0.1502

Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	109	-3.27	0.0014
Satterthwaite	Unequal	108.48	-3.27	0.0014

Model Information
Data Set	WORK.RESPL
Distribution	Binomial
Link Function	Logit
Dependent Variable	status

Number of Observations Read	444
Number of Observations Used	444
Number of Events	249
Number of Trials	444

Class Level Information
Class	Levels	Values
id	111	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 ...

Parameter Information
Parameter	Effect
Prm1	Intercept
Prm2	center
Prm3	treat
Prm4	sex
Prm5	age
Prm6	time
Prm7	bl

GEE Model Information
Correlation Structure	Independent
Subject Effect	id (111 levels)
Number of Clusters	111
Correlation Matrix Dimension	4
Maximum Cluster Size	4
Minimum Cluster Size	4

Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter	Estimate	Standard Error	95% Confidence Limits		Z	Pr > \|Z\|
Intercept	-2.8327	0.8981	-4.5930	-1.0725	-3.15	0.0016
center	0.6723	0.3572	-0.0278	1.3725	1.88	0.0598
treat	1.3006	0.3510	0.6126	1.9885	3.71	0.0002
sex	0.1194	0.4437	-0.7503	0.9890	0.27	0.7879
age	-0.0182	0.0130	-0.0437	0.0073	-1.40	0.1626
time	-0.0643	0.0816	-0.2242	0.0957	-0.79	0.4310
bl	1.8841	0.3502	1.1977	2.5704	5.38	<.0001

Contrast Estimate Results
Label	Mean Estimate	Mean		L'Beta Estimate	Standard Error	Alpha	L'Beta		Chi-Square	Pr > ChiSq
Label	Mean Estimate	Confidence Limits		L'Beta Estimate	Standard Error	Alpha	Confidence Limits		Chi-Square	Pr > ChiSq
treatment	0.7859	0.6485	0.8796	1.3006	0.3510	0.05	0.6126	1.9885	13.73	0.0002
Exp(treatment)				3.6713	1.2886	0.05	1.8453	7.3043

GEE Model Information
Correlation Structure	Unstructured
Subject Effect	id (111 levels)
Number of Clusters	111
Correlation Matrix Dimension	4
Maximum Cluster Size	4
Minimum Cluster Size	4

Response Profile
Ordered Value	status	Total Frequency
1	1	249
2	0	195

GEE Fit Criteria
QIC	509.4273
QICu	496.8007

Response Profile
Ordered Value	status	Total Frequency
1	1	249
2	0	195

GEE Fit Criteria
QIC	509.1823
QICu	496.9266

GEE Model Information
Correlation Structure	Exchangeable
Subject Effect	id (111 levels)
Number of Clusters	111
Correlation Matrix Dimension	4
Maximum Cluster Size	4
Minimum Cluster Size	4

GEE Model Information
Correlation Structure	AR(1)
Subject Effect	id (111 levels)
Number of Clusters	111
Correlation Matrix Dimension	4
Maximum Cluster Size	4
Minimum Cluster Size	4

Model Information
Data Set	WORK.RESPL
Response Variable	status
Response Distribution	Binary
Link Function	Logit
Variance Function	Default
Variance Matrix Blocked By	id
Estimation Technique	Residual PL
Degrees of Freedom Method	Kenward-Roger
Fixed Effects SE Adjustment	Kenward-Roger

Dimensions
G-side Cov. Parameters	1
Columns in X	7
Columns in Z per Subject	1
Subjects (Blocks in V)	111
Max Obs per Subject	4

Optimization Information
Optimization Technique	Newton-Raphson with Ridging
Parameters in Optimization	1
Lower Boundaries	1
Upper Boundaries	0
Fixed Effects	Profiled
Starting From	Data

Iteration History
Iteration	Restarts	Subiterations	Objective Function	Change	Max Gradient
0	0	4	2006.0074893	0.37823326	1.591E-7
1	0	3	2071.958408	0.16661455	6.979E-6
2	0	3	2098.9940163	0.05552455	1.839E-9
3	0	2	2106.2429348	0.01408368	7.202E-7
4	0	2	2107.9080296	0.00317548	1.91E-9
5	0	1	2108.2717673	0.00069019	7.562E-6
6	0	1	2108.3502005	0.00014931	3.537E-7
7	0	1	2108.3671377	0.00003215	1.64E-8
8	0	1	2108.3707837	0.00000692	7.59E-10
9	0	1	2108.371568	0.00000149	3.51E-11
10	0	0	2108.3717367	0.00000000	4.647E-6

Fit Statistics
-2 Res Log Pseudo-Likelihood	2108.37
Generalized Chi-Square	281.75
Gener. Chi-Square / DF	0.64

Covariance Parameter Estimates
Cov Parm	Subject	Estimate	Standard Error
Intercept	id	2.0433	0.5305

Correlated Binary and Count Data¶

Summary Measures Analysis¶

SAS Output

The TTEST Procedure

mnstatus

Statistics

Confidence Limits

T-Tests

Equality of Variances

Summary Panel

Q-Q Plots

arcsin

Statistics

Confidence Limits

T-Tests

Equality of Variances

Summary Panel

Q-Q Plots

GEE¶

SAS Output

The GENMOD Procedure

Model Information

Number of Observations

Class Level Information

Response Profile

Parameter Information

Convergence Status

GEE Model Information

Convergence Status

Fit Criteria

Analysis Of GEE Parameter Estimates - Empirical Std Errors

ESTIMATE Statement Results

SAS Output

The GENMOD Procedure

Model Information

Number of Observations

Class Level Information

Response Profile

Parameter Information

Convergence Status

GEE Model Information

Convergence Status

Fit Criteria

Analysis Of GEE Parameter Estimates - Empirical Std Errors

ESTIMATE Statement Results

SAS Output

The GENMOD Procedure

Model Information

Number of Observations

Class Level Information

Response Profile

Parameter Information

Convergence Status

GEE Model Information

Convergence Status

Exchangeable Working Correlation

Fit Criteria

Analysis Of GEE Parameter Estimates - Empirical Std Errors

ESTIMATE Statement Results

SAS Output

The GENMOD Procedure

Model Information

Number of Observations

Class Level Information

Response Profile

Parameter Information

Convergence Status

GEE Model Information

Convergence Status

Fit Criteria

Analysis Of GEE Parameter Estimates - Empirical Std Errors

ESTIMATE Statement Results

Generalized Linear Mixed Models¶

SAS Output

The GLIMMIX Procedure

Model Information

Number of Observations

Response Profiles

Dimensions

Optimization Information