Correlated Data¶

Subjects in medical/health studies are usually followed ober time (prospective studies). In clinical trials there is often a single outcome measure after a period of time that is then analyzed using techniques we've discussed so far such as linear regression, anova, logistic regresseion etc. We have also examine time to event data (survival analysis data) where we follow patients across time until an event happens or they are censored. Methods for this type of data include the Kaplan-Meier estimatro and the proportional hazards regression model.

All of the methods discussed so far assume that all of the responses are independent. In many applications this is not the case. For example, in longitudinal data an outcome is measured multple times on each patient through time. These responses measured through time on the same subject can no longer be assumed to be independent and different methods are needed for analysis. Longitudinal data is not the only type of data where we need to consider correlations. Consider a multi-site clinical trial. It can be assumed that patients from different sites are indpendent but patients within a site may be correlated due to common measurement procedures, level of care, and other factors. This type of data is uaually referred to as clustered, hierarchical, or multilevel data (The site forms one level and the patients another level nested within each site).

We will focus mainly on longitudinal data (which is interested in change over time) here but these methods can be applied to clustered data as well. We will look at

Graphical methods
Simple analysis methods through derived variables (summary measures analysis)
Linear Mixed Models
GEE and Genearlized Mixed Effects Models (GLMM)

Let's consider an example data set from a clinical trila with 40 subjects randomly assigned to two treatment groups and measured on the Brief Psychiatric Rating Scale (BPRS) at baseline and weekly for 8 weeks.

BPRS: 18 item measure used to screen for schizophrenia (range is 18-26)
First 20 subjects in dataset are Trt 1 and the next 20 are Trt2

data bprs;  
  input id x0-x8;
  grp=1;
  if _n_>20 then grp=2;
  id=100*grp+id;
cards; *datalines;
    1   42  36  36  43  41  40  38  47  51
    2   58  68  61  55  43  34  28  28  28
    3   54  55  41  38  43  28  29  25  24
    4   55  77  49  54  56  50  47  42  46
    5   72  75  72  65  50  39  32  38  32
    6   48  43  41  38  36  29  33  27  25
    7   71  61  47  30  27  40  30  31  31
    8   30  36  38  38  31  26  26  25  24
    9   41  43  39  35  28  22  20  23  21
    10  57  51  51  55  53  43  43  39  32
    11  30  34  34  41  36  36  38  36  36
    12  55  52  49  54  48  43  37  36  31
    13  36  32  36  31  25  25  21  19  22
    14  38  35  36  34  25  27  25  26  26
    15  66  68  65  49  36  32  27  30  37
    16  41  35  45  42  31  31  29  26  30
    17  45  38  46  38  40  33  27  31  27
    18  39  35  27  25  29  28  21  25  20
    19  24  28  31  28  29  21  22  23  22
    20  38  34  27  25  25  27  21  19  21
    1   52  73  42  41  39  38  43  62  50
    2   30  23  32  24  20  20  19  18  20
    3   65  31  33  28  22  25  24  31  32
    4   37  31  27  31  31  26  24  26  23
    5   59  67  58  61  49  38  37  36  35
    6   30  33  37  33  28  26  27  23  21
    7   69  52  41  33  34  37  37  38  35
    8   62  54  49  39  55  51  55  59  66
    9   38  40  38  27  31  24  22  21  21
    10  65  44  31  34  39  34  41  42  39
    11  78  95  75  76  66  64  64  60  75
    12  38  41  36  27  29  27  21  22  23
    13  63  65  60  53  52  32  37  52  28
    14  40  37  31  38  35  30  33  30  27
    15  40  36  55  55  42  30  26  30  37
    16  54  45  35  27  25  22  22  22  22
    17  33  41  30  32  46  43  43  43  43
    18  28  30  29  33  30  26  36  33  30
    19  52  43  26  27  24  32  21  21  21
    20  47  36  32  29  25  23  23  23  23
;
run;

PROC PRINT DATA=BPRS(OBS=5);
RUN;

SAS Connection established. Subprocess id is 4788

Note that this dataset is in wide form, where we have one row per subject (mutltiple observations on each subject per row). This form is good for calculating summary measures. Long form is more useful for plots and analyses such as mixed models.

*Baseline as covatiate;
data bprslb;
  set bprs;
  array xs x1-x8;
  bprs0=x0;
  do week=1 to 8;
    bprs=xs{week};
    output;
  end;
  keep id grp week bprs0 bprs;
run;

PROC PRINT DATA=bprslb(obs=5);
run;

Graphical Displays for Longitudinal Data¶

Graphical displays of longitudinal data are always useful for examining patterns and variability features of the data on both an individual level and at a population level (by group). This information may be informative to help determine the details of subsequent formal modeling.

Some simple guidlines given by Diggle et al, 2002 for graphical displays of longitudinal data

Show as much of the raw data as possible
Highlight aggregate patterns of potential interest
Identify both crossectional and longitudinal patterns
Try to make identification of unusual individuals or observations simple

One useful plot is called a "spaghetti plot"

proc sgpanel data=bprsl ;
  panelby grp / spacing=10;
  series y=bprs x=week /group=id ;
  Title 'All subjects by group';
run;

BPRS is decreasing on an individual level for almost everyone over the course of the study
Those who start higher (or lower) tend to stay higher (or lower).
There are individual differences in change (they dont all share the same slope for example).

To detect unusual observations, it may be helpful to standardize the reponses within each week.

proc sort data=bprsl;
 by week;
run;
proc stdize data=bprsl out=bprslz method=std;
  var bprs;
  by week;
run;
proc sgpanel data=bprslz ;
  panelby grp / spacing=10;
  series y=bprs x=week /group=id ;
  title ‘standardized scores by time and trt’;
run;

When looking at a dataset with a large number of observations, much of the information that can be gleaned by looking at the individual responses will be obscured. In this case, it is often useful to only plot some of the data (such as a randomly chosen subset of the data) or to look at the mean plots by trt group.

proc sgplot data=bprsl;
  vline week / response=bprs stat=mean group=grp limitstat=stderr groupdisplay=cluster;
  Title 'Means by week and TRT';
run;

This plot suggests little difference between groups with respect to mean BPRS score. Another similar plot, would be to look at side by side boxplots by week.

proc sgplot data=bprsl;
  vbox bprs / category=week group=grp;* nocaps;
  Title 'Boxplots by week and TRT';
run;

Summary Measures Analysis¶

This simple methods of repeated measures analysis uses a summary of the repeated measures for each individual to capture the information about the responses in a single value. Then we have a single measurement for each individual which are independent, so we can then apply standard analyses for independent groups that we have already covered. Which summary statistic to use depends on the question of interest in the study.

Let's try this approach on the BPRS data. Is the overall mean BPRS score different between the two treatments post baseline? We could calculate the mean of the bprs score for each individual post baseline and perform a two sample t-test

data bprsm; *Calculate means using wide dataset;
 set bprs;
 mnbprs=mean(of x1-x8);
run;
proc sgplot data=bprsm;;
  vbox mnbprs / category=grp ;
 Title "boxplots of mean outomes";
run;
proc ttest data=bprsm;  *t-test for mean outcome;
  class grp;
  var mnbprs;
run;

It may be better to include the baseline measurement as a coviarate in a regression model becuase these baseline values are often corrlated with the outcome. Including these baseline measurements can increase precision. (This is ANCOVA)

proc means data=bprsm;  *Compare baseline means;
 class grp;
 var x0;
run;

proc glm data=bprsm;  *ANCOVA;
  class grp;
  model mnbprs=x0 grp  /solution;
  title "ANCOVA for mean outcome";
run;

Another question of interest may be in how does the rate of change in the BPRS scores differ between the treatment groups? In this case, we could use the slope from a simple linear regression of the outcome bprs on time for each subject as the summary measure.

proc sort data=bprsl;
by id;
run;

proc reg data=bprsl outest=regco noprint;
 by grp id x0 ;
 model bprs=week;
run;

data reggrp; set regco;
   bprsrate=week;
   keep id grp bprsrate x0;
run;

proc sgplot data=reggrp;
  vbox bprsrate / category=grp ;
 Title ‘boxplots of mean rates’;
run;

proc ttest data=reggrp;  *t-test for reg coef;
  class grp;
  var bprsrate;
run;

Again, we could also control for the baseline measurment by performing ANCOVA.

*Ancova on slopes;
proc glm data=reggrp;
  class grp;
  model bprsrate=x0 grp /solution;
  title 'ANCOVA for mean outcome';
run;

Mixed Effect Models¶

With longitudinal/clustered data, we have correlated observations. If we ignore this correlation and fit standard linear regression models/ANOVA that assume independence, then our estimates of the parameter will still be valid (assuming the model is correct), but our inferences will be wrong (the standard errors/p-values will be incorect). Usually the estimation and inference for the regression parameters is of primary interest but there are cases where the correlation itself is important. In either case, we need to choose a model for the covariance structure.

We will fit linear mixed models using PROC MIXED. PROC MIXED has several built in correlation structures to choose from. We will describe the most common ones now:

Independence - no correlation is assumed between observations made on the same subject and so assumes all correlations are 0. This is the same as using a standard linear regression model.

$$\left(\begin{array}{ccc} \sigma^2 & 0 & 0 \\ 0 & \sigma^2 & 0 \\ 0 & 0 & \sigma^2\end{array}\right)$$

Compound symmetry - Every obervation is equally correlated with every other observation from the same subject. Only one additional parameter to estimate.

$$\left(\begin{array}{ccc} \sigma^2 & \rho & \rho \\ \rho & \sigma^2 & \rho \\ \rho & \rho & \sigma^2\end{array}\right)$$

Autoregressive (AR(1)) - assumes observations are related to their own past observations such that obesrvations closer to each other in time are more highly correlated than observations further apart in time. The correlation is assumed to drop off exponentially with time.

$$\left(\begin{array}{ccc} \sigma^2 & \rho & \rho^2 \\ \rho & \sigma^2 & \rho \\ \rho^2 & \rho & \sigma^2\end{array}\right)$$

Unstructured - no restrictions on the structure of the correlation, so each correlation gets its own parameter. This adds a lot of parameters and requires a lot of data to reliably estimate all parameters, but can match any correlation structure.

$$\left(\begin{array}{ccc} \sigma^2 & \rho_{12} & \rho_{13} \\ \rho_{21} & \sigma^2 & \rho_{23} \\ \rho_{31} & \rho_{32} & \sigma^2\end{array}\right)$$

PROC MIXED can be a rather complicated procedure and to fully discuss how the options work would require looking at the matrix form of the general linear mixed effects model. We will not persue that discussion here. We will instead look at a few examples of using proc mixed and try to heuristically describe the options for a given model.

First, let's discuss a simple linear regression model with possible random coefficients.

$$y_{i,j}=\beta_0 + \beta_1t_j + u_{i0} + u_{i1}*t_j\varepsilon_{i,j}$$

The fixed effects are $\beta_0$ and $\beta_1$ and the random effects are $u_{i0}$ and $u_{i1}$. You can think of this model as describing the population mean regression line through the betas and inddividuals follow this general shape but vary around this line with the random error for the idividual described by the random effects $u_{i0}$ and $u_{i1}$.

Unfortunately, we do not have enough time to discuss how to derive a model from a given dataset. We will only discuss how to fit a given model.

For out first example, lets consider the following dental dataset. This study followed 16 boys and 11 girls for dental visits at the ages of 8,10,12, and 14 where the distance between the center of the pituiatry gland to the pteryomaxillary fissure was measured in mm. The goals of the study were to describe the distance in boys and girls as simple functions of age, and then to compare the growth functions for boys and girls.

TITLE ;
Data dental; *Inputted as “Wide Form”;
input ID gender $ D8 D10 D12 D14;
datalines;
 1  F 21.0    20.0    21.5    23.0
 2  F 21.0    21.5    24.0    25.5
 3  F 20.5    24.0    24.5    26.0
 4  F 23.5    24.5    25.0    26.5
 5  F 21.5    23.0    22.5    23.5
 6  F 20.0    21.0    21.0    22.5
 7  F 21.5    22.5    23.0    25.0
 8  F 23.0    23.0    23.5    24.0
 9  F 20.0    21.0    22.0    21.5
10  F 16.5    19.0    19.0    19.5
11  F 24.5    25.0    28.0    28.0
12  M 26.0    25.0    29.0    31.0
13  M 21.5    22.5    23.0    26.5
14  M 23.0    22.5    24.0    27.5
15  M 25.5    27.5    26.5    27.0
16  M 20.0    23.5    22.5    26.0
17  M 24.5    25.5    27.0    28.5
18  M 22.0    22.0    24.5    26.5
19  M 24.0    21.5    24.5    25.5
20  M 23.0    20.5    31.0    26.0
21  M 27.5    28.0    31.0    31.5
22  M 23.0    23.0    23.5    25.0
23  M 21.5    23.5    24.0    28.0
24  M 17.0    24.5    26.0    29.5
25  M 22.5    25.5    25.5    26.0
26  M 23.0    24.5    26.0    30.0
27  M 22.0    21.5    23.5    25.0
;
RUN;

data DentL;
  set dental;
  array DR[4] d8 d10 d12 d14;
  do age=8 to 14 by 2;
    Dist=DR{(Age-6)/2};
    output;
  end;
  keep id gender age dist;
run;

PROC PRINT DATA=DENTL (obs=20) NOOBS;
RUN;

proc sgpanel data=DentL noautolegend;
  panelby gender / spacing=10;
  series y=Dist x=age /group=ID ;
  Title 'All subjects by group';
run;

*Means plot;
proc sgplot data=dentl;
  vline age /response=dist stat=mean 
         group=gender limitstat=stderr;  
run;

proc mixed data=dentl method=REML covtest;
Class gender;
model dist=age gender age*gender/solution ddfm=kr outp=pred outpm=outm residual   ;
random Intercept /subject=id type=un solution G;
ods output solutionr=t;
run;

proc sgpanel data=pred noautolegend;
  panelby gender / spacing=10;
  series y=pred x=age /group=ID ;
  Title 'Individual Predictions by Gender';
run;

proc sgplot data=outm noautolegend;
  series y=pred x=age /group=ID ;
  Title 'Means by Gender';
run;

/* Equivalent model using repeated instead of random*/
proc mixed data=dentl method=REML;
Class gender;
model dist=age gender age*gender/solution ddfm=kr;
repeated  /subject=ID type=CS r rcorr;
run;

Next, we will add in a random intercept to the model.

proc mixed data=dentl method=REML covtest;
Class gender;
model dist=age gender age*gender /solution ddfm=kr outp=predIS residual;
random Intercept age /subject=id type=un ;
run;

proc sgpanel data=predIS noautolegend;
  panelby gender / spacing=10;
  series y=pred x=age /group=ID ;
  Title 'Individual Predictions by Gender';
run;

*using AR(1) form;
proc mixed data=dentl method=REML;
Class gender;
model dist=age gender age*gender/solution ddfm=kr;
Random intercept /subject=id type=un;
repeated  /subject=ID type=AR(1) r rcorr;
run;

grp	N	Mean	Std Dev	Std Err	Minimum	Maximum
1	20	36.1688	8.3691	1.8714	24.8750	52.6250
2	20	36.5625	12.2090	2.7300	22.0000	71.8750
Diff (1-2)		-0.3937	10.4667	3.3098

grp	Method	Mean	95% CL Mean		Std Dev	95% CL Std Dev
1		36.1688	32.2519	40.0856	8.3691	6.3646	12.2236
2		36.5625	30.8485	42.2765	12.2090	9.2848	17.8321
Diff (1-2)	Pooled	-0.3937	-7.0942	6.3067	10.4667	8.5538	13.4892
Diff (1-2)	Satterthwaite	-0.3937	-7.1229	6.3354

Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	38	-0.12	0.9059
Satterthwaite	Unequal	33.626	-0.12	0.9060

Equality of Variances
Method	Num DF	Den DF	F Value	Pr > F
Folded F	19	19	2.13	0.1083

Analysis Variable : x0
grp	N Obs	N	Mean	Std Dev	Minimum	Maximum
1	20	20	47.0000000	13.6034051	24.0000000	72.0000000
2	20	20	49.0000000	14.9243707	28.0000000	78.0000000

Obs	id	x0	x1	x2	x3	x4	x5	x6	x7	x8	grp
1	101	42	36	36	43	41	40	38	47	51	1
2	102	58	68	61	55	43	34	28	28	28	1
3	103	54	55	41	38	43	28	29	25	24	1
4	104	55	77	49	54	56	50	47	42	46	1
5	105	72	75	72	65	50	39	32	38	32	1

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	1871.515626	935.757813	15.10	<.0001
Error	37	2292.965234	61.972033
Corrected Total	39	4164.480859

Source	DF	Type I SS	Mean Square	F Value	Pr > F
x0	1	1868.066649	1868.066649	30.14	<.0001
grp	1	3.448977	3.448977	0.06	0.8148

Source	DF	Type III SS	Mean Square	F Value	Pr > F
x0	1	1869.965235	1869.965235	30.17	<.0001
grp	1	3.448977	3.448977	0.06	0.8148

Parameter	Estimate		Standard Error	t Value	Pr > \|t\|
Intercept	12.49017488	B	4.72259392	2.64	0.0119
x0	0.49127194		0.08943408	5.49	<.0001
grp 1	0.58879388	B	2.49583596	0.24	0.8148
grp 2	0.00000000	B	.	.	.

grp	N	Mean	Std Dev	Std Err	Minimum	Maximum
1	20	-2.6283	1.8740	0.4190	-6.2833	1.1667
2	20	-1.9125	1.5539	0.3475	-4.2333	1.3833
Diff (1-2)		-0.7158	1.7214	0.5444

grp	Method	Mean	95% CL Mean		Std Dev	95% CL Std Dev
1		-2.6283	-3.5054	-1.7513	1.8740	1.4252	2.7372
2		-1.9125	-2.6398	-1.1852	1.5539	1.1817	2.2696
Diff (1-2)	Pooled	-0.7158	-1.8178	0.3862	1.7214	1.4068	2.2186
Diff (1-2)	Satterthwaite	-0.7158	-1.8191	0.3874

Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	38	-1.31	0.1964
Satterthwaite	Unequal	36.74	-1.31	0.1967

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	48.2429139	24.1214570	12.84	<.0001
Error	37	69.4884680	1.8780667
Corrected Total	39	117.7313819

Source	DF	Type I SS	Mean Square	F Value	Pr > F
x0	1	40.79852340	40.79852340	21.72	<.0001
grp	1	7.44439053	7.44439053	3.96	0.0539

Class Level Information
Class	Levels	Values
grp	2	1 2

Number of Observations Read	40
Number of Observations Used	40

Class Level Information
Class	Levels	Values
grp	2	1 2

Number of Observations Read	40
Number of Observations Used	40

Obs	id	x0	x1	x2	x3	x4	x5	x6	x7	x8	grp
1	101	42	36	36	43	41	40	38	47	51	1
2	102	58	68	61	55	43	34	28	28	28	1
3	103	54	55	41	38	43	28	29	25	24	1
4	104	55	77	49	54	56	50	47	42	46	1
5	105	72	75	72	65	50	39	32	38	32	1

Source	DF	Type III SS	Mean Square	F Value	Pr > F
x0	1	43.11874032	43.11874032	22.96	<.0001
grp	1	7.44439053	7.44439053	3.96	0.0539

Parameter	Estimate		Standard Error	t Value	Pr > \|t\|
Intercept	1.742894941	B	0.82212572	2.12	0.0408
x0	-0.074599897		0.01556900	-4.79	<.0001
grp 1	-0.865033127	B	0.43448388	-1.99	0.0539
grp 2	0.000000000	B	.	.	.

ID	gender	age	Dist
1	F	8	21.0
1	F	10	20.0
1	F	12	21.5
1	F	14	23.0
2	F	8	21.0
2	F	10	21.5
2	F	12	24.0
2	F	14	25.5
3	F	8	20.5
3	F	10	24.0
3	F	12	24.5
3	F	14	26.0
4	F	8	23.5
4	F	10	24.5
4	F	12	25.0
4	F	14	26.5
5	F	8	21.5
5	F	10	23.0
5	F	12	22.5
5	F	14	23.5

Model Information
Data Set	WORK.DENTL
Dependent Variable	Dist
Covariance Structure	Unstructured
Subject Effect	ID
Estimation Method	REML
Residual Variance Method	Profile
Fixed Effects SE Method	Kenward-Roger
Degrees of Freedom Method	Kenward-Roger

Dimensions
Covariance Parameters	2
Columns in X	6
Columns in Z per Subject	1
Subjects	27
Max Obs per Subject	4

Number of Observations
Number of Observations Read	108
Number of Observations Used	108
Number of Observations Not Used	0

Iteration History
Iteration	Evaluations	-2 Res Log Like	Criterion
0	1	483.55911746
1	1	433.75724920	0.00000000

Covariance Parameter Estimates
Cov Parm	Subject	Estimate	Standard Error	Z Value	Pr > Z
UN(1,1)	ID	3.2986	1.0716	3.08	0.0010
Residual		1.9221	0.3058	6.28	<.0001

Fit Statistics
-2 Res Log Likelihood	433.8
AIC (Smaller is Better)	437.8
AICC (Smaller is Better)	437.9
BIC (Smaller is Better)	440.3

Correlated Data¶

SAS Output

The PRINT Procedure

Data Set WORK.BPRS

SAS Output

The PRINT Procedure

Data Set WORK.BPRSLB

Graphical Displays for Longitudinal Data¶

SAS Output

The SGPANEL Procedure

The SGPanel Procedure

SAS Output

The SGPANEL Procedure

The SGPanel Procedure

SAS Output

The SGPLOT Procedure

The SGPlot Procedure

SAS Output

The SGPLOT Procedure

The SGPlot Procedure

Summary Measures Analysis¶

SAS Output

The SGPLOT Procedure

The SGPlot Procedure

The TTEST Procedure

mnbprs

Statistics

Confidence Limits

T-Tests

Equality of Variances

Summary Panel

Q-Q Plots

SAS Output

The MEANS Procedure

Summary statistics

The GLM Procedure

Data

Class Levels

Number of Observations

Analysis of Variance

mnbprs

Overall ANOVA

Fit Statistics

Type I Model ANOVA

Type III Model ANOVA

Solution

ANCOVA Plot

SAS Output

The SGPLOT Procedure

The SGPlot Procedure

The TTEST Procedure

bprsrate

Statistics

Confidence Limits

T-Tests

Equality of Variances

Summary Panel

Q-Q Plots

SAS Output

The GLM Procedure

Data

Class Levels

Number of Observations

Analysis of Variance

bprsrate

Overall ANOVA

Fit Statistics

Type I Model ANOVA

Type III Model ANOVA

Solution

ANCOVA Plot

Mixed Effect Models¶

SAS Output

The PRINT Procedure

Data Set WORK.DENTL

SAS Output

The SGPANEL Procedure

The SGPanel Procedure

The SGPLOT Procedure

The SGPlot Procedure

Obs	id	x0	x1	x2	x3	x4	x5	x6	x7	x8	grp
1	101	42	36	36	43	41	40	38	47	51	1
2	102	58	68	61	55	43	34	28	28	28	1
3	103	54	55	41	38	43	28	29	25	24	1
4	104	55	77	49	54	56	50	47	42	46	1
5	105	72	75	72	65	50	39	32	38	32	1

Solution for Fixed Effects
Effect	gender	Estimate	Standard Error	DF	t Value	Pr > \|t\|
Intercept		16.3406	0.9813	104	16.65	<.0001
age		0.7844	0.07750	79	10.12	<.0001
gender	F	1.0321	1.5374	104	0.67	0.5035
gender	M	0	.	.	.	.
age*gender	F	-0.3048	0.1214	79	-2.51	0.0141
age*gender	M	0	.	.	.	.

Dimensions
Covariance Parameters	2
Columns in X	6
Columns in Z	0
Subjects	27
Max Obs per Subject	4

Estimated R Matrix for Subject 1
Row	Col1	Col2	Col3	Col4
1	5.2207	3.2986	3.2986	3.2986
2	3.2986	5.2207	3.2986	3.2986
3	3.2986	3.2986	5.2207	3.2986
4	3.2986	3.2986	3.2986	5.2207

Fit Statistics
-2 Res Log Likelihood	432.6
AIC (Smaller is Better)	440.6
AICC (Smaller is Better)	441.0
BIC (Smaller is Better)	445.8

Fit Statistics
-2 Res Log Likelihood	433.7
AIC (Smaller is Better)	439.7
AICC (Smaller is Better)	439.9
BIC (Smaller is Better)	443.6