Data Set WORK.BPRS
| Obs | id | x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | grp |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | 42 | 36 | 36 | 43 | 41 | 40 | 38 | 47 | 51 | 1 |
| 2 | 102 | 58 | 68 | 61 | 55 | 43 | 34 | 28 | 28 | 28 | 1 |
| 3 | 103 | 54 | 55 | 41 | 38 | 43 | 28 | 29 | 25 | 24 | 1 |
| 4 | 104 | 55 | 77 | 49 | 54 | 56 | 50 | 47 | 42 | 46 | 1 |
| 5 | 105 | 72 | 75 | 72 | 65 | 50 | 39 | 32 | 38 | 32 | 1 |
Subjects in medical/health studies are usually followed ober time (prospective studies). In clinical trials there is often a single outcome measure after a period of time that is then analyzed using techniques we've discussed so far such as linear regression, anova, logistic regresseion etc. We have also examine time to event data (survival analysis data) where we follow patients across time until an event happens or they are censored. Methods for this type of data include the Kaplan-Meier estimatro and the proportional hazards regression model.
All of the methods discussed so far assume that all of the responses are independent. In many applications this is not the case. For example, in longitudinal data an outcome is measured multple times on each patient through time. These responses measured through time on the same subject can no longer be assumed to be independent and different methods are needed for analysis. Longitudinal data is not the only type of data where we need to consider correlations. Consider a multi-site clinical trial. It can be assumed that patients from different sites are indpendent but patients within a site may be correlated due to common measurement procedures, level of care, and other factors. This type of data is uaually referred to as clustered, hierarchical, or multilevel data (The site forms one level and the patients another level nested within each site).
We will focus mainly on longitudinal data (which is interested in change over time) here but these methods can be applied to clustered data as well. We will look at
Let's consider an example data set from a clinical trila with 40 subjects randomly assigned to two treatment groups and measured on the Brief Psychiatric Rating Scale (BPRS) at baseline and weekly for 8 weeks.
data bprs;
input id x0-x8;
grp=1;
if _n_>20 then grp=2;
id=100*grp+id;
cards; *datalines;
1 42 36 36 43 41 40 38 47 51
2 58 68 61 55 43 34 28 28 28
3 54 55 41 38 43 28 29 25 24
4 55 77 49 54 56 50 47 42 46
5 72 75 72 65 50 39 32 38 32
6 48 43 41 38 36 29 33 27 25
7 71 61 47 30 27 40 30 31 31
8 30 36 38 38 31 26 26 25 24
9 41 43 39 35 28 22 20 23 21
10 57 51 51 55 53 43 43 39 32
11 30 34 34 41 36 36 38 36 36
12 55 52 49 54 48 43 37 36 31
13 36 32 36 31 25 25 21 19 22
14 38 35 36 34 25 27 25 26 26
15 66 68 65 49 36 32 27 30 37
16 41 35 45 42 31 31 29 26 30
17 45 38 46 38 40 33 27 31 27
18 39 35 27 25 29 28 21 25 20
19 24 28 31 28 29 21 22 23 22
20 38 34 27 25 25 27 21 19 21
1 52 73 42 41 39 38 43 62 50
2 30 23 32 24 20 20 19 18 20
3 65 31 33 28 22 25 24 31 32
4 37 31 27 31 31 26 24 26 23
5 59 67 58 61 49 38 37 36 35
6 30 33 37 33 28 26 27 23 21
7 69 52 41 33 34 37 37 38 35
8 62 54 49 39 55 51 55 59 66
9 38 40 38 27 31 24 22 21 21
10 65 44 31 34 39 34 41 42 39
11 78 95 75 76 66 64 64 60 75
12 38 41 36 27 29 27 21 22 23
13 63 65 60 53 52 32 37 52 28
14 40 37 31 38 35 30 33 30 27
15 40 36 55 55 42 30 26 30 37
16 54 45 35 27 25 22 22 22 22
17 33 41 30 32 46 43 43 43 43
18 28 30 29 33 30 26 36 33 30
19 52 43 26 27 24 32 21 21 21
20 47 36 32 29 25 23 23 23 23
;
run;
PROC PRINT DATA=BPRS(OBS=5);
RUN;
Note that this dataset is in wide form, where we have one row per subject (mutltiple observations on each subject per row). This form is good for calculating summary measures. Long form is more useful for plots and analyses such as mixed models.
*Baseline as covatiate;
data bprslb;
set bprs;
array xs x1-x8;
bprs0=x0;
do week=1 to 8;
bprs=xs{week};
output;
end;
keep id grp week bprs0 bprs;
run;
PROC PRINT DATA=bprslb(obs=5);
run;
Graphical displays of longitudinal data are always useful for examining patterns and variability features of the data on both an individual level and at a population level (by group). This information may be informative to help determine the details of subsequent formal modeling.
Some simple guidlines given by Diggle et al, 2002 for graphical displays of longitudinal data
One useful plot is called a "spaghetti plot"
proc sgpanel data=bprsl ;
panelby grp / spacing=10;
series y=bprs x=week /group=id ;
Title 'All subjects by group';
run;
To detect unusual observations, it may be helpful to standardize the reponses within each week.
proc sort data=bprsl;
by week;
run;
proc stdize data=bprsl out=bprslz method=std;
var bprs;
by week;
run;
proc sgpanel data=bprslz ;
panelby grp / spacing=10;
series y=bprs x=week /group=id ;
title ‘standardized scores by time and trt’;
run;
When looking at a dataset with a large number of observations, much of the information that can be gleaned by looking at the individual responses will be obscured. In this case, it is often useful to only plot some of the data (such as a randomly chosen subset of the data) or to look at the mean plots by trt group.
proc sgplot data=bprsl;
vline week / response=bprs stat=mean group=grp limitstat=stderr groupdisplay=cluster;
Title 'Means by week and TRT';
run;
This plot suggests little difference between groups with respect to mean BPRS score. Another similar plot, would be to look at side by side boxplots by week.
proc sgplot data=bprsl;
vbox bprs / category=week group=grp;* nocaps;
Title 'Boxplots by week and TRT';
run;
This simple methods of repeated measures analysis uses a summary of the repeated measures for each individual to capture the information about the responses in a single value. Then we have a single measurement for each individual which are independent, so we can then apply standard analyses for independent groups that we have already covered. Which summary statistic to use depends on the question of interest in the study.
Let's try this approach on the BPRS data. Is the overall mean BPRS score different between the two treatments post baseline? We could calculate the mean of the bprs score for each individual post baseline and perform a two sample t-test
data bprsm; *Calculate means using wide dataset;
set bprs;
mnbprs=mean(of x1-x8);
run;
proc sgplot data=bprsm;;
vbox mnbprs / category=grp ;
Title "boxplots of mean outomes";
run;
proc ttest data=bprsm; *t-test for mean outcome;
class grp;
var mnbprs;
run;
It may be better to include the baseline measurement as a coviarate in a regression model becuase these baseline values are often corrlated with the outcome. Including these baseline measurements can increase precision. (This is ANCOVA)
proc means data=bprsm; *Compare baseline means;
class grp;
var x0;
run;
proc glm data=bprsm; *ANCOVA;
class grp;
model mnbprs=x0 grp /solution;
title "ANCOVA for mean outcome";
run;
Another question of interest may be in how does the rate of change in the BPRS scores differ between the treatment groups? In this case, we could use the slope from a simple linear regression of the outcome bprs on time for each subject as the summary measure.
proc sort data=bprsl;
by id;
run;
proc reg data=bprsl outest=regco noprint;
by grp id x0 ;
model bprs=week;
run;
data reggrp; set regco;
bprsrate=week;
keep id grp bprsrate x0;
run;
proc sgplot data=reggrp;
vbox bprsrate / category=grp ;
Title ‘boxplots of mean rates’;
run;
proc ttest data=reggrp; *t-test for reg coef;
class grp;
var bprsrate;
run;
Again, we could also control for the baseline measurment by performing ANCOVA.
*Ancova on slopes;
proc glm data=reggrp;
class grp;
model bprsrate=x0 grp /solution;
title 'ANCOVA for mean outcome';
run;
With longitudinal/clustered data, we have correlated observations. If we ignore this correlation and fit standard linear regression models/ANOVA that assume independence, then our estimates of the parameter will still be valid (assuming the model is correct), but our inferences will be wrong (the standard errors/p-values will be incorect). Usually the estimation and inference for the regression parameters is of primary interest but there are cases where the correlation itself is important. In either case, we need to choose a model for the covariance structure.
We will fit linear mixed models using PROC MIXED. PROC MIXED has several built in correlation structures to choose from. We will describe the most common ones now:
PROC MIXED can be a rather complicated procedure and to fully discuss how the options work would require looking at the matrix form of the general linear mixed effects model. We will not persue that discussion here. We will instead look at a few examples of using proc mixed and try to heuristically describe the options for a given model.
First, let's discuss a simple linear regression model with possible random coefficients.
$$y_{i,j}=\beta_0 + \beta_1t_j + u_{i0} + u_{i1}*t_j\varepsilon_{i,j}$$The fixed effects are $\beta_0$ and $\beta_1$ and the random effects are $u_{i0}$ and $u_{i1}$. You can think of this model as describing the population mean regression line through the betas and inddividuals follow this general shape but vary around this line with the random error for the idividual described by the random effects $u_{i0}$ and $u_{i1}$.
Unfortunately, we do not have enough time to discuss how to derive a model from a given dataset. We will only discuss how to fit a given model.
For out first example, lets consider the following dental dataset. This study followed 16 boys and 11 girls for dental visits at the ages of 8,10,12, and 14 where the distance between the center of the pituiatry gland to the pteryomaxillary fissure was measured in mm. The goals of the study were to describe the distance in boys and girls as simple functions of age, and then to compare the growth functions for boys and girls.
TITLE ;
Data dental; *Inputted as “Wide Form”;
input ID gender $ D8 D10 D12 D14;
datalines;
1 F 21.0 20.0 21.5 23.0
2 F 21.0 21.5 24.0 25.5
3 F 20.5 24.0 24.5 26.0
4 F 23.5 24.5 25.0 26.5
5 F 21.5 23.0 22.5 23.5
6 F 20.0 21.0 21.0 22.5
7 F 21.5 22.5 23.0 25.0
8 F 23.0 23.0 23.5 24.0
9 F 20.0 21.0 22.0 21.5
10 F 16.5 19.0 19.0 19.5
11 F 24.5 25.0 28.0 28.0
12 M 26.0 25.0 29.0 31.0
13 M 21.5 22.5 23.0 26.5
14 M 23.0 22.5 24.0 27.5
15 M 25.5 27.5 26.5 27.0
16 M 20.0 23.5 22.5 26.0
17 M 24.5 25.5 27.0 28.5
18 M 22.0 22.0 24.5 26.5
19 M 24.0 21.5 24.5 25.5
20 M 23.0 20.5 31.0 26.0
21 M 27.5 28.0 31.0 31.5
22 M 23.0 23.0 23.5 25.0
23 M 21.5 23.5 24.0 28.0
24 M 17.0 24.5 26.0 29.5
25 M 22.5 25.5 25.5 26.0
26 M 23.0 24.5 26.0 30.0
27 M 22.0 21.5 23.5 25.0
;
RUN;
data DentL;
set dental;
array DR[4] d8 d10 d12 d14;
do age=8 to 14 by 2;
Dist=DR{(Age-6)/2};
output;
end;
keep id gender age dist;
run;
PROC PRINT DATA=DENTL (obs=20) NOOBS;
RUN;
proc sgpanel data=DentL noautolegend;
panelby gender / spacing=10;
series y=Dist x=age /group=ID ;
Title 'All subjects by group';
run;
*Means plot;
proc sgplot data=dentl;
vline age /response=dist stat=mean
group=gender limitstat=stderr;
run;
proc mixed data=dentl method=REML covtest;
Class gender;
model dist=age gender age*gender/solution ddfm=kr outp=pred outpm=outm residual ;
random Intercept /subject=id type=un solution G;
ods output solutionr=t;
run;
proc sgpanel data=pred noautolegend;
panelby gender / spacing=10;
series y=pred x=age /group=ID ;
Title 'Individual Predictions by Gender';
run;
proc sgplot data=outm noautolegend;
series y=pred x=age /group=ID ;
Title 'Means by Gender';
run;
/* Equivalent model using repeated instead of random*/
proc mixed data=dentl method=REML;
Class gender;
model dist=age gender age*gender/solution ddfm=kr;
repeated /subject=ID type=CS r rcorr;
run;
Next, we will add in a random intercept to the model.
proc mixed data=dentl method=REML covtest;
Class gender;
model dist=age gender age*gender /solution ddfm=kr outp=predIS residual;
random Intercept age /subject=id type=un ;
run;
proc sgpanel data=predIS noautolegend;
panelby gender / spacing=10;
series y=pred x=age /group=ID ;
Title 'Individual Predictions by Gender';
run;
*using AR(1) form;
proc mixed data=dentl method=REML;
Class gender;
model dist=age gender age*gender/solution ddfm=kr;
Random intercept /subject=id type=un;
repeated /subject=ID type=AR(1) r rcorr;
run;