Introduction

In this section, we will begin to learn how to analyze clustered/correlated data. A key assumption of the multiple linear regression model is that all of the observations are independent. Consdier the following data sets:

Knee radiographs are taken yearly in order to understand the onset of osteoarthritis.
Troponin (which is an indicator of heart damage) is measured from blood samples 1, 3, and 6 days following a brain hemorrhage.
Groups of patients in a urinary incontinence trial are assembled from different treatment centers.
Susceptibility to tuberculosis is measured in family members.

All of these are examples of what is called repeated measures data or hierarchical or clustered data. Such data structures are quite common in medical research and a multitude of other fields.

Two features of this type of data are noteworthy and significantly impact the modes of statistical analysis. First, the outcomes are correlated across observations. Yearly radiographs on a person are more similar to one another than to radiographs on other people. Troponinmeasurements on the same person are more similar to one another than to those on other people. And groups of patients from a single center may yield similar responses because of treatment protocol variations from center-to-center, the persons or machines providing the measurements, or the similarity of individuals that choose to participate in a study at that center.

A second important feature of this type of data is that predictor variables can be associated with different levels of a hierarchy. Consider a study of the choice of type of surgery to treat a brain aneurysm either by clipping the base of the aneurysm or implanting a small coil. The study is conducted by measuring the type of surgery a patient receives from a number of surgeons at a number of different institutions. This is thus a hierarchical dataset with multiple patients clustered within a surgeon and multiple surgeons clustered within a hospital. Predictor variables can be specific to any level of this hierarchy. We might be interested in the volume of operations at the hospital, or whether it is a for-profit or not-for-profit hospital. We might be interested in the years of experience of the surgeon or where she was trained. Or we might be interested in how the choice of surgery type depends on the age and gender of the patient.

How to accomodate these different structures in our model is the topic of this section. Here, we will only consider continuous quantitative responses. We will come back later and see how to handle binary and count responses as well.

First, we will examine several different structures with examples that result in correlated data. There are two main structures:

Hierarchical (or Nested) structures
Non-hierarchical structures
- Cross-classified
- Multiple membership

Our main focus will be on hierarchical structures, but we will see some examples of non-hierarchical data as well. After we have seen some different examples, we will then discuss linear mixed effects models to analyze this type of data.