In many mdeical studies, the main outcome variable is the time to the occurrence of a particular event. In a randomized controlled trial of treatment for cancer, for example, surgery, radiation, and chemotherapy might be compared with respect to time from randomization and the start of therapy until death. In this case the event of interest is death of a patient, but in other situations, it might be remission from a disease, relief from symptoms, or the recurrence of a particular condition. Such observations are generally referred to by the generic term survival data even when the endpoint or event considered is not death but something else. Such data generally requiree special techniques for their analysis for two main reasons:
In this lecture, we will examing the WHAS500 data set which contains data on the Worcester Heart Attach Study. This study examined several factors such as age, gender and BMI, that may influence survival time after a heart attack. Follow up time for all patients begins at the time of hospital admission after a heart attach and ends with death or loss to follow up (censoring). We will use the following variables
Traditionally, important functions in statistics are the density function and the cumulative distribution function. Let T represent our survival time for uncensored patients, f represent the density, and $F(t)=Pr(T\leq t)$ be the cumualitive distribution function.
LIBNAME surv "H:\BiostatCourses\PublicHealthComputing\Lectures\Week11Survival\SAS";
ODS SELECT Histogram;
proc univariate data = surv.whas500(where=(fstat=1));
var lenfol;
histogram lenfol / kernel;
run;
The histogram with kernal density estimate of the (uncensored) survival time show that the highest risk of death is shortly after the heart attack and decreases as time passes. (Notice that it is right skewed which is very typical for survival data).
ODS SELECT cdfplot;
proc univariate data = surv.whas500(where=(fstat=1));
var lenfol;
cdfplot lenfol;
run;
The estimate cdf for these uncensored survival times show that after around 200 days a patients has accumulated quite a bit of the risk of death (around 50%), that is the probability of living dying before 200 days is 0.5. After this point the cdf increase more slowly. A faster increase in the cdf occurs in the time when there were more deaths (where the probability of death is more likely).
With censored data, we cannot estimate either of these functions. With (right) censeored data, we can estimate the survival and hazard rate functions. The survival function is
$$S(t)=1-F(t)=Pr(T>t)$$which described the probability of living longer than time t. In the presence of censoring, the survival function is estimated using the nonparametric estimator known as the Kaplan-Meier estimator. To calculate the Kaplan-Meier estimator, let
Then the Kaplan-Meier estimator is given by
$$\hat{S}(t)=\Pi_{t_j\leq t}\left(1-\dfrac{d_j}{r_j}\right)$$The Greenwood estimator of the variance of the Kaplan-Meier estimatory is
$$V\left(\hat{S}(t)\right)=\hat{S}^2(t)\sum_{t_j\leq t}\dfrac{d_j}{r_j(r_j-d_j)}$$To get the Kaplan-Meier estimator in SAS, you will use PROC LIFETEST.
ODS SELECT SURVIVALPLOT;
proc lifetest data=surv.whas500 plots=survival(atrisk cb);
ODS OUTPUT ProductLimitEstimates = ple;
time lenfol*fstat(0);
run;
PROC PRINT DATA=ple(obs=25);
RUN;
The TIME statement is required with PROC LIFETEST and is read as
The hazard function is another important function in survival anlysis. The hazard function is defined as
$$h(t)=\lim_{\Delta t\downarrow 0}\dfrac{Pr(t\leq T\leq t+\Delta t|T\geq t)}{s}=\dfrac{f(t)}{S(t)}.$$This function (approximately) describes the probability of dying at time t given that the patient has lived up to time t.
ODS SELECT HAZARDPLOT;
proc lifetest data=surv.whas500 plots=hazard(bw=200);
time lenfol*fstat(0);
run;
This type of "bathtub" shape for the hazard fucntion is common. The risk if death is typically highests immediately following hospitilization and then drops until eventually other factors such as age lead to the risk of death increasing again.
The survival and hazard rate functions provide useful summaries of survival times for a single group of patients, but usually the interest is in comparing groups by comparing their survival curves (for example compare surivival times of males vs females or new vs old cancer treatmenat). In the following, we compare the survival times after a heart attack for men vs women;
PROC FORMAT;
VALUE gdr 0 = "male" 1 = "female";
ODS SELECT SurvivalPlot HomTests;
proc lifetest data=surv.whas500 atrisk plots=survival(atrisk cb) outs=outwhas500;
strata gender / test=(all);
time lenfol*fstat(0);
FORMAT gender gdr.;
run;
These test are of
$$H_0: S_1=S_2\;\text{ vs }\;H_1:S_1\neq S_2$$The most common test is the log rank test, but SAS has other tests of these hypotheses as well. The log rank test is popular becuase under certain conditions it is the most powerful test. In this case, all of the tests conclud that the survival functions are significantly different between males and females. Females generally have a worse surival time.
In Cox's proportional hazards regression model, we model the hazard function with a baseline hazard function, $h_0(t)$, and a function of the covariates in the following way.
$$h(t|X)=h_0(t)\exp(\beta_1X_1+\cdots+\beta_pX_p)$$If we look at the ratio of the hazard functions at two different values of the covariates, then we have a quantity that is independent of time. This is the assumption that the hazards are proportional
$$\dfrac{h(t|x_2)}{h(t|x_1)}=\dfrac{h_0(t)\exp(\beta_1x_2)}{h_0(t)\exp(\beta_1x_1}=\exp(\beta_1(x_2-x_1)$$This means that this model assums that if a subject has a risk of death twice as high as another subject at some time point then the risk of death is always twice as high at every time point.
proc phreg data = surv.whas500 plots=survival;
class gender;
model lenfol*fstat(0) = gender age;
FORMAT gender gdr.;
run;
The global test of the null hypothesis: BETA=0 is a test of all the BETAS in the model being equal to zero. In this case, the model is significant, so at least on predictor is useful at predicting hazard rates.
The estimated hazard rate ratio between females and males is $\exp{-0.06556}=0.937$ which is roughly a 6% decrease in risk. This coefficient is not significant though.
The estimated hazard rate increase by about 7% (hazard rate ratio = $\exp(0.06683)=1.069$) for each additional year.
Cox's model is a semiparametric model meaning that we have made some parametric (functional form of the covariates and proportional hazards) assumptions and some nonparametric assumptions (no assumptions on the form of the baseline hazard). Just in multiple regression, we can use diagnostic plots to assess the these assumptions:
To assess the whether or not the functional form of the covariates is correct we can examine residual plots. For Cox's model, there are different types of residuals such as Cox-Snell, diviance, martingale and Schoenfeld residuals. How all these residuals are calculated and the theory behind them is beyond the scope of this course, but we will discuss how to use them in diagnostic plots. To assess the funcitonal form, we can plot the martingal residuals vs each variables. Similar to multiple regression, we should the relationship modeling between these two should be a flat line through 0. We could for example use a loess curve to estimate this relationship.
/*full model with linear and quadratric term for bmi */
ODS SELECT NONE;
proc phreg data = surv.whas500;
class gender;
model lenfol*fstat(0) = gender|age bmi hr;
outout out=residfull resmart=martingale;
run;
ODS SELECT ALL;
proc loess data = residfull plots=ResidualsBySmooth(smooth);
model martingale = bmi / smooth=0.2 0.4 0.6 0.8;
run;
In this case, the loess curves do vary some about the line y=0 but not too much. We may want to condiser a different functional form for a better fit though. Another way to assess the functional form is with the assess statement in proc phreg.
proc phreg data = surv.whas500;
class gender;
model lenfol*fstat(0) = gender|age bmi hr;
assess var=(age bmi hr) / resample;
run;
What we want to see in these plots is the solid line doesn't extend beyond the region set by the many dashed lines. The solid line should stay well within this region if the form is correct. These plots are accompanied by simulated p-value. A small p-value indicates that the model should be modified. Most of these plots look fine, but the bmi plot does have a slighly high residual in the lower bmi range around 20-27.
proc phreg data = surv.whas500;
class gender;
model lenfol*fstat(0) = gender|age bmi|bmi hr;
assess var=(age bmi bmi*bmi hr) / resample;
run;
The residuals for bmi are now smaller on the lower end of bmi. All of these plots look reasonable.
To assess the proportional hazards assumption, we can use several methods. If the predictor is categorical we can visually plot the two estimated Kaplan-Meier curves to see if this assumption is violated as we did for males and female above. We can also use the Schoenfel residuals. Plots of the Schoenfeld resisudal vs each predictor in the model should not show any estimated mean relationship (i.e a the line y=0). We can estimate this using a loess curve.
ODS SELECT NONE;
proc phreg data=surv.whas500;
class gender;
model lenfol*fstat(0) = gender|age bmi|bmi hr;
output out=schoen ressch=schgender schage schgenderage
schbmi schbmibmi schhr;
run;
ODS SELECT ALL;
proc loess data = schoen;
model schage=lenfol / smooth=(0.2 0.4 0.6 0.8);
run;
The smoothed loess curve appears roughly flat at 0 showing suggesting that the coefficient for age does not change over time and that the proportional hazards assumption holds for this covariate.
A third way we can check this assumption is by using the assess statement.
proc phreg data=surv.whas500;
class gender;
model lenfol*fstat(0) = gender|age bmi|bmi hr;
assess var=(age bmi bmi*bmi hr) ph / resample ;
run;
Again, we are looking for solid lines that stay within the region defined by the dashed simulated lines. None of these lines looks particularly worrisome and all of the p-values are large suggesting that the proportional hazards assumption holds for each of these covariates.