Last week we discussed multiple linear regression, where we modeled the mean response of a continuous variable as a linear function of some covariates. This model assumes that the response follows a normal distrubtion at each value of the covariates. This week we will consider modeling the mean response of a binary variable as a function of certain covariates. We could use a multiple regression model, but this approach has several issues:
Instead of using the standard normal linear model, the response is modeled as a Bernoulli variable where a function of the mean response is linear in the covariates. This function is known as the link function and the most common link function is the logit fcuntion:
$$\text{logit}(p)=\log\left(\dfrac{p}{1-p}\right).$$Let $\pi=\pi(x)=Pr(Y=1|X)$. In the logistic regression model, we model the logit of the mean response as a linear function of the covariates
$$\text{logit}(\pi)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p$$If we solve for $\pi$, then we get
$$\pi(x)=\dfrac{\exp(\beta_0+\beta_1X_1+\cdots+\beta_pX_p)}{1+\exp(\beta_0+\beta_1X_1+\cdots+\beta_pX_p)}$$Note that this model only allow $pi$ to be between 0 and 1. For a given predictor X, the associated $\beta$ is the effect on the log odds of a success for a one unit change in X.
DATA logitfct;
DO X=-15 to 15 BY 0.01;
predpos = EXP(1+ 0.5*X)/(1+EXP(1+0.5*X));
predneg = EXP(1 - 0.5*X)/(1+EXP(1-0.5*X));
OUTPUT;
END;
RUN;
PROC SGPLOT DATA=logitfct;
TITLE;
SERIES X = X Y = predpos;
SERIES X = X Y = predneg;
RUN;
Logistic regression aassumes a monotone (increasing or decreasing but not both) relationship between. If the $\beta_1>0$, then $\pi(x)$ increases with $x$ and if $\beta_1<0$, then $\pi(x)$ deccreases with $x$. Also, note that a fixed change in X has less impact on $\pi$ when $pi$ is near 0 or 1 than when $\pi$ is near the middle of its range.
For our first example, let's model the probability of coronary artery disease between males and females.
Data ecg;
input ecg CA $ gend $ cnt;
datalines;
0 absence female_0 11
0 presence female_0 4
1 presence female_0 8
1 absence female_0 10
0 presence male_1 9
0 absence male_1 9
1 presence male_1 21
1 absence male_1 6
;
run;
PROC FREQ DATA=ecg order=FREQ;
TABLES gend*CA;
weight cnt;
RUN;
proc logistic data=ecg descending;
class gend (ref='female_0') / param = ref;
model ca=gend;
weight cnt;
output out=probs p=prob;
run;
The option descending in PROC LOGISTIC here is important to make sure that the correct category of the response is being modeled as the success. PROC LOGISTIC in the binary response case gives the category 1 to the lowest value of the response and 0 to the highest. Here absence would be the lowest and be modeled as the success without the descending option. With the descending option, presence is modeled as 1 and absence is modeled as 0.
From the output, we get that
$$\text{logit}(\pi)=-0.5596+1.2527X$$Since X is also a binary qualitate variable in this case where X=0 (female) and X=1 (male) we have
PROC PRINT DATA=probs;
RUN;
The output also provides three (asymptotically) equivalent tests of overal covariates effect (similar to the ANOVA F-test in the normal linear model)
All tests in this case indicate a significant gender effect.
proc logistic data=ecg descending;
class gend (ref='female_0') /param=ref;
model ca=gend ecg;
weight cnt;
output out=probs2 p=prob;
Title "logistic with 2 factors, no interaction";
run;
From the output, we have
PROC PRINT DATA=probs2;
RUN;
The next data set looks at different covariates to predict low birthweight infants.
data lowbwt;
infile 'H:\BiostatCourses\PHC6937SAS\Week 8\lowbwt.dat';
input id low age lwt race ftv;
run;
PROC PRINT DATA=lowbwt (OBS=5);
RUN;
proc logistic data=lowbwt desc;
model loW=lwt;
estimate '10 pound OR' lwt 10 /exp cl; *OR for 10 pound increase;
output out=pout p=pred;
run;
proc logistic data=lowbwt desc;
class race / param=ref ref=first;
model low=age lwt race ftv;
run;
proc logistic data=lowbwt desc;
class race / param=ref ref=first;
model low=age lwt race ftv /selection=b SLSTAY=0.10;
run;
proc logistic data=lowbwt descending;
model loW=lwt /lackfit;
output out=pout p=pred resdev=devr reschi=chir;
run;
proc sgscatter data=pout;
plot devr*(id pred lwt);
run;
In this case, there is no apparent lack of fit. In the residuals, we should see the residual between -2 and 2 and fairly smooth lines.