Logistic Regression in SAS¶

Last week we discussed multiple linear regression, where we modeled the mean response of a continuous variable as a linear function of some covariates. This model assumes that the response follows a normal distrubtion at each value of the covariates. This week we will consider modeling the mean response of a binary variable as a function of certain covariates. We could use a multiple regression model, but this approach has several issues:

A binary variable only takes two values and so it can't be normal
Then mean of a binary variable is equal to the probability of "success". That is $E(Y)=Pr(Y=1)$. If we model the mean as a linear function of the covariates, say $E(Y|X)=Pr(Y=1|X)=\beta_0+\beta_1X$, then our model will allow this probability to take values less than 0 and more than 1.

Instead of using the standard normal linear model, the response is modeled as a Bernoulli variable where a function of the mean response is linear in the covariates. This function is known as the link function and the most common link function is the logit fcuntion:

$$\text{logit}(p)=\log\left(\dfrac{p}{1-p}\right).$$

Let $\pi=\pi(x)=Pr(Y=1|X)$. In the logistic regression model, we model the logit of the mean response as a linear function of the covariates

$$\text{logit}(\pi)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p$$

If we solve for $\pi$, then we get

$$\pi(x)=\dfrac{\exp(\beta_0+\beta_1X_1+\cdots+\beta_pX_p)}{1+\exp(\beta_0+\beta_1X_1+\cdots+\beta_pX_p)}$$

Note that this model only allow $pi$ to be between 0 and 1. For a given predictor X, the associated $\beta$ is the effect on the log odds of a success for a one unit change in X.

DATA logitfct;
DO X=-15 to 15 BY 0.01;
    predpos = EXP(1+ 0.5*X)/(1+EXP(1+0.5*X));
    predneg = EXP(1 - 0.5*X)/(1+EXP(1-0.5*X));
    OUTPUT;
END;
RUN;

PROC SGPLOT DATA=logitfct;
TITLE;
SERIES X = X Y = predpos;
SERIES X = X Y = predneg;
RUN;

Simple Logistic Regression¶

Logistic regression aassumes a monotone (increasing or decreasing but not both) relationship between. If the $\beta_1>0$, then $\pi(x)$ increases with $x$ and if $\beta_1<0$, then $\pi(x)$ deccreases with $x$. Also, note that a fixed change in X has less impact on $\pi$ when $pi$ is near 0 or 1 than when $\pi$ is near the middle of its range.

For our first example, let's model the probability of coronary artery disease between males and females.

ecg - (electrocardiogram) 0 for ecg < 0.1 for ST segment depression (mV) and 1 otherwise
CA - presence of coronary artery disease
gend - gender stored as "male" "female"
cnt - number of subjects with those parameter values

Data ecg;
input ecg CA $ gend $ cnt;
datalines;
0 absence female_0 11
0 presence female_0 4
1 presence female_0 8
1 absence female_0 10
0 presence male_1 9
0 absence male_1 9
1 presence male_1 21
1 absence male_1 6
;
run;

PROC FREQ DATA=ecg order=FREQ;
TABLES gend*CA;
weight cnt;
RUN;

proc logistic data=ecg descending;
class gend (ref='female_0') / param = ref;
model ca=gend;
weight cnt;
output out=probs p=prob;
run;

The option descending in PROC LOGISTIC here is important to make sure that the correct category of the response is being modeled as the success. PROC LOGISTIC in the binary response case gives the category 1 to the lowest value of the response and 0 to the highest. Here absence would be the lowest and be modeled as the success without the descending option. With the descending option, presence is modeled as 1 and absence is modeled as 0.

From the output, we get that

$$\text{logit}(\pi)=-0.5596+1.2527X$$

Since X is also a binary qualitate variable in this case where X=0 (female) and X=1 (male) we have

For females, $Pr(coronary artery disease)=\dfrac(\exp(-0.5596)}{1+\exp(-0.5596)}=0.36364$
For males, $Pr(coronary artery disease)=\dfrac(\exp(-0.5596+1.2527)}{1+\exp(-0.5596+1.2527)}=0.6667$
The estimated odds of coronary artery disease for males is $\exp(1.2527)=3.5$ times the odds of coronary artery disease for females.

PROC PRINT DATA=probs;
RUN;

The output also provides three (asymptotically) equivalent tests of overal covariates effect (similar to the ANOVA F-test in the normal linear model)

Likelihood-ratio test
Wald Test
Score Test

All tests in this case indicate a significant gender effect.

Multiple Logistic Regression¶

proc logistic data=ecg descending;
class gend (ref='female_0') /param=ref;
model ca=gend ecg;
weight cnt;
output out=probs2 p=prob;
Title "logistic with 2 factors, no interaction";
run;

From the output, we have

$\text{logit}(\pi)=-1.1747+1.277*Gender+1.0545*ECG$
The estimated odds of CA for females with ECG < 0.1 is $\exp{-1.17}=0.3104$
The estimated odds of CA for males with ECG < 0.1 is $\exp{-1.17 + 1.277}=1.1129$
The estimated odds of CA for females with ECG >= 0.1 is $\exp{-1.17 + 1.0545}=0.8909$
The estimated odds of CA for males with ECG >=1 0.1 is $\exp{-1.17 + 1.277 + 1.0545}=3.1947$
The estimated odds ratio between males and females is $\exp{1.277}=3.5859$
The estimated odds ratio between ECG $\geq$ 0.1 and ECG < 0.1 is $\exp{1.0545}=2.8705$

PROC PRINT DATA=probs2;
RUN;

The next data set looks at different covariates to predict low birthweight infants.

data lowbwt;
  infile 'H:\BiostatCourses\PHC6937SAS\Week 8\lowbwt.dat';
  input id low age lwt race ftv;
run;

PROC PRINT DATA=lowbwt (OBS=5);
RUN;

proc logistic data=lowbwt desc;
 model loW=lwt;
 estimate '10 pound OR' lwt 10 /exp cl;  *OR for 10 pound increase;
 output out=pout p=pred;
run;

proc logistic data=lowbwt desc;  
  class race / param=ref ref=first;
  model low=age lwt race ftv;
run;

proc logistic data=lowbwt desc;  
  class race / param=ref ref=first;
  model low=age lwt race ftv /selection=b SLSTAY=0.10;
run;

Diagnostics¶

Goodnes of Fit Test: Null hypothesis is the model fits (Hosmer and Lemshow Test)
Deviance residual analysis

proc logistic data=lowbwt descending;
model loW=lwt /lackfit;
output out=pout p=pred resdev=devr reschi=chir;
run;
proc sgscatter data=pout;

plot devr*(id pred lwt);
run;

In this case, there is no apparent lack of fit. In the residuals, we should see the residual between -2 and 2 and fairly smooth lines.

Model Information
Data Set	WORK.ECG
Response Variable	CA
Number of Response Levels	2
Weight Variable	cnt
Model	binary logit
Optimization Technique	Fisher's scoring

Number of Observations Read	8
Number of Observations Used	8
Sum of Weights Read	78
Sum of Weights Used	78

Response Profile
Ordered Value	CA	Total Frequency	Total Weight
1	presence	4	42.000000
2	absence	4	36.000000

Class Level Information
Class	Value	Design Variables
gend	female_0	0
	male_1	1

Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	109.669	104.548
SC	109.748	104.707
-2 Log L	107.669	100.548

Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	7.1209	1	0.0076
Score	7.0346	1	0.0080
Wald	6.7944	1	0.0091

Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	-0.5596	0.3619	2.3914	0.1220
gend	male_1	1	1.2527	0.4806	6.7944	0.0091

Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
gend male_1 vs female_0	3.500	1.364	8.976

Association of Predicted Probabilities and Observed Responses
Percent Concordant	25.0	Somers' D	0.000
Percent Discordant	25.0	Gamma	0.000
Percent Tied	50.0	Tau-a	0.000
Pairs	16	c	0.500

Association of Predicted Probabilities and Observed Responses
Percent Concordant	37.5	Somers' D	0.000
Percent Discordant	37.5	Gamma	0.000
Percent Tied	25.0	Tau-a	0.000
Pairs	16	c	0.500

Model Information
Data Set	WORK.LOWBWT
Response Variable	low
Number of Response Levels	2
Model	binary logit
Optimization Technique	Fisher's scoring

Association of Predicted Probabilities and Observed Responses
Percent Concordant	60.1	Somers' D	0.226
Percent Discordant	37.5	Gamma	0.232
Percent Tied	2.5	Tau-a	0.098
Pairs	7670	c	0.613

Estimate
Label	Estimate	Standard Error	z Value	Pr > \|z\|	Alpha	Lower	Upper	Exponentiated	Exponentiated Lower	Exponentiated Upper
10 pound OR	-0.1406	0.06170	-2.28	0.0227	0.05	-0.2615	-0.01966	0.8689	0.7699	0.9805

Association of Predicted Probabilities and Observed Responses
Percent Concordant	65.3	Somers' D	0.307
Percent Discordant	34.6	Gamma	0.307
Percent Tied	0.0	Tau-a	0.133
Pairs	7670	c	0.653

Summary of Backward Elimination
Step	Effect Removed	DF	Number In	Wald Chi-Square	Pr > ChiSq
1	ftv	1	3	0.0869	0.7681
2	age	1	2	0.5892	0.4427

Obs	id	age	lwt	race	ftv
1	85	19	182	2	0
2	86	33	155	3	3
3	87	20	105	1	1
4	88	21	108	1	2
5	89	18	107	1	0

Association of Predicted Probabilities and Observed Responses
Percent Concordant	64.3	Somers' D	0.295
Percent Discordant	34.8	Gamma	0.297
Percent Tied	0.9	Tau-a	0.127
Pairs	7670	c	0.647

Partition for the Hosmer and Lemeshow Test
Group	Total	low = 1		low = 0
Group	Total	Observed	Expected	Observed	Expected
1	21	4	3.21	17	17.79
2	20	5	4.58	15	15.42
3	19	4	5.28	15	13.72
4	19	8	5.73	11	13.27
5	18	4	5.80	14	12.20
6	20	5	6.70	15	13.30
7	19	4	6.68	15	12.32
8	23	10	8.57	13	14.43
9	20	9	8.08	11	11.92
10	10	6	4.39	4	5.61

Logistic Regression in SAS¶

SAS Output

The SGPLOT Procedure

The SGPlot Procedure

Simple Logistic Regression¶

SAS Output

The FREQ Procedure

Table gend * CA

Cross-Tabular Freq Table

SAS Output

The LOGISTIC Procedure

Model Information

Observations Summary

Response Profile

Class Level Information

Convergence Status

Fit Statistics

Global Tests

Type 3 Tests

Parameter Estimates

Odds Ratios

Association Statistics

SAS Output

The PRINT Procedure

Data Set WORK.PROBS

Multiple Logistic Regression¶

SAS Output

The LOGISTIC Procedure

Model Information

Observations Summary

Response Profile

Class Level Information

Convergence Status

Fit Statistics

Global Tests

Type 3 Tests

Parameter Estimates

Odds Ratios

Association Statistics

SAS Output

The PRINT Procedure

Data Set WORK.PROBS2

SAS Output

The PRINT Procedure

Data Set WORK.LOWBWT

The LOGISTIC Procedure

Model Information

Observations Summary

Response Profile

Convergence Status

Fit Statistics

Global Tests

Parameter Estimates

Odds Ratios

Association Statistics

Estimates

Estimate

SAS Output

The LOGISTIC Procedure

Model Information

Observations Summary

Response Profile

Class Level Information

Convergence Status

Fit Statistics

Global Tests

Type 3 Tests

Parameter Estimates

Odds Ratios

Association Statistics

SAS Output

The LOGISTIC Procedure

Model Information

Observations Summary

Response Profile

Class Level Information

Step 0

Convergence Status

Fit Statistics

Global Tests