Logistic Regression in SAS

Last week we discussed multiple linear regression, where we modeled the mean response of a continuous variable as a linear function of some covariates. This model assumes that the response follows a normal distrubtion at each value of the covariates. This week we will consider modeling the mean response of a binary variable as a function of certain covariates. We could use a multiple regression model, but this approach has several issues:

  • A binary variable only takes two values and so it can't be normal
  • Then mean of a binary variable is equal to the probability of "success". That is $E(Y)=Pr(Y=1)$. If we model the mean as a linear function of the covariates, say $E(Y|X)=Pr(Y=1|X)=\beta_0+\beta_1X$, then our model will allow this probability to take values less than 0 and more than 1.

Instead of using the standard normal linear model, the response is modeled as a Bernoulli variable where a function of the mean response is linear in the covariates. This function is known as the link function and the most common link function is the logit fcuntion:

$$\text{logit}(p)=\log\left(\dfrac{p}{1-p}\right).$$

Let $\pi=\pi(x)=Pr(Y=1|X)$. In the logistic regression model, we model the logit of the mean response as a linear function of the covariates

$$\text{logit}(\pi)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p$$

If we solve for $\pi$, then we get

$$\pi(x)=\dfrac{\exp(\beta_0+\beta_1X_1+\cdots+\beta_pX_p)}{1+\exp(\beta_0+\beta_1X_1+\cdots+\beta_pX_p)}$$

Note that this model only allow $pi$ to be between 0 and 1. For a given predictor X, the associated $\beta$ is the effect on the log odds of a success for a one unit change in X.

In [18]:
DATA logitfct;
DO X=-15 to 15 BY 0.01;
    predpos = EXP(1+ 0.5*X)/(1+EXP(1+0.5*X));
    predneg = EXP(1 - 0.5*X)/(1+EXP(1-0.5*X));
    OUTPUT;
END;
RUN;

PROC SGPLOT DATA=logitfct;
TITLE;
SERIES X = X Y = predpos;
SERIES X = X Y = predneg;
RUN;
Out[18]:
SAS Output

SAS Output

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure

Simple Logistic Regression

Logistic regression aassumes a monotone (increasing or decreasing but not both) relationship between. If the $\beta_1>0$, then $\pi(x)$ increases with $x$ and if $\beta_1<0$, then $\pi(x)$ deccreases with $x$. Also, note that a fixed change in X has less impact on $\pi$ when $pi$ is near 0 or 1 than when $\pi$ is near the middle of its range.

For our first example, let's model the probability of coronary artery disease between males and females.

  • ecg - (electrocardiogram) 0 for ecg < 0.1 for ST segment depression (mV) and 1 otherwise
  • CA - presence of coronary artery disease
  • gend - gender stored as "male" "female"
  • cnt - number of subjects with those parameter values
In [2]:
Data ecg;
input ecg CA $ gend $ cnt;
datalines;
0 absence female_0 11
0 presence female_0 4
1 presence female_0 8
1 absence female_0 10
0 presence male_1 9
0 absence male_1 9
1 presence male_1 21
1 absence male_1 6
;
run;

PROC FREQ DATA=ecg order=FREQ;
TABLES gend*CA;
weight cnt;
RUN;
Out[2]:
SAS Output

SAS Output

The SAS System

The FREQ Procedure

The FREQ Procedure

Table gend * CA

Cross-Tabular Freq Table

Frequency
Percent
Row Pct
Col Pct
Table of gend by CA
gend CA
presence absence Total
male_1
30
38.46
66.67
71.43
15
19.23
33.33
41.67
45
57.69
 
 
female_0
12
15.38
36.36
28.57
21
26.92
63.64
58.33
33
42.31
 
 
Total
42
53.85
36
46.15
78
100.00
In [4]:
proc logistic data=ecg descending;
class gend (ref='female_0') / param = ref;
model ca=gend;
weight cnt;
output out=probs p=prob;
run;
Out[4]:
SAS Output

SAS Output

The SAS System

The LOGISTIC Procedure

The LOGISTIC Procedure

Model Information

Model Information
Data Set WORK.ECG
Response Variable CA
Number of Response Levels 2
Weight Variable cnt
Model binary logit
Optimization Technique Fisher's scoring

Observations Summary

Number of Observations Read 8
Number of Observations Used 8
Sum of Weights Read 78
Sum of Weights Used 78

Response Profile

Response Profile
Ordered
Value
CA Total
Frequency
Total
Weight
1 presence 4 42.000000
2 absence 4 36.000000

Probability modeled is CA='presence'.

Class Level Information

Class Level Information
Class Value Design Variables
gend female_0 0
  male_1 1

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 109.669 104.548
SC 109.748 104.707
-2 Log L 107.669 100.548

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 7.1209 1 0.0076
Score 7.0346 1 0.0080
Wald 6.7944 1 0.0091

Type 3 Tests

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
gend 1 6.7944 0.0091

Parameter Estimates

Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -0.5596 0.3619 2.3914 0.1220
gend male_1 1 1.2527 0.4806 6.7944 0.0091

Odds Ratios

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
gend male_1 vs female_0 3.500 1.364 8.976

Association Statistics

Association of Predicted Probabilities and Observed Responses
Percent Concordant 25.0 Somers' D 0.000
Percent Discordant 25.0 Gamma 0.000
Percent Tied 50.0 Tau-a 0.000
Pairs 16 c 0.500

The option descending in PROC LOGISTIC here is important to make sure that the correct category of the response is being modeled as the success. PROC LOGISTIC in the binary response case gives the category 1 to the lowest value of the response and 0 to the highest. Here absence would be the lowest and be modeled as the success without the descending option. With the descending option, presence is modeled as 1 and absence is modeled as 0.

From the output, we get that

$$\text{logit}(\pi)=-0.5596+1.2527X$$

Since X is also a binary qualitate variable in this case where X=0 (female) and X=1 (male) we have

  • For females, $Pr(coronary artery disease)=\dfrac(\exp(-0.5596)}{1+\exp(-0.5596)}=0.36364$
  • For males, $Pr(coronary artery disease)=\dfrac(\exp(-0.5596+1.2527)}{1+\exp(-0.5596+1.2527)}=0.6667$
  • The estimated odds of coronary artery disease for males is $\exp(1.2527)=3.5$ times the odds of coronary artery disease for females.
In [5]:
PROC PRINT DATA=probs;
RUN;
Out[5]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.PROBS

Obs ecg CA gend cnt _LEVEL_ prob
1 0 absence female_0 11 presence 0.36364
2 0 presence female_0 4 presence 0.36364
3 1 presence female_0 8 presence 0.36364
4 1 absence female_0 10 presence 0.36364
5 0 presence male_1 9 presence 0.66665
6 0 absence male_1 9 presence 0.66665
7 1 presence male_1 21 presence 0.66665
8 1 absence male_1 6 presence 0.66665

The output also provides three (asymptotically) equivalent tests of overal covariates effect (similar to the ANOVA F-test in the normal linear model)

  • Likelihood-ratio test
  • Wald Test
  • Score Test

All tests in this case indicate a significant gender effect.

Multiple Logistic Regression

In [7]:
proc logistic data=ecg descending;
class gend (ref='female_0') /param=ref;
model ca=gend ecg;
weight cnt;
output out=probs2 p=prob;
Title "logistic with 2 factors, no interaction";
run;
Out[7]:
SAS Output

SAS Output

logistic with 2 factors, no interaction

The LOGISTIC Procedure

The LOGISTIC Procedure

Model Information

Model Information
Data Set WORK.ECG
Response Variable CA
Number of Response Levels 2
Weight Variable cnt
Model binary logit
Optimization Technique Fisher's scoring

Observations Summary

Number of Observations Read 8
Number of Observations Used 8
Sum of Weights Read 78
Sum of Weights Used 78

Response Profile

Response Profile
Ordered
Value
CA Total
Frequency
Total
Weight
1 presence 4 42.000000
2 absence 4 36.000000

Probability modeled is CA='presence'.

Class Level Information

Class Level Information
Class Value Design Variables
gend female_0 0
  male_1 1

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 109.669 101.900
SC 109.748 102.138
-2 Log L 107.669 95.900

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 11.7694 2 0.0028
Score 11.2410 2 0.0036
Wald 10.0644 2 0.0065

Type 3 Tests

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
gend 1 6.5750 0.0103
ecg 1 4.4844 0.0342

Parameter Estimates

Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -1.1747 0.4854 5.8571 0.0155
gend male_1 1 1.2770 0.4980 6.5750 0.0103
ecg   1 1.0545 0.4980 4.4844 0.0342

Odds Ratios

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
gend male_1 vs female_0 3.586 1.351 9.516
ecg 2.871 1.082 7.618

Association Statistics

Association of Predicted Probabilities and Observed Responses
Percent Concordant 37.5 Somers' D 0.000
Percent Discordant 37.5 Gamma 0.000
Percent Tied 25.0 Tau-a 0.000
Pairs 16 c 0.500

From the output, we have

  • $\text{logit}(\pi)=-1.1747+1.277*Gender+1.0545*ECG$
  • The estimated odds of CA for females with ECG < 0.1 is $\exp{-1.17}=0.3104$
  • The estimated odds of CA for males with ECG < 0.1 is $\exp{-1.17 + 1.277}=1.1129$
  • The estimated odds of CA for females with ECG >= 0.1 is $\exp{-1.17 + 1.0545}=0.8909$
  • The estimated odds of CA for males with ECG >=1 0.1 is $\exp{-1.17 + 1.277 + 1.0545}=3.1947$
  • The estimated odds ratio between males and females is $\exp{1.277}=3.5859$
  • The estimated odds ratio between ECG $\geq$ 0.1 and ECG < 0.1 is $\exp{1.0545}=2.8705$
In [8]:
PROC PRINT DATA=probs2;
RUN;
Out[8]:
SAS Output

SAS Output

logistic with 2 factors, no interaction

The PRINT Procedure

Data Set WORK.PROBS2

Obs ecg CA gend cnt _LEVEL_ prob
1 0 absence female_0 11 presence 0.23601
2 0 presence female_0 4 presence 0.23601
3 1 presence female_0 8 presence 0.46999
4 1 absence female_0 10 presence 0.46999
5 0 presence male_1 9 presence 0.52555
6 0 absence male_1 9 presence 0.52555
7 1 presence male_1 21 presence 0.76075
8 1 absence male_1 6 presence 0.76075

The next data set looks at different covariates to predict low birthweight infants.

In [19]:
data lowbwt;
  infile 'H:\BiostatCourses\PHC6937SAS\Week 8\lowbwt.dat';
  input id low age lwt race ftv;
run;

PROC PRINT DATA=lowbwt (OBS=5);
RUN;

proc logistic data=lowbwt desc;
 model loW=lwt;
 estimate '10 pound OR' lwt 10 /exp cl;  *OR for 10 pound increase;
 output out=pout p=pred;
run;
Out[19]:
SAS Output

SAS Output

The PRINT Procedure

Data Set WORK.LOWBWT

Obs id low age lwt race ftv
1 85 0 19 182 2 0
2 86 0 33 155 3 3
3 87 0 20 105 1 1
4 88 0 21 108 1 2
5 89 0 18 107 1 0

The LOGISTIC Procedure

The LOGISTIC Procedure

Model Information

Model Information
Data Set WORK.LOWBWT
Response Variable low
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring

Observations Summary

Number of Observations Read 189
Number of Observations Used 189

Response Profile

Response Profile
Ordered
Value
low Total
Frequency
1 1 59
2 0 130

Probability modeled is low=1.

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 236.672 232.691
SC 239.914 239.174
-2 Log L 234.672 228.691

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 5.9813 1 0.0145
Score 5.4382 1 0.0197
Wald 5.1921 1 0.0227

Parameter Estimates

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 0.9983 0.7853 1.6161 0.2036
lwt 1 -0.0141 0.00617 5.1921 0.0227

Odds Ratios

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
lwt 0.986 0.974 0.998

Association Statistics

Association of Predicted Probabilities and Observed Responses
Percent Concordant 60.1 Somers' D 0.226
Percent Discordant 37.5 Gamma 0.232
Percent Tied 2.5 Tau-a 0.098
Pairs 7670 c 0.613

Estimates

Estimate

Estimate
Label Estimate Standard Error z Value Pr > |z| Alpha Lower Upper Exponentiated Exponentiated Lower Exponentiated Upper
10 pound OR -0.1406 0.06170 -2.28 0.0227 0.05 -0.2615 -0.01966 0.8689 0.7699 0.9805
In [20]:
proc logistic data=lowbwt desc;  
  class race / param=ref ref=first;
  model low=age lwt race ftv;
run;
Out[20]:
SAS Output

SAS Output

The LOGISTIC Procedure

The LOGISTIC Procedure

Model Information

Model Information
Data Set WORK.LOWBWT
Response Variable low
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring

Observations Summary

Number of Observations Read 189
Number of Observations Used 189

Response Profile

Response Profile
Ordered
Value
low Total
Frequency
1 1 59
2 0 130

Probability modeled is low=1.

Class Level Information

Class Level Information
Class Value Design Variables
race 1 0 0
  2 1 0
  3 0 1

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 236.672 234.573
SC 239.914 254.023
-2 Log L 234.672 222.573

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 12.0991 5 0.0335
Score 11.3876 5 0.0442
Wald 10.6964 5 0.0577

Type 3 Tests

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
age 1 0.4988 0.4800
lwt 1 4.7428 0.0294
race 2 4.4108 0.1102
ftv 1 0.0869 0.7681

Parameter Estimates

Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 1.2953 1.0714 1.4616 0.2267
age   1 -0.0238 0.0337 0.4988 0.4800
lwt   1 -0.0142 0.00654 4.7428 0.0294
race 2 1 1.0039 0.4979 4.0660 0.0438
race 3 1 0.4331 0.3622 1.4296 0.2318
ftv   1 -0.0493 0.1672 0.0869 0.7681

Odds Ratios

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
age 0.976 0.914 1.043
lwt 0.986 0.973 0.999
race 2 vs 1 2.729 1.029 7.240
race 3 vs 1 1.542 0.758 3.136
ftv 0.952 0.686 1.321

Association Statistics

Association of Predicted Probabilities and Observed Responses
Percent Concordant 65.3 Somers' D 0.307
Percent Discordant 34.6 Gamma 0.307
Percent Tied 0.0 Tau-a 0.133
Pairs 7670 c 0.653
In [21]:
proc logistic data=lowbwt desc;  
  class race / param=ref ref=first;
  model low=age lwt race ftv /selection=b SLSTAY=0.10;
run;
Out[21]:
SAS Output

SAS Output

The LOGISTIC Procedure

The LOGISTIC Procedure

Model Information

Model Information
Data Set WORK.LOWBWT
Response Variable low
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring

Observations Summary

Number of Observations Read 189
Number of Observations Used 189

Response Profile

Response Profile
Ordered
Value
low Total
Frequency
1 1 59
2 0 130

Probability modeled is low=1.

Backward Elimination Procedure

Class Level Information

Class Level Information
Class Value Design Variables
race 1 0 0
  2 1 0
  3 0 1

Step 0

Step 0. The following effects were entered:

Intercept age lwt race ftv

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 236.672 234.573
SC 239.914 254.023
-2 Log L 234.672 222.573

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 12.0991 5 0.0335
Score 11.3876 5 0.0442
Wald 10.6964 5 0.0577

Step 1

Step 1. Effect ftv is removed:

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 236.672 232.661
SC 239.914 248.869
-2 Log L 234.672 222.661

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 12.0114 4 0.0173
Score 11.3202 4 0.0232
Wald 10.6284 4 0.0311

Residual Chi-Square

Residual Chi-Square Test
Chi-Square DF Pr > ChiSq
0.0870 1 0.7680

Step 2

Step 2. Effect age is removed:

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 236.672 231.259
SC 239.914 244.226
-2 Log L 234.672 223.259

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 11.4129 3 0.0097
Score 10.7572 3 0.0131
Wald 10.1316 3 0.0175

Residual Chi-Square

Residual Chi-Square Test
Chi-Square DF Pr > ChiSq
0.6779 2 0.7125

Note:No (additional) effects met the 0.1 significance level for removal from the model.

Model Building Summary

Summary of Backward Elimination
Step Effect
Removed
DF Number
In
Wald
Chi-Square
Pr > ChiSq
1 ftv 1 3 0.0869 0.7681
2 age 1 2 0.5892 0.4427

Type 3 Tests

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
lwt 1 5.5886 0.0181
race 2 5.4024 0.0671

Parameter Estimates

Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 0.8057 0.8452 0.9088 0.3404
lwt   1 -0.0152 0.00644 5.5886 0.0181
race 2 1 1.0811 0.4881 4.9065 0.0268
race 3 1 0.4806 0.3567 1.8156 0.1778

Odds Ratios

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
lwt 0.985 0.973 0.997
race 2 vs 1 2.948 1.133 7.672
race 3 vs 1 1.617 0.804 3.253

Association Statistics

Association of Predicted Probabilities and Observed Responses
Percent Concordant 64.3 Somers' D 0.295
Percent Discordant 34.8 Gamma 0.297
Percent Tied 0.9 Tau-a 0.127
Pairs 7670 c 0.647

Diagnostics

  • Goodnes of Fit Test: Null hypothesis is the model fits (Hosmer and Lemshow Test)
  • Deviance residual analysis
In [22]:
proc logistic data=lowbwt descending;
model loW=lwt /lackfit;
output out=pout p=pred resdev=devr reschi=chir;
run;
proc sgscatter data=pout;

plot devr*(id pred lwt);
run;
Out[22]:
SAS Output

SAS Output

The LOGISTIC Procedure

The LOGISTIC Procedure

Model Information

Model Information
Data Set WORK.LOWBWT
Response Variable low
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring

Observations Summary

Number of Observations Read 189
Number of Observations Used 189

Response Profile

Response Profile
Ordered
Value
low Total
Frequency
1 1 59
2 0 130

Probability modeled is low=1.

Convergence Status

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 236.672 232.691
SC 239.914 239.174
-2 Log L 234.672 228.691

Global Tests

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 5.9813 1 0.0145
Score 5.4382 1 0.0197
Wald 5.1921 1 0.0227

Parameter Estimates

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 0.9983 0.7853 1.6161 0.2036
lwt 1 -0.0141 0.00617 5.1921 0.0227

Odds Ratios

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
lwt 0.986 0.974 0.998

Association Statistics

Association of Predicted Probabilities and Observed Responses
Percent Concordant 60.1 Somers' D 0.226
Percent Discordant 37.5 Gamma 0.232
Percent Tied 2.5 Tau-a 0.098
Pairs 7670 c 0.613

Hosmer-Lemeshow Test

Partition

Partition for the Hosmer and Lemeshow Test
Group Total low = 1 low = 0
Observed Expected Observed Expected
1 21 4 3.21 17 17.79
2 20 5 4.58 15 15.42
3 19 4 5.28 15 13.72
4 19 8 5.73 11 13.27
5 18 4 5.80 14 12.20
6 20 5 6.70 15 13.30
7 19 4 6.68 15 12.32
8 23 10 8.57 13 14.43
9 20 9 8.08 11 11.92
10 10 6 4.39 4 5.61

Chi-Square Test

Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square DF Pr > ChiSq
6.7270 8 0.5664

The SGSCATTER Procedure

The SGScatter Procedure

The SGScatter Procedure

In this case, there is no apparent lack of fit. In the residuals, we should see the residual between -2 and 2 and fairly smooth lines.

In [ ]: