Categorical Predictors

So far we have only considered quantitative predictors, but we can alos have categorical predcitors in linear regression models. Let’s consider the following reasearch question:

Categorical Variable with \(>2\) Categories

Suppose we want to study the effect of exercise on glucose levels in the blood stream. High glucose levels lead to diabetes which is a known risk factor for heart disease, so understanding how exercise and other factors can affect glucose levels is important. Suppose that exercise was measured on a scale of 1 to 5, where

  • 1 = much less active
  • 2 = less active
  • 3 = somewhat active
  • 4 = more active
  • 5 = much more active.

We can include this categorical variable in a regression model by coding. In the smoking example, we coded the binary variable as a 0/1 variable. What we did was choose No to be represented by 0 and Yes to be represented by 1. No in this case is what is called the reference category. To code the exercise variable, we will need to choose one of the categories to be the reference category, and then we can create four binary variables, known as indicator variables, to represent the other four categories. Let’s choose 1 as the reference category. We can define the four indicator variables as follows:

  • \(x_{i1}\) is 0 if person i is less active, and 0 otherwise
  • \(x_{i2}\) is 0 if person i is somewhat active, and 0 otherwise
  • \(x_{i3}\) is 0 if person i is more active, and 0 otherwise
  • \(x_{i4}\) is 0 if person i is much more active, and 0 otherwise.

This leads to the following regression model:

\[E(glucose|x)=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\beta_4x_{i4}\]

This means that

\[E(glucose|x)=\begin{cases}\beta_0,&\text{ if much less active}\\ \beta_0+\beta_1,&\text{ if less active}\\ \beta_0+\beta_2,&\text{ if somewhat active}\\ \beta_0+\beta_3,&\text{ if more active}\\ \beta_0+\beta_4,&\text{ if much more active}\end{cases}\]

In general, if the categorical variable has c levels, then you will need \(c-1\) indicator variables. So

  • If your qualitative variable defines 2 groups, then you need 1 indicator variable.
  • If your qualitative variable defines 3 groups, then you need 2 indicator variables.
  • If your qualitative variable defines 4 groups, then you need 3 indicator variables.

And, so on.

Interactions

Interaction terms allow for the effect of a treatment to vary under different conditions. Let’s consider the follwing example on treatments for severe depression. For the sake of simplicity, we denote the three treatments by A, B, and C. The researchers collected the following data on 36 severly depressed individuals:

The data are shown below.

depression.dat <- read.table("Data/depression.txt",header=T)
with(depression.dat,plot(age,y,col=ifelse(x2==1,"blue",ifelse(x3==1,"red","green")),pch=ifelse(x2==1,16,ifelse(x3==1,15,17)),
                         xlab="Age (years)",ylab="Effectiveness of Treatment"))
legend("bottomright",legend=c("A","B","C"),title="TRT",pch=c(16,15,17),col=c("blue","red","green"))

The blue circles represent the data for individuals receiving treatment A, the red squares represent the data for individuals receiving treatment B, and the green diamonds represent the data for individuals receiving treatment C.

In the previous example, the two estimated regression functions had the same slopes —that is, they were parallel. If you tried to draw three best fitting lines through the data of this example, do you think the slopes of your lines would be the same? Probably not! In this case, we need to include what are called interaction terms in our formulated regression model.

A model that would allow each group to have different slopes is:

\[y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\beta_{12}x_{i1}x_{i2}+\beta_{13}x_{i1}x_{i3}+\varepsilon_i\]

where the \(\varepsilon_i\) are independent and have normal distributions with mean 0 and equal variance \(\sigma^2\). Since we have three treatments, we can break this down to three separate regression models:

\[y_i=\begin{cases}(\beta_0+\beta_2)+(\beta_1+\beta_{12})x_{i1},&\text{ if patient receives A ($x_{i2}=1$ and $x_{i3}=0$)}\\ (\beta_0+\beta_3)+(\beta_1+\beta_{13})x_{i1},&\text{ if patient receives B ($x_{i2}=0$ and $x_{i3}=1$)}\\ \beta_0+\beta_1x_{i1},&\text{ if patient receives C ($x_{i2}=0$ and $x_{i3}=0$)}\end{cases}\] So, in what way does including the interaction terms, \(x_{i1}x_{i2}\) and \(x_{i1}x_{i3}\), in the model imply that the predictors have an interaction effect on the mean response? Note that the slopes of the three regression functions differ — the slope of the first line is \(\beta_1 + \beta_{12}\), the slope of the second line is \(\beta_1 + \beta_{13}\), and the slope of the third line is \(\beta_1\). What does this mean in a practical sense? It means that…

Let’s examine these three lines visually, to see how the treatment and age depend on each other. First we need to find the estimate regression lines. The regression output is given below.

## 
## Call:
## lm(formula = y ~ age * x2 + age * x3, data = depression.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4366 -2.7637  0.1887  2.9075  6.5634 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.21138    3.34964   1.854 0.073545 .  
## age          1.03339    0.07233  14.288 6.34e-15 ***
## x2          41.30421    5.08453   8.124 4.56e-09 ***
## x3          22.70682    5.09097   4.460 0.000106 ***
## age:x2      -0.70288    0.10896  -6.451 3.98e-07 ***
## age:x3      -0.50971    0.11039  -4.617 6.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.925 on 30 degrees of freedom
## Multiple R-squared:  0.9143, Adjusted R-squared:  0.9001 
## F-statistic: 64.04 on 5 and 30 DF,  p-value: 4.264e-15

Fromt the output we have \[\hat{y}_i=\begin{cases}(6.2+41.3)+(1.03-0.70)x_{i1}=417.5+0.33x_1,&\text{ if patient receives A ($x_{i2}=1$ and $x_{i3}=0$)}\\ (6.2+22.7)+(1.03-0.51)x_{i1}=28.9+0.52x_1,&\text{ if patient receives B ($x_{i2}=0$ and $x_{i3}=1$)}\\ 6.21+1.03x_{i1},&\text{ if patient receives C ($x_{i2}=0$ and $x_{i3}=0$)}\end{cases}\]

Advantages of Using Categorical Predictors

In each case we have seen how these models with categorical predictors can be decomposed into simpler models for each level of the categorical variable, so why not just fit the individual regression models? Why should we worry about these more complex models?

There are two main advantages to fitting a single model as opposed to fitting the individual regression models.

  1. Fitting the combined model allows us to pool the data togther and obtain more accurate estimates for the regression parameters.
  2. Fitting a combined model allows us to easily and efficiently answer questions concerning the categorical variable while controlling for other variables.