Categorical Predictors
So far we have only considered quantitative predictors, but we can alos have categorical predcitors in linear regression models. Let’s consider the following reasearch question:
Is a baby’s birth weight related to the mother’s smoking during pregnancy?
Researchers interested in answering the above research question collected the data on a random sample of 32 births:
- Response (y): birth weight (Weight) in grams of the infant
- Predictor 1 (\(x_1\)): smoking statust of the mother (yes or no)
- Predictor 2 (\(x_2\)): length of gestation in weeks
The variable Smoking only takes two values yes or no, making it a categorical variable. Since it only takes two values, Smoking is referred to as a binary variable.
##
## Welch Two Sample t-test
##
## data: Wgt by Smoke
## t = 0.74829, df = 29.973, p-value = 0.4601
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -159.9642 344.9642
## sample estimates:
## mean in group no mean in group yes
## 3066.125 2973.625
If we only study the relationship between birth weight and smoking, then we could perform a two sample t-test to see if the mean birth weight is different between those with mothers who smoked and those with mothers who did not smoke. Looking at the boxplot the overall weights don’t look different. This is confirmed by the t-test, which has a p-value of 0.4601. There is no evidence that the mean birth weight is different for infants with mothers who smoked and with mothers who did not smoke.
However, this ignores some important information, such as how long the gestation period was. The important question now is, after taking into account the length of gestation, is there a significant difference in the average birth weights of babies born to smoking and non-smoking mothers? We can use the following regression model to answer this question:
\[y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\varepsilon_i\] where
- \(y_i\) is the birth weight of baby i
- \(x_{i1}\) is the gestation length of baby i
- \(x_{i2}\) is a binary variable coded as 1 if the baby’s mother smoked during pregnancy and 0 if she did not
and the \(\varepsilon_i\) are independent and have normal distributions with mean 0 and equal variance \(\sigma^2\).
Notice that in order to include a qualitative variable in a regression model, we have to code the variable, that is assign, a unique numbre to each of the possible categories. This leads to two regression models
- A regression model for infants with mothers who smoked during pregnancy: This means \(x_{i2}=1\)
\[y_i=\beta_0+\beta_1x_{i1}+\beta_2*1=(\beta_0+\beta_2)+\beta_1x_{i1}\]
- A regression model for infants with mothers who did not smoke during pregnancy: This means that \(x_{i2}=0\)
\[y_i=\beta_0+\beta_1x_{i1}+\beta_2*0=\beta_0+\beta_1x_{i1}\] Below is a scatterplot of the data along with the two estimated regression lines and regression output.
##
## Call:
## lm(formula = Wgt ~ Gest + Smoke, data = birthsmokers.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.693 -92.063 -9.365 79.663 197.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2389.573 349.206 -6.843 1.63e-07 ***
## Gest 143.100 9.128 15.677 1.07e-15 ***
## Smokeyes -244.544 41.982 -5.825 2.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared: 0.8964, Adjusted R-squared: 0.8892
## F-statistic: 125.4 on 2 and 29 DF, p-value: 5.289e-15
The resulting estimated regression equations are
\[\widehat{Wgt}_i=\begin{cases}-2390+143.1*Gest,&\text{ if the mother did not smoke during pregnancy}\\
-2635+143.1*Gest,&\text{ if the mother smoked during pregnancy}\end{cases}\] The blue circles represent the data on non-smoking mothers (\(x_2=0\)), while the red circles represent the data on smoking mothers (\(x_2=1\)). And, the blue line represents the estimated linear relationship between length of gestation and birth weight for non-smoking mothers, while the red line represents the estimated linear relationship for smoking mothers. Note that \(\beta_2\) represents the difference in mean birth weight after controlling for gestation length between infants with mothers who smoked during pregnacy and infants with mothers who did not smoke during pregnancy.
At least in this sample of data, it appears as if the birth weights for non-smoking mothers is higher than that for smoking mothers, regardless of the length of gestation. A hypothesis test or confidence interval would allow us to see if this result extends to the larger population.
How would we answer the following set of research questions?
- Is baby’s birth weight related to smoking during pregnancy, after taking into account length of gestation? (Conduct a hypothesis test for testing whether the slope parameter for smoking is 0.)
- How is birth weight related to gestation, after taking into account a mother’s smoking status? (Calculate and interpret a confidence interval for the slope parameter for gestation.)
We will learn how to answer these questions in the next section.
Categorical Variable with \(>2\) Categories
Suppose we want to study the effect of exercise on glucose levels in the blood stream. High glucose levels lead to diabetes which is a known risk factor for heart disease, so understanding how exercise and other factors can affect glucose levels is important. Suppose that exercise was measured on a scale of 1 to 5, where
- 1 = much less active
- 2 = less active
- 3 = somewhat active
- 4 = more active
- 5 = much more active.
We can include this categorical variable in a regression model by coding. In the smoking example, we coded the binary variable as a 0/1 variable. What we did was choose No to be represented by 0 and Yes to be represented by 1. No in this case is what is called the reference category. To code the exercise variable, we will need to choose one of the categories to be the reference category, and then we can create four binary variables, known as indicator variables, to represent the other four categories. Let’s choose 1 as the reference category. We can define the four indicator variables as follows:
- \(x_{i1}\) is 0 if person i is less active, and 0 otherwise
- \(x_{i2}\) is 0 if person i is somewhat active, and 0 otherwise
- \(x_{i3}\) is 0 if person i is more active, and 0 otherwise
- \(x_{i4}\) is 0 if person i is much more active, and 0 otherwise.
This leads to the following regression model:
\[E(glucose|x)=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\beta_4x_{i4}\]
This means that
\[E(glucose|x)=\begin{cases}\beta_0,&\text{ if much less active}\\
\beta_0+\beta_1,&\text{ if less active}\\
\beta_0+\beta_2,&\text{ if somewhat active}\\
\beta_0+\beta_3,&\text{ if more active}\\
\beta_0+\beta_4,&\text{ if much more active}\end{cases}\]
In general, if the categorical variable has c levels, then you will need \(c-1\) indicator variables. So
- If your qualitative variable defines 2 groups, then you need 1 indicator variable.
- If your qualitative variable defines 3 groups, then you need 2 indicator variables.
- If your qualitative variable defines 4 groups, then you need 3 indicator variables.
And, so on.
Interactions
Interaction terms allow for the effect of a treatment to vary under different conditions. Let’s consider the follwing example on treatments for severe depression. For the sake of simplicity, we denote the three treatments by A, B, and C. The researchers collected the following data on 36 severly depressed individuals:
- \(Y_i\) = measure of the effictiveness of the treatment on person i’s depression
- \(x_{i1}\) = age (in years) of individual i
- \(x_{i2}=1\) if individual i received treatment A and 0 otherwise
- \(x_{i3}=1\) if individual i received treatment B and 0 otherwise
The data are shown below.
depression.dat <- read.table("Data/depression.txt",header=T)
with(depression.dat,plot(age,y,col=ifelse(x2==1,"blue",ifelse(x3==1,"red","green")),pch=ifelse(x2==1,16,ifelse(x3==1,15,17)),
xlab="Age (years)",ylab="Effectiveness of Treatment"))
legend("bottomright",legend=c("A","B","C"),title="TRT",pch=c(16,15,17),col=c("blue","red","green"))
The blue circles represent the data for individuals receiving treatment A, the red squares represent the data for individuals receiving treatment B, and the green diamonds represent the data for individuals receiving treatment C.
In the previous example, the two estimated regression functions had the same slopes —that is, they were parallel. If you tried to draw three best fitting lines through the data of this example, do you think the slopes of your lines would be the same? Probably not! In this case, we need to include what are called interaction terms in our formulated regression model.
A model that would allow each group to have different slopes is:
\[y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\beta_{12}x_{i1}x_{i2}+\beta_{13}x_{i1}x_{i3}+\varepsilon_i\]
where the \(\varepsilon_i\) are independent and have normal distributions with mean 0 and equal variance \(\sigma^2\). Since we have three treatments, we can break this down to three separate regression models:
\[y_i=\begin{cases}(\beta_0+\beta_2)+(\beta_1+\beta_{12})x_{i1},&\text{ if patient receives A ($x_{i2}=1$ and $x_{i3}=0$)}\\
(\beta_0+\beta_3)+(\beta_1+\beta_{13})x_{i1},&\text{ if patient receives B ($x_{i2}=0$ and $x_{i3}=1$)}\\
\beta_0+\beta_1x_{i1},&\text{ if patient receives C ($x_{i2}=0$ and $x_{i3}=0$)}\end{cases}\] So, in what way does including the interaction terms, \(x_{i1}x_{i2}\) and \(x_{i1}x_{i3}\), in the model imply that the predictors have an interaction effect on the mean response? Note that the slopes of the three regression functions differ — the slope of the first line is \(\beta_1 + \beta_{12}\), the slope of the second line is \(\beta_1 + \beta_{13}\), and the slope of the third line is \(\beta_1\). What does this mean in a practical sense? It means that…
- the effect of the individual’s age (\(4x_1\)) on the treatment’s mean effectiveness depends on the treatment (\(x_2\) and \(x_3\)), and …
- the effect of treatment (\(x_2\) and \(x_3\)) on the treatment’s mean effectiveness depends on the individual’s age (\(x_1\)).
Let’s examine these three lines visually, to see how the treatment and age depend on each other. First we need to find the estimate regression lines. The regression output is given below.
##
## Call:
## lm(formula = y ~ age * x2 + age * x3, data = depression.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4366 -2.7637 0.1887 2.9075 6.5634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.21138 3.34964 1.854 0.073545 .
## age 1.03339 0.07233 14.288 6.34e-15 ***
## x2 41.30421 5.08453 8.124 4.56e-09 ***
## x3 22.70682 5.09097 4.460 0.000106 ***
## age:x2 -0.70288 0.10896 -6.451 3.98e-07 ***
## age:x3 -0.50971 0.11039 -4.617 6.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.925 on 30 degrees of freedom
## Multiple R-squared: 0.9143, Adjusted R-squared: 0.9001
## F-statistic: 64.04 on 5 and 30 DF, p-value: 4.264e-15
Fromt the output we have \[\hat{y}_i=\begin{cases}(6.2+41.3)+(1.03-0.70)x_{i1}=417.5+0.33x_1,&\text{ if patient receives A ($x_{i2}=1$ and $x_{i3}=0$)}\\
(6.2+22.7)+(1.03-0.51)x_{i1}=28.9+0.52x_1,&\text{ if patient receives B ($x_{i2}=0$ and $x_{i3}=1$)}\\
6.21+1.03x_{i1},&\text{ if patient receives C ($x_{i2}=0$ and $x_{i3}=0$)}\end{cases}\]
- For patients in this study receiving treatment A, the mean effectiveness of the treatment is estimated to increase 0.33 units for every additional year in age.
- For patients in this study receiving treatment B, the mean effectiveness of the treatment is estimated to increase 0.52 units for every additional year in age.
- For patients in this study receiving treatment C, the mean effectiveness of the treatment is estimated to increase 1.03 units for every additional year in age.