The four conditions (LINE) that comprise the multiple linear regression model are
In multiple linear regression, we can assess these assumptions by examining the errors ,\(e_i=\y_i-\hat{y}_i\). An equivalent way to state these four assumptions in terms of the errors, is that the true errors, \(\varepsilon_i\), are independent and have a normal distribution with mean 0 and constant variance.
The independence assumption has to be verified based on the subjects, and how the data was collected. Is is reasonable to beleive that the responses (or error) of one subject is independent of anothers? When this is not true, we will need to include these correlations in our model. This is the subject of mixed models, that we will discuss later.
If the relationship between the predictors and the mean response is (approximately) linear, then the errors should have mean 0. We can check this assumption by examining a plot of residuals vs the predicted values (\(e_i\) vs \(y_i\)). If the linearity assumption is met, the resulting scatterplot should not show any trend. It should look like a cloud of points centered around the horizontal line \(y=0\). This means the vertical average of the points remains close to zero as we move from left to right.
Below are some example scatterplot of residuals vs predicted values. The scatterplot on the left shows no clear patter. We just see random scatter above and below the line \(y=0\), but in the second plot, there is a clear curved (quadratic) pattern. This second indicates that the linear assumption is not met, and we need to rethink out model or use another technique.
This assumption is usually assessed with histograms and qq plots. In histograms, we want to see a symmetric and bell shaped distribution. We can also an estimated normal density curve to the histogram to help asses whether the histogram looks normal or not. Do note that with small samples, it is unlikely that we will see a bell shaped histogram even if the data did truly come from a normal distribution. In this case, we are looking to see that there are no major deviations from normality such as heavy skewness or outliers. We can see in the first histogram, that the histogram looks symmetric and bell shapes, showing no major deviations from the normal distribution, but in the second histogram, the errors are heavily right skewed.
We can also use QQ plots to check the normal assumptions. Again, we are more concerned with major deviations from normality, than the data perfectly following the line in the QQ plot. We can see in the QQ plot on the left that the residuals roughly follow a normal distribution, but the errors in the plot on the right greatly deviate from the line near the tails indicating heavy skewness.
For the equal variance assumption, we should examine a plot of the residuals vs the predicted values. If the variance is constant, then the vertical spread should remain constant. We should not see any “fanning” out of the residuals. In the left scatterplot, we see an example where the equal variance assumption is met. The random scatter above and below the line y=0 remains roughly constant in terms of spread above and below the line. However, in the the scatterplot on the right, we see an example of unequal variances. As you move from left to right in the plot, the vertical spread is increasing.
Possible methods to address unequal variances:
The model assumptions are important for inference. If the model assumptions are not met then our results will not be valids since
There are some other potential issues that affect the reliability of our regression estimates.