In this lecture, we will discuss exploratory data anlysis and base plotting in R. Topics include

Summary Statistics

We have alread seen some functions for summary statistics such as min, max, median, and mean. Here are several other functions in R for summary statistics.

Function Name	Task Performed
sum(x)	Sums the elements in x
max(x)	Maximum element in x
min(x)	Minimum element in x
range(x)	Range (min to max) of elements in x
length(x)	Number of elements in x
mean(x)	Mean (average value) of elements in x
sum(x)	Sums the elements in x
median(x)	Median (middle value) of the elements in x
var(x)	Variance of the elements in x
sd(x)	Standard deivation of the elements in x
cor(x,y)	Correlation between x and y
quantile(x,p)	The p^th quantile of x
fivenum(x)	The five number summary of x

Let’s look at a few of these using the HERS dataset. his data is from the Heart and Estrogen/Progestin Study (HERS), a clinical trial of hormone therapy for prevention of recurrent heart attacks and deaths among 2,763 post-menopausal women with existing coronary heart disease (Hulley et al., 1998)

#read in the data
setwd("H:/BiostatCourses/PublicHealthComputing/Lectures/Week3ODS_EDA/R")
hersdata <- read.table("hersdata.csv", header=T,sep=",")
#View(hersdata) #look at the dataset as a spreadsheet. try it
str(hersdata)

## 'data.frame':    2763 obs. of  37 variables:
##  $ HT      : Factor w/ 2 levels "hormone therapy",..: 2 2 1 2 2 1 2 1 1 1 ...
##  $ age     : int  70 62 69 64 65 68 70 69 61 62 ...
##  $ raceth  : Factor w/ 3 levels "African American",..: 1 1 3 3 3 1 3 3 3 3 ...
##  $ nonwhite: Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 1 1 1 1 ...
##  $ smoking : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 2 ...
##  $ drinkany: Factor w/ 3 levels "","no","yes": 2 2 2 3 2 3 2 2 3 3 ...
##  $ exercise: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 1 ...
##  $ physact : Factor w/ 5 levels "about as active",..: 3 2 1 2 4 1 1 3 1 4 ...
##  $ globrat : Factor w/ 6 levels "","excellent",..: 4 4 4 4 4 4 3 6 6 4 ...
##  $ poorfair: Factor w/ 3 levels "","no","yes": 2 2 2 2 2 2 3 2 2 2 ...
##  $ medcond : int  0 1 0 1 0 0 0 0 1 0 ...
##  $ htnmeds : Factor w/ 2 levels "no","yes": 2 2 2 2 1 1 1 2 2 2 ...
##  $ statins : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 2 1 1 ...
##  $ diabetes: Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ dmpills : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ insulin : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ weight  : num  73.8 70.9 102 64.4 57.9 ...
##  $ BMI     : num  23.7 28.6 42.5 24.4 21.9 ...
##  $ waist   : num  96 93 110 87 77 ...
##  $ WHR     : num  0.932 0.964 0.782 0.877 0.794 ...
##  $ glucose : int  84 111 114 94 101 116 120 95 105 98 ...
##  $ weight1 : num  73.6 73.4 96.1 58.6 58.9 ...
##  $ BMI1    : num  23.6 28.9 40.7 22.5 22.3 ...
##  $ waist1  : num  93 95 103 77 76.5 ...
##  $ WHR1    : num  0.912 0.964 0.774 0.802 0.757 ...
##  $ glucose1: int  94 78 98 93 92 115 NA 95 113 98 ...
##  $ tchol   : int  189 307 254 204 214 212 314 195 213 303 ...
##  $ LDL     : num  122 242 166 116 151 ...
##  $ HDL     : int  52 44 57 56 42 52 33 46 50 40 ...
##  $ TG      : int  73 107 154 159 107 111 261 139 150 215 ...
##  $ tchol1  : int  201 216 254 207 235 202 NA 190 211 240 ...
##  $ LDL1    : num  138 151 156 123 172 ...
##  $ HDL1    : int  48 48 66 57 35 53 NA 54 57 39 ...
##  $ TG1     : int  77 87 160 137 139 112 NA 113 239 139 ...
##  $ SBP     : int  138 118 134 152 175 174 119 178 162 111 ...
##  $ DBP     : int  78 70 78 72 95 98 61 82 70 71 ...
##  $ age10   : num  7 6.2 6.9 6.4 6.5 ...

attach(hersdata)
mean(BMI1, na.rm=T)

## [1] 28.36211

sd(BMI1, na.rm=T)

## [1] 5.558147

median(BMI1, na.rm=T)

## [1] 27.535

quantile(BMI1, probs = c(0.25, 0.5, 0.75), na.rm=T)

##    25%    50%    75% 
## 24.340 27.535 31.540

min(BMI1, na.rm=T)

## [1] 14.73

max(BMI1, na.rm=T)

## [1] 54.04

fivenum(BMI1, na.rm=T)

## [1] 14.730 24.340 27.535 31.540 54.040

The summary function is a generic function in R, meaning that its function depends on the type of R object that you pass to it. For a quantitative variable, you get the five number summary statistics and the mean.

summary(BMI1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.73   24.34   27.54   28.36   31.54   54.04     153

For factors (categorical variables), you get counts.

summary(smoking)

##   no  yes 
## 2403  360

We can see the function calls that summary can make based on the R object by looking at its methods.

methods(summary)

##  [1] summary.aov                    summary.aovlist*              
##  [3] summary.aspell*                summary.check_packages_in_dir*
##  [5] summary.connection             summary.data.frame            
##  [7] summary.Date                   summary.default               
##  [9] summary.ecdf*                  summary.factor                
## [11] summary.glm                    summary.infl*                 
## [13] summary.lm                     summary.loess*                
## [15] summary.manova                 summary.matrix                
## [17] summary.mlm*                   summary.nls*                  
## [19] summary.packageStatus*         summary.PDF_Dictionary*       
## [21] summary.PDF_Stream*            summary.POSIXct               
## [23] summary.POSIXlt                summary.ppr*                  
## [25] summary.prcomp*                summary.princomp*             
## [27] summary.proc_time              summary.srcfile               
## [29] summary.srcref                 summary.stepfun               
## [31] summary.stl*                   summary.table                 
## [33] summary.tukeysmooth*          
## see '?methods' for accessing help and source code

We can apply summary to multiple columns of a data frame to summarize each column at once.

summary(hersdata[1:3])

##                HT            age                     raceth    
##  hormone therapy:1380   Min.   :44.00   African American: 218  
##  placebo        :1383   1st Qu.:62.00   Other           :  94  
##                         Median :67.00   White           :2451  
##                         Mean   :66.65                          
##                         3rd Qu.:72.00                          
##                         Max.   :79.00

Basic R Plotting

Plots for Categorical Data

For categorical data, the basic plots we are interested in are tables, barplots and pie charts. To make a table, we use the table function in R.

#?table #see the help documentation
table(raceth)

## raceth
## African American            Other            White 
##              218               94             2451

We can also make contingency tables of counts with the table function.

table(raceth,smoking)

##                   smoking
## raceth               no  yes
##   African American  186   32
##   Other              85    9
##   White            2132  319

We will discuss more about contingency tables and row and column percentages in the lecture later on categorical data analysis.

To make a bar plot use the barplot() function applied to a table (this passes in the counts needed to make the bars).

barplot(table(raceth))

And to make a piechart use the pie function again applied to a table.

pie(table(raceth))

Plots for Quantitative variables

For one quantitative variable, we can make a histogram, stemplot, or a boxplot (there is also a dotplot but these don’t look very good in R).

#?hist
hist(BMI)

#Breaks = 20 gives more bins
#freq = FALSE results in a relative frequency histogram
hist(BMI1, breaks = 20, freq = FALSE)

#?boxplot
#Points indicate outliers
boxplot(BMI)

#Horizontal makes the boxplot horizontal
#range = 0 makes the whiskers extend to the max and min values
boxplot(BMI1, horizontal = TRUE, range = 0)

#?stem
stem(BMI1)

## 
##   The decimal point is at the |
## 
##   14 | 7
##   16 | 2445567903345699
##   18 | 001134445566778900011223333455555556666677777778888888888889999999
##   20 | 00001111111111112222333344444444455555555555555566666777777778888888+90
##   22 | 00000000000000011111111111222222222222233333333333333344444444444445+236
##   24 | 00000000000000000000000011111111111111111111112222222222222222222223+293
##   26 | 00000000000000000000000001111111111111111111111111112222222222222222+369
##   28 | 00000000000000000000000000111111111111111111111111122222222222222333+264
##   30 | 00000000000001111111111222222222222222333333333333344444444444444455+193
##   32 | 00000000000001111111111111222222222223333333333333333444444555555555+120
##   34 | 00000000001111111111122222333344444555555555666677777777777777788888+63
##   36 | 00000000000111111222222233334445555566777778888888888889999000111122+22
##   38 | 00000001122222333455555555567777778888000111222333455555677777777889
##   40 | 1112245556666777788800001144457777779
##   42 | 012444566669900149
##   44 | 12446800255799
##   46 | 6781579
##   48 | 03556789
##   50 | 28
##   52 | 8
##   54 | 0

The Plot Function

Plot is another generic function in R. This means its output depends on what you pass in as its arguments.

plot(raceth) #factors are plotted as bar plots

plot(BMI) #one are plotted as index vs variable for a time series plot

plot(SBP,DBP) #two quantitative variables yeild a scatterplot.

If we want a plot with one quantitative and one categorical, then we would use side by side boxplots. There are several ways to make these.

boxplot(split(BMI1, raceth))

boxplot(BMI1 ~ raceth) #using R formulas

plot(BMI ~ raceth)

Low Level Plotting Functions

As you may have notice, the base R plots don’t look that great by defualt. We will need to use options of the plotting function or other functions such as legend to make the plot look better. Let’s look at the scatter plot of systolic blood pressure vs diastolic blood pressure again.

plot(SBP, DBP)

There are a few problems

The x and y axis labels are not descriptive
Three is no title
The open circle may not be the best plotting character
Maybe we want to include more information in the plot such as different colors based on smoking status.

plot(SBP, DBP, xlab = "Systolic Blood Pressure",
               ylab = "Diastolic Blood Pressure",
               main = "Systolic vs. Diastolic Blood Pressure by Smoking Status",
               sub = "HERS Data",
               col = ifelse(smoking == "no", "blue", "red"),
               pch = ifelse(smoking == "no", 0, 16))
legend("bottomright", legend = c("Smoking Status","yes", "no"), 
        col = c("white","red", "blue"), pch = c(0,16, 0))

We can also combine plots or add extra points, descriptive text, lines or other symbols to the plot. Lets add a kernel density estimate and an estimated normal density curve to the histogram of BMI1.

Without any modifications we have a histogram that looks like this

hist(BMI1)

Now let’s make it look better and add the density curves.

hist(BMI1, col = "gray", freq = F, ylim = c(0,0.1), 
     xlab = "Body Mass Index", xaxt = "n", main = "")
axis(1, at = seq(10,55,by = 5)) #creat your own x axis
title(main = "Body Mass Index", sub = "HERS data") #can also use title function
lines(density(BMI1[!is.na(BMI1)]), col = "blue") #plots the density estimate on the latest plot
x <- seq(10,55,length.out=100)
lines(x,dnorm(x, mean = mean(BMI1, na.rm = T), sd = sd(BMI1, na.rm=T)),
      lty = 2, col = "red", add = T) #plot the normal density curve

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "add" is not a
## graphical parameter

legend("topright", legend = c("Kernel", "Normal"), col = c("blue", "red"), lty = c(1,2))

We can use the layout function to control where plots are placed in panels.

#?layout

opar <- par(no.readonly = T) #save original parameters
layout(matrix(c(1,1,2,3), nrow = 2, ncol = 2, byrow = T))
hist(BMI1)
boxplot(age)
plot(SBP, DBP)

par(opar)

There are many other graphical parameters we can set. See ?par for a list of graphical paramters in the help documentation.The margins are another parameter you will probably want to set in order to remove or add whitespace between plots. Here is a plot that shows in different margins.

#The following creates a plot that shows where the margins are and how to add text
# Generate data  
x = 0:10;  
y = 0:10;  

# Plot the data  

# - Specify the layout parameters before any plotting  
#   If you don't specify them before any plotting, the  
#   results will be inconsistent and misaligned.  
#  
# - oma stands for 'Outer Margin Area', or the total margin space that is outside  
#   of the standard plotting region (see graph)  
#  
# - The vector is ordered, the first value corresponding to the bottom. The entire  
#   array is c(bottom, left, top, right)  
#  
# - All of the alternatives are:  
#   - oma: Specify width of margins in number of lines  
#   - omi: Specify width of margins in inches  
#   - omd: Specify width of margins in 'device coordinates'  
#       - Device coordinates place (0,0) in the upper left and (1,1) in the  
#         lower right corner  

par(oma=c(3,3,3,3))  # all sides have 3 lines of space  
#par(omi=c(1,1,1,1)) # alternative, uncomment this and comment the previous line to try  

# - The mar command represents the figure margins. The vector is in the same ordering of  
#   the oma commands.  
#   
# - The default size is c(5,4,4,2) + 0.1, (equivalent to c(5.1,4.1,4.1,2.1)).   
#  
# - The axes tick marks will go in the first line of the left and bottom with the axis  
#   label going in the second line.  
#  
# - The title will fit in the third line on the top of the graph.   
#  
# - All of the alternatives are:  
#   - mar: Specify the margins of the figure in number of lines  
#   - mai: Specify the margins of the figure in number of inches  

par(mar=c(5,4,4,2) + 0.1)   
#par(mai=c(2,1.5,1.5,.5)) # alternative, uncomment this and comment the previous line  

# Plot  
plot(x, y, type="n", xlab="X", ylab="Y")    # type="n" hides the points  

# Place text in the plot and color everything plot-related red  
text(5,5, "Plot", col="red", cex=2)  
text(5,4, "text(5,5, \"Plot\", col=\"red\", cex=2)", col="red", cex=1)  
box("plot", col="red")  

# Place text in the margins and label the margins, all in green  
mtext("Margins", side=3, line=2, cex=2, col="green")  
mtext("par(mar=c(5,4,4,2) + 0.1)", side=3, line=1, cex=1, col="green")  
mtext("Line 0", side=3, line=0, adj=1.0, cex=1, col="green")  
mtext("Line 1", side=3, line=1, adj=1.0, cex=1, col="green")  
mtext("Line 2", side=3, line=2, adj=1.0, cex=1, col="green")  
mtext("Line 3", side=3, line=3, adj=1.0, cex=1, col="green")  
mtext("Line 0", side=2, line=0, adj=1.0, cex=1, col="green")  
mtext("Line 1", side=2, line=1, adj=1.0, cex=1, col="green")  
mtext("Line 2", side=2, line=2, adj=1.0, cex=1, col="green")  
mtext("Line 3", side=2, line=3, adj=1.0, cex=1, col="green")  
box("figure", col="green")  

# Label the outer margin area and color it blue  
# Note the 'outer=TRUE' command moves us from the figure margins to the outer  
# margins.  
mtext("Outer Margin Area", side=1, line=1, cex=2, col="blue", outer=TRUE)  
mtext("par(oma=c(3,3,3,3))", side=1, line=2, cex=1, col="blue", outer=TRUE)  
mtext("Line 0", side=1, line=0, adj=0.0, cex=1, col="blue", outer=TRUE)  
mtext("Line 1", side=1, line=1, adj=0.0, cex=1, col="blue", outer=TRUE)  
mtext("Line 2", side=1, line=2, adj=0.0, cex=1, col="blue", outer=TRUE)  
box("outer", col="blue")

par(opar)

We can even add other plots in the margins!

# Add boxplots to a scatterplot
par(fig=c(0,0.8,0,0.8), mar = c(4,4,4,3))
plot(SBP, DBP, xlab="Systolic Blood Pressure",
     ylab="Diastolic Blood Pressure")
par(fig=c(0,0.8,0.55,1), new=TRUE)
boxplot(SBP, horizontal=TRUE, axes=FALSE, boxwex = 2)
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(DBP, axes=FALSE, boxwex = 1.1)
mtext("Enhanced Scatterplot", side=3, outer=TRUE, line=-3)

par(opar)

Multivariate Plots

When comparing multiple quantitative variables a scatterplot matrix can be helpful.

pairs(data.frame(DBP,SBP,BMI1,glucose1))

Sometimes its useful to add a univariate plot on the diagonal. To add a histogram we need to write our own function to create the histogram plot. Note that we can’t just use the hist function. This is advanced!

panel.hist <- function(x, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
pairs(data.frame(DBP,SBP,BMI1,glucose1), diag.panel = panel.hist)

Some other useful plots are coplots and xyplots (simialr to SGPANEL in SAS) in the lattice package.

#coplots
#?coplot
coplot(DBP ~ SBP | smoking + HT)

## 
##  Missing rows: 2450

#xyplot - this plot usually can be made to look nicer
library(lattice)
#?xyplot
xyplot(DBP ~ SBP | smoking + HT)

Saving Plots

In RStudio, you can export a plot using the export option in the plots window. There are also several built in functions to save plots in different formats such as

pdf
png
jpeg

See the help at ?pdf, ?png, and ?jpeg for more details. We simply need to call the function before we make our plot and follow the plot code with dev.off() to save the image.

pdf("histogram.pdf", width = 6, height = 5)

hist(BMI1, col = "gray", freq = F, ylim = c(0,0.1), 
     xlab = "Body Mass Index", xaxt = "n", main = "")
axis(1, at = seq(10,55,by = 5)) #creat your own x axis
title(main = "Body Mass Index", sub = "HERS data") #can also use title function
lines(density(BMI1[!is.na(BMI1)]), col = "blue") #plots the density estimate on the latest plot
curve(dnorm(x, mean = mean(BMI1, na.rm = T), sd = sd(BMI1, na.rm=T)),
      lty = 2, col = "red", add = T) #plot the normal density curve
legend("topright", legend = c("Kernel", "Normal"), col = c("blue", "red"), lty = c(1,2))

dev.off() #close the graphics device to save image

## png 
##   2

EDA and Plotting in R

Robert Parker