Much of these notes are from https://monashbioinformaticsplatform.github.io/r-intro/.

R is both a programming language and an interactive environment for statistics. Today we will be concentrating on R as an interactive environment.

Working with R is primarily text-based. The basic mode of use for R is that the user types in a command in the R language and presses enter, and then R computes and displays the result.

We will be working in RStudio. This surrounds the console, where one enters commands and views the results, with various conveniences. In addition to the console, RStudio provides panels containing:

Open RStudio, click on the “Console” pane, type 1+1 and press enter. R displays the result of the calculation. In this document, we will be showing such an interaction with R as below.

1+1
## [1] 2

+ is called an operator. R has the operators you would expect for for basic mathematics: + - * / ^. It also has operators that do more obscure things.

* has higher precedence than +. We can use brackets if necessary ( ). Try 1+2*3 and (1+2)*3.

Spaces can be used to make code easier to read.

We can compare with == < > <= >=. This produces a “logical” value, TRUE or FALSE. Note the double equals, ==, for equality comparison.

2 * 2 == 4
## [1] TRUE

There are also character strings such as "string".

Variables

A variable is a name for a value, such as x, current_temperature, or subject.id. We can create a new variable by assigning a value to it using <-.

weight_kg <- 55

RStudio helpfully shows us the variable in the “Environment” pane. We can also print it by typing the name of the variable and hitting enter. In general, R will print to the console any object returned by a function or operation unless we assign it to a variable.

weight_kg
## [1] 55

Examples of valid variables names: hello, hello_there, hello.there, value1. Spaces aren’t ok inside variable names. Dots (.) are ok, unlike in many other languages.

We can do arithmetic with the variable:

# weight in pounds:
2.2 * weight_kg
## [1] 121

Tip

We can add comments to our code using the # character. It is useful to document our code in this way so that others (and us the next time we read it) have an easier time following what the code is doing.

We can also change a variable’s value by assigning it a new value:

weight_kg <- 57.5
# weight in kilograms is now
weight_kg
## [1] 57.5

Assigning a new value to one variable does not change the values of other variables. For example, let’s store the subject’s weight in pounds in a variable:

weight_lb <- 2.2 * weight_kg
# weight in kg...
weight_kg
## [1] 57.5
# ...and in pounds
weight_lb
## [1] 126.5

and then change weight_kg:

weight_kg <- 100.0
# weight in kg now...
weight_kg
## [1] 100
# ...and weight in pounds still
weight_lb
## [1] 126.5

Since weight_lb doesn’t “remember” where its value came from, it isn’t automatically updated when weight_kg changes.

This is different from how spreadsheets work.

R Objects

R has four base objects * Vectors * Arrays (matrix is a special case) * Lists * Data frames

Vectors

The most basic element in R is a vector.

A vector of numbers is a collection of numbers. We call the individual numbers elements of the vector.

We can make vectors with c( ), for example c(1,2,3). c means “combine”. R is obsesssed with vectors. In R, numbers are just vectors of length one. Many things that can be done with a single number can also be done with a vector. For example arithmetic can be done on vectors just like on single numbers.

myvec <- c(10,20,30,40,50)
myvec + 1
## [1] 11 21 31 41 51
myvec + myvec
## [1]  20  40  60  80 100
length(myvec)
## [1] 5
c(60, myvec)
## [1] 60 10 20 30 40 50
c(myvec, myvec)
##  [1] 10 20 30 40 50 10 20 30 40 50

When we talk about the length of a vector, we are talking about the number of numbers in the vector.

All elements of a vector must be of the same type.

Atomic Data Types

R has several atomic data types. The four that we will encounter are:

  • Numeric
    • Examples - 42 1e3 4.67
  • Character string
    • Examples - "Hello" "a b c"
  • Logical
    • Examples - TRUE FALSE T F

What happens when we try to inlcude more than one data type in a vector?

c(1,"a",5,6)
## [1] "1" "a" "5" "6"

Notice that now all the output is in quotes. Since all of the elements must be of a single data type, one of the data types will win and the others will be coerced to that data type. In this case the character type wins.

Indexing Vectors

Access elements of a vector with [ ], for example myvec[1] to get the first element. You can also assign to a specific element of a vector.

myvec[1]
## [1] 10
myvec[2]
## [1] 20
myvec[2] <- 5
myvec
## [1] 10  5 30 40 50

Can we use a vector to index another vector? Yes!

myind <- c(4,3,2)
myvec[myind]
## [1] 40 30  5

We could equivalently have written

myvec[c(4,3,2)]
## [1] 40 30  5

Sometimes we want a contiguous slice from a vector.

myvec[3:5]
## [1] 30 40 50

: here actually creates a vector, which is then used to index myvec. : is pretty useful on its own too.

3:5
## [1] 3 4 5
1:50
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50
numbers <- 1:10
numbers*numbers
##  [1]   1   4   9  16  25  36  49  64  81 100

Now we can see why R always puts a [1] in its output: it is indicating that the first element of the vector can be accessed with [1]. Subsequent lines show the appropriate index to access the first number in the line.

Some other ways to generate vectors besides : are with seq() and rep().

seq(3)
## [1] 1 2 3
seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10, by = 2)
## [1] 1 3 5 7 9
seq(1,10, length.out = 4)
## [1]  1  4  7 10
rep(3:5, times = 4)
##  [1] 3 4 5 3 4 5 3 4 5 3 4 5
rep(3:5, each = 4)
##  [1] 3 3 3 3 4 4 4 4 5 5 5 5

Challenge - Indexing and slicing data

We can take slices of character vectors as well:

phrase <- c("I", "don't", "know", "I", "know")
# first three words
phrase[1:3]
## [1] "I"     "don't" "know"
# last three words
phrase[3:5]
## [1] "know" "I"    "know"
  1. If the first four words are selected using the slice phrase[1:4], how can we obtain the first four words in reverse order?

  2. What is phrase[-2]? What is phrase[-5]? Given those answers, explain what phrase[-1:-3] does.

  3. Use indexing of phrase to create a new character vector that forms the phrase “I know I don’t”, i.e. c("I", "know", "I", "don't").

We can also access elements of a vector by passing a vector of logical values.

myvec >= 25 #TRUE if that element is greater than or equal to 25
## [1] FALSE FALSE  TRUE  TRUE  TRUE
myvec[myvec >= 25]
## [1] 30 40 50

Lists

Vectors contain all the same kind of thing. Lists can contain different kinds of thing. Lists can even contain vectors or other lists as elements.

We generally give the things in a list names. Try list(num=42, greeting="hello"). To access named elements we use $.

mylist <- list(num=42, greeting="Hello, world")
mylist$greeting
## [1] "Hello, world"

Functions that need to return multiple outputs often do so as a list.

Accessing List Elements

We can access elements of a list in several ways. We can index them like vectors, but instead we use [[]].

mylist[[1]]
## [1] 42

We can use the names of the elements to access them as well. We can either pass the name inside [[]] or use $.

mylist[["greeting"]]
## [1] "Hello, world"
mylist$num
## [1] 42

Arrays

Arrays are multidimensional vectors. A special case is a two dimensional array known as a matrix. We will start by looking at matrices in R. The following matrix contains quality measurements (particle size) on PVC plastic production, using eight different resin batches, and three different machine operators.

mat <- as.matrix(read.csv("H:/BiostatCourses/PublicHealthComputing/Lectures/Week1Intro/R/data/pvc.csv", row.names=1))
mat
##        Alice   Bob  Carl
## Resin1 36.25 35.40 35.30
## Resin2 35.15 35.35 33.35
## Resin3 30.70 29.65 29.20
## Resin4 29.70 30.05 28.65
## Resin5 31.85 31.40 29.30
## Resin6 30.20 30.65 29.75
## Resin7 32.90 32.50 32.80
## Resin8 36.80 36.45 33.15

We have read in the data from a csv file using the read.csv function. We will discuss reading data into R from files more next week.

To get more information about the matrix can use the class and str (structre) functions.

class(mat)
## [1] "matrix"
str(mat)
##  num [1:8, 1:3] 36.2 35.1 30.7 29.7 31.9 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:8] "Resin1" "Resin2" "Resin3" "Resin4" ...
##   ..$ : chr [1:3] "Alice" "Bob" "Carl"

Indexing Matrices

We can check the size of the matrix with the functions nrow and ncol:

nrow(mat)
## [1] 8
ncol(mat)
## [1] 3

This tells us that our matrix, mat, has 8 rows and 3 columns.

If we want to get a single value from the matrix, we can provide a row and column index in square brackets:

# first value in mat
mat[1, 1]
## [1] 36.25
# a middle value in mat
mat[4, 2]
## [1] 30.05

If our matrix has row names and column names, we can also refer to rows and columns by name.

mat["Resin4","Bob"]
## [1] 30.05

An index like [4, 2] selects a single element of a matrix, but we can select whole sections as well. For example, we can select the first two operators (columns) of values for the first four resins (rows) like this:

mat[1:4, 1:2]
##        Alice   Bob
## Resin1 36.25 35.40
## Resin2 35.15 35.35
## Resin3 30.70 29.65
## Resin4 29.70 30.05

The slice 1:4 means the numbers from 1 to 4. It’s the same as c(1,2,3,4).

The slice does not need to start at 1, e.g. the line below selects rows 5 through 8:

mat[5:8, 1:2]
##        Alice   Bob
## Resin5 31.85 31.40
## Resin6 30.20 30.65
## Resin7 32.90 32.50
## Resin8 36.80 36.45

We can use vectors created with c to select non-contiguous values:

mat[c(1,3,5), c(1,3)]
##        Alice Carl
## Resin1 36.25 35.3
## Resin3 30.70 29.2
## Resin5 31.85 29.3

We also don’t have to provide an index for either the rows or the columns. If we don’t include an index for the rows, R returns all the rows; if we don’t include an index for the columns, R returns all the columns. If we don’t provide an index for either rows or columns, e.g. mat[, ], R returns the full matrix.

# All columns from row 5
mat[5, ]
## Alice   Bob  Carl 
## 31.85 31.40 29.30
# All rows from column 2
mat[, 2]
## Resin1 Resin2 Resin3 Resin4 Resin5 Resin6 Resin7 Resin8 
##  35.40  35.35  29.65  30.05  31.40  30.65  32.50  36.45

Summary Functions on Matrices

Now let’s perform some common mathematical operations to learn about our data. When analyzing data we often want to look at partial statistics, such as the maximum value per resin or the average value per operator. One way to do this is to select the data we want as a new temporary variable, and then perform the calculation on this subset:

# first row, all of the columns
resin_1 <- mat[1, ]
# max particle size for resin 1
max(resin_1)
## [1] 36.25

We don’t actually need to store the row in a variable of its own. Instead, we can combine the selection and the function call:

# max particle size for resin 2
max(mat[2, ])
## [1] 35.35

R has functions for other common calculations, e.g. finding the minimum, mean, median, and standard deviation of the data:

# minimum particle size for operator 3
min(mat[, 3])
## [1] 28.65
# mean for operator 3
mean(mat[, 3])
## [1] 31.4375
# median for operator 3
median(mat[, 3])
## [1] 31.275
# standard deviation for operator 3
sd(mat[, 3])
## [1] 2.49453

Challenge - Subsetting data in a matrix

Suppose you want to determine the maximum particle size for resin 5 across operators 2 and 3. To do this you would extract the relevant slice from the matrix and calculate the maximum value. Which of the following lines of R code gives the correct answer?

  1. max(mat[5, ])
  2. max(mat[2:3, 5])
  3. max(mat[5, 2:3])
  4. max(mat[5, 2, 3])

What if we need the maximum particle size for all resins, or the average for each operator? As the diagram below shows, we want to perform the operation across a margin of the matrix:

Operations Across Axes

Operations Across Axes

To support this, we can use the apply function.

apply allows us to repeat a function on all of the rows (MARGIN = 1) or columns (MARGIN = 2) of a matrix. We can think of apply as collapsing the matrix down to just the dimension specified by MARGIN, with rows being dimension 1 and columns dimension 2 (recall that when indexing the matrix we give the row first and the column second).

Thus, to obtain the average particle size of each resin we will need to calculate the mean of all of the rows (MARGIN = 1) of the matrix.

(avg_resin <- apply(mat, 1, mean))
##   Resin1   Resin2   Resin3   Resin4   Resin5   Resin6   Resin7   Resin8 
## 35.65000 34.61667 29.85000 29.46667 30.85000 30.20000 32.73333 35.46667

And to obtain the average particle size for each operator we will need to calculate the mean of all of the columns (MARGIN = 2) of the matrix.

(avg_operator <- apply(mat, 2, mean))
##    Alice      Bob     Carl 
## 32.94375 32.68125 31.43750

Since the second argument to apply is MARGIN, the above command is equivalent to apply(dat, MARGIN = 2, mean). See ?apply or help(apply) for the help documentation.

Tip

Some common operations have more concise alternatives. For example, you can calculate the row-wise or column-wise means with rowMeans and colMeans, respectively.

Challenge - summarizing the matrix

How would you calculate the standard deviation for each resin?

Advanced: How would you calculate the values two standard deviations above and below the mean for each resin?

Arrays are similar to matrices, but they can have more than 2 dimensions.

myarray <- array(1:6, dim=c(3,2,2))
myarray
## , , 1
## 
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

we can access elements in an array in the same way as we did for matrices.

myarray[1,2,2]
## [1] 4
myarray[c(1,3),,]
## , , 1
## 
##      [,1] [,2]
## [1,]    1    4
## [2,]    3    6
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    1    4
## [2,]    3    6

Data Frames

Datasets in R are almost always stored as data frames. Unlike matrices, the columns of data frames do not have to be of the same atomic type (but they do within a column). Let’s use a data frame to store a data set from a clinical trial studying diabetes.

diabetes <- read.csv("H:/BiostatCourses/PublicHealthComputing/Lectures/Week1Intro/R/data/diabetes.csv")

class(diabetes)
## [1] "data.frame"
head(diabetes)
##   subject glyhb   location age gender height weight  frame
## 1   S1002  4.64 Buckingham  58 female     61    256  large
## 2   S1003  4.63 Buckingham  67   male     67    119  large
## 3   S1005  7.72 Buckingham  64   male     68    183 medium
## 4   S1008  4.81 Buckingham  34   male     71    190  large
## 5   S1011  4.84 Buckingham  30   male     69    191 medium
## 6   S1015  3.94 Buckingham  37   male     59    170 medium
colnames(diabetes)
## [1] "subject"  "glyhb"    "location" "age"      "gender"   "height"  
## [7] "weight"   "frame"
ncol(diabetes)
## [1] 8
nrow(diabetes)
## [1] 354

Data frames behave simlarly to matrices. we can access elements using [], but we can also access the columns using the $ seen with lists.

diabetes[1,] #first subject's info
##   subject glyhb   location age gender height weight frame
## 1   S1002  4.64 Buckingham  58 female     61    256 large
diabetes$age[1:10] #get the first 10 subjects' age
##  [1] 58 67 64 34 30 37 45 55 60 38

A method of indexing that we haven’t discussed yet is logical indexing. Instead of specifying the row number or numbers that we want, we can give a logical vector which is TRUE for the rows we want and FALSE otherwise. This can also be used with vectors and matrices.

Suppose we want to look at all the subjects 80 years of age or over. We first make a logical vector:

is_over_80 <- diabetes$age >= 80

head(is_over_80)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
sum(is_over_80)
## [1] 9

>= is a comparison operator meaning greater than or equal to. We can then grab just these rows of the data frame where is_over_80 is TRUE.

diabetes[is_over_80,]
##     subject glyhb   location age gender height weight  frame
## 45    S2770  4.98 Buckingham  92 female     62    217  large
## 56    S2794  8.40 Buckingham  91 female     61    127   <NA>
## 90    S4803  5.71     Louisa  83 female     59    125 medium
## 130  S13500  5.60     Louisa  82   male     66    163   <NA>
## 139  S15013  4.57     Louisa  81 female     64    158 medium
## 193  S15815  4.92 Buckingham  82 female     63    170 medium
## 321  S40784 10.07     Louisa  84 female     60    192  small
## 323  S40786  6.48     Louisa  80   male     71    212 medium
## 324  S40789 11.18     Louisa  80 female     62    162  small

We might also want to know which rows our logical vector is TRUE for. This is achieved with the which function. The result of this can also be used to index the data frame.

which_over_80 <- which(is_over_80)
which_over_80
## [1]  45  56  90 130 139 193 321 323 324
diabetes[which_over_80,]
##     subject glyhb   location age gender height weight  frame
## 45    S2770  4.98 Buckingham  92 female     62    217  large
## 56    S2794  8.40 Buckingham  91 female     61    127   <NA>
## 90    S4803  5.71     Louisa  83 female     59    125 medium
## 130  S13500  5.60     Louisa  82   male     66    163   <NA>
## 139  S15013  4.57     Louisa  81 female     64    158 medium
## 193  S15815  4.92 Buckingham  82 female     63    170 medium
## 321  S40784 10.07     Louisa  84 female     60    192  small
## 323  S40786  6.48     Louisa  80   male     71    212 medium
## 324  S40789 11.18     Louisa  80 female     62    162  small

Comparison operators available are:

  • x == y – “equal to”
  • x != y – “not equal to”
  • x < y – “less than”
  • x > y – “greater than”
  • x <= y – “less than or equal to”
  • x >= y – “greater than or equal to”

More complicated conditions can be constructed using logical operators:

  • a & b – “and”, true only if both a and b are true.
  • a | b – “or”, true if either a or b or both are true.
  • ! a – “not” , true if a is false, and false if a is true.
is_over_80_and_female <- is_over_80 & diabetes$gender == "female"

is_not_from_buckingham <- !(diabetes$location == "Buckingham")
# or
is_not_from_buckingham <- diabetes$location != "Buckingham"

The data we are working with is derived from a dataset called diabetes in the faraway package. The rows are people interviewed as part of a study of diabetes prevalence. The column glyhb is a measurement of percent glycated haemoglobin, which gives information about long term glucose levels in blood. Values of 7% or greater are usually taken as a positive diagnosis of diabetes. Let’s add this as a column.

diabetes$diabetic <- diabetes$glyhb >= 7.0

head(diabetes)
##   subject glyhb   location age gender height weight  frame diabetic
## 1   S1002  4.64 Buckingham  58 female     61    256  large    FALSE
## 2   S1003  4.63 Buckingham  67   male     67    119  large    FALSE
## 3   S1005  7.72 Buckingham  64   male     68    183 medium     TRUE
## 4   S1008  4.81 Buckingham  34   male     71    190  large    FALSE
## 5   S1011  4.84 Buckingham  30   male     69    191 medium    FALSE
## 6   S1015  3.94 Buckingham  37   male     59    170 medium    FALSE

Above where we retrieved people 80 or over we could just as well have written:

weedle<- diabetes $age>=   80
diabetes[    weedle ,]

R does not understand or care about the names we give to variables, and it doesn’t care about spaces between things.

We could also have written it as a single line:

diabetes[diabetes$age >= 80,]

We can almost always unpack complex expressions into a series of simpler variable assignments. The naming of variables and how far to unpack complex expressions is a matter of good taste. Will you understand it when you come back to it in a year? Will someone else understand your code?

Challenge

Which female subjects from Buckingham are under the age of 25?

What is their average glyhb?

Are any of them diabetic?

Test your understanding by writing your solutions several different ways.

Missing data

summary gives an overview of a data frame.

summary(diabetes)
##     subject        glyhb              location        age       
##  S10000 :  1   Min.   : 2.680   Buckingham:178   Min.   :19.00  
##  S10001 :  1   1st Qu.: 4.385   Louisa    :176   1st Qu.:35.00  
##  S10016 :  1   Median : 4.840                    Median :45.00  
##  S1002  :  1   Mean   : 5.580                    Mean   :46.91  
##  S10020 :  1   3rd Qu.: 5.565                    3rd Qu.:60.00  
##  S1003  :  1   Max.   :16.110                    Max.   :92.00  
##  (Other):348   NA's   :11                                       
##     gender        height          weight         frame      diabetic      
##  female:206   Min.   :52.00   Min.   : 99.0   large : 91   Mode :logical  
##  male  :148   1st Qu.:63.00   1st Qu.:150.0   medium:155   FALSE:291      
##               Median :66.00   Median :171.0   small : 96   TRUE :52       
##               Mean   :65.93   Mean   :176.2   NA's  : 12   NA's :11       
##               3rd Qu.:69.00   3rd Qu.:198.0                               
##               Max.   :76.00   Max.   :325.0                               
##               NA's   :5       NA's   :1

We see that some columns contain NAs. NA is R’s way of indicating missing data. Missing data is important in statistics, so R is careful with its treatment of this. If we try to calculate with an NA the result will be NA.

1 + NA
## [1] NA
mean(diabetes$glyhb)
## [1] NA

Many summary functions, such as mean, have a flag to say ignore NA values.

mean(diabetes$glyhb, na.rm=TRUE)
## [1] 5.580292

There is also an is.na function, allowing us to find which values are NA, and na.omit which removes NAs.

not_missing <- !is.na(diabetes$glyhb)
mean( diabetes$glyhb[not_missing] )
## [1] 5.580292
mean( na.omit(diabetes$glyhb) )
## [1] 5.580292

na.omit can also be used on a whole data frame, and removes rows with NA in any column.

Factors

When R loads a CSV file, it tries to give appropriate types to the columns. Let’s examine what types R has given our data.

str(diabetes)
## 'data.frame':    354 obs. of  9 variables:
##  $ subject : Factor w/ 354 levels "S10000","S10001",..: 4 6 7 8 9 10 11 12 13 14 ...
##  $ glyhb   : num  4.64 4.63 7.72 4.81 4.84 ...
##  $ location: Factor w/ 2 levels "Buckingham","Louisa": 1 1 1 1 1 1 1 1 2 2 ...
##  $ age     : int  58 67 64 34 30 37 45 55 60 38 ...
##  $ gender  : Factor w/ 2 levels "female","male": 1 2 2 2 2 2 2 1 1 1 ...
##  $ height  : int  61 67 68 71 69 59 69 63 65 58 ...
##  $ weight  : int  256 119 183 190 191 170 166 202 156 195 ...
##  $ frame   : Factor w/ 3 levels "large","medium",..: 1 1 2 1 2 2 1 3 2 2 ...
##  $ diabetic: logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

We might have expected the text columns to be the “character” data type, but they are instead “factor”s.

head( diabetes$frame )
## [1] large  large  medium large  medium medium
## Levels: large medium small

R uses the factor data type to store a vector of categorical data. The different possible categories are called “levels”.

Factors can be created from character vectors with factor. We sometimes care what order the levels are in, since this can affect how data is plotted or tabulated by various functions. If there is some sort of baseline level, such as “wildtype strain” or “no treatment”, it is usually given first. factor has an argument levels= to specify the desired order of levels.

Factors can be converted back to a character vector with as.character.

When R loaded our data, it chose levels in alphabetical order. Let’s adjust that for the column diabetes$frame.

diabetes$frame <- factor(diabetes$frame, levels=c("small","medium","large"))

head( diabetes$frame )
## [1] large  large  medium large  medium medium
## Levels: small medium large