Much of these notes are from https://monashbioinformaticsplatform.github.io/r-intro/.
R is both a programming language and an interactive environment for statistics. Today we will be concentrating on R as an interactive environment.
Working with R is primarily text-based. The basic mode of use for R is that the user types in a command in the R language and presses enter, and then R computes and displays the result.
We will be working in RStudio. This surrounds the console, where one enters commands and views the results, with various conveniences. In addition to the console, RStudio provides panels containing:
Open RStudio, click on the “Console” pane, type 1+1
and press enter. R displays the result of the calculation. In this document, we will be showing such an interaction with R as below.
1+1
## [1] 2
+
is called an operator. R has the operators you would expect for for basic mathematics: +
-
*
/
^
. It also has operators that do more obscure things.
*
has higher precedence than +
. We can use brackets if necessary ( )
. Try 1+2*3
and (1+2)*3
.
Spaces can be used to make code easier to read.
We can compare with == < > <= >=
. This produces a “logical” value, TRUE
or FALSE
. Note the double equals, ==
, for equality comparison.
2 * 2 == 4
## [1] TRUE
There are also character strings such as "string"
.
A variable is a name for a value, such as x
, current_temperature
, or subject.id
. We can create a new variable by assigning a value to it using <-
.
weight_kg <- 55
RStudio helpfully shows us the variable in the “Environment” pane. We can also print it by typing the name of the variable and hitting enter. In general, R will print to the console any object returned by a function or operation unless we assign it to a variable.
weight_kg
## [1] 55
Examples of valid variables names: hello
, hello_there
, hello.there
, value1
. Spaces aren’t ok inside variable names. Dots (.
) are ok, unlike in many other languages.
We can do arithmetic with the variable:
# weight in pounds:
2.2 * weight_kg
## [1] 121
Tip
We can add comments to our code using the
#
character. It is useful to document our code in this way so that others (and us the next time we read it) have an easier time following what the code is doing.
We can also change a variable’s value by assigning it a new value:
weight_kg <- 57.5
# weight in kilograms is now
weight_kg
## [1] 57.5
Assigning a new value to one variable does not change the values of other variables. For example, let’s store the subject’s weight in pounds in a variable:
weight_lb <- 2.2 * weight_kg
# weight in kg...
weight_kg
## [1] 57.5
# ...and in pounds
weight_lb
## [1] 126.5
and then change weight_kg
:
weight_kg <- 100.0
# weight in kg now...
weight_kg
## [1] 100
# ...and weight in pounds still
weight_lb
## [1] 126.5
Since weight_lb
doesn’t “remember” where its value came from, it isn’t automatically updated when weight_kg
changes.
This is different from how spreadsheets work.
R has four base objects * Vectors * Arrays (matrix is a special case) * Lists * Data frames
The most basic element in R is a vector.
A vector of numbers is a collection of numbers. We call the individual numbers elements of the vector.
We can make vectors with c( )
, for example c(1,2,3)
. c
means “combine”. R is obsesssed with vectors. In R, numbers are just vectors of length one. Many things that can be done with a single number can also be done with a vector. For example arithmetic can be done on vectors just like on single numbers.
myvec <- c(10,20,30,40,50)
myvec + 1
## [1] 11 21 31 41 51
myvec + myvec
## [1] 20 40 60 80 100
length(myvec)
## [1] 5
c(60, myvec)
## [1] 60 10 20 30 40 50
c(myvec, myvec)
## [1] 10 20 30 40 50 10 20 30 40 50
When we talk about the length of a vector, we are talking about the number of numbers in the vector.
All elements of a vector must be of the same type.
R has several atomic data types. The four that we will encounter are:
42
1e3
4.67
"Hello"
"a b c"
TRUE
FALSE
T
F
What happens when we try to inlcude more than one data type in a vector?
c(1,"a",5,6)
## [1] "1" "a" "5" "6"
Notice that now all the output is in quotes. Since all of the elements must be of a single data type, one of the data types will win and the others will be coerced to that data type. In this case the character type wins.
Access elements of a vector with [ ]
, for example myvec[1]
to get the first element. You can also assign to a specific element of a vector.
myvec[1]
## [1] 10
myvec[2]
## [1] 20
myvec[2] <- 5
myvec
## [1] 10 5 30 40 50
Can we use a vector to index another vector? Yes!
myind <- c(4,3,2)
myvec[myind]
## [1] 40 30 5
We could equivalently have written
myvec[c(4,3,2)]
## [1] 40 30 5
Sometimes we want a contiguous slice from a vector.
myvec[3:5]
## [1] 30 40 50
:
here actually creates a vector, which is then used to index myvec
. :
is pretty useful on its own too.
3:5
## [1] 3 4 5
1:50
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50
numbers <- 1:10
numbers*numbers
## [1] 1 4 9 16 25 36 49 64 81 100
Now we can see why R always puts a [1]
in its output: it is indicating that the first element of the vector can be accessed with [1]
. Subsequent lines show the appropriate index to access the first number in the line.
Some other ways to generate vectors besides :
are with seq()
and rep()
.
seq(3)
## [1] 1 2 3
seq(1,10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10, by = 2)
## [1] 1 3 5 7 9
seq(1,10, length.out = 4)
## [1] 1 4 7 10
rep(3:5, times = 4)
## [1] 3 4 5 3 4 5 3 4 5 3 4 5
rep(3:5, each = 4)
## [1] 3 3 3 3 4 4 4 4 5 5 5 5
Challenge - Indexing and slicing data
We can take slices of character vectors as well:
phrase <- c("I", "don't", "know", "I", "know") # first three words phrase[1:3]
## [1] "I" "don't" "know"
# last three words phrase[3:5]
## [1] "know" "I" "know"
If the first four words are selected using the slice
phrase[1:4]
, how can we obtain the first four words in reverse order?What is
phrase[-2]
? What isphrase[-5]
? Given those answers, explain whatphrase[-1:-3]
does.Use indexing of
phrase
to create a new character vector that forms the phrase “I know I don’t”, i.e.c("I", "know", "I", "don't")
.
We can also access elements of a vector by passing a vector of logical values.
myvec >= 25 #TRUE if that element is greater than or equal to 25
## [1] FALSE FALSE TRUE TRUE TRUE
myvec[myvec >= 25]
## [1] 30 40 50
Vectors contain all the same kind of thing. Lists can contain different kinds of thing. Lists can even contain vectors or other lists as elements.
We generally give the things in a list names. Try list(num=42, greeting="hello")
. To access named elements we use $
.
mylist <- list(num=42, greeting="Hello, world")
mylist$greeting
## [1] "Hello, world"
Functions that need to return multiple outputs often do so as a list.
We can access elements of a list in several ways. We can index them like vectors, but instead we use [[]]
.
mylist[[1]]
## [1] 42
We can use the names of the elements to access them as well. We can either pass the name inside [[]]
or use $
.
mylist[["greeting"]]
## [1] "Hello, world"
mylist$num
## [1] 42
Arrays are multidimensional vectors. A special case is a two dimensional array known as a matrix. We will start by looking at matrices in R. The following matrix contains quality measurements (particle size) on PVC plastic production, using eight different resin batches, and three different machine operators.
mat <- as.matrix(read.csv("H:/BiostatCourses/PublicHealthComputing/Lectures/Week1Intro/R/data/pvc.csv", row.names=1))
mat
## Alice Bob Carl
## Resin1 36.25 35.40 35.30
## Resin2 35.15 35.35 33.35
## Resin3 30.70 29.65 29.20
## Resin4 29.70 30.05 28.65
## Resin5 31.85 31.40 29.30
## Resin6 30.20 30.65 29.75
## Resin7 32.90 32.50 32.80
## Resin8 36.80 36.45 33.15
We have read in the data from a csv file using the read.csv function. We will discuss reading data into R from files more next week.
To get more information about the matrix can use the class
and str
(structre) functions.
class(mat)
## [1] "matrix"
str(mat)
## num [1:8, 1:3] 36.2 35.1 30.7 29.7 31.9 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:8] "Resin1" "Resin2" "Resin3" "Resin4" ...
## ..$ : chr [1:3] "Alice" "Bob" "Carl"
We can check the size of the matrix with the functions nrow
and ncol
:
nrow(mat)
## [1] 8
ncol(mat)
## [1] 3
This tells us that our matrix, mat
, has 8 rows and 3 columns.
If we want to get a single value from the matrix, we can provide a row and column index in square brackets:
# first value in mat
mat[1, 1]
## [1] 36.25
# a middle value in mat
mat[4, 2]
## [1] 30.05
If our matrix has row names and column names, we can also refer to rows and columns by name.
mat["Resin4","Bob"]
## [1] 30.05
An index like [4, 2]
selects a single element of a matrix, but we can select whole sections as well. For example, we can select the first two operators (columns) of values for the first four resins (rows) like this:
mat[1:4, 1:2]
## Alice Bob
## Resin1 36.25 35.40
## Resin2 35.15 35.35
## Resin3 30.70 29.65
## Resin4 29.70 30.05
The slice 1:4
means the numbers from 1 to 4. It’s the same as c(1,2,3,4)
.
The slice does not need to start at 1, e.g. the line below selects rows 5 through 8:
mat[5:8, 1:2]
## Alice Bob
## Resin5 31.85 31.40
## Resin6 30.20 30.65
## Resin7 32.90 32.50
## Resin8 36.80 36.45
We can use vectors created with c
to select non-contiguous values:
mat[c(1,3,5), c(1,3)]
## Alice Carl
## Resin1 36.25 35.3
## Resin3 30.70 29.2
## Resin5 31.85 29.3
We also don’t have to provide an index for either the rows or the columns. If we don’t include an index for the rows, R returns all the rows; if we don’t include an index for the columns, R returns all the columns. If we don’t provide an index for either rows or columns, e.g. mat[, ]
, R returns the full matrix.
# All columns from row 5
mat[5, ]
## Alice Bob Carl
## 31.85 31.40 29.30
# All rows from column 2
mat[, 2]
## Resin1 Resin2 Resin3 Resin4 Resin5 Resin6 Resin7 Resin8
## 35.40 35.35 29.65 30.05 31.40 30.65 32.50 36.45
Now let’s perform some common mathematical operations to learn about our data. When analyzing data we often want to look at partial statistics, such as the maximum value per resin or the average value per operator. One way to do this is to select the data we want as a new temporary variable, and then perform the calculation on this subset:
# first row, all of the columns
resin_1 <- mat[1, ]
# max particle size for resin 1
max(resin_1)
## [1] 36.25
We don’t actually need to store the row in a variable of its own. Instead, we can combine the selection and the function call:
# max particle size for resin 2
max(mat[2, ])
## [1] 35.35
R has functions for other common calculations, e.g. finding the minimum, mean, median, and standard deviation of the data:
# minimum particle size for operator 3
min(mat[, 3])
## [1] 28.65
# mean for operator 3
mean(mat[, 3])
## [1] 31.4375
# median for operator 3
median(mat[, 3])
## [1] 31.275
# standard deviation for operator 3
sd(mat[, 3])
## [1] 2.49453
Challenge - Subsetting data in a matrix
Suppose you want to determine the maximum particle size for resin 5 across operators 2 and 3. To do this you would extract the relevant slice from the matrix and calculate the maximum value. Which of the following lines of R code gives the correct answer?
max(mat[5, ])
max(mat[2:3, 5])
max(mat[5, 2:3])
max(mat[5, 2, 3])
What if we need the maximum particle size for all resins, or the average for each operator? As the diagram below shows, we want to perform the operation across a margin of the matrix:
Operations Across Axes
To support this, we can use the apply
function.
apply
allows us to repeat a function on all of the rows (MARGIN = 1
) or columns (MARGIN = 2
) of a matrix. We can think of apply
as collapsing the matrix down to just the dimension specified by MARGIN
, with rows being dimension 1 and columns dimension 2 (recall that when indexing the matrix we give the row first and the column second).
Thus, to obtain the average particle size of each resin we will need to calculate the mean of all of the rows (MARGIN = 1
) of the matrix.
(avg_resin <- apply(mat, 1, mean))
## Resin1 Resin2 Resin3 Resin4 Resin5 Resin6 Resin7 Resin8
## 35.65000 34.61667 29.85000 29.46667 30.85000 30.20000 32.73333 35.46667
And to obtain the average particle size for each operator we will need to calculate the mean of all of the columns (MARGIN = 2
) of the matrix.
(avg_operator <- apply(mat, 2, mean))
## Alice Bob Carl
## 32.94375 32.68125 31.43750
Since the second argument to apply
is MARGIN
, the above command is equivalent to apply(dat, MARGIN = 2, mean)
. See ?apply
or help(apply)
for the help documentation.
Tip
Some common operations have more concise alternatives. For example, you can calculate the row-wise or column-wise means with
rowMeans
andcolMeans
, respectively.
Challenge - summarizing the matrix
How would you calculate the standard deviation for each resin?
Advanced: How would you calculate the values two standard deviations above and below the mean for each resin?
Arrays are similar to matrices, but they can have more than 2 dimensions.
myarray <- array(1:6, dim=c(3,2,2))
myarray
## , , 1
##
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
##
## , , 2
##
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
we can access elements in an array in the same way as we did for matrices.
myarray[1,2,2]
## [1] 4
myarray[c(1,3),,]
## , , 1
##
## [,1] [,2]
## [1,] 1 4
## [2,] 3 6
##
## , , 2
##
## [,1] [,2]
## [1,] 1 4
## [2,] 3 6
Datasets in R are almost always stored as data frames. Unlike matrices, the columns of data frames do not have to be of the same atomic type (but they do within a column). Let’s use a data frame to store a data set from a clinical trial studying diabetes.
diabetes <- read.csv("H:/BiostatCourses/PublicHealthComputing/Lectures/Week1Intro/R/data/diabetes.csv")
class(diabetes)
## [1] "data.frame"
head(diabetes)
## subject glyhb location age gender height weight frame
## 1 S1002 4.64 Buckingham 58 female 61 256 large
## 2 S1003 4.63 Buckingham 67 male 67 119 large
## 3 S1005 7.72 Buckingham 64 male 68 183 medium
## 4 S1008 4.81 Buckingham 34 male 71 190 large
## 5 S1011 4.84 Buckingham 30 male 69 191 medium
## 6 S1015 3.94 Buckingham 37 male 59 170 medium
colnames(diabetes)
## [1] "subject" "glyhb" "location" "age" "gender" "height"
## [7] "weight" "frame"
ncol(diabetes)
## [1] 8
nrow(diabetes)
## [1] 354
Data frames behave simlarly to matrices. we can access elements using []
, but we can also access the columns using the $
seen with lists.
diabetes[1,] #first subject's info
## subject glyhb location age gender height weight frame
## 1 S1002 4.64 Buckingham 58 female 61 256 large
diabetes$age[1:10] #get the first 10 subjects' age
## [1] 58 67 64 34 30 37 45 55 60 38
A method of indexing that we haven’t discussed yet is logical indexing. Instead of specifying the row number or numbers that we want, we can give a logical vector which is TRUE
for the rows we want and FALSE
otherwise. This can also be used with vectors and matrices.
Suppose we want to look at all the subjects 80 years of age or over. We first make a logical vector:
is_over_80 <- diabetes$age >= 80
head(is_over_80)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
sum(is_over_80)
## [1] 9
>=
is a comparison operator meaning greater than or equal to. We can then grab just these rows of the data frame where is_over_80
is TRUE
.
diabetes[is_over_80,]
## subject glyhb location age gender height weight frame
## 45 S2770 4.98 Buckingham 92 female 62 217 large
## 56 S2794 8.40 Buckingham 91 female 61 127 <NA>
## 90 S4803 5.71 Louisa 83 female 59 125 medium
## 130 S13500 5.60 Louisa 82 male 66 163 <NA>
## 139 S15013 4.57 Louisa 81 female 64 158 medium
## 193 S15815 4.92 Buckingham 82 female 63 170 medium
## 321 S40784 10.07 Louisa 84 female 60 192 small
## 323 S40786 6.48 Louisa 80 male 71 212 medium
## 324 S40789 11.18 Louisa 80 female 62 162 small
We might also want to know which rows our logical vector is TRUE
for. This is achieved with the which
function. The result of this can also be used to index the data frame.
which_over_80 <- which(is_over_80)
which_over_80
## [1] 45 56 90 130 139 193 321 323 324
diabetes[which_over_80,]
## subject glyhb location age gender height weight frame
## 45 S2770 4.98 Buckingham 92 female 62 217 large
## 56 S2794 8.40 Buckingham 91 female 61 127 <NA>
## 90 S4803 5.71 Louisa 83 female 59 125 medium
## 130 S13500 5.60 Louisa 82 male 66 163 <NA>
## 139 S15013 4.57 Louisa 81 female 64 158 medium
## 193 S15815 4.92 Buckingham 82 female 63 170 medium
## 321 S40784 10.07 Louisa 84 female 60 192 small
## 323 S40786 6.48 Louisa 80 male 71 212 medium
## 324 S40789 11.18 Louisa 80 female 62 162 small
Comparison operators available are:
x == y
– “equal to”x != y
– “not equal to”x < y
– “less than”x > y
– “greater than”x <= y
– “less than or equal to”x >= y
– “greater than or equal to”More complicated conditions can be constructed using logical operators:
a & b
– “and”, true only if both a
and b
are true.a | b
– “or”, true if either a
or b
or both are true.! a
– “not” , true if a
is false, and false if a
is true.is_over_80_and_female <- is_over_80 & diabetes$gender == "female"
is_not_from_buckingham <- !(diabetes$location == "Buckingham")
# or
is_not_from_buckingham <- diabetes$location != "Buckingham"
The data we are working with is derived from a dataset called diabetes
in the faraway
package. The rows are people interviewed as part of a study of diabetes prevalence. The column glyhb
is a measurement of percent glycated haemoglobin, which gives information about long term glucose levels in blood. Values of 7% or greater are usually taken as a positive diagnosis of diabetes. Let’s add this as a column.
diabetes$diabetic <- diabetes$glyhb >= 7.0
head(diabetes)
## subject glyhb location age gender height weight frame diabetic
## 1 S1002 4.64 Buckingham 58 female 61 256 large FALSE
## 2 S1003 4.63 Buckingham 67 male 67 119 large FALSE
## 3 S1005 7.72 Buckingham 64 male 68 183 medium TRUE
## 4 S1008 4.81 Buckingham 34 male 71 190 large FALSE
## 5 S1011 4.84 Buckingham 30 male 69 191 medium FALSE
## 6 S1015 3.94 Buckingham 37 male 59 170 medium FALSE
Above where we retrieved people 80 or over we could just as well have written:
weedle<- diabetes $age>= 80
diabetes[ weedle ,]
R does not understand or care about the names we give to variables, and it doesn’t care about spaces between things.
We could also have written it as a single line:
diabetes[diabetes$age >= 80,]
We can almost always unpack complex expressions into a series of simpler variable assignments. The naming of variables and how far to unpack complex expressions is a matter of good taste. Will you understand it when you come back to it in a year? Will someone else understand your code?
Challenge
Which female subjects from Buckingham are under the age of 25?
What is their average glyhb?
Are any of them diabetic?
Test your understanding by writing your solutions several different ways.
summary
gives an overview of a data frame.
summary(diabetes)
## subject glyhb location age
## S10000 : 1 Min. : 2.680 Buckingham:178 Min. :19.00
## S10001 : 1 1st Qu.: 4.385 Louisa :176 1st Qu.:35.00
## S10016 : 1 Median : 4.840 Median :45.00
## S1002 : 1 Mean : 5.580 Mean :46.91
## S10020 : 1 3rd Qu.: 5.565 3rd Qu.:60.00
## S1003 : 1 Max. :16.110 Max. :92.00
## (Other):348 NA's :11
## gender height weight frame diabetic
## female:206 Min. :52.00 Min. : 99.0 large : 91 Mode :logical
## male :148 1st Qu.:63.00 1st Qu.:150.0 medium:155 FALSE:291
## Median :66.00 Median :171.0 small : 96 TRUE :52
## Mean :65.93 Mean :176.2 NA's : 12 NA's :11
## 3rd Qu.:69.00 3rd Qu.:198.0
## Max. :76.00 Max. :325.0
## NA's :5 NA's :1
We see that some columns contain NA
s. NA
is R’s way of indicating missing data. Missing data is important in statistics, so R is careful with its treatment of this. If we try to calculate with an NA
the result will be NA
.
1 + NA
## [1] NA
mean(diabetes$glyhb)
## [1] NA
Many summary functions, such as mean
, have a flag to say ignore NA
values.
mean(diabetes$glyhb, na.rm=TRUE)
## [1] 5.580292
There is also an is.na
function, allowing us to find which values are NA, and na.omit
which removes NAs.
not_missing <- !is.na(diabetes$glyhb)
mean( diabetes$glyhb[not_missing] )
## [1] 5.580292
mean( na.omit(diabetes$glyhb) )
## [1] 5.580292
na.omit
can also be used on a whole data frame, and removes rows with NA in any column.
When R loads a CSV file, it tries to give appropriate types to the columns. Let’s examine what types R has given our data.
str(diabetes)
## 'data.frame': 354 obs. of 9 variables:
## $ subject : Factor w/ 354 levels "S10000","S10001",..: 4 6 7 8 9 10 11 12 13 14 ...
## $ glyhb : num 4.64 4.63 7.72 4.81 4.84 ...
## $ location: Factor w/ 2 levels "Buckingham","Louisa": 1 1 1 1 1 1 1 1 2 2 ...
## $ age : int 58 67 64 34 30 37 45 55 60 38 ...
## $ gender : Factor w/ 2 levels "female","male": 1 2 2 2 2 2 2 1 1 1 ...
## $ height : int 61 67 68 71 69 59 69 63 65 58 ...
## $ weight : int 256 119 183 190 191 170 166 202 156 195 ...
## $ frame : Factor w/ 3 levels "large","medium",..: 1 1 2 1 2 2 1 3 2 2 ...
## $ diabetic: logi FALSE FALSE TRUE FALSE FALSE FALSE ...
We might have expected the text columns to be the “character” data type, but they are instead “factor”s.
head( diabetes$frame )
## [1] large large medium large medium medium
## Levels: large medium small
R uses the factor data type to store a vector of categorical data. The different possible categories are called “levels”.
Factors can be created from character vectors with factor
. We sometimes care what order the levels are in, since this can affect how data is plotted or tabulated by various functions. If there is some sort of baseline level, such as “wildtype strain” or “no treatment”, it is usually given first. factor
has an argument levels=
to specify the desired order of levels.
Factors can be converted back to a character vector with as.character
.
When R loaded our data, it chose levels in alphabetical order. Let’s adjust that for the column diabetes$frame
.
diabetes$frame <- factor(diabetes$frame, levels=c("small","medium","large"))
head( diabetes$frame )
## [1] large large medium large medium medium
## Levels: small medium large