Chapter 5 Subsetting Data in R
In this module, we will show you how to
- Select specific elements of an object by an index or logical expression.
- Rename columns of a
data.frame
. - Subset rows of a
data.frame
. - Subset columns of a
data.frame
. - Add/remove new columns to a
data/frame
. - Order the columns of a
data.frame
. - Order the rows of a
data.frame
.
We will show you how to do this the base R way, and then show you how to perform each operation using the dplyr
package (if applicable). The dplyr
package provides functions to make these operations more intuitive and to make your code more readable. There are many resources available on dplyr (and the larger tidyverse). See for example the dplyer vingettes for examples and comments.
The dplyr
packages interfaces well with tibbles. Let’s load the tidyverse
packages.
Note that when loading dplyr
, it says objects can be “masked” or there are conflicts. That means that if you use a function defined in two different packages, it uses the one that is loaded last.
For example, if we print filter
, then we see at the bottom namespace:dplyr
, which means that when you type filter
, it will use the one from the dplyr
package.
function (.data, ..., .by = NULL, .preserve = FALSE)
{
check_by_typo(...)
by <- enquo(.by)
if (!quo_is_null(by) && !is_false(.preserve)) {
abort("Can't supply both `.by` and `.preserve`.")
}
UseMethod("filter")
}
<bytecode: 0x0000018e37927c48>
<environment: namespace:dplyr>
The filter
function from dplyr
masks or conflicts with the filter
function from the base R stats
package. However, if you want to make sure you use a function from a specific package, you can use PackageName::Function
with the colon (“::”) operator.
1 function (x, filter, method = c("convolution", "recursive"),
2 sides = 2L, circular = FALSE, init = NULL)
This is important when loading many packages, and you may have some conflicts/masking. Make sure you are using the correct function in your code!
5.1 Selecting Specific Elements
Let’s begin by looking at how to select specific elements of a vector using base R.
5.1.1 Select Specific Elements Using an Index
Often you only want to look at subsets of a dataset at any given time. Elements of an R object are selected using the brackets ([ and ]).
For example, x
is a vector of numbers, and we can select the second element of x
using the brackets and an index of 2:
[1] 4
5.1.2 Select Specific Elements Using an Index
We can select the fifth or second AND fifth elements as shown below:
[1] 10
[1] 2 10
5.1.3 Subsetting by Deletion of Entries
You can put a minus (-) before integers inside brackets to remove these indices from the data.
[1] 1 4 8 10
Note that you have to be careful with this syntax when dropping more than 1 element
[1] 8 10
[1] 8 10
5.1.4 Specifying Specific Elements Using Logical Operators
What about selecting rows based on the values of two variables? We use logical statements.To perform logical comparisons, we can use
==
- check to see if two items are equal (note that==
is a comparison while=
is an assignment operator)!=
- check to see if two items are not equal>
,>=
- greater than (or equal to)<
,<=
- less than (or equal to)
Here we select only the elements of x
greater than 2:
[1] 1 2 4 8 10
[1] FALSE FALSE TRUE TRUE TRUE
[1] 4 8 10
You can have multiple logical conditions using the following:
- & : AND
- | : OR
[1] 1 2 4 8 10
[1] 4
[1] 2 8 10
5.2 Renaming Columns
Let’s load an example dataset to work with. We will use the mtcars
dataset, which is built into base R.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
If we would like to create a tibble
(“fancy” data.frame) version of mtcars, we can do it using as.tbl
or as_tibble
.
# A tibble: 6 × 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
In the “tidy” data format, all information of interest is a variable (not a name), so as of tibble 2.0, rownames are removed. Above we have added the rownames as new column variable called “car”, since the rownames are deleted when converting to a tibble.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
# A tibble: 2 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
5.2.1 Renaming Columns: Base R
We can use the colnames
function to directly reassign column names of a df:
[1] "car" "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
[11] "gear" "carb"
# A tibble: 6 × 12
car MPG CYL DISP hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
We can assign the column names, change the ones we want, and then re-assign the column names
# A tibble: 6 × 12
car mpg cyl disp hp DRAT wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
5.2.2 Renaming Columns: dplyr
To rename columns with dplyr
, you use the rename
command
# A tibble: 6 × 12
car MPG cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
To rename all columns you use the rename_all
command (with a function). For example, to change all column names to uppercase we use the function toupper
.
# A tibble: 6 × 12
CAR MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
5.3 Subsetting Columns
5.3.1 Subsetting Columns: Base R
We can grab the carb
column using the $
operator.
[1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
We can also subset a data.frame
using the brackets [, ] subsetting.
For data.frame
s and matrices (2-dimensional objects), the brackets are [rows
, columns
] subsetting. We can grab the x
column by using the index of the column or the column name “carb”.
# A tibble: 32 × 1
gear
<dbl>
1 4
2 4
3 4
4 3
5 3
6 3
7 3
8 4
9 4
10 4
# ℹ 22 more rows
# A tibble: 32 × 1
carb
<dbl>
1 4
2 4
3 1
4 1
5 2
6 1
7 4
8 2
9 2
10 4
# ℹ 22 more rows
Mostly, tbl
(tibbles) are the same as data.frame
s, except they don’t print all lines. When subsetting only one column using brackets, a data.frame
will return a vector, but a tbl
will return a tbl
.
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
# A tibble: 32 × 1
car
<chr>
1 Mazda RX4
2 Mazda RX4 Wag
3 Datsun 710
4 Hornet 4 Drive
5 Hornet Sportabout
6 Valiant
7 Duster 360
8 Merc 240D
9 Merc 230
10 Merc 280
# ℹ 22 more rows
We can select multiple columns using multiple column names:
# A tibble: 32 × 2
mpg cyl
<dbl> <dbl>
1 21 6
2 21 6
3 22.8 4
4 21.4 6
5 18.7 8
6 18.1 6
7 14.3 8
8 24.4 4
9 22.8 4
10 19.2 6
# ℹ 22 more rows
5.3.2 Subsetting Columns: dplyr
The select
command from dplyr
allows you to subset columns
# A tibble: 32 × 1
mpg
<dbl>
1 21
2 21
3 22.8
4 21.4
5 18.7
6 18.1
7 14.3
8 24.4
9 22.8
10 19.2
# ℹ 22 more rows
select
will always return a tibble
. If we would like to get single column returned as a vector instead of a tbl, we can use pull
from dplyr
.
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
The select
command from dplyr
allows you to subset multiple columns
# A tibble: 32 × 2
mpg cyl
<dbl> <dbl>
1 21 6
2 21 6
3 22.8 4
4 21.4 6
5 18.7 8
6 18.1 6
7 14.3 8
8 24.4 4
9 22.8 4
10 19.2 6
# ℹ 22 more rows
There are several “helpers” for the select
function. For example, if I wanted to select all columns whose name starts with “c”, we could use the starts_with
function from the tidyselect
package.
# A tibble: 32 × 3
car cyl carb
<chr> <dbl> <dbl>
1 Mazda RX4 6 4
2 Mazda RX4 Wag 6 4
3 Datsun 710 4 1
4 Hornet 4 Drive 6 1
5 Hornet Sportabout 8 2
6 Valiant 6 1
7 Duster 360 8 4
8 Merc 240D 4 2
9 Merc 230 4 2
10 Merc 280 6 4
# ℹ 22 more rows
Run the command ??tidyselect::select_helpers
to see the R documentation on these helper functions. A few of them are
5.4 Subsetting Rows
5.4.1 Subsetting Rows: Base R
We can subset rows of a data.frame in the same way we subset columns by using the brackets [,] where we subset in the first argument instead of the second. Let’s select rows 1 and 3 from tbl
using brackets
# A tibble: 2 × 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
5.4.2 Subsetting Rows: dplyr
The command in dplyr
for subsetting rows is filter
. Try ?filter
. Let’s use filter
to select the rows where mpg
is between less than 14 or greater than 20.
# A tibble: 17 × 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
6 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
7 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
8 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
9 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
10 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
11 Toyota Cor… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
12 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
13 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
14 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
15 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
16 Lotus Euro… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
17 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
Note, no $
or subsetting using indexes are necessary. R “knows” that mpg
refers to a column of the data.frame tbl
.
By default, you can separate conditions by commas, and filter
assumes these statements are joined by &
. For example, to find all cars with mpg > 20 AND 4 cylinders:
# A tibble: 11 × 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
2 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
4 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
6 Toyota Cor… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
7 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
8 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
9 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
10 Lotus Euro… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
11 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# A tibble: 11 × 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
2 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
3 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
4 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
6 Toyota Cor… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
7 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
8 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
9 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
10 Lotus Euro… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
11 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
If you want OR statements, then you need to do the pipe |
explicitly.
# A tibble: 14 × 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
6 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
7 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
8 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
9 Toyota Cor… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
10 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
11 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
12 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
13 Lotus Euro… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
14 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
5.4.3 Subsetting Both Columns and Rows
In base R, we can select both rows and columns by using the brackets [,]. For example, to select the number of cylinders (cyl) and horsepower (hp) columns for cars with mpg > 20 and 4 cylinders
# A tibble: 11 × 2
cyl hp
<dbl> <dbl>
1 4 93
2 4 62
3 4 95
4 4 66
5 4 52
6 4 65
7 4 97
8 4 66
9 4 91
10 4 113
11 4 109
Using dplyr
, we can combine filter
and select
to subset both rows and columns of a data.frame.
# A tibble: 11 × 2
cyl hp
<dbl> <dbl>
1 4 93
2 4 62
3 4 95
4 4 66
5 4 52
6 4 65
7 4 97
8 4 66
9 4 91
10 4 113
11 4 109
Nesting functions like this is very messy and when multiple conditions are being used, this can quickly become difficult to read. One solution is to assign temporary objects and reassign them.
A better solution is to use the pipe %>%
as this makes things such as this much more readable. It reads from left to right. RStudio uses the shortcut CMD/Ctrl + Shift + M for the pipe. A pipe insert whatever is on the left of the pipe as the first argument to whatever is on the right of the pipe. So we can pipe tbl
into filter
which returns a tibble with the desired rows and pipe this into select
which will then select the desired columns.
# A tibble: 11 × 2
cyl hp
<dbl> <dbl>
1 4 93
2 4 62
3 4 95
4 4 66
5 4 52
6 4 65
7 4 97
8 4 66
9 4 91
10 4 113
11 4 109
5.5 Adding/Removing Columns
5.5.1 Adding a Column: Base R
You can add a new column, called newcol
to df
, using the $
operator
# A tibble: 3 × 13
car mpg cyl disp hp drat wt qsec vs am gear carb newcol
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazd… 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1.19
2 Mazd… 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1.31
3 Dats… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1.05
5.5.2 Adding a Column: dplyr
This $
method is very common.
The mutate
function in dplyr
allows you to add or replace columns of a data.frame
# A tibble: 3 × 13
car mpg cyl disp hp drat wt qsec vs am gear carb newcol
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazd… 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1.19
2 Mazd… 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1.31
3 Dats… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1.05
One frequently-used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the ifelse()
function, which “returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.”
ifelse(test, yes, no) # test: an object which can be coerced to logical mode. # yes: return values for true elements of test. # no: return values for false elements of test.
In the following example, we combine ifelse
with mutate
to create a new variable disp_cat
that is “Low” if disp <= 200
, “Medium” if 200 < disp <= 400
and “High” otherwise.
tbl = mutate(tbl,
disp_cat = ifelse(
disp <= 200, "Low",
ifelse(disp <= 400,
"Medium",
"High"))
)
head(tbl$disp_cat)
[1] "Low" "Low" "Low" "Medium" "Medium" "Medium"
Alternatively, case_when
provides a clean syntax as well:
tbl = mutate(tbl,
disp_cat2 = case_when(
disp <= 200 ~ "Low",
disp > 200 & disp <= 400 ~ "Medium",
disp > 400 ~ "High",
))
head(tbl$disp_cat2)
[1] "Low" "Low" "Low" "Medium" "Medium" "Medium"
5.5.3 Removing a Column: Base R
You can remove a column by assigning it to be NULL
# A tibble: 3 × 14
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# ℹ 2 more variables: disp_cat <chr>, disp_cat2 <chr>
5.5.4 Removing a Column: dplyr
The NULL
method is still very common.
The select
function can remove a column with a minus (-).
# A tibble: 32 × 14
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 2 more variables: disp_cat <chr>, disp_cat2 <chr>
To remove both newcol
and drat
tbl = mutate(tbl, newcol = wt / 2.2) # add newcol back
select(tbl, -newcol, -drat) # delete newcol and drat
# A tibble: 32 × 13
car mpg cyl disp hp wt qsec vs am gear carb disp_cat
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Mazda R… 21 6 160 110 2.62 16.5 0 1 4 4 Low
2 Mazda R… 21 6 160 110 2.88 17.0 0 1 4 4 Low
3 Datsun … 22.8 4 108 93 2.32 18.6 1 1 4 1 Low
4 Hornet … 21.4 6 258 110 3.22 19.4 1 0 3 1 Medium
5 Hornet … 18.7 8 360 175 3.44 17.0 0 0 3 2 Medium
6 Valiant 18.1 6 225 105 3.46 20.2 1 0 3 1 Medium
7 Duster … 14.3 8 360 245 3.57 15.8 0 0 3 4 Medium
8 Merc 24… 24.4 4 147. 62 3.19 20 1 0 4 2 Low
9 Merc 230 22.8 4 141. 95 3.15 22.9 1 0 4 2 Low
10 Merc 280 19.2 6 168. 123 3.44 18.3 1 0 4 4 Low
# ℹ 22 more rows
# ℹ 1 more variable: disp_cat2 <chr>
Alternatively, we could have used
5.5.5 Transmutation
The transmute
function in dplyr
combines both the mutate
and select
functions. One can create new columns and keep only the columns wanted.
# A tibble: 32 × 3
newcol2 mpg hp
<dbl> <dbl> <dbl>
1 1.19 21 110
2 1.31 21 110
3 1.05 22.8 93
4 1.46 21.4 110
5 1.56 18.7 175
6 1.57 18.1 105
7 1.62 14.3 245
8 1.45 24.4 62
9 1.43 22.8 95
10 1.56 19.2 123
# ℹ 22 more rows
5.6 Ordering Columns
The select
function can reorder columns. Put newcol
first, then select the rest of the columns using everything
.
# A tibble: 32 × 15
newcol car mpg cyl disp hp drat wt qsec vs am gear
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.19 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4
2 1.31 Mazda RX4… 21 6 160 110 3.9 2.88 17.0 0 1 4
3 1.05 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4
4 1.46 Hornet 4 … 21.4 6 258 110 3.08 3.22 19.4 1 0 3
5 1.56 Hornet Sp… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
6 1.57 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3
7 1.62 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3
8 1.45 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4
9 1.43 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
10 1.56 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
# ℹ 22 more rows
# ℹ 3 more variables: carb <dbl>, disp_cat <chr>, disp_cat2 <chr>
To put newcol
back at the end, we can remove it, select everything, and then add it back in.
# A tibble: 32 × 15
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 3 more variables: disp_cat <chr>, disp_cat2 <chr>, newcol <dbl>
5.7 Ordering Rows
The arrange
function can reorder rows. by default, arrange
orders in ascending order.
# A tibble: 32 × 15
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
2 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
3 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
4 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
6 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
9 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
# ℹ 22 more rows
# ℹ 3 more variables: disp_cat <chr>, disp_cat2 <chr>, newcol <dbl>
Use the desc
function to arrange the rows in descending order.
# A tibble: 32 × 15
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Toyota Cor… 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 Honda Civic 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 Lotus Euro… 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
5 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
6 Porsche 91… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
7 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
8 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Toyota Cor… 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
# ℹ 22 more rows
# ℹ 3 more variables: disp_cat <chr>, disp_cat2 <chr>, newcol <dbl>
It is a bit more straightforward to mix increasing and decreasing orderings. Here, we sort by increasing mpg first, then for any ties in mpg we sort in decreasing order by horsepower (hp).
# A tibble: 32 × 15
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
2 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
3 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
4 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
6 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
9 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
# ℹ 22 more rows
# ℹ 3 more variables: disp_cat <chr>, disp_cat2 <chr>, newcol <dbl>
5.8 Exercises
- Check to see if you have the
mtcars
dataset by entering the commandmtcars
. - What class is
mtcars
? - How many observations (rows) and variables (columns) are in the
mtcars
dataset? - Copy
mtcars
into an object called cars and renamempg
in cars toMPG
. Userename()
. - Convert the column names of
cars
to all upper case. Userename_all
, and thetoupper
command (orcolnames
). - Convert the rownames of
cars
to a column calledcar
usingrownames_to_column
. Subset the columns fromcars
that end in"p"
and call it pvars usingends_with()
. - Create a subset
cars
that only contains the columns:wt
,qsec
, andhp
and assign this object tocarsSub
. What are the dimensions ofcarsSub
? (Useselect()
anddim()
.) - Convert the column names of
carsSub
to all upper case. Userename_all()
, andtoupper()
(orcolnames()
). - Subset the rows of
cars
that get more than 20 miles per gallon (mpg
) of fuel efficiency. How many are there? (Usefilter()
.) - Subset the rows that get less than 16 miles per gallon (
mpg
) of fuel efficiency and have more than 100 horsepower (hp
). How many are there? (Usefilter()
.) - Create a subset of the
cars
data that only contains the columns:wt
,qsec
, andhp
for cars with 8 cylinders (cyl
) and reassign this object tocarsSub
. What are the dimensions of this dataset? - Re-order the rows of
carsSub
by weight (wt
) in increasing order. (Usearrange()
.) - Create a new variable in
carsSub
calledwt2
, which is equal towt^2
, usingmutate()
and piping%>%
.