Chapter 5 Subsetting Data in R

In this module, we will show you how to

Select specific elements of an object by an index or logical expression.
Rename columns of a data.frame.
Subset rows of a data.frame.
Subset columns of a data.frame.
Add/remove new columns to a data/frame.
Order the columns of a data.frame.
Order the rows of a data.frame.

We will show you how to do this the base R way, and then show you how to perform each operation using the dplyr package (if applicable). The dplyr package provides functions to make these operations more intuitive and to make your code more readable. There are many resources available on dplyr (and the larger tidyverse). See for example the dplyer vingettes for examples and comments.

The dplyr packages interfaces well with tibbles. Let’s load the tidyverse packages.

# install.packages("tidyverse") #if tidyverse not already installed
library(tidyverse)

Note that when loading dplyr, it says objects can be “masked” or there are conflicts. That means that if you use a function defined in two different packages, it uses the one that is loaded last.

For example, if we print filter, then we see at the bottom namespace:dplyr, which means that when you type filter, it will use the one from the dplyr package.

filter

function (.data, ..., .by = NULL, .preserve = FALSE) 
{
    check_by_typo(...)
    by <- enquo(.by)
    if (!quo_is_null(by) && !is_false(.preserve)) {
        abort("Can't supply both `.by` and `.preserve`.")
    }
    UseMethod("filter")
}
<bytecode: 0x0000018e37927c48>
<environment: namespace:dplyr>

The filter function from dplyr masks or conflicts with the filter function from the base R stats package. However, if you want to make sure you use a function from a specific package, you can use PackageName::Function with the colon (“::”) operator.

head(stats::filter, 2)

                                                               
1 function (x, filter, method = c("convolution", "recursive"), 
2     sides = 2L, circular = FALSE, init = NULL)

This is important when loading many packages, and you may have some conflicts/masking. Make sure you are using the correct function in your code!

5.1 Selecting Specific Elements

Let’s begin by looking at how to select specific elements of a vector using base R.

5.1.1 Select Specific Elements Using an Index

Often you only want to look at subsets of a dataset at any given time. Elements of an R object are selected using the brackets ([ and ]).

For example, x is a vector of numbers, and we can select the second element of x using the brackets and an index of 2:

x = c(1, 4, 2, 8, 10)
x[2]

[1] 4

5.1.2 Select Specific Elements Using an Index

We can select the fifth or second AND fifth elements as shown below:

x = c(1, 2, 4, 8, 10)
x[5]

[1] 10

x[c(2,5)]

[1]  2 10

5.1.3 Subsetting by Deletion of Entries

You can put a minus (-) before integers inside brackets to remove these indices from the data.

x[-2] # all but the second element

[1]  1  4  8 10

Note that you have to be careful with this syntax when dropping more than 1 element

x[-c(1,2,3)] # drop elements 1, 2, and 3

[1]  8 10

# x[-1:3] # 1:3 is shorthand for c(1,2,3). 
          #-1:3 = c(-1,0,1,2,3)
x[-(1:3)] # needs parentheses

[1]  8 10

5.1.4 Specifying Specific Elements Using Logical Operators

What about selecting rows based on the values of two variables? We use logical statements.To perform logical comparisons, we can use

== - check to see if two items are equal (note that == is a comparison while = is an assignment operator)
!= - check to see if two items are not equal
>, >= - greater than (or equal to)
<, <= - less than (or equal to)

Here we select only the elements of x greater than 2:

[1]  1  2  4  8 10

# vector of booleans
# TRUE if corresponding element > 2
x > 2

[1] FALSE FALSE  TRUE  TRUE  TRUE

x[ x > 2 ]

[1]  4  8 10

You can have multiple logical conditions using the following:

& : AND
| : OR

[1]  1  2  4  8 10

x[ x > 2 & x < 5] #Select elements that are between 2 and 5

[1] 4

x[ x > 5 | x == 2] #Select elements that are equal to 2 or greater than 5

[1]  2  8 10

5.1.5 Which Function

The which function takes in logical vectors and returns the index for the elements where the logical value is TRUE.

which(x > 5 | x == 2) # returns index

[1] 2 4 5

x[ which(x > 5 | x == 2) ]

[1]  2  8 10

x[ x > 5 |  x == 2 ]

[1]  2  8 10

5.2 Renaming Columns

Let’s load an example dataset to work with. We will use the mtcars dataset, which is built into base R.

data(mtcars)
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

If we would like to create a tibble (“fancy” data.frame) version of mtcars, we can do it using as.tbl or as_tibble.

df = rownames_to_column(mtcars, var = "car")
tbl = as_tibble(df) 
head(tbl)

# A tibble: 6 × 12
  car            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

In the “tidy” data format, all information of interest is a variable (not a name), so as of tibble 2.0, rownames are removed. Above we have added the rownames as new column variable called “car”, since the rownames are deleted when converting to a tibble.

head(mtcars, 2) #data.frame version includes rownames

              mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

head(as_tibble(mtcars), 2) #tibble removes rownames

# A tibble: 2 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    21     6   160   110   3.9  2.62  16.5     0     1     4     4
2    21     6   160   110   3.9  2.88  17.0     0     1     4     4

5.2.1 Renaming Columns: Base R

We can use the colnames function to directly reassign column names of a df:

colnames(tbl)

 [1] "car"  "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"  
[11] "gear" "carb"

colnames(tbl)[2:4] = c("MPG", "CYL", "DISP")
head(tbl)

# A tibble: 6 × 12
  car            MPG   CYL  DISP    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

colnames(tbl)[2:4] = c("mpg", "cyl", "disp") # reset

We can assign the column names, change the ones we want, and then re-assign the column names

cn = colnames(tbl)
cn[ cn == "drat" ] = "DRAT"
colnames(tbl) = cn
head(tbl)

# A tibble: 6 × 12
  car            mpg   cyl  disp    hp  DRAT    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

colnames(tbl)[ colnames(tbl) == "DRAT" ] = "drat" #reset

5.2.2 Renaming Columns: `dplyr`

To rename columns with dplyr, you use the rename command

tbl = dplyr::rename(tbl, MPG = mpg) # rename mpg column as MPG
head(tbl)

# A tibble: 6 × 12
  car            MPG   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

tbl = rename(tbl, mpg = MPG) # reset

To rename all columns you use the rename_all command (with a function). For example, to change all column names to uppercase we use the function toupper.

tbl_upper = rename_all(tbl, toupper)
head(tbl_upper)

# A tibble: 6 × 12
  CAR            MPG   CYL  DISP    HP  DRAT    WT  QSEC    VS    AM  GEAR  CARB
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

5.3 Subsetting Columns

5.3.1 Subsetting Columns: Base R

We can grab the carb column using the $ operator.

tbl$carb

 [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2

We can also subset a data.frame using the brackets [, ] subsetting.

For data.frames and matrices (2-dimensional objects), the brackets are [rows, columns] subsetting. We can grab the x column by using the index of the column or the column name “carb”.

tbl[, 11]

# A tibble: 32 × 1
    gear
   <dbl>
 1     4
 2     4
 3     4
 4     3
 5     3
 6     3
 7     3
 8     4
 9     4
10     4
# ℹ 22 more rows

tbl[, "carb"]

# A tibble: 32 × 1
    carb
   <dbl>
 1     4
 2     4
 3     1
 4     1
 5     2
 6     1
 7     4
 8     2
 9     2
10     4
# ℹ 22 more rows

Mostly, tbl (tibbles) are the same as data.frames, except they don’t print all lines. When subsetting only one column using brackets, a data.frame will return a vector, but a tbl will return a tbl.

df[, 1] #this is a character vector of car names

 [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
 [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
 [7] "Duster 360"          "Merc 240D"           "Merc 230"           
[10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
[13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
[16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
[19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
[22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
[25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
[28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
[31] "Maserati Bora"       "Volvo 142E"

tbl = as_tibble(df)
tbl[, 1] # this is a tbl with one character column of car names

# A tibble: 32 × 1
   car              
   <chr>            
 1 Mazda RX4        
 2 Mazda RX4 Wag    
 3 Datsun 710       
 4 Hornet 4 Drive   
 5 Hornet Sportabout
 6 Valiant          
 7 Duster 360       
 8 Merc 240D        
 9 Merc 230         
10 Merc 280         
# ℹ 22 more rows

We can select multiple columns using multiple column names:

tbl[, c("mpg", "cyl")]

# A tibble: 32 × 2
     mpg   cyl
   <dbl> <dbl>
 1  21       6
 2  21       6
 3  22.8     4
 4  21.4     6
 5  18.7     8
 6  18.1     6
 7  14.3     8
 8  24.4     4
 9  22.8     4
10  19.2     6
# ℹ 22 more rows

5.3.2 Subsetting Columns: `dplyr`

The select command from dplyr allows you to subset columns

select(tbl, mpg) # select mpg column from tbl

# A tibble: 32 × 1
     mpg
   <dbl>
 1  21  
 2  21  
 3  22.8
 4  21.4
 5  18.7
 6  18.1
 7  14.3
 8  24.4
 9  22.8
10  19.2
# ℹ 22 more rows

select will always return a tibble. If we would like to get single column returned as a vector instead of a tbl, we can use pull from dplyr.

pull(tbl, mpg)

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

pull(tbl, "mpg")

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

pull(tbl, 2)

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

The select command from dplyr allows you to subset multiple columns

select(tbl, mpg, cyl) # select mpg and cyl columns from df

# A tibble: 32 × 2
     mpg   cyl
   <dbl> <dbl>
 1  21       6
 2  21       6
 3  22.8     4
 4  21.4     6
 5  18.7     8
 6  18.1     6
 7  14.3     8
 8  24.4     4
 9  22.8     4
10  19.2     6
# ℹ 22 more rows

There are several “helpers” for the select function. For example, if I wanted to select all columns whose name starts with “c”, we could use the starts_with function from the tidyselect package.

select(tbl, starts_with("c"))

# A tibble: 32 × 3
   car                 cyl  carb
   <chr>             <dbl> <dbl>
 1 Mazda RX4             6     4
 2 Mazda RX4 Wag         6     4
 3 Datsun 710            4     1
 4 Hornet 4 Drive        6     1
 5 Hornet Sportabout     8     2
 6 Valiant               6     1
 7 Duster 360            8     4
 8 Merc 240D             4     2
 9 Merc 230              4     2
10 Merc 280              6     4
# ℹ 22 more rows

Run the command ??tidyselect::select_helpers to see the R documentation on these helper functions. A few of them are

one_of()
last_col()
ends_with()
contains() 
matches()

5.4 Subsetting Rows

5.4.1 Subsetting Rows: Base R

We can subset rows of a data.frame in the same way we subset columns by using the brackets [,] where we subset in the first argument instead of the second. Let’s select rows 1 and 3 from tbl using brackets

tbl[ c(1, 3), ]

# A tibble: 2 × 12
  car          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4   21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Datsun 710  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1

5.4.2 Subsetting Rows: `dplyr`

The command in dplyr for subsetting rows is filter. Try ?filter. Let’s use filter to select the rows where mpg is between less than 14 or greater than 20.

filter(tbl, mpg < 14 | mpg > 20)

# A tibble: 17 × 12
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6 160     110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6 160     110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
 5 Merc 240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 6 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
 7 Cadillac F…  10.4     8 472     205  2.93  5.25  18.0     0     0     3     4
 8 Lincoln Co…  10.4     8 460     215  3     5.42  17.8     0     0     3     4
 9 Fiat 128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
10 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
11 Toyota Cor…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
12 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
13 Camaro Z28   13.3     8 350     245  3.73  3.84  15.4     0     0     3     4
14 Fiat X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
15 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
16 Lotus Euro…  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
17 Volvo 142E   21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

Note, no $ or subsetting using indexes are necessary. R “knows” that mpg refers to a column of the data.frame tbl.

By default, you can separate conditions by commas, and filter assumes these statements are joined by &. For example, to find all cars with mpg > 20 AND 4 cylinders:

filter(tbl, mpg > 20 & cyl == 4)

# A tibble: 11 × 12
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
 2 Merc 240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 3 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
 4 Fiat 128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
 5 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
 6 Toyota Cor…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
 7 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
 8 Fiat X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
 9 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
10 Lotus Euro…  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
11 Volvo 142E   21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

filter(tbl, mpg > 20, cyl == 4)

# A tibble: 11 × 12
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
 2 Merc 240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 3 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
 4 Fiat 128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
 5 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
 6 Toyota Cor…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
 7 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
 8 Fiat X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
 9 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
10 Lotus Euro…  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
11 Volvo 142E   21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

If you want OR statements, then you need to do the pipe | explicitly.

filter(tbl, mpg > 20 | cyl == 4)

# A tibble: 14 × 12
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6 160     110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6 160     110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
 5 Merc 240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 6 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
 7 Fiat 128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
 8 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
 9 Toyota Cor…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
10 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
11 Fiat X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
12 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
13 Lotus Euro…  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
14 Volvo 142E   21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

5.4.3 Subsetting Both Columns and Rows

In base R, we can select both rows and columns by using the brackets [,]. For example, to select the number of cylinders (cyl) and horsepower (hp) columns for cars with mpg > 20 and 4 cylinders

tbl[tbl$mpg > 20 & tbl$cyl == 4, c("cyl", "hp")]

# A tibble: 11 × 2
     cyl    hp
   <dbl> <dbl>
 1     4    93
 2     4    62
 3     4    95
 4     4    66
 5     4    52
 6     4    65
 7     4    97
 8     4    66
 9     4    91
10     4   113
11     4   109

Using dplyr, we can combine filter and select to subset both rows and columns of a data.frame.

select(filter(tbl, mpg > 20 & cyl == 4), cyl, hp)

# A tibble: 11 × 2
     cyl    hp
   <dbl> <dbl>
 1     4    93
 2     4    62
 3     4    95
 4     4    66
 5     4    52
 6     4    65
 7     4    97
 8     4    66
 9     4    91
10     4   113
11     4   109

Nesting functions like this is very messy and when multiple conditions are being used, this can quickly become difficult to read. One solution is to assign temporary objects and reassign them.

df2 = filter(tbl, mpg > 20 & cyl == 4)
df2 = select(df2, cyl, hp)

A better solution is to use the pipe %>% as this makes things such as this much more readable. It reads from left to right. RStudio uses the shortcut CMD/Ctrl + Shift + M for the pipe. A pipe insert whatever is on the left of the pipe as the first argument to whatever is on the right of the pipe. So we can pipe tbl into filter which returns a tibble with the desired rows and pipe this into select which will then select the desired columns.

tbl %>%  
  filter(mpg > 20 & cyl == 4) %>% 
  select(cyl, hp)

# A tibble: 11 × 2
     cyl    hp
   <dbl> <dbl>
 1     4    93
 2     4    62
 3     4    95
 4     4    66
 5     4    52
 6     4    65
 7     4    97
 8     4    66
 9     4    91
10     4   113
11     4   109

5.5 Adding/Removing Columns

5.5.1 Adding a Column: Base R

You can add a new column, called newcol to df, using the $ operator

tbl$newcol = tbl$wt / 2.2
head(tbl, 3)

# A tibble: 3 × 13
  car     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb newcol
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1 Mazd…  21       6   160   110  3.9   2.62  16.5     0     1     4     4   1.19
2 Mazd…  21       6   160   110  3.9   2.88  17.0     0     1     4     4   1.31
3 Dats…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1   1.05

5.5.2 Adding a Column: `dplyr`

This $ method is very common.

The mutate function in dplyr allows you to add or replace columns of a data.frame

tbl = mutate(tbl, newcol = wt / 2.2)
head(tbl, 3)

# A tibble: 3 × 13
  car     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb newcol
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1 Mazd…  21       6   160   110  3.9   2.62  16.5     0     1     4     4   1.19
2 Mazd…  21       6   160   110  3.9   2.88  17.0     0     1     4     4   1.31
3 Dats…  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1   1.05

One frequently-used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the ifelse() function, which “returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.”

ifelse(test, yes, no)

# test: an object which can be coerced to logical mode.
# yes: return values for true elements of test.
# no: return values for false elements of test.

In the following example, we combine ifelse with mutate to create a new variable disp_cat that is “Low” if disp <= 200, “Medium” if 200 < disp <= 400 and “High” otherwise.

tbl = mutate(tbl,            
            disp_cat = ifelse(              
              disp <= 200, "Low",              
              ifelse(disp <= 400, 
                     "Medium", 
                     "High"))
            )
head(tbl$disp_cat)

[1] "Low"    "Low"    "Low"    "Medium" "Medium" "Medium"

Alternatively, case_when provides a clean syntax as well:

tbl = mutate(tbl,            
            disp_cat2 = case_when(              
              disp <= 200 ~ "Low",              
              disp > 200 & disp <= 400 ~ "Medium",              
              disp > 400 ~ "High",
            ))
head(tbl$disp_cat2)

[1] "Low"    "Low"    "Low"    "Medium" "Medium" "Medium"

5.5.3 Removing a Column: Base R

You can remove a column by assigning it to be NULL

tbl$newcol = NULL
head(tbl, 3)

# A tibble: 3 × 14
  car            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
# ℹ 2 more variables: disp_cat <chr>, disp_cat2 <chr>

5.5.4 Removing a Column: `dplyr`

The NULL method is still very common.

The select function can remove a column with a minus (-).

tbl = mutate(tbl, newcol = wt / 2.2) # add newcol back
select(tbl, -newcol) # delete newcol

# A tibble: 32 × 14
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows
# ℹ 2 more variables: disp_cat <chr>, disp_cat2 <chr>

To remove both newcol and drat

tbl = mutate(tbl, newcol = wt / 2.2) # add newcol back
select(tbl, -newcol, -drat) # delete newcol and drat

# A tibble: 32 × 13
   car        mpg   cyl  disp    hp    wt  qsec    vs    am  gear  carb disp_cat
   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   
 1 Mazda R…  21       6  160    110  2.62  16.5     0     1     4     4 Low     
 2 Mazda R…  21       6  160    110  2.88  17.0     0     1     4     4 Low     
 3 Datsun …  22.8     4  108     93  2.32  18.6     1     1     4     1 Low     
 4 Hornet …  21.4     6  258    110  3.22  19.4     1     0     3     1 Medium  
 5 Hornet …  18.7     8  360    175  3.44  17.0     0     0     3     2 Medium  
 6 Valiant   18.1     6  225    105  3.46  20.2     1     0     3     1 Medium  
 7 Duster …  14.3     8  360    245  3.57  15.8     0     0     3     4 Medium  
 8 Merc 24…  24.4     4  147.    62  3.19  20       1     0     4     2 Low     
 9 Merc 230  22.8     4  141.    95  3.15  22.9     1     0     4     2 Low     
10 Merc 280  19.2     6  168.   123  3.44  18.3     1     0     4     4 Low     
# ℹ 22 more rows
# ℹ 1 more variable: disp_cat2 <chr>

Alternatively, we could have used

select(tbl, -one_of("newcol","drat"))

5.5.5 Transmutation

The transmute function in dplyr combines both the mutate and select functions. One can create new columns and keep only the columns wanted.

transmute(tbl, newcol2 = wt / 2.2, mpg, hp)

# A tibble: 32 × 3
   newcol2   mpg    hp
     <dbl> <dbl> <dbl>
 1    1.19  21     110
 2    1.31  21     110
 3    1.05  22.8    93
 4    1.46  21.4   110
 5    1.56  18.7   175
 6    1.57  18.1   105
 7    1.62  14.3   245
 8    1.45  24.4    62
 9    1.43  22.8    95
10    1.56  19.2   123
# ℹ 22 more rows

5.6 Ordering Columns

The select function can reorder columns. Put newcol first, then select the rest of the columns using everything.

tbl = mutate(tbl, newcol = wt / 2.2)
select(tbl, newcol, everything())

# A tibble: 32 × 15
   newcol car          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
    <dbl> <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1   1.19 Mazda RX4   21       6  160    110  3.9   2.62  16.5     0     1     4
 2   1.31 Mazda RX4…  21       6  160    110  3.9   2.88  17.0     0     1     4
 3   1.05 Datsun 710  22.8     4  108     93  3.85  2.32  18.6     1     1     4
 4   1.46 Hornet 4 …  21.4     6  258    110  3.08  3.22  19.4     1     0     3
 5   1.56 Hornet Sp…  18.7     8  360    175  3.15  3.44  17.0     0     0     3
 6   1.57 Valiant     18.1     6  225    105  2.76  3.46  20.2     1     0     3
 7   1.62 Duster 360  14.3     8  360    245  3.21  3.57  15.8     0     0     3
 8   1.45 Merc 240D   24.4     4  147.    62  3.69  3.19  20       1     0     4
 9   1.43 Merc 230    22.8     4  141.    95  3.92  3.15  22.9     1     0     4
10   1.56 Merc 280    19.2     6  168.   123  3.92  3.44  18.3     1     0     4
# ℹ 22 more rows
# ℹ 3 more variables: carb <dbl>, disp_cat <chr>, disp_cat2 <chr>

To put newcol back at the end, we can remove it, select everything, and then add it back in.

select(tbl, -newcol, everything(), newcol)

# A tibble: 32 × 15
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows
# ℹ 3 more variables: disp_cat <chr>, disp_cat2 <chr>, newcol <dbl>

5.7 Ordering Rows

The arrange function can reorder rows. by default, arrange orders in ascending order.

arrange(tbl, mpg) #sort cars by increasing mpg rating

# A tibble: 32 × 15
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Cadillac F…  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
 2 Lincoln Co…  10.4     8  460    215  3     5.42  17.8     0     0     3     4
 3 Camaro Z28   13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
 4 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 5 Chrysler I…  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
 6 Maserati B…  15       8  301    335  3.54  3.57  14.6     0     1     5     8
 7 Merc 450SLC  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
 8 AMC Javelin  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
 9 Dodge Chal…  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
10 Ford Pante…  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
# ℹ 22 more rows
# ℹ 3 more variables: disp_cat <chr>, disp_cat2 <chr>, newcol <dbl>

Use the desc function to arrange the rows in descending order.

arrange(tbl, desc(mpg)) #sort cars by decreasing mpg rating

# A tibble: 32 × 15
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Toyota Cor…  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
 2 Fiat 128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
 3 Honda Civic  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
 4 Lotus Euro…  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
 5 Fiat X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
 6 Porsche 91…  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
 7 Merc 240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 8 Datsun 710   22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
 9 Merc 230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
10 Toyota Cor…  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
# ℹ 22 more rows
# ℹ 3 more variables: disp_cat <chr>, disp_cat2 <chr>, newcol <dbl>

It is a bit more straightforward to mix increasing and decreasing orderings. Here, we sort by increasing mpg first, then for any ties in mpg we sort in decreasing order by horsepower (hp).

arrange(tbl, mpg, desc(hp))

# A tibble: 32 × 15
   car           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Lincoln Co…  10.4     8  460    215  3     5.42  17.8     0     0     3     4
 2 Cadillac F…  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
 3 Camaro Z28   13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
 4 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 5 Chrysler I…  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
 6 Maserati B…  15       8  301    335  3.54  3.57  14.6     0     1     5     8
 7 Merc 450SLC  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
 8 AMC Javelin  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
 9 Dodge Chal…  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
10 Ford Pante…  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
# ℹ 22 more rows
# ℹ 3 more variables: disp_cat <chr>, disp_cat2 <chr>, newcol <dbl>

5.8 Exercises

Check to see if you have the mtcars dataset by entering the command mtcars.
What class is mtcars?
How many observations (rows) and variables (columns) are in the mtcars dataset?
Copy mtcars into an object called cars and rename mpg in cars to MPG. Use rename().
Convert the column names of cars to all upper case. Use rename_all, and the toupper command (or colnames).
Convert the rownames of cars to a column called car using rownames_to_column. Subset the columns from cars that end in "p" and call it pvars using ends_with().
Create a subset cars that only contains the columns: wt, qsec, and hp and assign this object to carsSub. What are the dimensions of carsSub? (Use select() and dim().)
Convert the column names of carsSub to all upper case. Use rename_all(), and toupper() (or colnames()).
Subset the rows of cars that get more than 20 miles per gallon (mpg) of fuel efficiency. How many are there? (Use filter().)
Subset the rows that get less than 16 miles per gallon (mpg) of fuel efficiency and have more than 100 horsepower (hp). How many are there? (Use filter().)
Create a subset of the cars data that only contains the columns: wt, qsec, and hp for cars with 8 cylinders (cyl) and reassign this object to carsSub. What are the dimensions of this dataset?
Re-order the rows of carsSub by weight (wt) in increasing order. (Use arrange().)
Create a new variable in carsSub called wt2, which is equal to wt^2, using mutate() and piping %>%.