Chapter 9 Manipulating Data in R

In this section, we will show you how to:

  1. Reshaping data from wide (fat) to long (tall)
  2. Reshaping data from long (tall) to wide (fat)
  3. Merging Data

We will show you how to do each operation in base R then show you how to use the dplyr or tidyr package to do the same operation (if applicable).

See the “Data Wrangling Cheat Sheet using dplyr and tidyr”:

In this section, we will use the dataset Charm_City_Circulator_Ridership.csv.

library(readr)
library(tidyverse)
circ = read_csv("data/Charm_City_Circulator_Ridership.csv")
head(circ, 2)
# A tibble: 2 × 15
  day     date    orangeBoardings orangeAlightings orangeAverage purpleBoardings
  <chr>   <chr>             <dbl>            <dbl>         <dbl>           <dbl>
1 Monday  01/11/…             877             1027           952              NA
2 Tuesday 01/12/…             777              815           796              NA
# ℹ 9 more variables: purpleAlightings <dbl>, purpleAverage <dbl>,
#   greenBoardings <dbl>, greenAlightings <dbl>, greenAverage <dbl>,
#   bannerBoardings <dbl>, bannerAlightings <dbl>, bannerAverage <dbl>,
#   daily <dbl>
class(circ$date)
[1] "character"
library(lubridate) # great for dates!
sum(is.na(circ$date))
[1] 0
sum( circ$date == "")
[1] 0
circ = mutate(circ, date = mdy(date))
sum( is.na(circ$date) ) # all converted correctly
[1] 0
head(circ$date, 3)
[1] "2010-01-11" "2010-01-12" "2010-01-13"
class(circ$date)
[1] "Date"

9.1 Reshaping Data

What is wide/long data?

See http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

Wide = multiple columns per observation

  • e.g. visit1, visit2, visit3
  id visit1 visit2 visit3
1  1     10      4      3
2  2      5      6     NA

Long = multiple rows per observation

  id visit value
1  1     1    10
2  1     2     4
3  1     3     3
4  2     1     5
5  2     2     6

More accurately, data is wide or long with respect to certain variables.

In base R, the reshape command exists. It is a confusing function. Don’t use it.

tidyr allows you to “tidy” your data. We will be talking about:

  • pivot_longer - make multiple columns into variables, (wide to long)
  • pivot_wider - make a variable into multiple columns, (long to wide)
  • separate - string into multiple columns
  • unite - multiple columns into one string

9.1.1 Reshaping Data from Wide (Fat) to Long (Tall)

To change from wide to long use tidyr::pivot_longer, which puts column data into rows.

We want the column names into “var” variable in the output dataset and the value in “number” variable. We then describe which columns we want to “gather:”

In the following example, we take the columns corresponding to the different bus ridership counts, e.g. orangeBoardings, orangeAlightings, etc., which are currently in one row for each day, and reshape the data so that each count gets its own row, i.e. there will be a row for the Monday 2010-01-11 orangeBoardings ridership count, a row for the Monday 2010-01-11 orangeAlightings ridership count, etc.

head(circ)
# A tibble: 6 × 15
  day       date       orangeBoardings orangeAlightings orangeAverage
  <chr>     <date>               <dbl>            <dbl>         <dbl>
1 Monday    2010-01-11             877             1027          952 
2 Tuesday   2010-01-12             777              815          796 
3 Wednesday 2010-01-13            1203             1220         1212.
4 Thursday  2010-01-14            1194             1233         1214.
5 Friday    2010-01-15            1645             1643         1644 
6 Saturday  2010-01-16            1457             1524         1490.
# ℹ 10 more variables: purpleBoardings <dbl>, purpleAlightings <dbl>,
#   purpleAverage <dbl>, greenBoardings <dbl>, greenAlightings <dbl>,
#   greenAverage <dbl>, bannerBoardings <dbl>, bannerAlightings <dbl>,
#   bannerAverage <dbl>, daily <dbl>
long = pivot_longer(circ, names_to = "var", values_to = "number", 
              c(-day, -date, -daily))
head(arrange(long, date), 4)
# A tibble: 4 × 5
  day    date       daily var              number
  <chr>  <date>     <dbl> <chr>             <dbl>
1 Monday 2010-01-11   952 orangeBoardings     877
2 Monday 2010-01-11   952 orangeAlightings   1027
3 Monday 2010-01-11   952 orangeAverage       952
4 Monday 2010-01-11   952 purpleBoardings      NA

Note that the number of columns that we wanted to gather was much more than the columns we didn’t, so we used the - to tell pivot_longer the columns that we didn’t want to gather instead. Alternatively, we could have been explicit in what we wanted to gather.

head(circ)
# A tibble: 6 × 15
  day       date       orangeBoardings orangeAlightings orangeAverage
  <chr>     <date>               <dbl>            <dbl>         <dbl>
1 Monday    2010-01-11             877             1027          952 
2 Tuesday   2010-01-12             777              815          796 
3 Wednesday 2010-01-13            1203             1220         1212.
4 Thursday  2010-01-14            1194             1233         1214.
5 Friday    2010-01-15            1645             1643         1644 
6 Saturday  2010-01-16            1457             1524         1490.
# ℹ 10 more variables: purpleBoardings <dbl>, purpleAlightings <dbl>,
#   purpleAverage <dbl>, greenBoardings <dbl>, greenAlightings <dbl>,
#   greenAverage <dbl>, bannerBoardings <dbl>, bannerAlightings <dbl>,
#   bannerAverage <dbl>, daily <dbl>
long = pivot_longer(circ, names_to = "var", values_to = "number", 
              c(starts_with("orange"), starts_with("purple"),
              starts_with("green"), starts_with("banner")))
head(arrange(long, date), 4)
# A tibble: 4 × 5
  day    date       daily var              number
  <chr>  <date>     <dbl> <chr>             <dbl>
1 Monday 2010-01-11   952 orangeBoardings     877
2 Monday 2010-01-11   952 orangeAlightings   1027
3 Monday 2010-01-11   952 orangeAverage       952
4 Monday 2010-01-11   952 purpleBoardings      NA
dim(circ)
[1] 1146   15
long %>% count(var)
# A tibble: 12 × 2
   var                  n
   <chr>            <int>
 1 bannerAlightings  1146
 2 bannerAverage     1146
 3 bannerBoardings   1146
 4 greenAlightings   1146
 5 greenAverage      1146
 6 greenBoardings    1146
 7 orangeAlightings  1146
 8 orangeAverage     1146
 9 orangeBoardings   1146
10 purpleAlightings  1146
11 purpleAverage     1146
12 purpleBoardings   1146

After converting from wide to long, we may want to adjust the new data made from the columns names. For example, with the bus ridership data the former column names described the bus line, orange, purple, green, or banner, and the type of count being recording, Boardings = getting on the bus, Alightings = getting off the bus, and Average = average of boardings and alightings counts. Lets separate the new var column into two columns line and type. We will use str_replace from stringr to put _ in the names.

long = long %>% mutate(
  var = var %>% 
    str_replace("Board", "_Board") %>% 
    str_replace("Alight", "_Alight") %>% 
    str_replace("Average", "_Average") 
)
long %>% count(var)
# A tibble: 12 × 2
   var                   n
   <chr>             <int>
 1 banner_Alightings  1146
 2 banner_Average     1146
 3 banner_Boardings   1146
 4 green_Alightings   1146
 5 green_Average      1146
 6 green_Boardings    1146
 7 orange_Alightings  1146
 8 orange_Average     1146
 9 orange_Boardings   1146
10 purple_Alightings  1146
11 purple_Average     1146
12 purple_Boardings   1146
long = separate(long, var, into = c("line", "type"), sep = "_")
head(long, 2)
# A tibble: 2 × 6
  day    date       daily line   type       number
  <chr>  <date>     <dbl> <chr>  <chr>       <dbl>
1 Monday 2010-01-11   952 orange Boardings     877
2 Monday 2010-01-11   952 orange Alightings   1027
unique(long$line)
[1] "orange" "purple" "green"  "banner"
unique(long$type)
[1] "Boardings"  "Alightings" "Average"   

We can also reunite columns using the unite function if we wanted to put them back together.

reunited = long %>% 
  unite(col = var, line, type, sep = "_")  
reunited %>% select(day, var) %>% head(3) %>% print
# A tibble: 3 × 2
  day    var              
  <chr>  <chr>            
1 Monday orange_Boardings 
2 Monday orange_Alightings
3 Monday orange_Average   

We could also use paste/paste0.

9.1.2 Reshaping Data From Long (Tall) to Wide (Fat)

In tidyr, the pivot_wider function spreads rows into columns. Now we have a long data set, but we want to separate the Average, Alightings and Boardings into different columns:

wide = pivot_wider(long, names_from = type, values_from = number)
head(wide)
# A tibble: 6 × 7
  day     date       daily line   Boardings Alightings Average
  <chr>   <date>     <dbl> <chr>      <dbl>      <dbl>   <dbl>
1 Monday  2010-01-11   952 orange       877       1027     952
2 Monday  2010-01-11   952 purple        NA         NA      NA
3 Monday  2010-01-11   952 green         NA         NA      NA
4 Monday  2010-01-11   952 banner        NA         NA      NA
5 Tuesday 2010-01-12   796 orange       777        815     796
6 Tuesday 2010-01-12   796 purple        NA         NA      NA

Note that there are several rows that are all NA because that bus line did not run on that particular date. We can use rowSums to see if any values in the row is NA and keep if the row, which is a combination of date and line type has any non-missing data.

head(wide, 3)
# A tibble: 3 × 7
  day    date       daily line   Boardings Alightings Average
  <chr>  <date>     <dbl> <chr>      <dbl>      <dbl>   <dbl>
1 Monday 2010-01-11   952 orange       877       1027     952
2 Monday 2010-01-11   952 purple        NA         NA      NA
3 Monday 2010-01-11   952 green         NA         NA      NA
not_namat = !is.na(select(wide, Alightings, Average, Boardings))
head(not_namat, 2)
     Alightings Average Boardings
[1,]       TRUE    TRUE      TRUE
[2,]      FALSE   FALSE     FALSE
wide$good = rowSums(not_namat) > 0

Now we can filter only the good rows and delete the good column.

wide = filter(wide, good) %>% select(-good)
head(wide)
# A tibble: 6 × 7
  day       date       daily line   Boardings Alightings Average
  <chr>     <date>     <dbl> <chr>      <dbl>      <dbl>   <dbl>
1 Monday    2010-01-11  952  orange       877       1027    952 
2 Tuesday   2010-01-12  796  orange       777        815    796 
3 Wednesday 2010-01-13 1212. orange      1203       1220   1212.
4 Thursday  2010-01-14 1214. orange      1194       1233   1214.
5 Friday    2010-01-15 1644  orange      1645       1643   1644 
6 Saturday  2010-01-16 1490. orange      1457       1524   1490.

Slice allows you to select records (compared to first/last on a vector). The following code gets the first and last boarding records for each bus line.

long = long %>% filter(!is.na(number) & number > 0)
first_and_last = long %>% arrange(date) %>% # arrange by date
  filter(type == "Boardings") %>% # keep boardings only
  group_by(line) %>% # group by line
  slice( c(1, n())) # select ("slice") first and last (n() command) lines
first_and_last %>%  head(4)
# A tibble: 4 × 6
# Groups:   line [2]
  day     date        daily line   type      number
  <chr>   <date>      <dbl> <chr>  <chr>      <dbl>
1 Monday  2012-06-04 13342. banner Boardings    520
2 Friday  2013-03-01    NA  banner Boardings    817
3 Tuesday 2011-11-01  8873  green  Boardings    887
4 Friday  2013-03-01    NA  green  Boardings   2592

9.2 Merging Datasets

base has the baseline data for ids 1 to 10 and age.

base <- tibble(id = 1:10, Age= seq(55,60, length=10))
head(base, 2)
# A tibble: 2 × 2
     id   Age
  <int> <dbl>
1     1  55  
2     2  55.6

visits has ids 1 to 8 and 11 (a new id), with 3 visits and the visit outcome measurement.

visits <- tibble(id = c(rep(1:8, 3), 11), visit= c(rep(1:3, 8), 3),
                    Outcome = seq(10,50, length=25))
head(visits, 2)
# A tibble: 2 × 3
     id visit Outcome
  <dbl> <dbl>   <dbl>
1     1     1    10  
2     2     2    11.7

Joining in dplyr:

  • Merging/joining data sets together - usually on key variables, usually “id”
  • ?join - see different types of joining for dplyr
  • Let’s look at https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
  • inner_join(x, y) - only rows that match for x and y are kept
  • full_join(x, y) - all rows of x and y are kept
  • left_join(x, y) - all rows of x are kept even if not merged with y
  • right_join(x, y) - all rows of y are kept even if not merged with x

Inner Join

ij = inner_join(base, visits)
Joining with `by = join_by(id)`
dim(ij)
[1] 24  4
tail(ij)
# A tibble: 6 × 4
     id   Age visit Outcome
  <dbl> <dbl> <dbl>   <dbl>
1     7  58.3     1    20  
2     7  58.3     3    33.3
3     7  58.3     2    46.7
4     8  58.9     2    21.7
5     8  58.9     1    35  
6     8  58.9     3    48.3

Left Join

lj = left_join(base, visits)
Joining with `by = join_by(id)`
dim(lj)
[1] 26  4
tail(lj)
# A tibble: 6 × 4
     id   Age visit Outcome
  <dbl> <dbl> <dbl>   <dbl>
1     7  58.3     2    46.7
2     8  58.9     2    21.7
3     8  58.9     1    35  
4     8  58.9     3    48.3
5     9  59.4    NA    NA  
6    10  60      NA    NA  

The tidylog package can show you log outputs from dplyr (newly added). You will need to install to use.

# install.packages("tidylog")
library(tidylog)
Warning: package 'tidylog' was built under R version 4.3.2

Attaching package: 'tidylog'
The following objects are masked from 'package:dplyr':

    add_count, add_tally, anti_join, count, distinct, distinct_all,
    distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
    full_join, group_by, group_by_all, group_by_at, group_by_if,
    inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
    relocate, rename, rename_all, rename_at, rename_if, rename_with,
    right_join, sample_frac, sample_n, select, select_all, select_at,
    select_if, semi_join, slice, slice_head, slice_max, slice_min,
    slice_sample, slice_tail, summarise, summarise_all, summarise_at,
    summarise_if, summarize, summarize_all, summarize_at, summarize_if,
    tally, top_frac, top_n, transmute, transmute_all, transmute_at,
    transmute_if, ungroup
The following objects are masked from 'package:tidyr':

    drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
    spread, uncount
The following object is masked from 'package:stats':

    filter
left_join(base, visits)
Joining with `by = join_by(id)`
left_join: added 2 columns (visit, Outcome)
           > rows only in x    2
           > rows only in y  ( 1)
           > matched rows     24    (includes duplicates)
           >                 ====
           > rows total       26
# A tibble: 26 × 4
      id   Age visit Outcome
   <dbl> <dbl> <dbl>   <dbl>
 1     1  55       1    10  
 2     1  55       3    23.3
 3     1  55       2    36.7
 4     2  55.6     2    11.7
 5     2  55.6     1    25  
 6     2  55.6     3    38.3
 7     3  56.1     3    13.3
 8     3  56.1     2    26.7
 9     3  56.1     1    40  
10     4  56.7     1    15  
# ℹ 16 more rows

Right Join

rj = right_join(base, visits)
Joining with `by = join_by(id)`
right_join: added 2 columns (visit, Outcome)
> rows only in x ( 2)
> rows only in y 1
> matched rows 24
> ====
> rows total 25
dim(rj)
[1] 25  4
tail(rj)
# A tibble: 6 × 4
     id   Age visit Outcome
  <dbl> <dbl> <dbl>   <dbl>
1     7  58.3     3    33.3
2     7  58.3     2    46.7
3     8  58.9     2    21.7
4     8  58.9     1    35  
5     8  58.9     3    48.3
6    11  NA       3    50  

Right Join: Switching Arguments

rj2 = right_join(visits, base)
Joining with `by = join_by(id)`
right_join: added one column (Age)
> rows only in x ( 1)
> rows only in y 2
> matched rows 24 (includes duplicates)
> ====
> rows total 26
dim(rj2)
[1] 26  4
tail(rj2)
# A tibble: 6 × 4
     id visit Outcome   Age
  <dbl> <dbl>   <dbl> <dbl>
1     5     3    43.3  57.2
2     6     1    45    57.8
3     7     2    46.7  58.3
4     8     3    48.3  58.9
5     9    NA    NA    59.4
6    10    NA    NA    60  

Full Join

fj = full_join(base, visits)
Joining with `by = join_by(id)`
full_join: added 2 columns (visit, Outcome)
> rows only in x 2
> rows only in y 1
> matched rows 24 (includes duplicates)
> ====
> rows total 27
dim(fj)
[1] 27  4
tail(fj)
# A tibble: 6 × 4
     id   Age visit Outcome
  <dbl> <dbl> <dbl>   <dbl>
1     8  58.9     2    21.7
2     8  58.9     1    35  
3     8  58.9     3    48.3
4     9  59.4    NA    NA  
5    10  60      NA    NA  
6    11  NA       3    50  

By default, these methods will join based on all columns that have the same name between the two datasets. If we want to merge based on specific columns or if the columns have different names between the datasets, we will need to use the by argument. For example, let’s change the id column to id_key in the visits dataset.

visits2 = rename(visits, id_key = id)
rename: renamed one variable (id_key)
full_join(base, visits2)
Error in `.fun()`:
! `by` must be supplied when `x` and `y` have no common
  variables.
ℹ Use `cross_join()` to perform a cross-join.
fj2 = full_join(base, visits2 , by = c("id" = "id_key"))
full_join: added 2 columns (visit, Outcome)
           > rows only in x    2
           > rows only in y    1
           > matched rows     24    (includes duplicates)
           >                 ====
           > rows total       27
tail(fj2)
# A tibble: 6 × 4
     id   Age visit Outcome
  <dbl> <dbl> <dbl>   <dbl>
1     8  58.9     2    21.7
2     8  58.9     1    35  
3     8  58.9     3    48.3
4     9  59.4    NA    NA  
5    10  60      NA    NA  
6    11  NA       3    50  
identical(fj, fj2)
[1] TRUE

The duplicated command can give you indications if there are duplicates in a vector:

duplicated(1:5)
[1] FALSE FALSE FALSE FALSE FALSE
duplicated(c(1:5, 1))
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
fj %>% mutate(dup_id = duplicated(id))
mutate: new variable 'dup_id' (logical) with 2 unique values and 0% NA
# A tibble: 27 × 5
      id   Age visit Outcome dup_id
   <dbl> <dbl> <dbl>   <dbl> <lgl> 
 1     1  55       1    10   FALSE 
 2     1  55       3    23.3 TRUE  
 3     1  55       2    36.7 TRUE  
 4     2  55.6     2    11.7 FALSE 
 5     2  55.6     1    25   TRUE  
 6     2  55.6     3    38.3 TRUE  
 7     3  56.1     3    13.3 FALSE 
 8     3  56.1     2    26.7 TRUE  
 9     3  56.1     1    40   TRUE  
10     4  56.7     1    15   FALSE 
# ℹ 17 more rows

9.2.1 Merging in Base R (Not Covered)

  • merge() is the most common way to do this with data sets in base R
  • rbind()/cbind() - row/column bind, respectively
    • rbind is the equivalent of appending in STATA or “setting” in SAS
    • cbind allows you to add columns in addition to the previous ways.
  • t() can transpose data but doesn’t make it a data.frame
merged.data <- merge(base, visits, by = "id") #inner join
head(merged.data, 5)
  id      Age visit  Outcome
1  1 55.00000     1 10.00000
2  1 55.00000     3 23.33333
3  1 55.00000     2 36.66667
4  2 55.55556     2 11.66667
5  2 55.55556     1 25.00000
dim(merged.data)
[1] 24  4
all.data <- merge(base, visits, by = "id", all = TRUE) # jull join
tail(all.data)
   id      Age visit  Outcome
22  8 58.88889     2 21.66667
23  8 58.88889     1 35.00000
24  8 58.88889     3 48.33333
25  9 59.44444    NA       NA
26 10 60.00000    NA       NA
27 11       NA     3 50.00000
dim(all.data)
[1] 27  4

9.3 Exercises

For these exercises, we will use the following datasets

  • the wide bike lane datasets, Bike_Lanes_Wide.csv which will be called wide bike lanes dataset, Bike_Lanes.csv, from the Baltimore Department of Transportation’s bike program (source). The R dataset containing Bike_Lanes.csv will be referred to as bike in the questions below.
  • the roads.csv and crashes.csv which will be referred to as crash and road.
  1. Read in the Bike_Lanes_Wide.csv dataset and call is wide.
  2. Reshape wide using pivot_longer. Call this data long. Make the key lanetype, and the value the_length. Make sure we gather all columns but name, using -name. Note the NAs here.
  3. Read in the roads and crashes .csv files and call them road and crash.
  4. Replace (using str_replace) any hyphens (-) with a space in crash$Road. Call this data crash2. Table the Road variable.
  5. How many observations are in each dataset?
  6. Separate the Road column (using separate) into (type and number) in crash2. Reassign this to crash2. Table crash2$type. Then create a new variable calling it road_hyphen using the unite function. Unite the type and number columns using a hyphen (-) and then table road_hyphen.
  7. Which and how many years were data collected in the crash dataset?
  8. Read in the dataset Bike_Lanes.csv and call it bike.
  9. Keep rows where the record is not missing type and not missing name and re-assign the output to bike.
  10. Summarize and group the data by grouping name and type (i.e for each type within each name) and take the sum of the length (reassign the sum of the lengths to the length variable). Call this data set sub.
  11. Reshape sub using pivot_wider. Spread the data where the key is type and we want the value in the new columns to be length - the bike lane length. Call this wide2. Look at the column names of wide2 - what are they? (they also have spaces).
  12. Join data in the crash and road datasets to retain only complete data, (using an inner join) e.g. those observations with road lengths and districts. Merge without using by argument, then merge using by = "Road". call the output merged. How many observations are there?
  13. Join data using a full_join. Call the output full. How many observations are there?
  14. Do a left join of the road and crash. ORDER matters here! How many observations are there?
  15. Repeat above with a right_join with the same order of the arguments. How many observations are there?