Chapter 9 Manipulating Data in R
In this section, we will show you how to:
- Reshaping data from wide (fat) to long (tall)
- Reshaping data from long (tall) to wide (fat)
- Merging Data
We will show you how to do each operation in base R then show you how to use the dplyr
or tidyr
package to do the same operation (if applicable).
See the “Data Wrangling Cheat Sheet using dplyr
and tidyr
”:
In this section, we will use the dataset Charm_City_Circulator_Ridership.csv.
library(readr)
library(tidyverse)
circ = read_csv("data/Charm_City_Circulator_Ridership.csv")
head(circ, 2)
# A tibble: 2 × 15
day date orangeBoardings orangeAlightings orangeAverage purpleBoardings
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Monday 01/11/… 877 1027 952 NA
2 Tuesday 01/12/… 777 815 796 NA
# ℹ 9 more variables: purpleAlightings <dbl>, purpleAverage <dbl>,
# greenBoardings <dbl>, greenAlightings <dbl>, greenAverage <dbl>,
# bannerBoardings <dbl>, bannerAlightings <dbl>, bannerAverage <dbl>,
# daily <dbl>
[1] "character"
[1] 0
[1] 0
[1] 0
[1] "2010-01-11" "2010-01-12" "2010-01-13"
[1] "Date"
9.1 Reshaping Data
What is wide/long data?
See http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
Wide = multiple columns per observation
- e.g. visit1, visit2, visit3
id visit1 visit2 visit3
1 1 10 4 3
2 2 5 6 NA
Long = multiple rows per observation
id visit value
1 1 1 10
2 1 2 4
3 1 3 3
4 2 1 5
5 2 2 6
More accurately, data is wide or long with respect to certain variables.
In base R, the reshape
command exists. It is a confusing function. Don’t use it.
tidyr
allows you to “tidy” your data. We will be talking about:
pivot_longer
- make multiple columns into variables, (wide to long)pivot_wider
- make a variable into multiple columns, (long to wide)separate
- string into multiple columnsunite
- multiple columns into one string
9.1.1 Reshaping Data from Wide (Fat) to Long (Tall)
To change from wide to long use tidyr::pivot_longer
, which puts column data into rows.
We want the column names into “var
” variable in the output dataset and the value in “number
” variable. We then describe which columns we want to “gather:”
In the following example, we take the columns corresponding to the different bus ridership counts, e.g. orangeBoardings, orangeAlightings, etc., which are currently in one row for each day, and reshape the data so that each count gets its own row, i.e. there will be a row for the Monday 2010-01-11 orangeBoardings ridership count, a row for the Monday 2010-01-11 orangeAlightings ridership count, etc.
# A tibble: 6 × 15
day date orangeBoardings orangeAlightings orangeAverage
<chr> <date> <dbl> <dbl> <dbl>
1 Monday 2010-01-11 877 1027 952
2 Tuesday 2010-01-12 777 815 796
3 Wednesday 2010-01-13 1203 1220 1212.
4 Thursday 2010-01-14 1194 1233 1214.
5 Friday 2010-01-15 1645 1643 1644
6 Saturday 2010-01-16 1457 1524 1490.
# ℹ 10 more variables: purpleBoardings <dbl>, purpleAlightings <dbl>,
# purpleAverage <dbl>, greenBoardings <dbl>, greenAlightings <dbl>,
# greenAverage <dbl>, bannerBoardings <dbl>, bannerAlightings <dbl>,
# bannerAverage <dbl>, daily <dbl>
long = pivot_longer(circ, names_to = "var", values_to = "number",
c(-day, -date, -daily))
head(arrange(long, date), 4)
# A tibble: 4 × 5
day date daily var number
<chr> <date> <dbl> <chr> <dbl>
1 Monday 2010-01-11 952 orangeBoardings 877
2 Monday 2010-01-11 952 orangeAlightings 1027
3 Monday 2010-01-11 952 orangeAverage 952
4 Monday 2010-01-11 952 purpleBoardings NA
Note that the number of columns that we wanted to gather was much more than the columns we didn’t, so we used the -
to tell pivot_longer
the columns that we didn’t want to gather instead. Alternatively, we could have been explicit in what we wanted to gather.
# A tibble: 6 × 15
day date orangeBoardings orangeAlightings orangeAverage
<chr> <date> <dbl> <dbl> <dbl>
1 Monday 2010-01-11 877 1027 952
2 Tuesday 2010-01-12 777 815 796
3 Wednesday 2010-01-13 1203 1220 1212.
4 Thursday 2010-01-14 1194 1233 1214.
5 Friday 2010-01-15 1645 1643 1644
6 Saturday 2010-01-16 1457 1524 1490.
# ℹ 10 more variables: purpleBoardings <dbl>, purpleAlightings <dbl>,
# purpleAverage <dbl>, greenBoardings <dbl>, greenAlightings <dbl>,
# greenAverage <dbl>, bannerBoardings <dbl>, bannerAlightings <dbl>,
# bannerAverage <dbl>, daily <dbl>
long = pivot_longer(circ, names_to = "var", values_to = "number",
c(starts_with("orange"), starts_with("purple"),
starts_with("green"), starts_with("banner")))
head(arrange(long, date), 4)
# A tibble: 4 × 5
day date daily var number
<chr> <date> <dbl> <chr> <dbl>
1 Monday 2010-01-11 952 orangeBoardings 877
2 Monday 2010-01-11 952 orangeAlightings 1027
3 Monday 2010-01-11 952 orangeAverage 952
4 Monday 2010-01-11 952 purpleBoardings NA
[1] 1146 15
# A tibble: 12 × 2
var n
<chr> <int>
1 bannerAlightings 1146
2 bannerAverage 1146
3 bannerBoardings 1146
4 greenAlightings 1146
5 greenAverage 1146
6 greenBoardings 1146
7 orangeAlightings 1146
8 orangeAverage 1146
9 orangeBoardings 1146
10 purpleAlightings 1146
11 purpleAverage 1146
12 purpleBoardings 1146
After converting from wide to long, we may want to adjust the new data made from the columns names. For example, with the bus ridership data the former column names described the bus line, orange, purple, green, or banner, and the type of count being recording, Boardings = getting on the bus, Alightings = getting off the bus, and Average = average of boardings and alightings counts. Lets separate the new var
column into two columns line
and type
. We will use str_replace
from stringr
to put _
in the names.
long = long %>% mutate(
var = var %>%
str_replace("Board", "_Board") %>%
str_replace("Alight", "_Alight") %>%
str_replace("Average", "_Average")
)
long %>% count(var)
# A tibble: 12 × 2
var n
<chr> <int>
1 banner_Alightings 1146
2 banner_Average 1146
3 banner_Boardings 1146
4 green_Alightings 1146
5 green_Average 1146
6 green_Boardings 1146
7 orange_Alightings 1146
8 orange_Average 1146
9 orange_Boardings 1146
10 purple_Alightings 1146
11 purple_Average 1146
12 purple_Boardings 1146
# A tibble: 2 × 6
day date daily line type number
<chr> <date> <dbl> <chr> <chr> <dbl>
1 Monday 2010-01-11 952 orange Boardings 877
2 Monday 2010-01-11 952 orange Alightings 1027
[1] "orange" "purple" "green" "banner"
[1] "Boardings" "Alightings" "Average"
We can also reunite columns using the unite
function if we wanted to put them back together.
reunited = long %>%
unite(col = var, line, type, sep = "_")
reunited %>% select(day, var) %>% head(3) %>% print
# A tibble: 3 × 2
day var
<chr> <chr>
1 Monday orange_Boardings
2 Monday orange_Alightings
3 Monday orange_Average
We could also use paste
/paste0
.
9.1.2 Reshaping Data From Long (Tall) to Wide (Fat)
In tidyr
, the pivot_wider
function spreads rows into columns. Now we have a long data set, but we want to separate the Average, Alightings and Boardings into different columns:
# A tibble: 6 × 7
day date daily line Boardings Alightings Average
<chr> <date> <dbl> <chr> <dbl> <dbl> <dbl>
1 Monday 2010-01-11 952 orange 877 1027 952
2 Monday 2010-01-11 952 purple NA NA NA
3 Monday 2010-01-11 952 green NA NA NA
4 Monday 2010-01-11 952 banner NA NA NA
5 Tuesday 2010-01-12 796 orange 777 815 796
6 Tuesday 2010-01-12 796 purple NA NA NA
Note that there are several rows that are all NA
because that bus line did not run on that particular date. We can use rowSums
to see if any values in the row is NA
and keep if the row, which is a combination of date and line type has any non-missing data.
# A tibble: 3 × 7
day date daily line Boardings Alightings Average
<chr> <date> <dbl> <chr> <dbl> <dbl> <dbl>
1 Monday 2010-01-11 952 orange 877 1027 952
2 Monday 2010-01-11 952 purple NA NA NA
3 Monday 2010-01-11 952 green NA NA NA
Alightings Average Boardings
[1,] TRUE TRUE TRUE
[2,] FALSE FALSE FALSE
Now we can filter only the good rows and delete the good
column.
# A tibble: 6 × 7
day date daily line Boardings Alightings Average
<chr> <date> <dbl> <chr> <dbl> <dbl> <dbl>
1 Monday 2010-01-11 952 orange 877 1027 952
2 Tuesday 2010-01-12 796 orange 777 815 796
3 Wednesday 2010-01-13 1212. orange 1203 1220 1212.
4 Thursday 2010-01-14 1214. orange 1194 1233 1214.
5 Friday 2010-01-15 1644 orange 1645 1643 1644
6 Saturday 2010-01-16 1490. orange 1457 1524 1490.
Slice allows you to select records (compared to first/last on a vector). The following code gets the first and last boarding records for each bus line.
long = long %>% filter(!is.na(number) & number > 0)
first_and_last = long %>% arrange(date) %>% # arrange by date
filter(type == "Boardings") %>% # keep boardings only
group_by(line) %>% # group by line
slice( c(1, n())) # select ("slice") first and last (n() command) lines
first_and_last %>% head(4)
# A tibble: 4 × 6
# Groups: line [2]
day date daily line type number
<chr> <date> <dbl> <chr> <chr> <dbl>
1 Monday 2012-06-04 13342. banner Boardings 520
2 Friday 2013-03-01 NA banner Boardings 817
3 Tuesday 2011-11-01 8873 green Boardings 887
4 Friday 2013-03-01 NA green Boardings 2592
9.2 Merging Datasets
base
has the baseline data for ids 1
to 10
and age.
# A tibble: 2 × 2
id Age
<int> <dbl>
1 1 55
2 2 55.6
visits
has ids 1
to 8
and 11
(a new id), with 3 visits and the visit outcome measurement.
visits <- tibble(id = c(rep(1:8, 3), 11), visit= c(rep(1:3, 8), 3),
Outcome = seq(10,50, length=25))
head(visits, 2)
# A tibble: 2 × 3
id visit Outcome
<dbl> <dbl> <dbl>
1 1 1 10
2 2 2 11.7
Joining in dplyr
:
- Merging/joining data sets together - usually on key variables, usually “id”
?join
- see different types of joining fordplyr
- Let’s look at https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
inner_join(x, y)
- only rows that match forx
andy
are keptfull_join(x, y)
- all rows ofx
andy
are keptleft_join(x, y)
- all rows ofx
are kept even if not merged withy
right_join(x, y)
- all rows ofy
are kept even if not merged withx
Inner Join
Joining with `by = join_by(id)`
[1] 24 4
# A tibble: 6 × 4
id Age visit Outcome
<dbl> <dbl> <dbl> <dbl>
1 7 58.3 1 20
2 7 58.3 3 33.3
3 7 58.3 2 46.7
4 8 58.9 2 21.7
5 8 58.9 1 35
6 8 58.9 3 48.3
Left Join
Joining with `by = join_by(id)`
[1] 26 4
# A tibble: 6 × 4
id Age visit Outcome
<dbl> <dbl> <dbl> <dbl>
1 7 58.3 2 46.7
2 8 58.9 2 21.7
3 8 58.9 1 35
4 8 58.9 3 48.3
5 9 59.4 NA NA
6 10 60 NA NA
The tidylog
package can show you log outputs from dplyr
(newly added). You will need to install to use.
Warning: package 'tidylog' was built under R version 4.3.2
Attaching package: 'tidylog'
The following objects are masked from 'package:dplyr':
add_count, add_tally, anti_join, count, distinct, distinct_all,
distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
full_join, group_by, group_by_all, group_by_at, group_by_if,
inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
relocate, rename, rename_all, rename_at, rename_if, rename_with,
right_join, sample_frac, sample_n, select, select_all, select_at,
select_if, semi_join, slice, slice_head, slice_max, slice_min,
slice_sample, slice_tail, summarise, summarise_all, summarise_at,
summarise_if, summarize, summarize_all, summarize_at, summarize_if,
tally, top_frac, top_n, transmute, transmute_all, transmute_at,
transmute_if, ungroup
The following objects are masked from 'package:tidyr':
drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
spread, uncount
The following object is masked from 'package:stats':
filter
Joining with `by = join_by(id)`
left_join: added 2 columns (visit, Outcome)
> rows only in x 2
> rows only in y ( 1)
> matched rows 24 (includes duplicates)
> ====
> rows total 26
# A tibble: 26 × 4
id Age visit Outcome
<dbl> <dbl> <dbl> <dbl>
1 1 55 1 10
2 1 55 3 23.3
3 1 55 2 36.7
4 2 55.6 2 11.7
5 2 55.6 1 25
6 2 55.6 3 38.3
7 3 56.1 3 13.3
8 3 56.1 2 26.7
9 3 56.1 1 40
10 4 56.7 1 15
# ℹ 16 more rows
Right Join
Joining with `by = join_by(id)`
right_join: added 2 columns (visit, Outcome)
> rows only in x ( 2)
> rows only in y 1
> matched rows 24
> ====
> rows total 25
[1] 25 4
# A tibble: 6 × 4
id Age visit Outcome
<dbl> <dbl> <dbl> <dbl>
1 7 58.3 3 33.3
2 7 58.3 2 46.7
3 8 58.9 2 21.7
4 8 58.9 1 35
5 8 58.9 3 48.3
6 11 NA 3 50
Right Join: Switching Arguments
Joining with `by = join_by(id)`
right_join: added one column (Age)
> rows only in x ( 1)
> rows only in y 2
> matched rows 24 (includes duplicates)
> ====
> rows total 26
[1] 26 4
# A tibble: 6 × 4
id visit Outcome Age
<dbl> <dbl> <dbl> <dbl>
1 5 3 43.3 57.2
2 6 1 45 57.8
3 7 2 46.7 58.3
4 8 3 48.3 58.9
5 9 NA NA 59.4
6 10 NA NA 60
Full Join
Joining with `by = join_by(id)`
full_join: added 2 columns (visit, Outcome)
> rows only in x 2
> rows only in y 1
> matched rows 24 (includes duplicates)
> ====
> rows total 27
[1] 27 4
# A tibble: 6 × 4
id Age visit Outcome
<dbl> <dbl> <dbl> <dbl>
1 8 58.9 2 21.7
2 8 58.9 1 35
3 8 58.9 3 48.3
4 9 59.4 NA NA
5 10 60 NA NA
6 11 NA 3 50
By default, these methods will join based on all columns that have the same name between the two datasets. If we want to merge based on specific columns or if the columns have different names between the datasets, we will need to use the by
argument. For example, let’s change the id
column to id_key
in the visits
dataset.
rename: renamed one variable (id_key)
Error in `.fun()`:
! `by` must be supplied when `x` and `y` have no common
variables.
ℹ Use `cross_join()` to perform a cross-join.
full_join: added 2 columns (visit, Outcome)
> rows only in x 2
> rows only in y 1
> matched rows 24 (includes duplicates)
> ====
> rows total 27
# A tibble: 6 × 4
id Age visit Outcome
<dbl> <dbl> <dbl> <dbl>
1 8 58.9 2 21.7
2 8 58.9 1 35
3 8 58.9 3 48.3
4 9 59.4 NA NA
5 10 60 NA NA
6 11 NA 3 50
[1] TRUE
The duplicated
command can give you indications if there are duplicates in a vector:
[1] FALSE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE TRUE
mutate: new variable 'dup_id' (logical) with 2 unique values and 0% NA
# A tibble: 27 × 5
id Age visit Outcome dup_id
<dbl> <dbl> <dbl> <dbl> <lgl>
1 1 55 1 10 FALSE
2 1 55 3 23.3 TRUE
3 1 55 2 36.7 TRUE
4 2 55.6 2 11.7 FALSE
5 2 55.6 1 25 TRUE
6 2 55.6 3 38.3 TRUE
7 3 56.1 3 13.3 FALSE
8 3 56.1 2 26.7 TRUE
9 3 56.1 1 40 TRUE
10 4 56.7 1 15 FALSE
# ℹ 17 more rows
9.2.1 Merging in Base R (Not Covered)
merge()
is the most common way to do this with data sets in base Rrbind()
/cbind()
- row/column bind, respectivelyrbind
is the equivalent of appending in STATA or “setting” in SAScbind
allows you to add columns in addition to the previous ways.
t()
can transpose data but doesn’t make it adata.frame
id Age visit Outcome
1 1 55.00000 1 10.00000
2 1 55.00000 3 23.33333
3 1 55.00000 2 36.66667
4 2 55.55556 2 11.66667
5 2 55.55556 1 25.00000
[1] 24 4
id Age visit Outcome
22 8 58.88889 2 21.66667
23 8 58.88889 1 35.00000
24 8 58.88889 3 48.33333
25 9 59.44444 NA NA
26 10 60.00000 NA NA
27 11 NA 3 50.00000
[1] 27 4
9.3 Exercises
For these exercises, we will use the following datasets
- the wide bike lane datasets, Bike_Lanes_Wide.csv which will be called
wide
bike lanes dataset, Bike_Lanes.csv, from the Baltimore Department of Transportation’s bike program (source). The R dataset containing Bike_Lanes.csv will be referred to asbike
in the questions below. - the roads.csv and crashes.csv which will be referred to as
crash
androad
.
- Read in the Bike_Lanes_Wide.csv dataset and call is
wide
. - Reshape
wide
usingpivot_longer
. Call this datalong
. Make the keylanetype
, and the valuethe_length
. Make sure we gather all columns butname
, using-name
. Note theNAs
here. - Read in the roads and crashes .csv files and call them
road
andcrash
. - Replace (using
str_replace
) any hyphens (-
) with a space incrash$Road
. Call this datacrash2
. Table theRoad
variable. - How many observations are in each dataset?
- Separate the
Road
column (usingseparate
) into (type
andnumber
) incrash2
. Reassign this tocrash2
. Tablecrash2$type
. Then create a new variable calling itroad_hyphen
using theunite
function. Unite thetype
andnumber
columns using a hyphen (-
) and then tableroad_hyphen
. - Which and how many years were data collected in the
crash
dataset? - Read in the dataset Bike_Lanes.csv and call it
bike
. - Keep rows where the record is not missing
type
and not missingname
and re-assign the output tobike
. - Summarize and group the data by grouping
name
andtype
(i.e for each type within each name) and take thesum
of thelength
(reassign the sum of the lengths to thelength
variable). Call this data setsub
. - Reshape
sub
usingpivot_wider
. Spread the data where the key istype
and we want the value in the new columns to belength
- the bike lane length. Call thiswide2
. Look at the column names ofwide2
- what are they? (they also have spaces). - Join data in the crash and road datasets to retain only complete data, (using an inner join) e.g. those observations with road lengths and districts. Merge without using
by
argument, then merge usingby = "Road"
. call the output merged. How many observations are there? - Join data using a
full_join
. Call the outputfull
. How many observations are there? - Do a left join of the
road
andcrash
. ORDER matters here! How many observations are there? - Repeat above with a
right_join
with the same order of the arguments. How many observations are there?