Chapter 8 Data Cleaning

In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable. In this lesson, we will focus on checking for missing data and manipulated strings.

THE MOST IMPORTANT RULE - LOOK AT YOUR DATA!

We will use the Baltimore city employee salary dataset (source), Baltimore_City_Employee_Salaries_FY2015.csv.

library(readr)
library(tidyverse)
Sal = read_csv("./data/Baltimore_City_Employee_Salaries_FY2015.csv")
Sal = rename(Sal, Name = name)

8.1 Dealing with Missing Data

One of the most important aspects of data cleaning is missing values.

Some useful functions for checking your data include

  • is.na - is TRUE if the data is NA FALSE otherwise
  • ! - negation (NOT)
    • if is.na(x) is TRUE, then !is.na(x) is FALSE
  • all takes in a logical and will be TRUE if ALL are TRUE
    • all(!is.na(x)) - are all values of x NOT NA
  • any will be TRUE if ANY are true
    • any(is.na(x)) - do we have any NA’s in x?
  • complete.cases - returns TRUE if EVERY value of a row is NOT NA
    • very stringent condition
    • FALSE missing one value (even if not important)

Types of “missing” data in R include:

  • NA - general missing data
  • NaN - stands for “Not a Number”, happens when you do 0/0.
  • Inf and -Inf - Infinity, happens when you divide a positive number (or negative number) by 0.

We need to be able to check for these R missing types in our data. Each missing data type has a function that returns TRUE if the data is missing:

  • NA - is.na
  • NaN - is.nan
  • Inf and -Inf - is.infinite
  • is.finite returns FALSE for all missing data and TRUE for non-missing

One important aspect (esp with subsetting) is that logical operations return NA for NA values. Think about it, the data could be > 2 or not. We don’t know, so R says there is no TRUE or FALSE, so that is missing:

x = c(0, NA, 2, 3, 4, -0.5, 0.2)
x > 2
[1] FALSE    NA FALSE  TRUE  TRUE FALSE FALSE

So what do we do to handle the NA? What if we want if x > 2 and x isn’t NA?
Don’t do x != NA, do x > 2 and x is NOT NA:

x != NA # WRONG!
[1] NA NA NA NA NA NA NA
x > 2 & !is.na(x)
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

What about seeing if a value is equal to multiple values? You can do (x == 1 | x == 2) & !is.na(x), but that is not efficient.

(x == 0 | x == 2) # has NA
[1]  TRUE    NA  TRUE FALSE FALSE FALSE FALSE
(x == 0 | x == 2) & !is.na(x) # No NA
[1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE

So what do we do?

Filter removes missing values. If you want to keep the NAs, you have to keep them specifically selecting them using is.na:

df = data_frame(x = x)
df %>% filter(x > 2)
# A tibble: 2 × 1
      x
  <dbl>
1     3
2     4
filter(df, between(x, -1, 3) | is.na(x))
# A tibble: 6 × 1
      x
  <dbl>
1   0  
2  NA  
3   2  
4   3  
5  -0.5
6   0.2

Be careful with missing data when subsetting.

x
[1]  0.0   NA  2.0  3.0  4.0 -0.5  0.2
x %in% c(0, 2, NA) # NEVER returns NA. Returns logical. Don't do this
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
x %in% c(0, 2) # NEVER returns NA. Returns logical. Don't do this
[1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
x %in% c(0, 2) | is.na(x) # Do this
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

Similarly with logicals, operations/arithmetic with NA will result in NAs:

x + 2
[1] 2.0  NA 4.0 5.0 6.0 1.5 2.2
x * 2
[1]  0.0   NA  4.0  6.0  8.0 -1.0  0.4

Some other useful checking functions that we have already seen are

  • unique - gives you the unique values of a variable
  • table(x) - will give a one-way table of x
    • table(x, useNA = "ifany") - will have row NA
  • table(x, y) - will give a cross-tab of x and y
  • df %>% count(x, y)
    • df %>% group_by(x, y) %>% tally()

We have already used the table function to create frequency tables of a categorical variable or a contingency table for two categorical variables. This function has a useful argument useNA with the possible values

  • "no - never show NA category values even if they exist
  • "ifany" - only add an NA category if some exist
  • "always" - include an NA category whether any exist or not
unique(x)
[1]  0.0   NA  2.0  3.0  4.0 -0.5  0.2
table(x)
x
-0.5    0  0.2    2    3    4 
   1    1    1    1    1    1 
table(x, useNA = "ifany") # Includes an NA category
x
-0.5    0  0.2    2    3    4 <NA> 
   1    1    1    1    1    1    1 
df = tibble(x = x)
df %>% count(x) #Note it counts NA's
# A tibble: 7 × 2
      x     n
  <dbl> <int>
1  -0.5     1
2   0       1
3   0.2     1
4   2       1
5   3       1
6   4       1
7  NA       1

Here is an example with no NAs.

df2 = tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3))
table(df2$x, useNA = "always") # Includes NA category even though there are none

   0    1    2    3 <NA> 
   1    1    4    4    0 

If you use a factor, all levels will be given even if it does not appear in the data! (May be wanted or not):

fac = factor(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
             levels = 1:4)
tab = table(fac) 
tab # Shows level 4 even though it does not appear in the observed data
fac
1 2 3 4 
1 4 4 0 
tab[ tab > 0 ] # Only shows categories with observed data
fac
1 2 3 
1 4 4 

It is often useful to check logical conditions using

  • any() - checks if there are any TRUEs
  • all() - checks if ALL are true
head(Sal,2)
# A tibble: 2 × 7
  Name             JobTitle       AgencyID Agency HireDate AnnualSalary GrossPay
  <chr>            <chr>          <chr>    <chr>  <chr>    <chr>        <chr>   
1 Aaron,Patricia G Facilities/Of… A03031   OED-E… 10/24/1… $55314.00    $53626.…
2 Aaron,Petra L    ASSISTANT STA… A29045   State… 09/25/2… $74000.00    $73000.…
any(is.na(Sal$Name)) # are there any NAs?
[1] FALSE

We get FALSE, so there are no observations with a missing name.

8.2 Strings and Recoding Variables

Let’s say gender is a variable in our dataset, but we have combined data from multiple datasets or there were multiple individuals who entered data. It may be the case that the same encoding for male and female were not used across these dataset/individuals. For example, gender may have been coded as Male, M, m, Female, F, f. Using excel to find all of these would be a matter of filtering and changing all by hand or using if statements.

In dplyr, you can use the recode function:

data = data %>% 
  mutate(gender = recode(gender, 
                         M = "Male", m = "Male", M = "Male"))

or use ifelse:

data %>% 
  mutate(gender = ifelse(gender %in% c("Male", "M", "m"),
                         "Male", gender))

This, however, assumes that we know all the patterns that we are looking for and that there are not very many of them. Sometimes though, it’s not so simple. That’s where functions that find patterns come in very useful. For example, what if our gender variable’s data looked like this?

set.seed(4) # random sample below - make sure same every time
gender <- sample(c("Male", "mAle", "MaLe", "M", "MALE", "Ma", "FeMAle", "F", "Woman", "Man", "Fm", "FEMALE"), 1000, replace = TRUE)
table(gender)
gender
     F FeMAle FEMALE     Fm      M     Ma   mAle   Male   MaLe   MALE    Man 
    80     88     76     87     99     76     84     83     79     93     84 
 Woman 
    71 

We need a way of splitting, finding, replacing strings and searching strings by using regular expressions. R can do much more than find exact matches for a whole string. Like Perl and other languages, it can use regular expressions. What are regular expressions?

  • Ways to search for specific strings
  • Can be very complicated or simple
  • Highly Useful - think “Find” on steroids
  • http://www.regular-expressions.info/reference.html
  • They can use to match a large number of strings in one statement
  • . matches any single character
  • * means repeat as many (even if 0) more times the last character
  • ? makes the last thing optional
  • ^ matches start of vector ^a - starts with “a”
  • $ matches end of vector b$ - ends with “b”

We will use the stringr package, which is part of the tidyverse.

  • Makes string manipulation more intuitive
  • Has a standard format for most functions: the first argument is a string, like first argument is a data.frame in dplyr functions
  • We will not cover grep or gsub - base R functions are used on forums for answers
  • Almost all functions in stringr start with str_

Let’s look at some modifiers for stringr function arguments. See ?modifiers with the stringr package loaded.

  • fixed - match everything exactly
  • regexp - default - uses regular expressions
  • ignore_case is an option to not have to use tolower

8.2.1 Substring and String Splitting

To get a substring or split a string with stringr we use

  • str_sub(x, start, end) - substrings from position start to position end
  • str_split(string, pattern) - splits strings up - returns list!
library(stringr)
x <- c("I really", "like writing", "R code programs")
y <- str_split(x, " ") # returns a list
y
[[1]]
[1] "I"      "really"

[[2]]
[1] "like"    "writing"

[[3]]
[1] "R"        "code"     "programs"

When we pass string expressions to search find, in this case to split by, we have to be aware of special characters in regular expressions, such as "." which is the special character for match anything in regular expressions.

str_split("I.like.strings", ".") #splits on every character
[[1]]
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
str_split("I.like.strings", fixed(".")) #splits on .
[[1]]
[1] "I"       "like"    "strings"

Let’s extract from y.

y
[[1]]
[1] "I"      "really"

[[2]]
[1] "like"    "writing"

[[3]]
[1] "R"        "code"     "programs"
y[[2]]
[1] "like"    "writing"
sapply(y, dplyr::first) # on the fly grab first string
[1] "I"    "like" "R"   
sapply(y, nth, 2) # on the fly grab 2nd string
[1] "really"  "writing" "code"   
sapply(y, last) # on the fly grab last string
[1] "really"   "writing"  "programs"

From tidyr, you can split a character column in a data set into multiple columns. By default it splits based on any sequence of non-alphanumeric characters.

df = tibble(x = c("I really", "like writing", "R code programs"))
df %>% separate(x, into = c("first", "second", "third"))
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].
# A tibble: 3 × 3
  first second  third   
  <chr> <chr>   <chr>   
1 I     really  <NA>    
2 like  writing <NA>    
3 R     code    programs
df %>% separate(x, into = c("first", "second"))
Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
# A tibble: 3 × 2
  first second 
  <chr> <chr>  
1 I     really 
2 like  writing
3 R     code   
df = tibble(x = c("I really", "like. _writing R. But not", "R code programs"))
df %>% separate(x, into = c("first", "second", "third"), extra = "merge")
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
# A tibble: 3 × 3
  first second  third     
  <chr> <chr>   <chr>     
1 I     really  <NA>      
2 like  writing R. But not
3 R     code    programs  
df %>% separate(x, into = c("first", "second", "third"), extra = "merge", sep =  " ")
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
# A tibble: 3 × 3
  first second   third     
  <chr> <chr>    <chr>     
1 I     really   <NA>      
2 like. _writing R. But not
3 R     code     programs  

8.2.2 Find Functions in stringr

str_detect, str_subset, str_replace, and str_replace_all search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.

  • str_detect - returns TRUE if pattern is found
  • str_subset - returns only the strings which pattern were detected
    • convenient wrapper around x[str_detect(x, pattern)]
  • str_extract - returns only strings which pattern were detected, but ONLY the pattern
  • str_replace - replaces pattern with replacement the first time
  • str_replace_all - replaces pattern with replacement as many times matched

str_detect returns TRUE in the indices where the pattern match occurs:

head(str_detect(Sal$Name, "Rawlings"))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
which(str_detect(Sal$Name, "Rawlings"))
[1] 10256 10257 10258

str_extract returns a character string with either NA if the string was not found or the string that was searched for.

ss = str_extract(Sal$Name, "Rawling")
head(ss)
[1] NA NA NA NA NA NA
ss[ !is.na(ss)]
[1] "Rawling" "Rawling" "Rawling"

str_subset only returns the full string that contained the searched for substring.

str_subset(Sal$Name, "Rawlings")
[1] "Rawlings,Kellye A"          "Rawlings,Paula M"          
[3] "Rawlings-Blake,Stephanie C"
Sal %>% filter(str_detect(Name, "Rawlings"))
# A tibble: 3 × 7
  Name                   JobTitle AgencyID Agency HireDate AnnualSalary GrossPay
  <chr>                  <chr>    <chr>    <chr>  <chr>    <chr>        <chr>   
1 Rawlings,Kellye A      EMERGEN… A40302   M-R I… 01/06/2… $48940.00    $73356.…
2 Rawlings,Paula M       COMMUNI… A04015   R&P-R… 12/10/2… $19802.00    $10443.…
3 Rawlings-Blake,Stepha… MAYOR    A01001   Mayor… 12/07/1… $167449.00   $165249…

str_extract_all extracts all the matched strings * \\d searches for DIGITS/numbers

head(Sal$AgencyID)
[1] "A03031" "A29045" "A65026" "A99005" "A40001" "A90005"
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "2" "6" "9" "4" "9"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]]
[1] "0" "3" "0" "3" "1"

[[2]]
[1] "2" "9" "0" "4" "5"

str_replace_all finds all cases of the matched strings and replaces them with new string provided. `str_replace only replaces the first occurrence of the matched string.

head(Sal$Name)
[1] "Aaron,Patricia G"     "Aaron,Petra L"        "Abaineh,Yohannes T"  
[4] "Abbene,Anthony M"     "Abbey,Emmanuel"       "Abbott-Cole,Michelle"
head(str_replace(Sal$Name, "a", "j"))
[1] "Ajron,Patricia G"     "Ajron,Petra L"        "Abjineh,Yohannes T"  
[4] "Abbene,Anthony M"     "Abbey,Emmjnuel"       "Abbott-Cole,Michelle"
head(str_replace_all(Sal$Name, "a", "j"), 2)
[1] "Ajron,Pjtricij G" "Ajron,Petrj L"   

Let’s say we wanted to sort the data set by Annual Salary:

class(Sal$AnnualSalary)
[1] "character"
head(Sal$AnnualSalary, 4)
[1] "$55314.00" "$74000.00" "$64500.00" "$46309.00"
head(as.numeric(Sal$AnnualSalary), 4)
Warning in head(as.numeric(Sal$AnnualSalary), 4): NAs introduced by coercion
[1] NA NA NA NA

R didn’t like the $ so it thought turned them all to NA, so we need to remove the $ first but remember it is a special character in regular expressions. So we can replace the $ with nothing (used fixed("$") because $ means ending):

Sal = Sal %>% mutate(
  AnnualSalary = str_replace(AnnualSalary, fixed("$"), ""),
  AnnualSalary = as.numeric(AnnualSalary)
  ) %>% 
  arrange(desc(AnnualSalary))
head(Sal$AnnualSalary)
[1] 238772 211785 200000 192500 187200 178900

8.2.3 Using Regular Expressions

Let’s look at a few simple examples of using regular expressions. Recall

  • http://www.regular-expressions.info/reference.html
  • They can use to match a large number of strings in one statement
  • . matches any single character
  • * means repeat as many (even if 0) more times the last character
  • ? makes the last thing optional
  • ^ matches start of vector ^a - starts with “a”
  • $ matches end of vector b$ - ends with “b”

Let’s look for any name that starts with

  • Payne at the beginning,
  • Leonard and then an S
  • Spence then capital C
# "^Payne.*" = starts with Payne followed by any number of any other character
head(str_subset( Sal$Name, "^Payne.*"), 3)
[1] "Payne,James R"  "Payne,Karen V"  "Payne,Jasman T"
# "Leonard.?S" = contains Leonard optionally followed by any single character and then an S
head(str_subset( Sal$Name, "Leonard.?S"))
[1] "Szumlanski,Leonard S" "Payne,Leonard S"     
# "Spence.*C.*" = contains Spence followed by any number of any character then C then any number of any character
head(str_subset( Sal$Name, "Spence.*C.*"))
[1] "Spencer,Michael C"  "Spencer,Clarence W" "Spencer,Charles A" 

8.2.4 Combining Strings

Paste can be very useful for joining vectors together:

paste("Visit", 1:5, sep = "_")
[1] "Visit_1" "Visit_2" "Visit_3" "Visit_4" "Visit_5"
paste("Visit", 1:5, sep = "_", collapse = " ")
[1] "Visit_1 Visit_2 Visit_3 Visit_4 Visit_5"
paste("To", "is going be the ", "we go to the store!", sep = "day ")
[1] "Today is going be the day we go to the store!"
# and paste0 can be even simpler see ?paste0 
paste0("Visit",1:5)
[1] "Visit1" "Visit2" "Visit3" "Visit4" "Visit5"
paste(1:5)
[1] "1" "2" "3" "4" "5"
paste(1:5, collapse = " ")
[1] "1 2 3 4 5"

From tidyr, you can unite two or more columns:

df = data_frame(id = rep(1:5, 3), visit = rep(1:3, each = 5))
df %>% unite(col = "unique_id", id, visit, sep = "_")
# A tibble: 15 × 1
   unique_id
   <chr>    
 1 1_1      
 2 2_1      
 3 3_1      
 4 4_1      
 5 5_1      
 6 1_2      
 7 2_2      
 8 3_2      
 9 4_2      
10 5_2      
11 1_3      
12 2_3      
13 3_3      
14 4_3      
15 5_3      
df %>% unite(col = "unique_id", id, visit, sep = "_", remove = FALSE)
# A tibble: 15 × 3
   unique_id    id visit
   <chr>     <int> <int>
 1 1_1           1     1
 2 2_1           2     1
 3 3_1           3     1
 4 4_1           4     1
 5 5_1           5     1
 6 1_2           1     2
 7 2_2           2     2
 8 3_2           3     2
 9 4_2           4     2
10 5_2           5     2
11 1_3           1     3
12 2_3           2     3
13 3_3           3     3
14 4_3           4     3
15 5_3           5     3

8.2.5 Useful String Functions

Some other useful string functions

  • toupper(), tolower() - uppercase or lowercase your data:
  • str_trim() (in the stringr package) or trimws in base: will trim leading and trailing whitespace
  • nchar - get the number of characters in a string
my_string = "   Robert      Parker       "
toupper(my_string)
[1] "   ROBERT      PARKER       "
tolower(my_string)
[1] "   robert      parker       "
str_trim(my_string)
[1] "Robert      Parker"
nchar(my_string)
[1] 28

Note that sorting on strings may not work as expected.

  • sort - reorders the data - characters work, but not correctly
  • rank - gives the rank of the data - ties are split
  • order - gives the indices, if subset, would give the data sorted
    • x[order(x)] is the same as sorting
sort(c("1", "2", "10")) #  not sorted correctly
[1] "1"  "10" "2" 
order(c("1", "2", "10"))
[1] 1 3 2
x = rnorm(10)
x[1] = x[2] # create a tie
rank(x)
 [1]  2.5  2.5 10.0  7.0  4.0  1.0  8.0  5.0  9.0  6.0

8.2.6 Comparison of stringr to base R (not covered)

The material in this section will show you the base R functions to manipulate strings that match the stringr functions we learned above. Though we won’t use these functions, it is useful to recognize these as they may be used in examples on help forums or just in general by other R users.

Substringing is very similar:

Base R

  • substr(x, start, stop) - substrings from position start to position stop
  • strsplit(x, split) - splits strings up - returns list!

stringr

  • str_sub(x, start, end) - substrings from position start to position end
  • str_split(string, pattern) - splits strings up - returns list!

In base R, strsplit splits a vector on a string into a list

x <- c("I really", "like writing", "R code programs")
y <- strsplit(x, split = " ") # returns a list
y
[[1]]
[1] "I"      "really"

[[2]]
[1] "like"    "writing"

[[3]]
[1] "R"        "code"     "programs"

‘Find’ functions: base R

grep: grep, grepl, regexpr and gregexpr search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.

grep(pattern, x, fixed=FALSE), where:

  • pattern = character string containing a regular expression to be matched in the given character vector.
  • x = a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
  • If fixed=TRUE, it will do exact matching for the phrase anywhere in the vector (regular find)

‘Find’ functions: stringr compared to base R

Base R does not use these functions. Here is a “translator” of the stringr function to base R functions

  • str_detect - similar to grepl (return logical)
  • grep(value = FALSE) is similar to which(str_detect())
  • str_subset - similar to grep(value = TRUE) - return value of matched
  • str_replace - similar to sub - replace one time
  • str_replace_all - similar to gsub - replace many times

Important comparisons:

Base R:

  • Argument order is (pattern, x)
  • Uses option (fixed = TRUE)

stringr

  • Argument order is (string, pattern) aka (x, pattern)
  • Uses function fixed(pattern)

These are the indices where the pattern match occurs:

grep("Rawlings", Sal$Name)
[1]     9  6854 13284
which(grepl("Rawlings", Sal$Name))
[1]     9  6854 13284
which(str_detect(Sal$Name, "Rawlings"))
[1]     9  6854 13284

These are the indices where the pattern match occurs:

head(grepl("Rawlings", Sal$Name))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
head(str_detect(Sal$Name, "Rawlings"))
[1] FALSE FALSE FALSE FALSE FALSE FALSE

These are the strings containing the matches and a

grep("Rawlings", Sal$Name, value=TRUE)
[1] "Rawlings-Blake,Stephanie C" "Rawlings,Kellye A"         
[3] "Rawlings,Paula M"          
Sal[grep("Rawlings",Sal$Name),]
# A tibble: 3 × 7
  Name                   JobTitle AgencyID Agency HireDate AnnualSalary GrossPay
  <chr>                  <chr>    <chr>    <chr>  <chr>           <dbl> <chr>   
1 Rawlings-Blake,Stepha… MAYOR    A01001   Mayor… 12/07/1…       167449 $165249…
2 Rawlings,Kellye A      EMERGEN… A40302   M-R I… 01/06/2…        48940 $73356.…
3 Rawlings,Paula M       COMMUNI… A04015   R&P-R… 12/10/2…        19802 $10443.…

Using Regular Expressions

Look for any name that starts with:

  • Payne at the beginning,
  • Leonard and then an S
  • Spence then capital C

Base R

head(grep("^Payne.*", x = Sal$Name, value = TRUE), 3)
[1] "Payne,James R"  "Payne,Karen V"  "Payne,Jasman T"
head(grep("Leonard.?S", x = Sal$Name, value = TRUE))
[1] "Szumlanski,Leonard S" "Payne,Leonard S"     
head(grep("Spence.*C.*", x = Sal$Name, value = TRUE))
[1] "Spencer,Michael C"  "Spencer,Clarence W" "Spencer,Charles A" 

Here we do the same thing using stringr

head(str_subset( Sal$Name, "^Payne.*"), 3)
[1] "Payne,James R"  "Payne,Karen V"  "Payne,Jasman T"
head(str_subset( Sal$Name, "Leonard.?S"))
[1] "Szumlanski,Leonard S" "Payne,Leonard S"     
head(str_subset( Sal$Name, "Spence.*C.*"))
[1] "Spencer,Michael C"  "Spencer,Clarence W" "Spencer,Charles A" 

Let’s say we wanted to sort the data set by Annual Salary:

class(Sal$AnnualSalary)
[1] "numeric"

So we must change the annual pay into a numeric:

head(Sal$AnnualSalary, 4)
[1] 238772 211785 200000 192500
head(as.numeric(Sal$AnnualSalary), 4)
[1] 238772 211785 200000 192500

R didn’t like the $ so it thought turned them all to NA.

sub() and gsub() can do the replacing part in base R.

Now we can replace the $ with nothing (used fixed=TRUE because $ means ending):

Sal$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="", 
                              Sal$AnnualSalary, fixed=TRUE))
Sal <- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ] 
Sal[1:5, c("Name", "AnnualSalary", "JobTitle")]
# A tibble: 5 × 3
  Name            AnnualSalary JobTitle              
  <chr>                  <dbl> <chr>                 
1 Mosby,Marilyn J       238772 STATE'S ATTORNEY      
2 Batts,Anthony W       211785 Police Commissioner   
3 Wen,Leana             200000 Executive Director III
4 Raymond,Henry J       192500 Executive Director III
5 Swift,Michael         187200 CONTRACT SERV SPEC II 

We can do the same thing (with 2 piping operations!) in dplyr

dplyr_sal = Sal
dplyr_sal = dplyr_sal %>% mutate( 
  AnnualSalary = AnnualSalary %>%
    str_replace(
      fixed("$"), 
      "") %>%
    as.numeric) %>%
  arrange(desc(AnnualSalary))
check_Sal = Sal
rownames(check_Sal) = NULL
all.equal(check_Sal, dplyr_sal)
[1] TRUE

8.3 Exercises

For these exercises, we will use the following datasets

  • bike lanes dataset, Bike_Lanes.csv, from the Baltimore Department of Transportation’s bike program (source). The R dataset containing Bike_Lanes.csv will be referred to as bike in the questions below.
  • the “Real Property Taxes” data in the file Real_Property_Taxes.csv.gz (note this is a compressed file, but you do not need to unzip it to read it into R). We will refer to the R object containing this data as tax.
  • Baltimore city employee salaries dataset, Baltimore_City_Employee_Salaries_FY2015.csv, and will be referred to as sal.
  1. Count the number of rows of the bike data and count the number of complete cases of the bike data. Use sum and complete.cases.
  2. Create a data set called namat which is equal to is.na(bike). What is the class of namat? Run rowSums and colSums on namat. These represent the number of missing values in the rows and columns of bike. Don’t print rowSums, but do a table of the rowSums.
  3. Filter rows of bike that are NOT missing the route variable, assign this to the object have_route. Do a table of the subType variable using table, including the missing subTypes. Get the same frequency distribution using group_by(subType) and tally() or count().
  4. Filter rows of bike that have the type SIDEPATH or BIKE LANE using %in%. Call it side_bike. Confirm this gives you the same number of results using the | and ==.
  5. Do a cross tabulation of the bike type and the number of lanes (numLanes). Call it tab. Do a prop.table on the rows and columns margins. Try as.data.frame(tab) or broom::tidy(tab).
  6. Read the Property Tax data into R and call it the variable tax.
  7. How many addresses pay property taxes? (Assume each row is a different address.)
  8. What is the total (a) city (CityTax) and (b) state (SateTax) tax paid? You need to remove the $ from the CityTax variable, then you need to make it numeric. Try str_replace, but remember $ is “special” and you need fixed() around it.
  9. Using table() or group_by and summarize(n()) or tally().
    1. How many observations/properties are in each ward (Ward)?
    2. What is the mean state tax per ward? Use group_by and summarize.
    3. What is the maximum amount still due (AmountDue) in each ward? Use group_by and summarize with ’max`.
    4. What is the 75th percentile of city and state tax paid by Ward? (quantile)
  10. Make boxplots showing CityTax (y-variable) by whether the property is a principal residence (x = ResCode) or not. You will need to trim some leading/trailing white space from ResCode.
  11. Subset the data to only retain those houses that are principal residences. Which command subsets rows? Filter or select?
    1. How many such houses are there?
    2. Describe the distribution of property taxes on these residences. Use hist/qplot with certain breaks or plot(density(variable)).
  12. Make an object called health.sal using the salaries data set, with only agencies (JobTitle) of those with “fire” (anywhere in the job title), if any, in the name remember fixed("string_match", ignore_case = TRUE) will ignore cases.
  13. Make a data set called trans which contains only agencies that contain “TRANS”.
  14. What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.
  15. What does the distribution of annual salaries look like? (use hist, 20 breaks) What is the IQR? Hint: first convert to numeric. Try str_replace, but remember $ is “special” and you need fixed() around it.
  16. Convert HireDate to the Date class - plot Annual Salary vs Hire Date. Use AnnualSalary ~ HireDate with a data = sal argument in plot or use x, y notation in scatter.smooth. Use the lubridate package. Is it mdy(date) or dmy(date) for this data - look at HireDate.
  17. Create a smaller dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the Agency variable with string matching. Call this emer. How many employees are in this new dataset?
  18. Create a variable called dept in the emer data set, dept = str_extract(Agency, ".*(ment|ice)"). E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Replot annual salary versus hire date and color by dept (not yet - using ggplot). Use the argument col = factor(dept) in plot.
  19. (Bonus). Convert the ‘LotSize’ variable to a numeric square feet variable in the tax data set. Some tips: a) 1 acre = 43560 square feet b) The hyphens represent a decimals. (This will take a lot of searching to find all the string changes needed before you can convert to numeric.)