Chapter 8 Data Cleaning
In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable. In this lesson, we will focus on checking for missing data and manipulated strings.
THE MOST IMPORTANT RULE - LOOK AT YOUR DATA!
We will use the Baltimore city employee salary dataset (source), Baltimore_City_Employee_Salaries_FY2015.csv.
library(readr)
library(tidyverse)
Sal = read_csv("./data/Baltimore_City_Employee_Salaries_FY2015.csv")
Sal = rename(Sal, Name = name)
8.1 Dealing with Missing Data
One of the most important aspects of data cleaning is missing values.
Some useful functions for checking your data include
is.na
- isTRUE
if the data isNA
FALSE
otherwise!
- negation (NOT)- if
is.na(x)
isTRUE
, then!is.na(x)
isFALSE
- if
all
takes in alogical
and will beTRUE
if ALL areTRUE
all(!is.na(x))
- are all values ofx
NOT NA
any
will beTRUE
if ANY are trueany(is.na(x))
- do we have anyNA
’s inx
?
complete.cases
- returnsTRUE
if EVERY value of a row is NOTNA
- very stringent condition
FALSE
missing one value (even if not important)
Types of “missing” data in R include:
NA
- general missing dataNaN
- stands for “Not a Number”, happens when you do 0/0.Inf
and-Inf
- Infinity, happens when you divide a positive number (or negative number) by 0.
We need to be able to check for these R missing types in our data. Each missing data type has a function that returns TRUE
if the data is missing:
NA
-is.na
NaN
-is.nan
Inf
and-Inf
-is.infinite
is.finite
returnsFALSE
for all missing data andTRUE
for non-missing
One important aspect (esp with subsetting) is that logical operations return NA
for NA
values. Think about it, the data could be > 2
or not. We don’t know, so R
says there is no TRUE
or FALSE
, so that is missing:
[1] FALSE NA FALSE TRUE TRUE FALSE FALSE
So what do we do to handle the NA
? What if we want if x > 2
and x
isn’t NA
?
Don’t do x != NA
, do x > 2
and x
is NOT NA
:
[1] NA NA NA NA NA NA NA
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
What about seeing if a value is equal to multiple values? You can do (x == 1 | x == 2) & !is.na(x)
, but that is not efficient.
[1] TRUE NA TRUE FALSE FALSE FALSE FALSE
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
So what do we do?
Filter removes missing values. If you want to keep the NA
s, you have to keep them specifically selecting them using is.na
:
# A tibble: 2 × 1
x
<dbl>
1 3
2 4
# A tibble: 6 × 1
x
<dbl>
1 0
2 NA
3 2
4 3
5 -0.5
6 0.2
Be careful with missing data when subsetting.
[1] 0.0 NA 2.0 3.0 4.0 -0.5 0.2
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE
Similarly with logicals, operations/arithmetic with NA
will result in NA
s:
[1] 2.0 NA 4.0 5.0 6.0 1.5 2.2
[1] 0.0 NA 4.0 6.0 8.0 -1.0 0.4
Some other useful checking functions that we have already seen are
unique
- gives you the unique values of a variabletable(x)
- will give a one-way table ofx
table(x, useNA = "ifany")
- will have row NA
table(x, y)
- will give a cross-tab ofx
andy
df %>% count(x, y)
df %>% group_by(x, y) %>% tally()
We have already used the table
function to create frequency tables of a categorical variable or a contingency table for two categorical variables. This function has a useful argument useNA
with the possible values
"no
- never showNA
category values even if they exist"ifany"
- only add anNA
category if some exist"always"
- include anNA
category whether any exist or not
[1] 0.0 NA 2.0 3.0 4.0 -0.5 0.2
x
-0.5 0 0.2 2 3 4
1 1 1 1 1 1
x
-0.5 0 0.2 2 3 4 <NA>
1 1 1 1 1 1 1
# A tibble: 7 × 2
x n
<dbl> <int>
1 -0.5 1
2 0 1
3 0.2 1
4 2 1
5 3 1
6 4 1
7 NA 1
Here is an example with no NA
s.
df2 = tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3))
table(df2$x, useNA = "always") # Includes NA category even though there are none
0 1 2 3 <NA>
1 1 4 4 0
If you use a factor
, all levels will be given even if it does not appear in the data! (May be wanted or not):
fac = factor(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
levels = 1:4)
tab = table(fac)
tab # Shows level 4 even though it does not appear in the observed data
fac
1 2 3 4
1 4 4 0
fac
1 2 3
1 4 4
It is often useful to check logical conditions using
any()
- checks if there are anyTRUE
sall()
- checks if ALL are true
# A tibble: 2 × 7
Name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Aaron,Patricia G Facilities/Of… A03031 OED-E… 10/24/1… $55314.00 $53626.…
2 Aaron,Petra L ASSISTANT STA… A29045 State… 09/25/2… $74000.00 $73000.…
[1] FALSE
We get FALSE
, so there are no observations with a missing name.
8.2 Strings and Recoding Variables
Let’s say gender is a variable in our dataset, but we have combined data from multiple datasets or there were multiple individuals who entered data. It may be the case that the same encoding for male and female were not used across these dataset/individuals. For example, gender may have been coded as Male, M, m, Female, F, f. Using excel to find all of these would be a matter of filtering and changing all by hand or using if statements.
In dplyr
, you can use the recode
function:
or use ifelse
:
This, however, assumes that we know all the patterns that we are looking for and that there are not very many of them. Sometimes though, it’s not so simple. That’s where functions that find patterns come in very useful. For example, what if our gender variable’s data looked like this?
set.seed(4) # random sample below - make sure same every time
gender <- sample(c("Male", "mAle", "MaLe", "M", "MALE", "Ma", "FeMAle", "F", "Woman", "Man", "Fm", "FEMALE"), 1000, replace = TRUE)
table(gender)
gender
F FeMAle FEMALE Fm M Ma mAle Male MaLe MALE Man
80 88 76 87 99 76 84 83 79 93 84
Woman
71
We need a way of splitting, finding, replacing strings and searching strings by using regular expressions. R can do much more than find exact matches for a whole string. Like Perl and other languages, it can use regular expressions. What are regular expressions?
- Ways to search for specific strings
- Can be very complicated or simple
- Highly Useful - think “Find” on steroids
- http://www.regular-expressions.info/reference.html
- They can use to match a large number of strings in one statement
.
matches any single character*
means repeat as many (even if 0) more times the last character?
makes the last thing optional^
matches start of vector^a
- starts with “a”$
matches end of vectorb$
- ends with “b”
We will use the stringr
package, which is part of the tidyverse
.
- Makes string manipulation more intuitive
- Has a standard format for most functions: the first argument is a string, like first argument is a
data.frame
indplyr
functions - We will not cover
grep
orgsub
- base R functions are used on forums for answers - Almost all functions in
stringr
start withstr_
Let’s look at some modifiers for stringr
function arguments. See ?modifiers
with the stringr
package loaded.
fixed
- match everything exactlyregexp
- default - uses regular expressionsignore_case
is an option to not have to usetolower
8.2.1 Substring and String Splitting
To get a substring or split a string with stringr
we use
str_sub(x, start, end)
- substrings from position start to position endstr_split(string, pattern)
- splits strings up - returns list!
library(stringr)
x <- c("I really", "like writing", "R code programs")
y <- str_split(x, " ") # returns a list
y
[[1]]
[1] "I" "really"
[[2]]
[1] "like" "writing"
[[3]]
[1] "R" "code" "programs"
When we pass string expressions to search find, in this case to split by, we have to be aware of special characters in regular expressions, such as "."
which is the special character for match anything in regular expressions.
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[[1]]
[1] "I" "like" "strings"
Let’s extract from y.
[[1]]
[1] "I" "really"
[[2]]
[1] "like" "writing"
[[3]]
[1] "R" "code" "programs"
[1] "like" "writing"
[1] "I" "like" "R"
[1] "really" "writing" "code"
[1] "really" "writing" "programs"
From tidyr
, you can split a character column in a data set into multiple columns. By default it splits based on any sequence of non-alphanumeric characters.
df = tibble(x = c("I really", "like writing", "R code programs"))
df %>% separate(x, into = c("first", "second", "third"))
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].
# A tibble: 3 × 3
first second third
<chr> <chr> <chr>
1 I really <NA>
2 like writing <NA>
3 R code programs
Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
# A tibble: 3 × 2
first second
<chr> <chr>
1 I really
2 like writing
3 R code
df = tibble(x = c("I really", "like. _writing R. But not", "R code programs"))
df %>% separate(x, into = c("first", "second", "third"), extra = "merge")
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
# A tibble: 3 × 3
first second third
<chr> <chr> <chr>
1 I really <NA>
2 like writing R. But not
3 R code programs
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
# A tibble: 3 × 3
first second third
<chr> <chr> <chr>
1 I really <NA>
2 like. _writing R. But not
3 R code programs
8.2.2 Find Functions in stringr
str_detect
, str_subset
, str_replace
, and str_replace_all
search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
str_detect
- returnsTRUE
ifpattern
is foundstr_subset
- returns only the strings which pattern were detected- convenient wrapper around
x[str_detect(x, pattern)]
- convenient wrapper around
str_extract
- returns only strings which pattern were detected, but ONLY the patternstr_replace
- replacespattern
withreplacement
the first timestr_replace_all
- replacespattern
withreplacement
as many times matched
str_detect
returns TRUE
in the indices where the pattern match occurs:
[1] FALSE FALSE FALSE FALSE FALSE FALSE
[1] 10256 10257 10258
str_extract
returns a character string with either NA
if the string was not found or the string that was searched for.
[1] NA NA NA NA NA NA
[1] "Rawling" "Rawling" "Rawling"
str_subset
only returns the full string that contained the searched for substring.
[1] "Rawlings,Kellye A" "Rawlings,Paula M"
[3] "Rawlings-Blake,Stephanie C"
# A tibble: 3 × 7
Name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Rawlings,Kellye A EMERGEN… A40302 M-R I… 01/06/2… $48940.00 $73356.…
2 Rawlings,Paula M COMMUNI… A04015 R&P-R… 12/10/2… $19802.00 $10443.…
3 Rawlings-Blake,Stepha… MAYOR A01001 Mayor… 12/07/1… $167449.00 $165249…
str_extract_all
extracts all the matched strings
* \\d
searches for DIGITS/numbers
[1] "A03031" "A29045" "A65026" "A99005" "A40001" "A90005"
[1] "0" "2" "6" "9" "4" "9"
[[1]]
[1] "0" "3" "0" "3" "1"
[[2]]
[1] "2" "9" "0" "4" "5"
str_replace_all
finds all cases of the matched strings and replaces them with new string provided. `str_replace
only replaces the first occurrence of the matched string.
[1] "Aaron,Patricia G" "Aaron,Petra L" "Abaineh,Yohannes T"
[4] "Abbene,Anthony M" "Abbey,Emmanuel" "Abbott-Cole,Michelle"
[1] "Ajron,Patricia G" "Ajron,Petra L" "Abjineh,Yohannes T"
[4] "Abbene,Anthony M" "Abbey,Emmjnuel" "Abbott-Cole,Michelle"
[1] "Ajron,Pjtricij G" "Ajron,Petrj L"
Let’s say we wanted to sort the data set by Annual Salary:
[1] "character"
[1] "$55314.00" "$74000.00" "$64500.00" "$46309.00"
Warning in head(as.numeric(Sal$AnnualSalary), 4): NAs introduced by coercion
[1] NA NA NA NA
R didn’t like the $
so it thought turned them all to NA
, so we need to remove the $
first but remember it is a special character in regular expressions. So we can replace the $
with nothing (used fixed("$")
because $
means ending):
Sal = Sal %>% mutate(
AnnualSalary = str_replace(AnnualSalary, fixed("$"), ""),
AnnualSalary = as.numeric(AnnualSalary)
) %>%
arrange(desc(AnnualSalary))
head(Sal$AnnualSalary)
[1] 238772 211785 200000 192500 187200 178900
8.2.3 Using Regular Expressions
Let’s look at a few simple examples of using regular expressions. Recall
- http://www.regular-expressions.info/reference.html
- They can use to match a large number of strings in one statement
.
matches any single character*
means repeat as many (even if 0) more times the last character?
makes the last thing optional^
matches start of vector^a
- starts with “a”$
matches end of vectorb$
- ends with “b”
Let’s look for any name that starts with
- Payne at the beginning,
- Leonard and then an S
- Spence then capital C
# "^Payne.*" = starts with Payne followed by any number of any other character
head(str_subset( Sal$Name, "^Payne.*"), 3)
[1] "Payne,James R" "Payne,Karen V" "Payne,Jasman T"
# "Leonard.?S" = contains Leonard optionally followed by any single character and then an S
head(str_subset( Sal$Name, "Leonard.?S"))
[1] "Szumlanski,Leonard S" "Payne,Leonard S"
# "Spence.*C.*" = contains Spence followed by any number of any character then C then any number of any character
head(str_subset( Sal$Name, "Spence.*C.*"))
[1] "Spencer,Michael C" "Spencer,Clarence W" "Spencer,Charles A"
8.2.4 Combining Strings
Paste can be very useful for joining vectors together:
[1] "Visit_1" "Visit_2" "Visit_3" "Visit_4" "Visit_5"
[1] "Visit_1 Visit_2 Visit_3 Visit_4 Visit_5"
[1] "Today is going be the day we go to the store!"
[1] "Visit1" "Visit2" "Visit3" "Visit4" "Visit5"
[1] "1" "2" "3" "4" "5"
[1] "1 2 3 4 5"
From tidyr
, you can unite two or more columns:
df = data_frame(id = rep(1:5, 3), visit = rep(1:3, each = 5))
df %>% unite(col = "unique_id", id, visit, sep = "_")
# A tibble: 15 × 1
unique_id
<chr>
1 1_1
2 2_1
3 3_1
4 4_1
5 5_1
6 1_2
7 2_2
8 3_2
9 4_2
10 5_2
11 1_3
12 2_3
13 3_3
14 4_3
15 5_3
# A tibble: 15 × 3
unique_id id visit
<chr> <int> <int>
1 1_1 1 1
2 2_1 2 1
3 3_1 3 1
4 4_1 4 1
5 5_1 5 1
6 1_2 1 2
7 2_2 2 2
8 3_2 3 2
9 4_2 4 2
10 5_2 5 2
11 1_3 1 3
12 2_3 2 3
13 3_3 3 3
14 4_3 4 3
15 5_3 5 3
8.2.5 Useful String Functions
Some other useful string functions
toupper()
,tolower()
- uppercase or lowercase your data:str_trim()
(in thestringr
package) ortrimws
in base: will trim leading and trailing whitespacenchar
- get the number of characters in a string
[1] " ROBERT PARKER "
[1] " robert parker "
[1] "Robert Parker"
[1] 28
Note that sorting on strings may not work as expected.
sort
- reorders the data - characters work, but not correctlyrank
- gives the rank of the data - ties are splitorder
- gives the indices, if subset, would give the data sortedx[order(x)]
is the same as sorting
[1] "1" "10" "2"
[1] 1 3 2
[1] 2.5 2.5 10.0 7.0 4.0 1.0 8.0 5.0 9.0 6.0
8.2.6 Comparison of stringr
to base R (not covered)
The material in this section will show you the base R functions to manipulate strings that match the stringr
functions we learned above. Though we won’t use these functions, it is useful to recognize these as they may be used in examples on help forums or just in general by other R users.
Substringing is very similar:
Base R
substr(x, start, stop)
- substrings from position start to position stopstrsplit(x, split)
- splits strings up - returns list!
stringr
str_sub(x, start, end)
- substrings from position start to position endstr_split(string, pattern)
- splits strings up - returns list!
In base R, strsplit
splits a vector on a string into a list
x <- c("I really", "like writing", "R code programs")
y <- strsplit(x, split = " ") # returns a list
y
[[1]]
[1] "I" "really"
[[2]]
[1] "like" "writing"
[[3]]
[1] "R" "code" "programs"
‘Find’ functions: base R
grep
: grep
, grepl
, regexpr
and gregexpr
search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
grep(pattern, x, fixed=FALSE)
, where:
- pattern = character string containing a regular expression to be matched in the given character vector.
- x = a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
- If fixed=TRUE, it will do exact matching for the phrase anywhere in the vector (regular find)
‘Find’ functions: stringr compared to base R
Base R does not use these functions. Here is a “translator” of the stringr
function to base R functions
str_detect
- similar togrepl
(return logical)grep(value = FALSE)
is similar towhich(str_detect())
str_subset
- similar togrep(value = TRUE)
- return value of matchedstr_replace
- similar tosub
- replace one timestr_replace_all
- similar togsub
- replace many times
Important comparisons:
Base R:
- Argument order is
(pattern, x)
- Uses option
(fixed = TRUE)
stringr
- Argument order is
(string, pattern)
aka(x, pattern)
- Uses function
fixed(pattern)
These are the indices where the pattern match occurs:
[1] 9 6854 13284
[1] 9 6854 13284
[1] 9 6854 13284
These are the indices where the pattern match occurs:
[1] FALSE FALSE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE FALSE
These are the strings containing the matches and a
[1] "Rawlings-Blake,Stephanie C" "Rawlings,Kellye A"
[3] "Rawlings,Paula M"
# A tibble: 3 × 7
Name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay
<chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 Rawlings-Blake,Stepha… MAYOR A01001 Mayor… 12/07/1… 167449 $165249…
2 Rawlings,Kellye A EMERGEN… A40302 M-R I… 01/06/2… 48940 $73356.…
3 Rawlings,Paula M COMMUNI… A04015 R&P-R… 12/10/2… 19802 $10443.…
Using Regular Expressions
Look for any name that starts with:
- Payne at the beginning,
- Leonard and then an S
- Spence then capital C
Base R
[1] "Payne,James R" "Payne,Karen V" "Payne,Jasman T"
[1] "Szumlanski,Leonard S" "Payne,Leonard S"
[1] "Spencer,Michael C" "Spencer,Clarence W" "Spencer,Charles A"
Here we do the same thing using stringr
[1] "Payne,James R" "Payne,Karen V" "Payne,Jasman T"
[1] "Szumlanski,Leonard S" "Payne,Leonard S"
[1] "Spencer,Michael C" "Spencer,Clarence W" "Spencer,Charles A"
Let’s say we wanted to sort the data set by Annual Salary:
[1] "numeric"
So we must change the annual pay into a numeric:
[1] 238772 211785 200000 192500
[1] 238772 211785 200000 192500
R didn’t like the $
so it thought turned them all to NA
.
sub()
and gsub()
can do the replacing part in base R.
Now we can replace the $
with nothing (used fixed=TRUE
because $
means ending):
Sal$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="",
Sal$AnnualSalary, fixed=TRUE))
Sal <- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ]
Sal[1:5, c("Name", "AnnualSalary", "JobTitle")]
# A tibble: 5 × 3
Name AnnualSalary JobTitle
<chr> <dbl> <chr>
1 Mosby,Marilyn J 238772 STATE'S ATTORNEY
2 Batts,Anthony W 211785 Police Commissioner
3 Wen,Leana 200000 Executive Director III
4 Raymond,Henry J 192500 Executive Director III
5 Swift,Michael 187200 CONTRACT SERV SPEC II
We can do the same thing (with 2 piping operations!) in dplyr
dplyr_sal = Sal
dplyr_sal = dplyr_sal %>% mutate(
AnnualSalary = AnnualSalary %>%
str_replace(
fixed("$"),
"") %>%
as.numeric) %>%
arrange(desc(AnnualSalary))
check_Sal = Sal
rownames(check_Sal) = NULL
all.equal(check_Sal, dplyr_sal)
[1] TRUE
8.3 Exercises
For these exercises, we will use the following datasets
- bike lanes dataset, Bike_Lanes.csv, from the Baltimore Department of Transportation’s bike program (source). The R dataset containing Bike_Lanes.csv will be referred to as
bike
in the questions below. - the “Real Property Taxes” data in the file Real_Property_Taxes.csv.gz (note this is a compressed file, but you do not need to unzip it to read it into R). We will refer to the R object containing this data as
tax
. - Baltimore city employee salaries dataset, Baltimore_City_Employee_Salaries_FY2015.csv, and will be referred to as
sal
.
- Count the number of rows of the bike data and count the number of complete cases of the bike data. Use
sum
andcomplete.cases
. - Create a data set called namat which is equal to
is.na(bike)
. What is the class ofnamat
? RunrowSums
andcolSums
onnamat
. These represent the number of missing values in the rows and columns of bike. Don’t printrowSums
, but do a table of therowSums
. - Filter rows of bike that are NOT missing the
route
variable, assign this to the objecthave_route
. Do a table of thesubType
variable using table, including the missingsubType
s. Get the same frequency distribution usinggroup_by(subType)
andtally()
orcount()
. - Filter rows of
bike
that have the typeSIDEPATH
orBIKE LANE
using%in%
. Call itside_bike
. Confirm this gives you the same number of results using the|
and==
. - Do a cross tabulation of the bike
type
and the number of lanes (numLanes
). Call ittab
. Do aprop.table
on the rows and columns margins. Tryas.data.frame(tab)
orbroom::tidy(tab)
. - Read the Property Tax data into R and call it the variable
tax
. - How many addresses pay property
taxes
? (Assume each row is a different address.) - What is the total (a) city (
CityTax
) and (b) state (SateTax
) tax paid? You need to remove the$
from theCityTax
variable, then you need to make it numeric. Trystr_replace
, but remember$
is “special” and you needfixed()
around it. - Using
table()
orgroup_by
andsummarize(n())
ortally()
.- How many observations/properties are in each ward (
Ward
)? - What is the mean state tax per ward? Use
group_by
andsummarize
. - What is the maximum amount still due (
AmountDue
) in each ward? Usegroup_by
andsummarize
with ’max`. - What is the 75th percentile of city and state tax paid by Ward? (
quantile
)
- How many observations/properties are in each ward (
- Make boxplots showing
CityTax
(y
-variable) by whether the property is a principal residence (x = ResCode
) or not. You will need to trim some leading/trailing white space fromResCode
. - Subset the data to only retain those houses that are principal residences. Which command subsets rows? Filter or select?
- How many such houses are there?
- Describe the distribution of property taxes on these residences. Use
hist
/qplot
with certain breaks orplot(density(variable))
.
- Make an object called
health.sal
using the salaries data set, with only agencies (JobTitle
) of those with “fire” (anywhere in the job title), if any, in the name rememberfixed("string_match", ignore_case = TRUE)
will ignore cases. - Make a data set called
trans
which contains only agencies that contain “TRANS”. - What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.
- What does the distribution of annual salaries look like? (use
hist
, 20 breaks) What is the IQR? Hint: first convert to numeric. Trystr_replace
, but remember$
is “special” and you needfixed()
around it. - Convert
HireDate
to theDate
class - plot Annual Salary vs Hire Date. UseAnnualSalary ~ HireDate
with adata = sal
argument in plot or usex
,y
notation inscatter.smooth
. Use thelubridate
package. Is itmdy(date)
ordmy(date)
for this data - look atHireDate
. - Create a smaller dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the
Agency
variable with string matching. Call thisemer
. How many employees are in this new dataset? - Create a variable called
dept
in theemer
data set,dept = str_extract(Agency, ".*(ment|ice)")
. E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Replot annual salary versus hire date and color by dept (not yet - using ggplot). Use the argumentcol = factor(dept)
inplot
. - (Bonus). Convert the ‘LotSize’ variable to a numeric square feet variable in the
tax
data set. Some tips: a) 1 acre = 43560 square feet b) The hyphens represent a decimals. (This will take a lot of searching to find all the string changes needed before you can convert to numeric.)