Chapter 16 Solutions to Exercises
16.1 RStudio
- Go through and change the colors in
palette
to something other than what they originally were. See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for a large list of colors. - Change the days you are keeping to show
"Sunday"
instead of"Saturday"
. - Change the word
geom_line()
togeom_point()
.
# keep only some days
avg = avg %>%
filter(day %in% c("Monday", "Tuesday", "Friday", "Sunday")) #change Saturday to Sunday
filter: removed 1,237 rows (43%), 1,647 rows remaining
16.2 Basic R
- create a new variable called
my.num
that contains 6 numbers
- multiply
my.num
by 4
[1] 20 16 28 32 48 56
- create a second variable called
my.char
that contains 5 character strings
- combine the two variables
my.num
andmy.char
into a variable calledboth
- what is the length of
both
?
[1] 11
- what class is
both
?
[1] "character"
- divide
both
by 3, what happens?
Error in both/3: non-numeric argument to binary operator
- create a vector with elements 1 2 3 4 5 6 and call it
x
- create another vector with elements 10 20 30 40 50 and call it
y
- what happens if you try to add
x
andy
together? why?
[1] 11 22 33 44 55 16
Since y
is shorter than x
, it “recycles” the first element if y
to make it long enough to match the length of x
. This is can be a useful trick, but you must be careful when using R’s recycling.
11. append the value 60 onto the vector y
(hint: you can use the c()
function)
- add
x
andy
together
[1] 11 22 33 44 55 66
- multiply
x
andy
together. pay attention to how R performs operations on vectors of the same length.
[1] 10 40 90 160 250 360
Note that R does vector operations componentwise.
16.3 Data IO
- Read in the Youth Tobacco study, Youth_Tobacco_Survey_YTS_Data.csv and name it
youth
.
Rows: 9794 Columns: 31
── Column specification ──────────────────────────────────────────────
Delimiter: ","
chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
dbl (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- Install and invoke the
readxl
package. RStudio > Tools > Install Packages. Type readxl into the Package search and click install. Load the installed library withlibrary(readxl)
.
- Download an Excel version of the Monuments dataset, Monuments.xlsx, from CANVAS. Use the
read_excel()
function in thereadxl
package to read in the dataset and call the outputmon
.
- Write out the
mon
R object as a CSV file usingreadr::write_csv
and call the file “monuments.csv”.
- Write out the
mon
R object as an RDS file usingreadr::write_rds
and call it “monuments.rds”.
16.4 Subsetting Data
- Check to see if you have the
mtcars
dataset by entering the commandmtcars
.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
- What class is
mtcars
?
[1] "data.frame"
- How many observations (rows) and variables (columns) are in the
mtcars
dataset?
[1] 32 11
[1] 32
[1] 11
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
- Copy
mtcars
into an object called cars and renamempg
in cars toMPG
. Userename()
.
# A tibble: 6 × 11
MPG cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
- Convert the column names of
cars
to all upper case. Userename_all
, and thetoupper
command (orcolnames
).
rename_all: renamed 10 variables (CYL, DISP, HP, DRAT, WT, …)
# A tibble: 6 × 11
MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# Method 2
cars = as_tibble(mtcars)
cn = colnames(cars) # extract column names
cn = toupper(cn) # make them uppercase
colnames(cars) = cn # reassign
head(cars)
# A tibble: 6 × 11
MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
- Convert the rownames of
cars
to a column calledcar
usingrownames_to_column
. Subset the columns fromcars
that end in"p"
and call it pvars usingends_with()
.
car mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
car disp hp
1 Mazda RX4 160 110
2 Mazda RX4 Wag 160 110
3 Datsun 710 108 93
4 Hornet 4 Drive 258 110
5 Hornet Sportabout 360 175
6 Valiant 225 105
- Create a subset
cars
that only contains the columns:wt
,qsec
, andhp
and assign this object tocarsSub
. What are the dimensions ofcarsSub
? (Useselect()
anddim()
.)
[1] 32 4
- Convert the column names of
carsSub
to all upper case. Userename_all()
, andtoupper()
(orcolnames()
).
rename_all: renamed 4 variables (CAR, WT, QSEC, HP)
- Subset the rows of
cars
that get more than 20 miles per gallon (mpg
) of fuel efficiency. How many are there? (Usefilter()
.)
filter: removed 18 rows (56%), 14 rows remaining
[1] 14 12
[1] 14
Rows: 14
Columns: 12
$ car <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Me…
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 24.4, 22.8, 32.4, 30.4, 33.9, 21.5, 27.3,…
$ cyl <dbl> 6, 6, 4, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 146.7, 140.8, 78.7, 75.7, 71.1, 120.1…
$ hp <dbl> 110, 110, 93, 110, 62, 95, 66, 52, 65, 97, 66, 91, 113, 109
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.69, 3.92, 4.08, 4.93, 4.22, 3.70, 4.08,…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.190, 3.150, 2.200, 1.615, 1.835, 2.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 20.00, 22.90, 19.47, 18.52, 19.90, 20…
$ vs <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1
$ am <dbl> 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1
$ gear <dbl> 4, 4, 4, 3, 4, 4, 4, 4, 4, 3, 4, 5, 5, 4
$ carb <dbl> 4, 4, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2, 2
There are 14 cars. There are now nrow(cars_mpg)
cars.
filter: removed 18 rows (56%), 14 rows remaining
[1] 14
filter: removed 18 rows (56%), 14 rows remaining
[1] 14
- Subset the rows that get less than 16 miles per gallon (
mpg
) of fuel efficiency and have more than 100 horsepower (hp
). How many are there? (Usefilter()
.)
filter: removed 22 rows (69%), 10 rows remaining
car mpg cyl disp hp drat wt qsec vs am gear carb
1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
2 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
5 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
6 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
7 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
8 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
9 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
10 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
filter: removed 22 rows (69%), 10 rows remaining
[1] 10
filter: removed 22 rows (69%), 10 rows remaining
[1] 10
filter: removed 22 rows (69%), 10 rows remaining
[1] 10
- Create a subset of the
cars
data that only contains the columns:wt
,qsec
, andhp
for cars with 8 cylinders (cyl
) and reassign this object tocarsSub
. What are the dimensions of this dataset?
filter: removed 18 rows (56%), 14 rows remaining
[1] 14 4
filter: removed 18 rows (56%), 14 rows remaining
[1] 14 4
- Re-order the rows of
carsSub
by weight (wt
) in increasing order. (Usearrange()
.)
- Create a new variable in
carsSub
calledwt2
, which is equal towt^2
, usingmutate()
and piping%>%
.
wt qsec hp car wt2
1 3.170 14.50 264 Ford Pantera L 10.04890
2 3.435 17.30 150 AMC Javelin 11.79922
3 3.440 17.02 175 Hornet Sportabout 11.83360
4 3.520 16.87 150 Dodge Challenger 12.39040
5 3.570 15.84 245 Duster 360 12.74490
6 3.570 14.60 335 Maserati Bora 12.74490
7 3.730 17.60 180 Merc 450SL 13.91290
8 3.780 18.00 180 Merc 450SLC 14.28840
9 3.840 15.41 245 Camaro Z28 14.74560
10 3.845 17.05 175 Pontiac Firebird 14.78403
11 4.070 17.40 180 Merc 450SE 16.56490
12 5.250 17.98 205 Cadillac Fleetwood 27.56250
13 5.345 17.42 230 Chrysler Imperial 28.56902
14 5.424 17.82 215 Lincoln Continental 29.41978
16.5 Data Summarization
- How many bike lanes are currently in Baltimore? You can assume that each observation/row is a different bike lane.
[1] 1631
[1] 1631 9
[1] 1631
- How many (a) feet and (b) miles of total bike lanes are currently in Baltimore? (The
length
variable provides the length in feet.)
[1] 439447.6
[1] 83.22871
[1] 83.22871
- How many types (
type
) bike lanes are there? Which type (a) occurs the most and (b) has the longest average bike lane length?
BIKE BOULEVARD BIKE LANE CONTRAFLOW SHARED BUS BIKE SHARROW
49 621 13 39 589
SIDEPATH SIGNED ROUTE <NA>
7 304 9
[1] "BIKE BOULEVARD" "SIDEPATH" "SIGNED ROUTE" "BIKE LANE"
[5] "SHARROW" NA "CONTRAFLOW" "SHARED BUS BIKE"
[1] 7
[1] 8
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
count: now 8 rows and 2 columns, ungrouped
# (b)
bike %>%
group_by(type) %>%
summarise(number_of_rows = n(),
mean = mean(length)) %>%
arrange(desc(mean))
summarise: now 8 rows and 3 columns, ungrouped
# A tibble: 8 × 3
type number_of_rows mean
<chr> <int> <dbl>
1 SIDEPATH 7 666.
2 BIKE LANE 621 300.
3 SHARED BUS BIKE 39 277.
4 SIGNED ROUTE 304 264.
5 <NA> 9 260.
6 SHARROW 589 244.
7 BIKE BOULEVARD 49 197.
8 CONTRAFLOW 13 136.
- How many different projects (
project
) do the bike lanes fall into? Whichproject
category has the longest average bike lane length?
[1] 13
# Several different solutions for finding the longest average bike lane length for a project
bike %>%
group_by(project) %>%
summarise(n = n(),
mean = mean(length)) %>%
arrange(desc(mean))
summarise: now 13 rows and 3 columns, ungrouped
# A tibble: 13 × 3
project n mean
<chr> <int> <dbl>
1 MAINTENANCE 4 1942.
2 ENGINEERING CONSTRUCTION 12 512.
3 TRAFFIC 51 420.
4 COLLEGETOWN 339 321.
5 PARK HEIGHTS BIKE NETWORK 172 283.
6 CHARM CITY CIRCULATOR 39 277.
7 TRAFFIC CALMING 79 269.
8 OPERATION ORANGE CONE 458 250.
9 <NA> 74 214.
10 COLLEGETOWN NETWORK 13 214.
11 SOUTHEAST BIKE NETWORK 323 211.
12 PLANNING TRAFFIC 18 209.
13 GUILFORD AVE BIKE BLVD 49 197.
bike %>%
group_by(project, type) %>%
summarise(n = n(),
mean = mean(length)) %>%
arrange(desc(mean)) %>%
ungroup() %>%
slice(1) %>%
magrittr::extract("project")
summarise: now 32 rows and 4 columns, one group variable remaining (project)
ungroup: no grouping variables
slice: removed 31 rows (97%), one row remaining
# A tibble: 1 × 1
project
<chr>
1 MAINTENANCE
summarize: now 32 rows and 4 columns, one group variable remaining (project)
# A tibble: 32 × 4
# Groups: project [13]
project type n mean
<chr> <chr> <int> <dbl>
1 MAINTENANCE BIKE LANE 4 1942.
2 TRAFFIC SIDEPATH 1 1848.
3 ENGINEERING CONSTRUCTION BIKE LANE 8 537.
4 <NA> SIDEPATH 2 512.
5 ENGINEERING CONSTRUCTION SIDEPATH 3 481.
6 ENGINEERING CONSTRUCTION SHARROW 1 403.
7 PARK HEIGHTS BIKE NETWORK SIGNED ROUTE 27 398.
8 TRAFFIC BIKE LANE 50 391.
9 PLANNING TRAFFIC BIKE LANE 6 363.
10 <NA> BIKE LANE 12 357.
# ℹ 22 more rows
avg = bike %>%
group_by(type) %>%
summarize(mn = mean(length, na.rm = TRUE)) %>%
filter(mn == max(mn))
summarize: now 8 rows and 2 columns, ungrouped
filter: removed 7 rows (88%), one row remaining
- What was the average bike lane length per year that they were installed? (Be sure to first set
dateInstalled
toNA
if it is equal to zero.)
# Several example of changing 0's to NA for dateInstalled
bike = bike %>% mutate(
dateInstalled = ifelse(
dateInstalled == 0,
NA,
dateInstalled)
)
mean(bike$length[ !is.na(bike$dateInstalled)]) # overall average bike lane length
[1] 273.9943
b2 = bike %>%
mutate(dateInstalled = ifelse(dateInstalled == "0", NA,
dateInstalled))
b2 = bike %>%
mutate(dateInstalled = if_else(dateInstalled == "0",
NA_real_,
dateInstalled))
bike$dateInstalled[bike$dateInstalled == "0"] = NA
# Mean bike lane length by year. Note that bike lane length of 0 is also NA
bike %>%
mutate(length = ifelse(length == 0, NA, length)) %>%
group_by(dateInstalled) %>%
summarise(n = n(),
mean_of_the_bike = mean(length, na.rm = TRUE),
n_missing = sum(is.na(length)))
summarise: now 9 rows and 4 columns, ungrouped
# A tibble: 9 × 4
dateInstalled n mean_of_the_bike n_missing
<dbl> <int> <dbl> <int>
1 2006 2 1469. 0
2 2007 368 310. 0
3 2008 206 249. 0
4 2009 86 407. 0
5 2010 625 246. 0
6 2011 101 233. 0
7 2012 107 271. 0
8 2013 10 290. 0
9 NA 126 217. 1
- Numerically and (b) graphically describe the distribution of bike lane lengths (
length
).
- Numerically and (b) graphically describe the distribution of bike lane lengths (
0% 25% 50% 75% 100%
0.0000 124.0407 200.3027 341.0224 3749.3226
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
7. Describe the distribution of bike lane lengths numerically and graphically after stratifying them by (a) type and then by (b) number of lanes (numLanes
).
# (a)
bike %>%
group_by(type) %>%
summarise(mean_length = mean(length, rm.na = TRUE),
sd_length = sd(length, na.rm = TRUE))
summarise: now 8 rows and 3 columns, ungrouped
# A tibble: 8 × 3
type mean_length sd_length
<chr> <dbl> <dbl>
1 BIKE BOULEVARD 197. 113.
2 BIKE LANE 300. 315.
3 CONTRAFLOW 136. 48.5
4 SHARED BUS BIKE 277. 140.
5 SHARROW 244. 188.
6 SIDEPATH 666. 619.
7 SIGNED ROUTE 264. 348.
8 <NA> 260. 254.
# (b)
bike %>%
group_by(numLanes) %>%
summarise(mean_length = mean(length, rm.na = TRUE),
sd_length = sd(length, na.rm = TRUE))
summarise: now 3 rows and 3 columns, ungrouped
# A tibble: 3 × 3
numLanes mean_length sd_length
<dbl> <dbl> <dbl>
1 0 308. 202.
2 1 278. 318.
3 2 258. 219.
16.6 Data Classes
library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
bike = read_csv("./data/Bike_Lanes.csv")
- Get all the different types of bike lanes from the
type
column. Usesort(unique())
. Assign this to an objectbtypes
. Typedput(btypes)
.
[1] BIKE BOULEVARD SIDEPATH SIGNED ROUTE SIDEPATH BIKE LANE
[6] SIGNED ROUTE
7 Levels: BIKE BOULEVARD BIKE LANE CONTRAFLOW SHARED BUS BIKE ... SIGNED ROUTE
btypes = sort(unique(bike$type)) # get the unique categories in type. easier than typing them out as shown below in x
x = c("SIDEPATH","BIKE BOULEVARD", "BIKE LANE", "CONTRAFLOW",
"SHARED BUS BIKE", "SHARROW", "SIGNED ROUTE") #same vector as btypes
dput(btypes) #outputs the c() vector. can be used for assign
c("BIKE BOULEVARD", "BIKE LANE", "CONTRAFLOW", "SHARED BUS BIKE",
"SHARROW", "SIDEPATH", "SIGNED ROUTE")
- By rearranging vector
btypes
and usingdput
, recodetype
as a factor that hasSIDEPATH
as the first level. Printhead(bike$type)
. Note what you see. Runtable(bike$type)
afterwards and note the order.
c("BIKE BOULEVARD", "BIKE LANE", "CONTRAFLOW", "SHARED BUS BIKE",
"SHARROW", "SIDEPATH", "SIGNED ROUTE")
[1] "SIDEPATH" "BIKE BOULEVARD" "BIKE LANE" "CONTRAFLOW"
[5] "SHARED BUS BIKE" "SHARROW" "SIGNED ROUTE"
c("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE", "CONTRAFLOW", "SHARED BUS BIKE",
"SHARROW", "SIGNED ROUTE")
BIKE BOULEVARD BIKE LANE CONTRAFLOW SHARED BUS BIKE SHARROW
49 621 13 39 589
SIDEPATH SIGNED ROUTE
7 304
bike$type = relevel(bike$type, "SIDEPATH")
bike$type = factor(bike$type,
levels = dput(btypes[c(6,1:5,7)]))
c("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE", "CONTRAFLOW", "SHARED BUS BIKE",
"SHARROW", "SIGNED ROUTE")
c("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE", "CONTRAFLOW", "SHARED BUS BIKE",
"SHARROW", "SIGNED ROUTE")
SIDEPATH BIKE BOULEVARD BIKE LANE CONTRAFLOW SHARED BUS BIKE
7 49 621 13 39
SHARROW SIGNED ROUTE
589 304
- Make a column called
type2
, which is a factor of thetype
column, with the levels:c("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE")
. Runtable(bike$type2)
, with the optionsuseNA = "always"
. Note, we do not have to make type a character again before doing this.
bike = bike %>%
mutate(type2 = factor(type,
levels = c( "SIDEPATH", "BIKE BOULEVARD",
"BIKE LANE") ) )
table(bike$type)
SIDEPATH BIKE BOULEVARD BIKE LANE CONTRAFLOW SHARED BUS BIKE
7 49 621 13 39
SHARROW SIGNED ROUTE
589 304
SIDEPATH BIKE BOULEVARD BIKE LANE
7 49 621
table(bike$type2, useNA = "always") #The other levels are now counted as NA since they are not included in the levels
SIDEPATH BIKE BOULEVARD BIKE LANE <NA>
7 49 621 954
- Reassign
dateInstalled
into a character usingas.character
. Runhead(bike$dateInstalled)
.
- Reassign
[1] "0" "2010" "2010" "0" "2011" "2007"
b. Reassign `dateInstalled` as a factor, using the default levels. Run `head(bike$dateInstalled)`.
[1] 0 2010 2010 0 2011 2007
Levels: 0 2006 2007 2008 2009 2010 2011 2012 2013
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
0 2 368 206 86 625 101 107 10 0 0 0 0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 <NA>
0 2 368 206 86 625 101 107 10 0 0 0 0 126
c. Do not reassign `dateInstalled`, but simply run `head(as.numeric(bike$dateInstalled))`. We are looking to see what happens when we try to go from factor to numeric.
[1] 1 6 6 1 7 3
d. Do not reassign `dateInstalled`, but simply run `head(as.numeric(as.character(bike$dateInstalled)))`. This is how you get a "numeric" value back if they were incorrectly converted to factors.
[1] 0 2010 2010 0 2011 2007
- Convert
type
back to a character vector. Make a columntype2
(replacing the old one), where if the type is one of these categoriesc("CONTRAFLOW", "SHARED BUS BIKE", "SHARROW", "SIGNED ROUTE")
call it"OTHER"
. Use%in%
andifelse
. Maketype2
a factor with the levelsc("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE", "OTHER")
.
# Note that this is done in a single mutate but the order matters
bike = bike %>% mutate(
type = as.character(type),
type2 = ifelse(type %in% c("CONTRAFLOW", "SHARED BUS BIKE",
"SHARROW", "SIGNED ROUTE"), "OTHER", type),
type2 = factor(type2, levels = c( "SIDEPATH", "BIKE BOULEVARD",
"BIKE LANE", "OTHER") ))
table(bike$type2)
SIDEPATH BIKE BOULEVARD BIKE LANE OTHER
7 49 621 945
# An alternative way to recode type2
bike2 = bike %>%
mutate(
type = factor(type,
levels = c( "SIDEPATH", "BIKE BOULEVARD",
"BIKE LANE", "CONTRAFLOW",
"SHARED BUS BIKE",
"SHARROW", "SIGNED ROUTE")
),
type2 = recode_factor(type,
"CONTRAFLOW" = "OTHER",
"SHARED BUS BIKE" = "OTHER",
"SHARROW" = "OTHER",
"SIGNED ROUTE" = "OTHER")
)
table(bike2$type2)
OTHER SIDEPATH BIKE BOULEVARD BIKE LANE
945 7 49 621
- Parse the following dates using the correct
lubridate
functions:- “2014/02-14”
[1] "2014-02-14"
b. "04/22/14 03:20" assume `mdy`
[1] "2014-04-22 03:20:00 UTC"
c. “4/5/2016 03:2:22” assume `mdy`
[1] "2016-04-05 03:02:22 UTC"
16.7 Data Cleaning
library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(tidyverse)
library(broom)
bike = read_csv("./data/Bike_Lanes.csv")
- Count the number of rows of the bike data and count the number of complete cases of the bike data. Use
sum
andcomplete.cases
.
[1] 1631
[1] 257
- Create a data set called namat which is equal to
is.na(bike)
. What is the class ofnamat
? RunrowSums
andcolSums
onnamat
. These represent the number of missing values in the rows and columns of bike. Don’t printrowSums
, but do a table of therowSums
.
[1] "matrix" "array"
subType name block type numLanes
4 12 215 9 0
project route length dateInstalled
74 1269 0 0
0 1 2 3 4
257 1181 182 6 5
- Filter rows of bike that are NOT missing the
route
variable, assign this to the objecthave_route
. Do a table of thesubType
variable using table, including the missingsubType
s. Get the same frequency distribution usinggroup_by(subType)
andtally()
orcount()
.
filter: removed 1,269 rows (78%), 362 rows remaining
[1] 362 9
STRALY STRPRD <NA>
3 358 1
tally: now 3 rows and 2 columns, ungrouped
# A tibble: 3 × 2
subType n
<chr> <int>
1 STRALY 3
2 STRPRD 358
3 <NA> 1
count: now 3 rows and 2 columns, one group variable remaining (subType)
# A tibble: 3 × 2
# Groups: subType [3]
subType n
<chr> <int>
1 STRALY 3
2 STRPRD 358
3 <NA> 1
summarize: now 3 rows and 2 columns, ungrouped
# A tibble: 3 × 2
subType n_obs
<chr> <int>
1 STRALY 3
2 STRPRD 358
3 <NA> 1
tally: now 3 rows and 2 columns, ungrouped
# A tibble: 3 × 2
subType n
<chr> <int>
1 STRALY 3
2 STRPRD 358
3 <NA> 1
tally: now 3 rows and 2 columns, ungrouped
# A tibble: 3 × 2
subType n
<chr> <int>
1 STRALY 3
2 STRPRD 358
3 <NA> 1
- Filter rows of
bike
that have the typeSIDEPATH
orBIKE LANE
using%in%
. Call itside_bike
. Confirm this gives you the same number of results using the|
and==
.
filter: removed 1,003 rows (61%), 628 rows remaining
filter: removed 1,003 rows (61%), 628 rows remaining
[1] TRUE
[1] 628
[1] 628
filter: removed 1,003 rows (61%), 628 rows remaining
filter: removed 1,003 rows (61%), 628 rows remaining
[1] TRUE
- Do a cross tabulation of the bike
type
and the number of lanes (numLanes
). Call ittab
. Do aprop.table
on the rows and columns margins. Tryas.data.frame(tab)
orbroom::tidy(tab)
.
numLanes
type 0 1 2
BIKE BOULEVARD 0.00000000 0.02040816 0.97959184
BIKE LANE 0.03220612 0.66183575 0.30595813
CONTRAFLOW 0.00000000 0.53846154 0.46153846
SHARED BUS BIKE 0.00000000 1.00000000 0.00000000
SHARROW 0.00000000 0.36842105 0.63157895
SIDEPATH 0.00000000 0.85714286 0.14285714
SIGNED ROUTE 0.00000000 0.69407895 0.30592105
numLanes
type 0 1 2
BIKE BOULEVARD 0.000000000 0.001121076 0.067605634
BIKE LANE 1.000000000 0.460762332 0.267605634
CONTRAFLOW 0.000000000 0.007847534 0.008450704
SHARED BUS BIKE 0.000000000 0.043721973 0.000000000
SHARROW 0.000000000 0.243273543 0.523943662
SIDEPATH 0.000000000 0.006726457 0.001408451
SIGNED ROUTE 0.000000000 0.236547085 0.130985915
type numLanes Freq
1 BIKE BOULEVARD 0 0
2 BIKE LANE 0 20
3 CONTRAFLOW 0 0
4 SHARED BUS BIKE 0 0
5 SHARROW 0 0
6 SIDEPATH 0 0
7 SIGNED ROUTE 0 0
8 BIKE BOULEVARD 1 1
9 BIKE LANE 1 411
10 CONTRAFLOW 1 7
11 SHARED BUS BIKE 1 39
12 SHARROW 1 217
13 SIDEPATH 1 6
14 SIGNED ROUTE 1 211
15 BIKE BOULEVARD 2 48
16 BIKE LANE 2 190
17 CONTRAFLOW 2 6
18 SHARED BUS BIKE 2 0
19 SHARROW 2 372
20 SIDEPATH 2 1
21 SIGNED ROUTE 2 93
# A tibble: 21 × 3
type numLanes n
<chr> <chr> <int>
1 BIKE BOULEVARD 0 0
2 BIKE LANE 0 20
3 CONTRAFLOW 0 0
4 SHARED BUS BIKE 0 0
5 SHARROW 0 0
6 SIDEPATH 0 0
7 SIGNED ROUTE 0 0
8 BIKE BOULEVARD 1 1
9 BIKE LANE 1 411
10 CONTRAFLOW 1 7
# ℹ 11 more rows
- Read the Property Tax data into R and call it the variable
tax
.
- How many addresses pay property
taxes
? (Assume each row is a different address.)
[1] 237963 16
[1] 237963
[1] 237963
[1] 18696
[1] 219267
- What is the total (a) city (
CityTax
) and (b) state (SateTax
) tax paid? You need to remove the$
from theCityTax
variable, then you need to make it numeric. Trystr_replace
, but remember$
is “special” and you needfixed()
around it.
[1] NA NA "$1463.45" "$2861.70" "$884.21" "$1011.60"
tax = tax %>%
mutate(
CityTax = str_replace(CityTax,
fixed("$"), "") ,
CityTax = as.numeric(CityTax)
)
# no piping
tax$CityTax = str_replace(tax$CityTax, fixed("$"), "")
tax$CityTax = as.numeric(tax$CityTax)
sum(tax$CityTax) #there are NAs
[1] NA
[1] 831974818
[1] 831.9748
# (b)
options(digits=12) # so no rounding
tax = tax %>% mutate(StateTax = parse_number(StateTax))
sum(tax$StateTax, na.rm = TRUE)
[1] 42085409.8
[1] 42.0854098
- Using
table()
orgroup_by
andsummarize(n())
ortally()
.- How many observations/properties are in each ward (
Ward
)?
- How many observations/properties are in each ward (
01 02 03 04 05 06 07 08 09 10 11 12 13
6613 3478 2490 1746 466 4360 4914 11445 11831 1704 3587 8874 9439
14 15 16 17 18 19 20 21 22 23 24 25 26
3131 18663 11285 1454 2314 3834 11927 3182 1720 2577 6113 14865 25510
27 28 50
50215 10219 7
tally: now 29 rows and 2 columns, ungrouped
summarize: now 29 rows and 2 columns, ungrouped
b. What is the mean state tax per ward? Use `group_by` and `summarize`.
summarize: now 29 rows and 2 columns, ungrouped
# A tibble: 29 × 2
Ward `mean(StateTax, na.rm = TRUE)`
<chr> <dbl>
1 01 291.
2 02 344.
3 03 726.
4 04 2327.
5 05 351.
6 06 178.
7 07 90.3
8 08 55.3
9 09 89.4
10 10 82.2
# ℹ 19 more rows
c. What is the maximum amount still due (`AmountDue`) in each ward? Use `group_by` and `summarize` with 'max`.
tax = tax %>%
mutate(
AmountDue = str_replace(AmountDue,
fixed("$"), "") ,
AmountDue = as.numeric(AmountDue)
)
tax %>%
group_by(Ward) %>%
summarize(max_due = max(AmountDue, na.rm = TRUE))
summarize: now 29 rows and 2 columns, ungrouped
# A tibble: 29 × 2
Ward max_due
<chr> <dbl>
1 01 610032.
2 02 1459942.
3 03 1023666.
4 04 3746356.
5 05 948503.
6 06 407589.
7 07 1014574.
8 08 249133.
9 09 150837.
10 10 156216.
# ℹ 19 more rows
d. What is the 75th percentile of city and state tax paid by Ward? (`quantile`)
summarize: now 29 rows and 2 columns, ungrouped
# A tibble: 29 × 2
Ward Percentile
<chr> <dbl>
1 01 311.
2 02 321.
3 03 305.
4 04 569.
5 05 131.
6 06 195.
7 07 40.3
8 08 84.1
9 09 135.
10 10 101.
# ℹ 19 more rows
# putting it all together
ward_table = tax %>%
group_by(Ward) %>%
summarize(
number_of_obs = n(),
mean_state_tax = mean(StateTax, na.rm = TRUE),
max_amount_due = max(AmountDue, na.rm = TRUE),
q75_city = quantile(CityTax, probs = 0.75, na.rm = TRUE),
q75_state = quantile(StateTax, probs = 0.75, na.rm = TRUE)
)
summarize: now 29 rows and 6 columns, ungrouped
- Make boxplots showing
CityTax
(y
-variable) by whether the property is a principal residence (x = ResCode
) or not. You will need to trim some leading/trailing white space fromResCode
.
tax = tax %>%
mutate(ResCode = str_trim(ResCode))
# Using ggplot2
qplot(y = CityTax, x = ResCode, data = tax, geom = "boxplot")
#CityTax is highly right skewed. The log transform makes it more symmetric
qplot(y = log10(CityTax + 1), x = ResCode, data = tax, geom = "boxplot")
11. Subset the data to only retain those houses that are principal residences. Which command subsets rows? Filter or select? a. How many such houses are there?
filter: removed 123,029 rows (52%), 114,934 rows remaining
filter: removed 123,029 rows (52%), 114,934 rows remaining
[1] 114934 16
b. Describe the distribution of property taxes on these residences. Use `hist`/`qplot` with certain breaks or `plot(density(variable))`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
12. Make an object called health.sal
using the salaries data set, with only agencies (JobTitle
) of those with “fire” (anywhere in the job title), if any, in the name remember fixed("string_match", ignore_case = TRUE)
will ignore cases.
sal = read_csv("./data/Baltimore_City_Employee_Salaries_FY2015.csv")
health.sal = sal %>%
filter(str_detect(JobTitle,
fixed("fire", ignore_case = TRUE)))
- Make a data set called
trans
which contains only agencies that contain “TRANS”.
filter: removed 13,982 rows (>99%), 35 rows remaining
- What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.
filter: removed 14,005 rows (>99%), 12 rows remaining
# A tibble: 12 × 7
name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Abraham,Donta D LABORER (H… B70357 DPW-S… 10/16/2… $30721.00 $32793.…
2 Abraham,Santhosh ACCOUNTANT… A50100 DPW-W… 12/31/2… $49573.00 $49104.…
3 Abraham,Sharon M HOUSING IN… A06043 Housi… 12/15/1… $50365.00 $50804.…
4 Abrahams,Brandon A POLICE OFF… A99416 Polic… 01/20/2… $46199.00 $18387.…
5 Abrams,Maria OFFICE SER… A29008 State… 09/19/2… $35691.00 $34970.…
6 Abrams,Maxine COLLECTION… A14003 FIN-C… 08/26/2… $37359.00 $37768.…
7 Abrams,Terry RECREATION… P04001 R&P-R… 06/19/2… $20800.00 $1690.00
8 Bey,Abraham PROCUREMEN… A17001 FIN-P… 04/23/2… $69600.00 $69763.…
9 Elgamil,Abraham D AUDITOR SU… A24001 COMP-… 01/31/1… $94400.00 $95791.…
10 Gatto,Abraham M POLICE OFF… A99200 Polic… 12/12/2… $66086.00 $73651.…
11 Schwartz,Abraham M GENERAL CO… A54005 FPR A… 07/26/1… $110400.00 $111408…
12 Velez,Abraham L POLICE OFF… A99036 Polic… 10/24/2… $70282.00 $112957…
- What does the distribution of annual salaries look like? (use
hist
, 20 breaks) What is the IQR? Hint: first convert to numeric. Trystr_replace
, but remember$
is “special” and you needfixed()
around it.
sal = sal %>% mutate(AnnualSalary = str_replace(AnnualSalary, fixed("$"), ""))
sal = sal %>% mutate(AnnualSalary = as.numeric(AnnualSalary))
#ggplot
qplot(x = AnnualSalary, data = sal, geom = "histogram", bins = 20)
0% 25% 50% 75% 100%
900 33354 48126 68112 238772
- Convert
HireDate
to theDate
class - plot Annual Salary vs Hire Date. UseAnnualSalary ~ HireDate
with adata = sal
argument in plot or usex
,y
notation inscatter.smooth
. Use thelubridate
package. Is itmdy(date)
ordmy(date)
for this data - look atHireDate
.
sal = sal %>% mutate(HireDate = lubridate::mdy(HireDate))
q = qplot(y = AnnualSalary, x = HireDate,
data = sal, geom = "point")
q + geom_smooth(colour = "red", se = FALSE)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs =
"cs")'
# base R
# plot(AnnualSalary ~ HireDate, data = sal)
# scatter.smooth(sal$AnnualSalary, x = sal$HireDate, col = "red")
- Create a smaller dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the
Agency
variable with string matching. Call thisemer
. How many employees are in this new dataset?
filter: removed 9,095 rows (65%), 4,922 rows remaining
emer = sal %>% filter(
str_detect(Agency, "Sheriff's Office") |
str_detect(Agency, "Police Department") |
str_detect(Agency, "Fire Department")
)
filter: removed 9,095 rows (65%), 4,922 rows remaining
- Create a variable called
dept
in theemer
data set,dept = str_extract(Agency, ".*(ment|ice)")
. E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Replot annual salary versus hire date and color by dept (not yet - using ggplot). Use the argumentcol = factor(dept)
inplot
.
emer = emer %>%
mutate(
dept = str_extract(Agency, ".*(ment|ice)")
)
# Replot annual salary versus hire date, color by dept (not yet - using ggplot)
ggplot(aes(x = HireDate, y = AnnualSalary,
colour = dept), data = emer) +
geom_point() + theme(legend.position = c(0.5, 0.8))
19. (Bonus). Convert the ‘LotSize’ variable to a numeric square feet variable in the tax
data set. Some tips: a) 1 acre = 43560 square feet b) The hyphens represent a decimals. (This will take a lot of searching to find all the string changes needed before you can convert to numeric.)
First lets take care of acres
[1] 1 2 9 10 11 15
[1] "0.020 ACRES" "0.020 ACRES" "0.020 ACRES" "0.020 ACRES" "0.020 ACRES"
[6] "0.020 ACRES"
acre = tax$LotSize[aIndex] # temporary variable
## find and replace character strings
acre = str_replace_all(acre, " AC.*","")
acre = str_replace_all(acre, " %","")
table(!is.na(as.numeric(acre)))
FALSE TRUE
237 18328
[1] "2-158" "O.084" "1-87" "1-85"
[5] "1-8" "1-778" "2-423" "1-791"
[9] "1-81" "O.018" "IMP ONLY 0.190" "26,140"
[13] "8.OOO" "IMP.ONLY 3.615" "2-688" "1.914ACRES"
[17] "2-364" "3-67" "3-46" "3-4"
[21] "2-617" "2-617" "2-617" "4-40"
[25] "3-94" "2-456" "2-486" "2-24"
[29] "1-36" "21-8X5.5" "3-14" "1-566"
[33] "O-70" "16-509" "2-36" "5-25"
[37] "1-005" "0.045ACRES" "0.039ACRES" "0.029ACRES"
[41] "0.029ACRES" "0.041ACRES" "0.048ACRES" "0.053ACRES"
[45] "O.147" "O-602" "O-3567" "28-90"
[49] "4-3165" "1-379"
## lets clean the rest
acre = str_replace_all(acre, "-",".") # hyphen instead of decimal
head(acre[is.na(as.numeric(acre))])
[1] "O.084" "O.018" "IMP ONLY 0.190" "26,140"
[5] "8.OOO" "IMP.ONLY 3.615"
FALSE TRUE
96 18469
[1] "O.084" "O.018" "IMP ONLY 0.190" "26,140"
[5] "8.OOO" "IMP.ONLY 3.615"
# take care of individual mistakes
acre = str_replace_all(acre, "O","0") # 0 vs O
acre = str_replace_all(acre, "Q","") # Q, oops
acre = str_replace_all(acre, ",.",".") # extra ,
acre = str_replace_all(acre, ",","") # extra ,
acre = str_replace_all(acre, "L","0") # leading L
acre = str_replace_all(acre, "-",".") # hyphen to period
acre[is.na(as.numeric(acre))]
[1] "IMP 0N0Y 0.190" "IMP.0N0Y 3.615" "21.8X5.5"
[1] 3
Now let’s convert all of the square feet variables
library(purrr)
fIndex = which(str_detect(tax$LotSize, "X"))
ft = tax$LotSize[fIndex]
ft = str_replace_all(ft, fixed("&"), "-")
ft = str_replace_all(ft, "IMP ONLY ", "")
ft = str_replace_all(ft, "`","1")
ft= map_chr(str_split(ft, " "), first)
## now get the widths and lengths
width = map_chr(str_split(ft,"X"), first)
length = map_chr(str_split(ft,"X"), nth, 2)
## width
widthFeet = as.numeric(map_chr(str_split(width, "-"), first))
widthInch = as.numeric(map_chr(str_split(width, "-"),nth,2))/12
widthInch[is.na(widthInch)] = 0 # when no inches present
totalWidth = widthFeet + widthInch # add together
# length
lengthFeet = as.numeric(map_chr(str_split(length, "-"),first))
lengthInch = as.numeric(map_chr(str_split(length, "-",2),nth,2))/12
lengthInch[is.na(lengthInch)] = 0 # when no inches present
totalLength = lengthFeet + lengthInch
# combine together for square feet
sqrtFt = totalWidth*totalLength
ft[is.na(sqrtFt)] # what is left?
[1] "161X" "11XX0" "M2X169-9"
[4] "Q8X120" "37-1X-60-10X57-11" "POINT"
[7] "22-11" "ASSESS" "ASSESSCTY56-2X1-1"
[10] "ASSESSCTY53-2X1-1" "O-7X125" "O-3X125"
[13] "X134" "{1-4X128-6" "12-3XX4-2"
[16] "13XX5" "F7-10X80" "N1-8X92-6"
[19] "POINTX100-5X10" "]7X100" "16-11XX5"
[22] "Q5X119"
And now we combine everything together:
[1] 0.935456352458
# already in square feet, easy!!
sIndex=which(str_detect(tax$LotSize, "FT") | str_detect(tax$LotSize, "S.*F."))
sf = tax$LotSize[sIndex] # subset temporary variable
sqft2 = map_chr(str_split(sf,"( |SQ|SF)"),first)
sqft2 = as.numeric(str_replace_all(sqft2, ",", "")) # remove , and convert
tax$sqft[sIndex] = sqft2
table(is.na(tax$sqft))
FALSE TRUE
237771 192
[1] "IMPROVEMENTS ONLY" "0.067" "0.858"
[4] "161X 137" "IMP ONLY" "0.036 ARES"
[7] "11XX0" "IMPROVEMENT ONLY" "IMPROVEMENT ONLY"
[10] "IMPROVEMENT ONLY" "IMPROVEMENT ONLY" "IMPROVEMENT ONLY"
[13] "0.033" "IMPROVEMENT ONLY" "IMPROVEMENT ONLY"
[16] "IMP ONLY" "IMP ONLY 0.190 AC" "IMP ONLY"
[19] "AIR RIGHTS" "AIR RIGHTS" "AIR RIGHTS"
[22] "IMP ONLY" "IMP ONLY" "AIR RIGHTS"
[25] "IMP ONLY" "0.013" "IMP ONLY"
[28] "IMP ONLY" "IMP ONLY" "IMP ONLY"
[31] "196552-8" "0.040" "IMP.ONLY 3.615 AC"
[34] "M2X169-9" "Q8X120" "IMP ONLY"
[37] "IMP ONLY" "IMPROVEMENTS ONLY" "IMPROVEMENTS ONLY"
[40] "IMPROVEMENTS ONLY" "0.016" NA
[43] "0.230" "0.026" "\\560 S.F."
[46] "1233.04" "IMP ONLY" "0.649"
[49] "ASSESSMENT ONLY" "24179D096" "IMPROVEMENT ONLY"
[52] "20080211" NA "IMP ONLY"
[55] "0.028" "IMP ONLY" "IMPROVEMENT ONLY"
[58] "2381437.2 CUBIC F" "IMP ONLY" "IMP ONLY"
[61] "IMPROVEMENTS ONLY" "AIR RIGHTS ONLY" "IMP ONLY"
[64] "IMP ONLY" "37-1X-60-10X57-11" "IMPROVEMENT ONLY"
[67] "IMPROVEMENT ONLY" "IMPROVEMENT ONLY" "IMPROVEMENT ONLY"
[70] "IMPROVEMENT ONLY" "IMP ONLY" "IMP ONLY"
[73] "IMP ONLY" "IMP ONLY" "POINT X 57-9"
[76] "IMP ONLY" "AIR RIGHTS" "0.5404%"
[79] "IMPROVEMENT ONLY" "IMP ONLY" "19520819"
[82] NA "0.024" "IMPROVEMENT ONLY"
[85] "IMPROVEMENT ONLY" "68I.0 SQ FT" "11979"
[88] "1159-0 S.F." "IMPROVEMENT ONLY" NA
[91] "IMPROVEMENT ONLY" "0.039" "IMPROVEMENT ONLY"
[94] "IMP ONLY" "IMP ONLY" "IMPROVEMENT ONLY"
[97] "IMP ONLY" "IMP ONLY" "0.681 ARCES"
[100] "7,138 SF" "13,846 SF" "9,745 SF"
[103] "13,772 SF" "17,294 SF" "15,745 SF"
[106] "18, 162 SF" "01185" "AIR RIGHTS"
[109] "AIR RIGHTS" "IMPROVEMENTS ONLY" "22-11 X57"
[112] "0.129%" "0.129%" "0.129%"
[115] "0.129%" "0.129%" "ASSESS CTY 84 S.F"
[118] "IMPROVEMENT ONLY" "ASSESS CTY 50 S.F" "ASSESS CTY 57X1-1"
[121] "IMP ONLY" "ASSESSCTY56-2X1-1" "ASSESSCTY53-2X1-1"
[124] "IMPROVEMENT ONLY" "IMPROVEMENT ONLY" "O-7X125"
[127] "O-3X125" "0.026" "X134"
[130] "{1-4X128-6" NA NA
[133] NA "IMPROVEMENT ONLY" "IMPROVEMENT ONLY"
[136] "12-3XX4-2" "13XX5" "ASSESS CTY"
[139] "IMP ONLY" "IMP ONLY" "IMP ONLY"
[142] "IMP ONLY" "IMPROVEMENTS ONLY" "IMPROVEMENT ONLY"
[145] "0.032" "20050712" "F7-10X80"
[148] "IMP ONLY" "IMP ONLY" "0.033"
[151] "03281969" "IMP.ONLY" "IMP ONLY"
[154] "REAR PART 697 S.F" "IMP ONLY" "IMPROVEMENT ONLY"
[157] "N1-8X92-6" "POINTX100-5X10" "IMPROVEMENTS ONLY"
[160] "]7X100" "07031966" "13O4 SQ FT"
[163] "05081978" "0.030" "IMPROVEMENTS ONLY"
[166] "AIR RIGHTS" "AIR RIGHTS" "IMPROVEMENT ONLY"
[169] "IMPROVEMENT ONLY" "IMPROVEMENT ONLY" "IMPROVEMENT ONLY"
[172] "0.191 ARES" "19815-4" "11`79"
[175] NA NA "IMPROVEMENT ONLY"
[178] "IMP ONLY" "194415" "IMP ONLY"
[181] "IMPROVEMENT ONLY" "IMP ONLY" "120-103"
[184] "IMP ONLY" "IMP. ONLY" "196700"
[187] "IMPROVEMENT ONLY" "16-11XX5" "Q5X119"
[190] "21210" "IMPROVEMENT ONLY" "5306120"
There are still many strings that need to be removed before we can convert
16.8 Mainpulating Data
- Read in the Bike_Lanes_Wide.csv dataset and call is
wide
.
Rows: 134 Columns: 9
── Column specification ──────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (8): BIKE BOULEVARD, BIKE LANE, CONTRAFLOW, SHARED BUS BIKE, SHARROW, SI...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- Reshape
wide
usingpivot_longer
. Call this datalong
. Make the keylanetype
, and the valuethe_length
. Make sure we gather all columns butname
, using-name
. Note theNAs
here.
pivot_longer: reorganized (BIKE BOULEVARD, BIKE LANE, CONTRAFLOW, SHARED BUS BIKE, SHARROW, …) into (lanetype, the_length) [was 134x9, now 1072x3]
# A tibble: 6 × 3
name lanetype the_length
<chr> <chr> <dbl>
1 ALBEMARLE ST BIKE BOULEVARD NA
2 ALBEMARLE ST BIKE LANE NA
3 ALBEMARLE ST CONTRAFLOW NA
4 ALBEMARLE ST SHARED BUS BIKE NA
5 ALBEMARLE ST SHARROW 110.
6 ALBEMARLE ST SIDEPATH NA
- Read in the roads and crashes .csv files and call them
road
andcrash
.
Rows: 110 Columns: 4
── Column specification ──────────────────────────────────────────────
Delimiter: ","
chr (1): Road
dbl (3): Year, N_Crashes, Volume
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 5 Columns: 3
── Column specification ──────────────────────────────────────────────
Delimiter: ","
chr (2): Road, District
dbl (1): Length
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 4
Year Road N_Crashes Volume
<dbl> <chr> <dbl> <dbl>
1 1991 Interstate 65 25 40000
2 1992 Interstate 65 37 41000
3 1993 Interstate 65 45 45000
4 1994 Interstate 65 46 45600
5 1995 Interstate 65 46 49000
6 1996 Interstate 65 59 51000
# A tibble: 5 × 3
Road District Length
<chr> <chr> <dbl>
1 Interstate 65 Greenfield 262
2 Interstate 70 Vincennes 156
3 US-36 Crawfordsville 139
4 US-40 Greenfield 150
5 US-52 Crawfordsville 172
- Replace (using
str_replace
) any hyphens (-
) with a space incrash$Road
. Call this datacrash2
. Table theRoad
variable.
Interstate 275 Interstate 65 Interstate 70 US 36 US 40
22 22 22 22 22
- How many observations are in each dataset?
[1] 110 4
[1] 5 3
- Separate the
Road
column (usingseparate
) into (type
andnumber
) incrash2
. Reassign this tocrash2
. Tablecrash2$type
. Then create a new variable calling itroad_hyphen
using theunite
function. Unite thetype
andnumber
columns using a hyphen (-
) and then tableroad_hyphen
.
Interstate US
66 44
Interstate-275 Interstate-65 Interstate-70 US-36 US-40
22 22 22 22 22
- Which and how many years were data collected in the
crash
dataset?
[1] 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
[16] 2006 2007 2008 2009 2010 2011 2012
[1] 22
- Read in the dataset Bike_Lanes.csv and call it
bike
.
- Keep rows where the record is not missing
type
and not missingname
and re-assign the output tobike
.
filter: removed 21 rows (1%), 1,610 rows remaining
- Summarize and group the data by grouping
name
andtype
(i.e for each type within each name) and take thesum
of thelength
(reassign the sum of the lengths to thelength
variable). Call this data setsub
.
summarize: now 162 rows and 3 columns, one group variable remaining (name)
- Reshape
sub
usingpivot_wider
. Spread the data where the key istype
and we want the value in the new columns to belength
- the bike lane length. Call thiswide2
. Look at the column names ofwide2
- what are they? (they also have spaces).
pivot_wider: reorganized (type, length) into (SHARROW, SIGNED ROUTE, BIKE LANE, CONTRAFLOW, SHARED BUS BIKE, …) [was 162x3, now 132x8]
Look at the column names of wide
. What are they? (they also have spaces)
[1] "name" "SHARROW" "SIGNED ROUTE" "BIKE LANE"
[5] "CONTRAFLOW" "SHARED BUS BIKE" "SIDEPATH" "BIKE BOULEVARD"
- Join data in the crash and road datasets to retain only complete data, (using an inner join) e.g. those observations with road lengths and districts. Merge without using
by
argument, then merge usingby = "Road"
. call the output merged. How many observations are there?
Joining with `by = join_by(Road)`
inner_join: added 2 columns (District, Length)
> rows only in x (22)
> rows only in y ( 1)
> matched rows 88
> ====
> rows total 88
inner_join: added 2 columns (District, Length)
> rows only in x (22)
> rows only in y ( 1)
> matched rows 88
> ====
> rows total 88
[1] 88 6
- Join data using a
full_join
. Call the outputfull
. How many observations are there?
Joining with `by = join_by(Road)`
full_join: added 2 columns (District, Length)
> rows only in x 22
> rows only in y 1
> matched rows 88
> =====
> rows total 111
full_join: added 2 columns (District, Length)
> rows only in x 22
> rows only in y 1
> matched rows 88
> =====
> rows total 111
[1] 111
- Do a left join of the
road
andcrash
. ORDER matters here! How many observations are there?
Joining with `by = join_by(Road)`
left_join: added 3 columns (Year, N_Crashes, Volume)
> rows only in x 1
> rows only in y (22)
> matched rows 88 (includes duplicates)
> ====
> rows total 89
left_join: added 3 columns (Year, N_Crashes, Volume)
> rows only in x 1
> rows only in y (22)
> matched rows 88 (includes duplicates)
> ====
> rows total 89
[1] 89
- Repeat above with a
right_join
with the same order of the arguments. How many observations are there?
Joining with `by = join_by(Road)`
right_join: added 3 columns (Year, N_Crashes, Volume)
> rows only in x ( 1)
> rows only in y 22
> matched rows 88
> =====
> rows total 110
right_join: added 3 columns (Year, N_Crashes, Volume)
> rows only in x ( 1)
> rows only in y 22
> matched rows 88
> =====
> rows total 110
[1] 110
16.9 Data Visualization
library(readr)
library(ggplot2)
library(tidyr)
library(dplyr)
library(lubridate)
library(stringr)
circ = read_csv("./data/Charm_City_Circulator_Ridership.csv")
# covert dates
circ = mutate(circ, date = mdy(date))
# change colnames for reshaping
colnames(circ) = colnames(circ) %>%
str_replace("Board", ".Board") %>%
str_replace("Alight", ".Alight") %>%
str_replace("Average", ".Average")
# make long
long = gather(circ, "var", "number",
starts_with("orange"),
starts_with("purple"), starts_with("green"),
starts_with("banner"))
# separate
long = separate(long, var, into = c("route", "type"),
sep = "[.]")
## take just average ridership per day
avg = filter(long, type == "Average")
avg = filter(avg, !is.na(number))
# separate
type_wide = spread(long, type, value = number)
head(type_wide)
# A tibble: 6 × 7
day date daily route Alightings Average Boardings
<chr> <date> <dbl> <chr> <dbl> <dbl> <dbl>
1 Friday 2010-01-15 1644 banner NA NA NA
2 Friday 2010-01-15 1644 green NA NA NA
3 Friday 2010-01-15 1644 orange 1643 1644 1645
4 Friday 2010-01-15 1644 purple NA NA NA
5 Friday 2010-01-22 1394. banner NA NA NA
6 Friday 2010-01-22 1394. green NA NA NA
- Plot average ridership (
avg
data set) bydate
using a scatterplot.- Color the points by route (
orange
,purple
,green
,banner
)
- Color the points by route (
b. Add black smoothed curves for each route
qplot(x = date, y = number, data = avg, colour = route) +
geom_smooth(aes(group = route), colour= "black") #change the smoothed line to black so it is more visible
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs =
"cs")'
c. Color the points by day of the week
# qplot(x = date, y = number, data = avg, colour = day)
# This way the legend appears on order for days of the week
avg = avg %>% mutate(dayFactor = factor(day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))
g = ggplot(avg, aes(x = date, y = number, color = dayFactor))
g + geom_point()
2. Replot 1a where the colors of the points are the name of the route
(with banner
–> blue
)
pal = c("blue", "darkgreen","orange","purple")
# qplot(x = date, y = number, data = avg, colour = route) +
# scale_colour_manual(values = pal)
g = ggplot(avg, aes(x = date, y = number, color = route))
g + geom_point() + scale_colour_manual(values = pal)
3. Plot average ridership by date with one panel per route
g = ggplot(avg, aes(x = date, y = number, color = route))
g + geom_point() +
facet_wrap( ~ route) +
scale_colour_manual(values=pal)
4. Plot average ridership by date with separate panels by day
of the week, colored by route
.
ggplot(aes(x = date, y = number, colour = route), data= avg) +
geom_point() +
facet_wrap( ~dayFactor) + scale_colour_manual(values=pal)
5. Plot average ridership (avg
) by date, colored by route
(same as 1a). (do not take an average, use the average column for each route). Make the x-label "Year"
. Make the y-label "Number of People"
. Use the black and white theme theme_bw()
. Change the text_size
to (text = element_text(size = 20)
) in theme
.
first_plot = ggplot(avg, aes(x = date, y = number, color = route)) +
geom_point() + scale_colour_manual(values=pal)
first_plot +
xlab("Year") + ylab("Number of People") + theme_bw() +
theme(text = element_text(size = 20))
6. Plot average ridership on the orange
route versus date as a solid line, and add dashed
“error” lines based on the boardings
and alightings.
The line colors should be orange.
(hint linetype
is an aesthetic for lines - see also scale_linetype
and scale_linetype_manual
. Use Alightings = "dashed", Boardings = "dashed", Average = "solid"
)
filter: removed 10,314 rows (75%), 3,438 rows remaining
ggplot(orange, aes(x = date, y = number)) +
geom_line(aes(linetype = type), colour = "orange") +
scale_linetype_manual(
values = c(Alightings = "dashed",
Boardings = "dashed",
Average = "solid"))
The following zooms in on a section of the plot, so you can actually see the three lines
16.10 Statistical Analysis
library(readr)
library(dplyr)
mort = read_csv("./data/indicatordeadkids35.csv")
mort = mort %>%
dplyr::rename(country = "...1")
- Compute the correlation between the
1980
,1990
,2000
, and2010
mortality data. No need to save this in an object. Just display the result to the screen. Note anyNA
s. Then compute usinguse = "complete.obs"
.
1980 1990 2000 2010
1980 1.000000000000 0.960153953723 0.888841117492 NA
1990 0.960153953723 1.000000000000 0.961384203917 NA
2000 0.888841117492 0.961384203917 1.000000000000 NA
2010 NA NA NA 1
1980 1990 2000 2010
1980 1.000000000000 0.959684628292 0.887743334542 0.846828397953
1990 0.959684628292 1.000000000000 0.961026893827 0.924719217724
2000 0.887743334542 0.961026893827 1.000000000000 0.986234456933
2010 0.846828397953 0.924719217724 0.986234456933 1.000000000000
1980 1990 2000 2010
1980 1.000000000000 0.960153953723 0.888841117492 NA
1990 0.960153953723 1.000000000000 0.961384203917 NA
2000 0.888841117492 0.961384203917 1.000000000000 NA
2010 NA NA NA 1
1980 1990 2000 2010
1980 1.000000000000 0.960153953723 0.888841117492 NA
1990 0.960153953723 1.000000000000 0.961384203917 NA
2000 0.888841117492 0.961384203917 1.000000000000 NA
2010 NA NA NA 1
- Compute the correlation between the
Myanmar
,China
, andUnited States
mortality data. Store this correlation matrix in an object calledcountry_cor
- Compute the correlation between the
mort_subs = mort %>%
filter(country %in% c("China", "Myanmar","United States")) %>%
arrange(country)
filter: removed 194 rows (98%), 3 rows remaining
long = mort_subs %>%
pivot_longer( -country, names_to = "year", values_to = "death") %>%
filter(!is.na(death))
pivot_longer: reorganized (1760, 1761, 1762, 1763, 1764, …) into (year, death) [was 3x255, now 762x3]
filter: removed 120 rows (16%), 642 rows remaining
pivot_wider: reorganized (country, death) into (China, Myanmar, United States) [was 642x3, now 214x4]
Myanmar China United States
Myanmar 1.000000000000 0.974364554315 0.629262462521
China 0.974364554315 1.000000000000 0.690926321344
United States 0.629262462521 0.690926321344 1.000000000000
#alternatively
mort_subs = mort %>%
filter(country %in% c("China", "Myanmar","United States")) %>%
dplyr::select(-country)
filter: removed 194 rows (98%), 3 rows remaining
b. Extract the Myanmar-US correlation from the correlation matrix.
Run the following to see that the order is China, Myanmar, US:
[1] 0.629262462521
Extract the Myanmar-US correlation.
[1] 0.629262462521
- Is there a difference between mortality information from
1990
and2000
? Run a t-test and a Wilcoxon rank sum test to assess this. Hint: to extract the column of information for1990
, usemort$"1990"
Paired t-test
data: mort$"1990" and mort$"2000"
t = 10.02534013, df = 196, p-value < 2.220446e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.127776911751 0.190359276239
sample estimates:
mean difference
0.159068093995
Wilcoxon signed rank test with continuity correction
data: mort$`1990` and mort$`2000`
V = 17997, p-value < 2.220446e-16
alternative hypothesis: true location shift is not equal to 0
- Using the cars dataset, fit a linear regression model with vehicle cost (
VehBCost
) as the outcome and vehicle age (VehicleAge
) and whether it’s an online sale (IsOnlineSale
) as predictors as well as their interaction. Save the model fit in an object calledlmfit_cars
and display the summary table.
cars = read_csv("./data/kaggleCarAuction.csv",
col_types = cols(VehBCost = col_double()))
lmfit_cars = lm(VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge:IsOnlineSale,
data = cars)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 8063. 16.6 486. 0
2 VehicleAge -321. 3.67 -87.4 0
3 IsOnlineSale 514. 108. 4.77 0.00000180
4 VehicleAge:IsOnlineSale -55.4 25.6 -2.17 0.0303
- Create a variable called
expensive
in thecars
data that indicates if the vehicle cost is over$10,000
. Use a chi-squared test to assess if there is a relationship between a car being expensive and it being labeled as a “bad buy” (IsBadBuy
).
cars = cars %>%
mutate(expensive = VehBCost > 10000)
tab = table(cars$expensive, cars$IsBadBuy)
chisq.test(tab)
Pearson's Chi-squared test with Yates' continuity correction
data: tab
X-squared = 1.815247312, df = 1, p-value = 0.177880045
- Fit a logistic regression model where the outcome is “bad buy” status and predictors are the
expensive
status and vehicle age (VehicleAge
). Save the model fit in an object calledlogfit_cars
and display the summary table. Use summary ortidy(logfit_cars, conf.int = TRUE, exponentiate = TRUE)
ortidy(logfit_cars, conf.int = TRUE, exponentiate = FALSE)
for log odds ratios
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial)
#logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial())
#logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = "binomial")
summary(logfit_cars)
Call:
glm(formula = IsBadBuy ~ expensive + VehicleAge, family = binomial,
data = cars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.24948976307 0.03324440927 -97.74545 < 2e-16 ***
expensiveTRUE -0.08038741797 0.06069562592 -1.32444 0.18536
VehicleAge 0.28660451302 0.00647588104 44.25722 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 54421.30728 on 72982 degrees of freedom
Residual deviance: 52441.78751 on 72980 degrees of freedom
AIC: 52447.78751
Number of Fisher Scoring iterations: 5
# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0388 0.0332 -97.7 0 0.0363 0.0414
2 expensiveTRUE 0.923 0.0607 -1.32 0.185 0.818 1.04
3 VehicleAge 1.33 0.00648 44.3 0 1.32 1.35
# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.25 0.0332 -97.7 0 -3.31 -3.18
2 expensiveTRUE -0.0804 0.0607 -1.32 0.185 -0.201 0.0370
3 VehicleAge 0.287 0.00648 44.3 0 0.274 0.299
16.11 Functions
- Write a function,
sqdif
, that does the following:- takes two numbers
x
andy
with default values of 2 and 3. - takes the difference
- squares this difference
- then returns the final value
- checks that
x
andy
are numeric andstop
s with an error message otherwise.
- takes two numbers
sqdif = function(x = 2, y = 3) {
if (!is.numeric(x) | !is.numeric(y)){
stop("Both x and y must be numeric")
} else return((x - y)^2)
}
sqdif
function(x = 2, y = 3) {
if (!is.numeric(x) | !is.numeric(y)){
stop("Both x and y must be numeric")
} else return((x - y)^2)
}
[1] 25
Error in sqdif("5", 1): Both x and y must be numeric
- Try to write a function called
top()
that takes amatrix
ordata.frame
and a numbern
, and returns the firstn
rows and columns, with the default value ofn=5
.
[,1] [,2] [,3] [,4] [,5]
[1,] 1 11 21 31 41
[2,] 2 12 22 32 42
[3,] 3 13 23 33 43
[4,] 4 14 24 34 44
[5,] 5 15 25 35 45
[,1] [,2]
[1,] 1 6
[2,] 2 7
- Write a function that will calculate a 95% one sample t interval. The results will be stored in a list to be returned containing sample mean and the confidence interval. The input the functions is the numeric vector containing our data. For review, the formula for a 95% one sample t-interval is \(\bar{x} \pm 1.96*s/\sqrt{n}\).
my_t_CI = function(x) {
x_bar = mean(x)
n = length(x)
z = qt(0.975, df = n-1) # approximately 1.96 for large n
CI.lower = x_bar - z*sd(x) / sqrt(n)
CI.upper = x_bar + z*sd(x) / sqrt(n)
return(list(x_bar = x_bar, CI.lower = CI.lower, CI.upper = CI.upper))
}
data_sim = rnorm(100)
my_t_CI(data_sim)
$x_bar
[1] 0.0472156866557
$CI.lower
[1] -0.154868109042
$CI.upper
[1] 0.249299482354
One Sample t-test
data: data_sim
t = 0.4636005847, df = 99, p-value = 0.6439516
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.154868109042 0.249299482354
sample estimates:
mean of x
0.0472156866557
16.12 Simulations
- Simulate a random sample of size \(n=100\) from
- a normal distribution with mean 0 and variance 1. (see
rnorm
)
- a normal distribution with mean 0 and variance 1. (see
[1] 0.24657838830181 -0.28315247004084 0.27358560551538 0.73355368164771
[5] 0.00590490085201 -1.81955582293960
b. a normal distribution with mean 1 and variance 1. (see `rnorm`)
[1] 0.468497914898 0.720659579491 0.812009696163 3.227644726789 1.094358592940
[6] 1.587670874195
c. a uniform distribution over the interval [-2, 2]. (see `runif`)
[1] -1.5004615578800 -0.4790567662567 0.8695481279865 -0.0178337078542
[5] 1.8206319734454 -0.0594882853329
- Run a simulation experiment to see how the type I error rate behaves for a two sided one sample t-test when the true population follows a Uniform distribution over [-10,10]. Modify the function
t.test.sim
that we wrote to run this simulation by- changing our random samples of size
n
to come from a uniform distribution over [-10,10] (seerunif
). - performing a two sided t-test instead of a one sided t-test.
- performing the test at the 0.01 significance level.
- choosing an appropriate value for the null value in the t-test. Note that the true mean in this case is 0 for a Uniform(-10,10) population.
Try this experiment for
n=10,30,50,100,500
. What happens the estimated type I error rate asn
changes? Is the type I error rate maintained for any of these sample sizes?
- changing our random samples of size
t.test.sim = function(n = 30, N = 1000){
reject = logical(N) #Did this sample lead to a rejection of H0?
for (i in 1:N) {
x = runif(n, -10, 10) #generate a sample of size n from Exp(1) distribution
t.test.p.value = t.test(x, mu = 0, alternative = "two.sided")$p.value
reject[i] = (t.test.p.value < 0.01) #Do we reject H0?
}
mean(reject) #estimated type 1 error rate
}
t.test.sim(n = 10)
[1] 0.01
[1] 0.011
[1] 0.014
[1] 0.014
[1] 0.012
Generally, what you will see is that the estimated type I error rate will get closer to 0.01 as the sample size gets larger.
3. From introductory statistics, we know that the sampling distribution of a sample mean will be approximately normal with mean \(\mu\) and standard error \(\sigma/\sqrt{n}\) if we have a random sample from a population with mean \(\mu\) and standard deviation \(\sigma\) and the sample size is “large” (usually at least 30). In this problem, we will build a simulation that will show when the sample size is large enough.
a. Generate N=500
samples of size n=50
from a Uniform[-5,5] distribution.
b. For each of the N=500
samples, calculate the sample mean, so that you now have a vector of 500 sample means.
c. Plot a histogram of these 500 sample means. Does it look normally distributed and centered at 0?
d. Turn this simulation into a function that takes arguments N
the number of simulated samples to make and n
the sample size of each simulated sample. Run this function for n=10,15,30,50
. What do you notice about the histogram of the sample means (the sampling distribution of the sample mean) as the sample size increases.
x_bar_sampledist_sim = function(n = 10, N = 500) {
x_bar = numeric(N)
for (i in 1:N) {
x_bar[i] = mean(runif(n, -5, 5))
}
hist(x_bar, col = "gray", xlab = "x", ylab = "", xlim = c(-5, 5), main = paste("n = ", n))
}
opar = par()
par(mfrow = c(2, 2))
x_bar_sampledist_sim()
x_bar_sampledist_sim(n = 15)
x_bar_sampledist_sim(n = 30)
x_bar_sampledist_sim(n = 50)
Generally you should observe that for the smaller sample sizes the histogram looks roughly triangle shapes, but as the sample size increases it gets “skinnier” and looks more bell shaped.