Chapter 4 Data I/O
Before we can work with our data in R, we have to first import the data into R. R is capable of importing data from many different file types and sources using add on packages, such as data from
- csv files
- excel files
- tab delimited files
- fixed width files
- databases (e.g. mySQL databases)
- data files from other programs such as SAS .sas7bdat files, SPSS .sav files, or STATA .dta files
We will focus on importing the most commonly used file types of csv and excel files using real publicly available data. RStudio features some nice “drop down” support, where you can run some task by selecting them from the toolbar.
For example, you can easily import datasets using File > Import Dataset and selecting
- From Text (readr) to import csv files
- From Excel to import excel files
- From SPSS, STATA, or SAS to import the corresponding dataset file from these other software packages.
Note that these use add on packages and will ask you to install the appropriate packages if this is your first time using them. These will produce the R code needed to import your data, which you can then save to a script to use again if you need to re-import the dataset.
Before we look at importing a real dataset, let’s talk about working directories.
4.1 Working Directories
When reading files, you will need to specify the path to where this file is on your computer. R “look”s for files on your computer relative to the working directory. There are several way to have R set to the desired directory folder on your computer.
- Go to Session > Set Working Directory and choose the folder that contains your dataset
- Set the working directory manually using the function
setwd("/path_to_folder/")
- Set up an R project in the folder that contains your dataset files and R scripts. When you open this R project, R will default the working directory to the folder containing the Rproj file.
When setting the working directory, there are some finicky syntax issues that can occur between Windows and Mac users.
- Windows - The default directory structure involves a single backslash
"\"
, but R interprets these as escape characters, so you must replace the backslashes with forward slashes"/"
or two backslashes"\\"
. - Mac/Linux - The default directory structure already uses forward slashes, so you shouldn’t need to alter copied paths from these operating systems.
Some typical directory structure syntax elements include
..
- go up one level in the directory./
- the current directory~
- your “home” directory
To see where your current working directory is in your R sessions, you can use the function getwd()
which prints the directory in the console. In RStudio, this path is also displayed at the top of the console. To see what files are in this directory, you can either use the dir()
function to print the list of files to the console, or in RStudio you can view the files in the Files tab of the bottom right pane (where you find plots, help, package listings, etc).
For example, in my current Rproject, the working directory is
[1] "H:/BiostatCourses/PHC6089/R_Notes"
I can view the files in this folder by using dir
[1] "_book" "_bookdown.yml"
[3] "_bookdown_files" "_output.yml"
[5] "01-introduction.Rmd" "02-RStudio.Rmd"
[7] "03-BasicR.Rmd" "04-DataIO.Rmd"
[9] "05-SubsettingData.Rmd" "06-DataSummarization.Rmd"
[11] "07-DataClasses.Rmd" "08-DataCleaning.Rmd"
[13] "09-ManipulatingDatainR.Rmd" "10-DataVisualiation.Rmd"
[15] "11-StatisticalAnalysis.Rmd" "12-Functions.Rmd"
[17] "13-Simulations.Rmd" "14-RMarkdown.Rmd"
[19] "15-Shiny.Rmd" "16-Solutions_to_Exercises.Rmd"
[21] "Best_Model_Coefficients.csv" "Best_Model_Coefficients.xlsx"
[23] "book.bib" "data"
[25] "data.zip" "img"
[27] "index.Rmd" "my_tab.txt"
[29] "packages.bib" "preamble.tex"
[31] "R_notes.Rmd" "R_notes.Rproj"
[33] "R_notes_files" "README.md"
[35] "style.css"
I can view the files one folder above my current working directory.
[1] "Incomplete_Olga_Munoz" "PublicHealthComputing_old"
[3] "R_Homework" "R_InClass"
[5] "R_Labs" "R_libpaths.R"
[7] "R_Notes" "R_Project"
[9] "SAS_homework" "SAS_Notes"
[11] "SASProject" "Syllabus"
[13] "testRMD"
Or I can view files in a subdirectory, for example in the data folder.
[1] "Baltimore_City_Employee_Salaries_FY2015.csv"
[2] "Best_Model_Coefficients.csv"
[3] "Best_Model_Coefficients.xlsx"
[4] "Bike_Lanes.csv"
[5] "Bike_Lanes_Long.csv"
[6] "Bike_Lanes_Wide.csv"
[7] "Carvana_Data_Dictionary.txt"
[8] "Carvana_Data_Dictionary_formatted.txt"
[9] "Charm_City_Circulator_Ridership.csv"
[10] "charmcitycirc_reduced.csv"
[11] "country_pop.txt"
[12] "country_pop_bad_char.txt"
[13] "crashes.csv"
[14] "hersdata.xls"
[15] "indicator_estimatedincidencealltbper100000.xlsx"
[16] "indicatordeadkids35.csv"
[17] "kaggleCarAuction.csv"
[18] "kaggleCarAuction.zip"
[19] "Monuments.csv"
[20] "monuments.rds"
[21] "Monuments.xlsx"
[22] "MonumentsTest.csv"
[23] "my_tab.txt"
[24] "Real_Property_Taxes.csv.gz"
[25] "roads.csv"
[26] "SIR_grid.dat"
[27] "tb_incidence.xlsx"
[28] "Thumbs.db"
[29] "wcgs.xls"
[30] "Youth_Tobacco_Survey_YTS_Data.csv"
[31] "YouthTobacco_newNames.csv"
[32] "yts_data.rda"
[33] "yts_dataset.rds"
[34] "yts_temp.rds"
Because I am currently in an R project where my working directory is set to my desired folder containing all of my needed files, I can use relative paths instead of absolute paths. An absolute path or full path points to the the same location of the file system, regardless of the current working directory. To do that, it must include the root directory. This means that if I try your code using absolute paths, it won’t work unless we have the exact same folder structure where R is looking (this is bad). By contrast a relative path starts from some given working directory, avoiding the need to provide the full absolute path.
For example, “/Users/robertparker/Documents/UF/PHC6089/Lectures/2020/R_notes” is the absolute path on my computer to the R notes. If I zip up and send you the folder R_notes, using absolute paths to reference files (images, datasets, etc.), then this code will not work on your computer unless you have the R_notes folder in the exact same path as I do. However, if you set the working directory to this folder (manually or using an Rproj), and I use relative paths to reference items in this folder (e.g. “./data/Charm_City_Circulator_Ridership.csv”), then my code will work immediately on your computer as well. This is very important to do, so that it is easy to share your code with collaborators.
4.2 Importing Data
For our example dataset, we will use the Youth Tobacco Survey (YTS) dataset.
“The YTS was developed to provide states with comprehensive data on both middle school and high school students regarding tobacco use, exposure to environmental tobacco smoke, smoking cessation, school curriculum, minors’ability to purchase or otherwise obtain tobacco products, knowledge and attitudes about tobacco, and familiarity with pro-tobacco and anti-tobacco media messages.”
See the source of the data here.
In order to try the code on your own computer, be sure to download the dataset from the course website and set your RStudio session to the working directory containing your dataset file by going to Session > Set Working Directory > Choose file location.
I will have the data in a folder called data in my working directory. The YTS dataset is a csv file, so we will use the read_csv()
function from the readr
package to an R object called mydat.
library(readr) # load the readr library so R has access to read_csv()
mydat = read_csv("./data/Youth_Tobacco_Survey_YTS_Data.csv")
Rows: 9794 Columns: 31
── Column specification ──────────────────────────────────────────────
Delimiter: ","
chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
dbl (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 31
YEAR LocationAbbr LocationDesc TopicType TopicDesc MeasureDesc DataSource
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2015 AZ Arizona Tobacco Use … Cessatio… Percent of… YTS
2 2015 AZ Arizona Tobacco Use … Cessatio… Percent of… YTS
3 2015 AZ Arizona Tobacco Use … Cessatio… Percent of… YTS
# ℹ 24 more variables: Response <chr>, Data_Value_Unit <chr>,
# Data_Value_Type <chr>, Data_Value <dbl>, Data_Value_Footnote_Symbol <chr>,
# Data_Value_Footnote <chr>, Data_Value_Std_Err <dbl>,
# Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>, Sample_Size <dbl>,
# Gender <chr>, Race <chr>, Age <chr>, Education <chr>, GeoLocation <chr>,
# TopicTypeId <chr>, TopicId <chr>, MeasureId <chr>, StratificationID1 <chr>,
# StratificationID2 <chr>, StratificationID3 <chr>, …
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Simple! The data is now read into your R workspace. All we needed to do was pass the file location of our csv dataset to the function read_csv()
, but what is this function doing for us? This function is a special case of read_delim()
which reads in a general delimited file. Our delimiter was a comma, but it could be a tab (“\t”), semicolon (“;”), or other separating character. These functions read delimited files into a data.frame
with some special printing and subsetting properties called a tibble
(class “tbl”). The general arguments you can specify to read_delim()
are as follows
read_delim(file, delim, quote = "\"", escape_backslash = FALSE, escape_double = TRUE, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, comment = "", trim_ws = FALSE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), progress = show_progress(), skip_empty_rows = TRUE)
- file - specify the path to your dataset file, in quotes (the filename can be a url to a file on the web and looks in your working directory if you use a relative file path).
- delim - specify your delimiter in quotes (for
read_csv()
delim = “,”) - col_names - Does the first row contain column names?
- col_types - pre-specify column types (e.g. col_double(), col_character())
- guess_max - number of rows to scan to try to automatically identify the data type of a column
Even if the dataset is read in, there can still be potential errors that were noted but not fatal. We can see these errors using problems()
. Applying the problems()
function to a data.frame returns a data.frame with a row for each problem found and four columns. The first two columns give the row and column number in the original data.frame where the problem occurred, the third column give a brief message about what it expected (but didn’t get causing the error), an the fourth column gives the actual data value that was read in.
For our dataset, there were no problems.
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>
A common error is that the incorrect data type was chosen by the function guessing based on the first few rows. For example, if the first several rows are blank (missing data), then read_csv()
may interpret this as a character column when in fact much further down in the data column are numeric values. We will see an example of this in the homework. We can repair this by manually specifying the column type in the col_types
argument or by setting guess_max = Inf
. We can see the column types in our dataset by using the spec()
function.
cols(
YEAR = col_double(),
LocationAbbr = col_character(),
LocationDesc = col_character(),
TopicType = col_character(),
TopicDesc = col_character(),
MeasureDesc = col_character(),
DataSource = col_character(),
Response = col_character(),
Data_Value_Unit = col_character(),
Data_Value_Type = col_character(),
Data_Value = col_double(),
Data_Value_Footnote_Symbol = col_character(),
Data_Value_Footnote = col_character(),
Data_Value_Std_Err = col_double(),
Low_Confidence_Limit = col_double(),
High_Confidence_Limit = col_double(),
Sample_Size = col_double(),
Gender = col_character(),
Race = col_character(),
Age = col_character(),
Education = col_character(),
GeoLocation = col_character(),
TopicTypeId = col_character(),
TopicId = col_character(),
MeasureId = col_character(),
StratificationID1 = col_character(),
StratificationID2 = col_character(),
StratificationID3 = col_character(),
StratificationID4 = col_character(),
SubMeasureID = col_character(),
DisplayOrder = col_double()
)
After reading in our data, there may be other features we want to know/check such as
nrow()
- displays the number of rows of the data.frame (i.e. number of observations)ncol()
- displays the number of columns of the data.frame (i.e. the number of variables)dim()
- displays the number of rows and columns in a vector of length 2 (# rows, # cols)colnames()
- displays the column names (if any)rownames()
- displays the row names (if any)
[1] 9794 31
[1] 9794
[1] 31
[1] "YEAR" "LocationAbbr"
[3] "LocationDesc" "TopicType"
[5] "TopicDesc" "MeasureDesc"
[7] "DataSource" "Response"
[9] "Data_Value_Unit" "Data_Value_Type"
[11] "Data_Value" "Data_Value_Footnote_Symbol"
[13] "Data_Value_Footnote" "Data_Value_Std_Err"
[15] "Low_Confidence_Limit" "High_Confidence_Limit"
[17] "Sample_Size" "Gender"
[19] "Race" "Age"
[21] "Education" "GeoLocation"
[23] "TopicTypeId" "TopicId"
[25] "MeasureId" "StratificationID1"
[27] "StratificationID2" "StratificationID3"
[29] "StratificationID4" "SubMeasureID"
[31] "DisplayOrder"
We can change column names by using the rename()
function from the R package dplyr
. For example, we could rename the column “YEAR” to “year” in the YTS dataset.
[1] "year" "LocationAbbr"
[3] "LocationDesc" "TopicType"
[5] "TopicDesc" "MeasureDesc"
[7] "DataSource" "Response"
[9] "Data_Value_Unit" "Data_Value_Type"
[11] "Data_Value" "Data_Value_Footnote_Symbol"
[13] "Data_Value_Footnote" "Data_Value_Std_Err"
[15] "Low_Confidence_Limit" "High_Confidence_Limit"
[17] "Sample_Size" "Gender"
[19] "Race" "Age"
[21] "Education" "GeoLocation"
[23] "TopicTypeId" "TopicId"
[25] "MeasureId" "StratificationID1"
[27] "StratificationID2" "StratificationID3"
[29] "StratificationID4" "SubMeasureID"
[31] "DisplayOrder"
Many data analysts collaborate with researchers who use Excel to enter and curate their data. Often times, this is the input data for an analysis. There are two options for getting this data into R
- Save the excel sheet as a .csv file and use
read_csv()
- Use the add on package
readxl
to read in the excel file usingread_excel()
For single worksheet .xls or .xlsx files, I often just save the spreadsheet as a .csv file. For multi-worksheet .xlsx files, I use the readxl
package. You will try reading in an excel file as part of the exercises below. Try reading about the function read_excel
by looking at the help file.
Other data file types and packages:
- haven package contains functions to read in SAS, SPSS, and STATA dataset files (see the package documentation reference manual)
- sas7bdat package reads .sas7bdat files
- foreign package can read all the same formats as haven. This package has been around longer (so it is tested better), but it is not as maintained/updated (so may be good in the future).
Some of these are used as part of RStudio’s Import Dataset dropdown function.
4.3 Exporting Data
It is equally important to be able to output data somewhere to share you updated file(s) with collaborators. There are data exporting functions in the readr
package (as well as other), which have the pattern write_*
like write_csv
and write_delim
. For example, here is the function header for write_csv
.
write_delim(x, file, delim = " ", na = "NA", append = FALSE, col_names = !append, quote_escape = "double")
x
is the R data.frame/tibble/matrix that you want to write to a filepath
is the directory and filename where you want the data.frame to be written. It can be an absolute path or a filename in a relative path.delim
is the character that separates columns- “,” = csv (this results in the
write_csv
function) - “\t” = tab delimited
- “,” = csv (this results in the
We modified our YTS dataset to change the first column name from “YEAR” to “YEAR”. Now let’s save the update dataset to a new csv file called “YouthTobacco_newNames.csv”. I will save it to my data folder within my project working directory.
#library(readr) #uncomment and load readr if not already loaded
write_csv(mydat, file = "./data/YouthTobacco_newNames.csv")
tail(dir("./data"), 4)
[1] "YouthTobacco_newNames.csv" "yts_data.rda"
[3] "yts_dataset.rds" "yts_temp.rds"
4.4 More Ways to Save
If you want to save one object, you can use write_rds
from the readr
package to save a data.frame (or other R object) to an rds
file.
#library(readr) #uncomment and load readr if not already loaded
write_rds(mydat, file = "./data/yts_dataset.rds")
tail(dir("./data"), 4)
[1] "YouthTobacco_newNames.csv" "yts_data.rda"
[3] "yts_dataset.rds" "yts_temp.rds"
To read this back in to R, you need to use read_rds
, but you need to assign it.
#library(readr) #uncomment and load readr if not already loaded
mydat2 = read_rds(file = "./data/yts_dataset.rds")
all.equal(mydat, mydat2)
[1] TRUE
You can also save and load multiple R objects by using the save
and load
functions in base R. These functions save objects with the .rda
or .RData
extensions.
x = 5
yts = read_csv("./data/Youth_Tobacco_Survey_YTS_Data.csv")
save(yts, x, file = "./data/yts_data.rda")
We can view the R objects in our workspace using ls()
. Note that we have the objects x
and yts
.
[1] "avg" "avgs" "circ" "hers.dat" "long" "mydat"
[7] "mydat2" "name" "name2" "palette" "sub_yts" "tb"
[13] "x" "y" "yts"
Let’s remove them from our workspace and reload them from the .rda file we saved.
rm(list = c("x", "yts")) #delete x and yts
ls() # x and yts are not longer in our workspace. They were deleted
[1] "avg" "avgs" "circ" "hers.dat" "long" "mydat"
[7] "mydat2" "name" "name2" "palette" "sub_yts" "tb"
[13] "y"
[1] "avg" "avgs" "circ" "hers.dat" "long" "mydat"
[7] "mydat2" "name" "name2" "palette" "sub_yts" "tb"
[13] "x" "y" "yts"
4.5 Data I/O in Base R
Base R also have functions to import and export data
read.delim
is the base R equivalent ofreadr
’sread_delim
read.csv
is the base R equivalent ofreadr
’sread_csv
These functions have slightly different syntax for reading data, like header
(which indicates of the first row are column names) and as.is
(which controls whether or not a specified column is converted to a factor). Many online resources will use these base R functions, so you should be familiar with them. However, RStudio has switched to using the readr
data import tools, so we will use them in class. The readr
import functions are also up to two times faster for reading in large datasets and have a progress bar which is nice. You may use whichever function you feel more comfortable with.
Here is how to read in the YTS dataset using base R functions which return a data.frame instead of a tibble.
YEAR LocationAbbr LocationDesc TopicType TopicDesc
1 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth)
2 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth)
3 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth)
4 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth)
5 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth)
6 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth)
MeasureDesc DataSource Response
1 Percent of Current Smokers Who Want to Quit YTS
2 Percent of Current Smokers Who Want to Quit YTS
3 Percent of Current Smokers Who Want to Quit YTS
4 Quit Attempt in Past Year Among Current Cigarette Smokers YTS
5 Quit Attempt in Past Year Among Current Cigarette Smokers YTS
6 Quit Attempt in Past Year Among Current Cigarette Smokers YTS
Data_Value_Unit Data_Value_Type Data_Value Data_Value_Footnote_Symbol
1 % Percentage NA *
2 % Percentage NA *
3 % Percentage NA *
4 % Percentage NA *
5 % Percentage NA *
6 % Percentage NA *
Data_Value_Footnote
1 Data in these cells have been suppressed because of a small sample size
2 Data in these cells have been suppressed because of a small sample size
3 Data in these cells have been suppressed because of a small sample size
4 Data in these cells have been suppressed because of a small sample size
5 Data in these cells have been suppressed because of a small sample size
6 Data in these cells have been suppressed because of a small sample size
Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit Sample_Size
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
Gender Race Age Education
1 Overall All Races All Ages Middle School
2 Male All Races All Ages Middle School
3 Female All Races All Ages Middle School
4 Overall All Races All Ages Middle School
5 Male All Races All Ages Middle School
6 Female All Races All Ages Middle School
GeoLocation TopicTypeId TopicId MeasureId
1 (34.865970280000454, -111.76381127699972) BEH 105BEH 170CES
2 (34.865970280000454, -111.76381127699972) BEH 105BEH 170CES
3 (34.865970280000454, -111.76381127699972) BEH 105BEH 170CES
4 (34.865970280000454, -111.76381127699972) BEH 105BEH 169QUA
5 (34.865970280000454, -111.76381127699972) BEH 105BEH 169QUA
6 (34.865970280000454, -111.76381127699972) BEH 105BEH 169QUA
StratificationID1 StratificationID2 StratificationID3 StratificationID4
1 1GEN 8AGE 6RAC 1EDU
2 2GEN 8AGE 6RAC 1EDU
3 3GEN 8AGE 6RAC 1EDU
4 1GEN 8AGE 6RAC 1EDU
5 2GEN 8AGE 6RAC 1EDU
6 3GEN 8AGE 6RAC 1EDU
SubMeasureID DisplayOrder
1 YTS01 1
2 YTS02 2
3 YTS03 3
4 YTS04 4
5 YTS05 5
6 YTS06 6
Notice the printing of a data.frame is not as “pretty” as for a tibble, since it overflows off the frame.
Similarly, there are base R functions for exporting data like write.delim
and write.csv
.
4.6 Exercises
- Read in the Youth Tobacco study, Youth_Tobacco_Survey_YTS_Data.csv and name it
youth
. - Install and invoke the
readxl
package. RStudio > Tools > Install Packages. Type readxl into the Package search and click install. Load the installed library withlibrary(readxl)
. - Download an Excel version of the Monuments dataset, Monuments.xlsx, from CANVAS. Use the
read_excel()
function in thereadxl
package to read in the dataset and call the outputmon
. - Write out the
mon
R object as a CSV file usingreadr::write_csv
and call the file “monuments.csv”. - Write out the
mon
R object as an RDS file usingreadr::write_rds
and call it “monuments.rds”.