Chapter 4 Data I/O

Before we can work with our data in R, we have to first import the data into R. R is capable of importing data from many different file types and sources using add on packages, such as data from

  • csv files
  • excel files
  • tab delimited files
  • fixed width files
  • databases (e.g. mySQL databases)
  • data files from other programs such as SAS .sas7bdat files, SPSS .sav files, or STATA .dta files

We will focus on importing the most commonly used file types of csv and excel files using real publicly available data. RStudio features some nice “drop down” support, where you can run some task by selecting them from the toolbar.

For example, you can easily import datasets using File > Import Dataset and selecting

  • From Text (readr) to import csv files
  • From Excel to import excel files
  • From SPSS, STATA, or SAS to import the corresponding dataset file from these other software packages.

Note that these use add on packages and will ask you to install the appropriate packages if this is your first time using them. These will produce the R code needed to import your data, which you can then save to a script to use again if you need to re-import the dataset.

Before we look at importing a real dataset, let’s talk about working directories.

4.1 Working Directories

When reading files, you will need to specify the path to where this file is on your computer. R “look”s for files on your computer relative to the working directory. There are several way to have R set to the desired directory folder on your computer.

  1. Go to Session > Set Working Directory and choose the folder that contains your dataset
  2. Set the working directory manually using the function setwd("/path_to_folder/")
  3. Set up an R project in the folder that contains your dataset files and R scripts. When you open this R project, R will default the working directory to the folder containing the Rproj file.

When setting the working directory, there are some finicky syntax issues that can occur between Windows and Mac users.

  • Windows - The default directory structure involves a single backslash "\", but R interprets these as escape characters, so you must replace the backslashes with forward slashes "/" or two backslashes "\\".
  • Mac/Linux - The default directory structure already uses forward slashes, so you shouldn’t need to alter copied paths from these operating systems.

Some typical directory structure syntax elements include

  • .. - go up one level in the directory
  • ./ - the current directory
  • ~ - your “home” directory

To see where your current working directory is in your R sessions, you can use the function getwd() which prints the directory in the console. In RStudio, this path is also displayed at the top of the console. To see what files are in this directory, you can either use the dir() function to print the list of files to the console, or in RStudio you can view the files in the Files tab of the bottom right pane (where you find plots, help, package listings, etc).

For example, in my current Rproject, the working directory is

getwd()
[1] "H:/BiostatCourses/PHC6089/R_Notes"

I can view the files in this folder by using dir

dir("./")
 [1] "_book"                         "_bookdown.yml"                
 [3] "_bookdown_files"               "_output.yml"                  
 [5] "01-introduction.Rmd"           "02-RStudio.Rmd"               
 [7] "03-BasicR.Rmd"                 "04-DataIO.Rmd"                
 [9] "05-SubsettingData.Rmd"         "06-DataSummarization.Rmd"     
[11] "07-DataClasses.Rmd"            "08-DataCleaning.Rmd"          
[13] "09-ManipulatingDatainR.Rmd"    "10-DataVisualiation.Rmd"      
[15] "11-StatisticalAnalysis.Rmd"    "12-Functions.Rmd"             
[17] "13-Simulations.Rmd"            "14-RMarkdown.Rmd"             
[19] "15-Shiny.Rmd"                  "16-Solutions_to_Exercises.Rmd"
[21] "Best_Model_Coefficients.csv"   "Best_Model_Coefficients.xlsx" 
[23] "book.bib"                      "data"                         
[25] "data.zip"                      "img"                          
[27] "index.Rmd"                     "my_tab.txt"                   
[29] "packages.bib"                  "preamble.tex"                 
[31] "R_notes.Rmd"                   "R_notes.Rproj"                
[33] "R_notes_files"                 "README.md"                    
[35] "style.css"                    

I can view the files one folder above my current working directory.

dir("..")
 [1] "Incomplete_Olga_Munoz"     "PublicHealthComputing_old"
 [3] "R_Homework"                "R_InClass"                
 [5] "R_Labs"                    "R_libpaths.R"             
 [7] "R_Notes"                   "R_Project"                
 [9] "SAS_homework"              "SAS_Notes"                
[11] "SASProject"                "Syllabus"                 
[13] "testRMD"                  

Or I can view files in a subdirectory, for example in the data folder.

dir("./data/")
 [1] "Baltimore_City_Employee_Salaries_FY2015.csv"    
 [2] "Best_Model_Coefficients.csv"                    
 [3] "Best_Model_Coefficients.xlsx"                   
 [4] "Bike_Lanes.csv"                                 
 [5] "Bike_Lanes_Long.csv"                            
 [6] "Bike_Lanes_Wide.csv"                            
 [7] "Carvana_Data_Dictionary.txt"                    
 [8] "Carvana_Data_Dictionary_formatted.txt"          
 [9] "Charm_City_Circulator_Ridership.csv"            
[10] "charmcitycirc_reduced.csv"                      
[11] "country_pop.txt"                                
[12] "country_pop_bad_char.txt"                       
[13] "crashes.csv"                                    
[14] "hersdata.xls"                                   
[15] "indicator_estimatedincidencealltbper100000.xlsx"
[16] "indicatordeadkids35.csv"                        
[17] "kaggleCarAuction.csv"                           
[18] "kaggleCarAuction.zip"                           
[19] "Monuments.csv"                                  
[20] "monuments.rds"                                  
[21] "Monuments.xlsx"                                 
[22] "MonumentsTest.csv"                              
[23] "my_tab.txt"                                     
[24] "Real_Property_Taxes.csv.gz"                     
[25] "roads.csv"                                      
[26] "SIR_grid.dat"                                   
[27] "tb_incidence.xlsx"                              
[28] "Thumbs.db"                                      
[29] "wcgs.xls"                                       
[30] "Youth_Tobacco_Survey_YTS_Data.csv"              
[31] "YouthTobacco_newNames.csv"                      
[32] "yts_data.rda"                                   
[33] "yts_dataset.rds"                                
[34] "yts_temp.rds"                                   

Because I am currently in an R project where my working directory is set to my desired folder containing all of my needed files, I can use relative paths instead of absolute paths. An absolute path or full path points to the the same location of the file system, regardless of the current working directory. To do that, it must include the root directory. This means that if I try your code using absolute paths, it won’t work unless we have the exact same folder structure where R is looking (this is bad). By contrast a relative path starts from some given working directory, avoiding the need to provide the full absolute path.

For example, “/Users/robertparker/Documents/UF/PHC6089/Lectures/2020/R_notes” is the absolute path on my computer to the R notes. If I zip up and send you the folder R_notes, using absolute paths to reference files (images, datasets, etc.), then this code will not work on your computer unless you have the R_notes folder in the exact same path as I do. However, if you set the working directory to this folder (manually or using an Rproj), and I use relative paths to reference items in this folder (e.g. “./data/Charm_City_Circulator_Ridership.csv”), then my code will work immediately on your computer as well. This is very important to do, so that it is easy to share your code with collaborators.

4.2 Importing Data

For our example dataset, we will use the Youth Tobacco Survey (YTS) dataset.

“The YTS was developed to provide states with comprehensive data on both middle school and high school students regarding tobacco use, exposure to environmental tobacco smoke, smoking cessation, school curriculum, minors’ability to purchase or otherwise obtain tobacco products, knowledge and attitudes about tobacco, and familiarity with pro-tobacco and anti-tobacco media messages.”

See the source of the data here.

In order to try the code on your own computer, be sure to download the dataset from the course website and set your RStudio session to the working directory containing your dataset file by going to Session > Set Working Directory > Choose file location.

I will have the data in a folder called data in my working directory. The YTS dataset is a csv file, so we will use the read_csv() function from the readr package to an R object called mydat.

library(readr) # load the readr library so R has access to read_csv()
mydat = read_csv("./data/Youth_Tobacco_Survey_YTS_Data.csv")
Rows: 9794 Columns: 31
── Column specification ──────────────────────────────────────────────
Delimiter: ","
chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
dbl  (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(mydat, 3) # display the first 3 rows of the data.frame
# A tibble: 3 × 31
   YEAR LocationAbbr LocationDesc TopicType     TopicDesc MeasureDesc DataSource
  <dbl> <chr>        <chr>        <chr>         <chr>     <chr>       <chr>     
1  2015 AZ           Arizona      Tobacco Use … Cessatio… Percent of… YTS       
2  2015 AZ           Arizona      Tobacco Use … Cessatio… Percent of… YTS       
3  2015 AZ           Arizona      Tobacco Use … Cessatio… Percent of… YTS       
# ℹ 24 more variables: Response <chr>, Data_Value_Unit <chr>,
#   Data_Value_Type <chr>, Data_Value <dbl>, Data_Value_Footnote_Symbol <chr>,
#   Data_Value_Footnote <chr>, Data_Value_Std_Err <dbl>,
#   Low_Confidence_Limit <dbl>, High_Confidence_Limit <dbl>, Sample_Size <dbl>,
#   Gender <chr>, Race <chr>, Age <chr>, Education <chr>, GeoLocation <chr>,
#   TopicTypeId <chr>, TopicId <chr>, MeasureId <chr>, StratificationID1 <chr>,
#   StratificationID2 <chr>, StratificationID3 <chr>, …
class(mydat)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

Simple! The data is now read into your R workspace. All we needed to do was pass the file location of our csv dataset to the function read_csv(), but what is this function doing for us? This function is a special case of read_delim() which reads in a general delimited file. Our delimiter was a comma, but it could be a tab (“\t”), semicolon (“;”), or other separating character. These functions read delimited files into a data.frame with some special printing and subsetting properties called a tibble (class “tbl”). The general arguments you can specify to read_delim() are as follows

read_delim(file, delim, quote = "\"", escape_backslash = FALSE,
  escape_double = TRUE, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  comment = "", trim_ws = FALSE, skip = 0, n_max = Inf,
  guess_max = min(1000, n_max), progress = show_progress(),
  skip_empty_rows = TRUE)
  • file - specify the path to your dataset file, in quotes (the filename can be a url to a file on the web and looks in your working directory if you use a relative file path).
  • delim - specify your delimiter in quotes (for read_csv() delim = “,”)
  • col_names - Does the first row contain column names?
  • col_types - pre-specify column types (e.g. col_double(), col_character())
  • guess_max - number of rows to scan to try to automatically identify the data type of a column

Even if the dataset is read in, there can still be potential errors that were noted but not fatal. We can see these errors using problems(). Applying the problems() function to a data.frame returns a data.frame with a row for each problem found and four columns. The first two columns give the row and column number in the original data.frame where the problem occurred, the third column give a brief message about what it expected (but didn’t get causing the error), an the fourth column gives the actual data value that was read in.

For our dataset, there were no problems.

problems(mydat)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>

A common error is that the incorrect data type was chosen by the function guessing based on the first few rows. For example, if the first several rows are blank (missing data), then read_csv() may interpret this as a character column when in fact much further down in the data column are numeric values. We will see an example of this in the homework. We can repair this by manually specifying the column type in the col_types argument or by setting guess_max = Inf. We can see the column types in our dataset by using the spec() function.

spec(mydat)
cols(
  YEAR = col_double(),
  LocationAbbr = col_character(),
  LocationDesc = col_character(),
  TopicType = col_character(),
  TopicDesc = col_character(),
  MeasureDesc = col_character(),
  DataSource = col_character(),
  Response = col_character(),
  Data_Value_Unit = col_character(),
  Data_Value_Type = col_character(),
  Data_Value = col_double(),
  Data_Value_Footnote_Symbol = col_character(),
  Data_Value_Footnote = col_character(),
  Data_Value_Std_Err = col_double(),
  Low_Confidence_Limit = col_double(),
  High_Confidence_Limit = col_double(),
  Sample_Size = col_double(),
  Gender = col_character(),
  Race = col_character(),
  Age = col_character(),
  Education = col_character(),
  GeoLocation = col_character(),
  TopicTypeId = col_character(),
  TopicId = col_character(),
  MeasureId = col_character(),
  StratificationID1 = col_character(),
  StratificationID2 = col_character(),
  StratificationID3 = col_character(),
  StratificationID4 = col_character(),
  SubMeasureID = col_character(),
  DisplayOrder = col_double()
)

After reading in our data, there may be other features we want to know/check such as

  • nrow() - displays the number of rows of the data.frame (i.e. number of observations)
  • ncol() - displays the number of columns of the data.frame (i.e. the number of variables)
  • dim() - displays the number of rows and columns in a vector of length 2 (# rows, # cols)
  • colnames() - displays the column names (if any)
  • rownames() - displays the row names (if any)
dim(mydat)
[1] 9794   31
nrow(mydat)
[1] 9794
ncol(mydat)
[1] 31
colnames(mydat)
 [1] "YEAR"                       "LocationAbbr"              
 [3] "LocationDesc"               "TopicType"                 
 [5] "TopicDesc"                  "MeasureDesc"               
 [7] "DataSource"                 "Response"                  
 [9] "Data_Value_Unit"            "Data_Value_Type"           
[11] "Data_Value"                 "Data_Value_Footnote_Symbol"
[13] "Data_Value_Footnote"        "Data_Value_Std_Err"        
[15] "Low_Confidence_Limit"       "High_Confidence_Limit"     
[17] "Sample_Size"                "Gender"                    
[19] "Race"                       "Age"                       
[21] "Education"                  "GeoLocation"               
[23] "TopicTypeId"                "TopicId"                   
[25] "MeasureId"                  "StratificationID1"         
[27] "StratificationID2"          "StratificationID3"         
[29] "StratificationID4"          "SubMeasureID"              
[31] "DisplayOrder"              

We can change column names by using the rename() function from the R package dplyr. For example, we could rename the column “YEAR” to “year” in the YTS dataset.

library(dplyr)
mydat = rename(mydat, year = YEAR)
colnames(mydat)
 [1] "year"                       "LocationAbbr"              
 [3] "LocationDesc"               "TopicType"                 
 [5] "TopicDesc"                  "MeasureDesc"               
 [7] "DataSource"                 "Response"                  
 [9] "Data_Value_Unit"            "Data_Value_Type"           
[11] "Data_Value"                 "Data_Value_Footnote_Symbol"
[13] "Data_Value_Footnote"        "Data_Value_Std_Err"        
[15] "Low_Confidence_Limit"       "High_Confidence_Limit"     
[17] "Sample_Size"                "Gender"                    
[19] "Race"                       "Age"                       
[21] "Education"                  "GeoLocation"               
[23] "TopicTypeId"                "TopicId"                   
[25] "MeasureId"                  "StratificationID1"         
[27] "StratificationID2"          "StratificationID3"         
[29] "StratificationID4"          "SubMeasureID"              
[31] "DisplayOrder"              

Many data analysts collaborate with researchers who use Excel to enter and curate their data. Often times, this is the input data for an analysis. There are two options for getting this data into R

  • Save the excel sheet as a .csv file and use read_csv()
  • Use the add on package readxl to read in the excel file using read_excel()

For single worksheet .xls or .xlsx files, I often just save the spreadsheet as a .csv file. For multi-worksheet .xlsx files, I use the readxl package. You will try reading in an excel file as part of the exercises below. Try reading about the function read_excel by looking at the help file.

library(readxl)
?read_excel #See the help file for read_excel()

Other data file types and packages:

  • haven package contains functions to read in SAS, SPSS, and STATA dataset files (see the package documentation reference manual)
  • sas7bdat package reads .sas7bdat files
  • foreign package can read all the same formats as haven. This package has been around longer (so it is tested better), but it is not as maintained/updated (so may be good in the future).

Some of these are used as part of RStudio’s Import Dataset dropdown function.

4.3 Exporting Data

It is equally important to be able to output data somewhere to share you updated file(s) with collaborators. There are data exporting functions in the readr package (as well as other), which have the pattern write_* like write_csv and write_delim. For example, here is the function header for write_csv.

write_delim(x, file, delim = " ", na = "NA", append = FALSE,
  col_names = !append, quote_escape = "double")
  • x is the R data.frame/tibble/matrix that you want to write to a file
  • path is the directory and filename where you want the data.frame to be written. It can be an absolute path or a filename in a relative path.
  • delim is the character that separates columns
    • “,” = csv (this results in the write_csv function)
    • “\t” = tab delimited

We modified our YTS dataset to change the first column name from “YEAR” to “YEAR”. Now let’s save the update dataset to a new csv file called “YouthTobacco_newNames.csv”. I will save it to my data folder within my project working directory.

#library(readr) #uncomment and load readr if not already loaded
write_csv(mydat, file = "./data/YouthTobacco_newNames.csv")
tail(dir("./data"), 4)
[1] "YouthTobacco_newNames.csv" "yts_data.rda"             
[3] "yts_dataset.rds"           "yts_temp.rds"             

4.4 More Ways to Save

If you want to save one object, you can use write_rds from the readr package to save a data.frame (or other R object) to an rds file.

#library(readr) #uncomment and load readr if not already loaded
write_rds(mydat, file = "./data/yts_dataset.rds")
tail(dir("./data"), 4)
[1] "YouthTobacco_newNames.csv" "yts_data.rda"             
[3] "yts_dataset.rds"           "yts_temp.rds"             

To read this back in to R, you need to use read_rds, but you need to assign it.

#library(readr) #uncomment and load readr if not already loaded
mydat2 = read_rds(file = "./data/yts_dataset.rds")
all.equal(mydat, mydat2)
[1] TRUE

You can also save and load multiple R objects by using the save and load functions in base R. These functions save objects with the .rda or .RData extensions.

x = 5
yts = read_csv("./data/Youth_Tobacco_Survey_YTS_Data.csv")
save(yts, x, file = "./data/yts_data.rda")

We can view the R objects in our workspace using ls(). Note that we have the objects x and yts.

ls() #list of R objects in the workspace
 [1] "avg"      "avgs"     "circ"     "hers.dat" "long"     "mydat"   
 [7] "mydat2"   "name"     "name2"    "palette"  "sub_yts"  "tb"      
[13] "x"        "y"        "yts"     

Let’s remove them from our workspace and reload them from the .rda file we saved.

rm(list = c("x", "yts")) #delete x and yts
ls() # x and yts are not longer in our workspace. They were deleted
 [1] "avg"      "avgs"     "circ"     "hers.dat" "long"     "mydat"   
 [7] "mydat2"   "name"     "name2"    "palette"  "sub_yts"  "tb"      
[13] "y"       
load("./data/yts_data.rda")
ls()
 [1] "avg"      "avgs"     "circ"     "hers.dat" "long"     "mydat"   
 [7] "mydat2"   "name"     "name2"    "palette"  "sub_yts"  "tb"      
[13] "x"        "y"        "yts"     

4.5 Data I/O in Base R

Base R also have functions to import and export data

  • read.delim is the base R equivalent of readr’s read_delim
  • read.csv is the base R equivalent of readr’s read_csv

These functions have slightly different syntax for reading data, like header (which indicates of the first row are column names) and as.is (which controls whether or not a specified column is converted to a factor). Many online resources will use these base R functions, so you should be familiar with them. However, RStudio has switched to using the readr data import tools, so we will use them in class. The readr import functions are also up to two times faster for reading in large datasets and have a progress bar which is nice. You may use whichever function you feel more comfortable with.

Here is how to read in the YTS dataset using base R functions which return a data.frame instead of a tibble.

mydat2 = read.csv("./data/Youth_Tobacco_Survey_YTS_Data.csv")
head(mydat2)
  YEAR LocationAbbr LocationDesc                 TopicType         TopicDesc
1 2015           AZ      Arizona Tobacco Use - Survey Data Cessation (Youth)
2 2015           AZ      Arizona Tobacco Use - Survey Data Cessation (Youth)
3 2015           AZ      Arizona Tobacco Use - Survey Data Cessation (Youth)
4 2015           AZ      Arizona Tobacco Use - Survey Data Cessation (Youth)
5 2015           AZ      Arizona Tobacco Use - Survey Data Cessation (Youth)
6 2015           AZ      Arizona Tobacco Use - Survey Data Cessation (Youth)
                                                MeasureDesc DataSource Response
1               Percent of Current Smokers Who Want to Quit        YTS         
2               Percent of Current Smokers Who Want to Quit        YTS         
3               Percent of Current Smokers Who Want to Quit        YTS         
4 Quit Attempt in Past Year Among Current Cigarette Smokers        YTS         
5 Quit Attempt in Past Year Among Current Cigarette Smokers        YTS         
6 Quit Attempt in Past Year Among Current Cigarette Smokers        YTS         
  Data_Value_Unit Data_Value_Type Data_Value Data_Value_Footnote_Symbol
1               %      Percentage         NA                          *
2               %      Percentage         NA                          *
3               %      Percentage         NA                          *
4               %      Percentage         NA                          *
5               %      Percentage         NA                          *
6               %      Percentage         NA                          *
                                                      Data_Value_Footnote
1 Data in these cells have been suppressed because of a small sample size
2 Data in these cells have been suppressed because of a small sample size
3 Data in these cells have been suppressed because of a small sample size
4 Data in these cells have been suppressed because of a small sample size
5 Data in these cells have been suppressed because of a small sample size
6 Data in these cells have been suppressed because of a small sample size
  Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit Sample_Size
1                 NA                   NA                    NA          NA
2                 NA                   NA                    NA          NA
3                 NA                   NA                    NA          NA
4                 NA                   NA                    NA          NA
5                 NA                   NA                    NA          NA
6                 NA                   NA                    NA          NA
   Gender      Race      Age     Education
1 Overall All Races All Ages Middle School
2    Male All Races All Ages Middle School
3  Female All Races All Ages Middle School
4 Overall All Races All Ages Middle School
5    Male All Races All Ages Middle School
6  Female All Races All Ages Middle School
                                GeoLocation TopicTypeId TopicId MeasureId
1 (34.865970280000454, -111.76381127699972)         BEH  105BEH    170CES
2 (34.865970280000454, -111.76381127699972)         BEH  105BEH    170CES
3 (34.865970280000454, -111.76381127699972)         BEH  105BEH    170CES
4 (34.865970280000454, -111.76381127699972)         BEH  105BEH    169QUA
5 (34.865970280000454, -111.76381127699972)         BEH  105BEH    169QUA
6 (34.865970280000454, -111.76381127699972)         BEH  105BEH    169QUA
  StratificationID1 StratificationID2 StratificationID3 StratificationID4
1              1GEN              8AGE              6RAC              1EDU
2              2GEN              8AGE              6RAC              1EDU
3              3GEN              8AGE              6RAC              1EDU
4              1GEN              8AGE              6RAC              1EDU
5              2GEN              8AGE              6RAC              1EDU
6              3GEN              8AGE              6RAC              1EDU
  SubMeasureID DisplayOrder
1        YTS01            1
2        YTS02            2
3        YTS03            3
4        YTS04            4
5        YTS05            5
6        YTS06            6

Notice the printing of a data.frame is not as “pretty” as for a tibble, since it overflows off the frame.

Similarly, there are base R functions for exporting data like write.delim and write.csv.

4.6 Exercises

  1. Read in the Youth Tobacco study, Youth_Tobacco_Survey_YTS_Data.csv and name it youth.
  2. Install and invoke the readxl package. RStudio > Tools > Install Packages. Type readxl into the Package search and click install. Load the installed library with library(readxl).
  3. Download an Excel version of the Monuments dataset, Monuments.xlsx, from CANVAS. Use the read_excel() function in the readxl package to read in the dataset and call the output mon.
  4. Write out the mon R object as a CSV file using readr::write_csv and call the file “monuments.csv”.
  5. Write out the mon R object as an RDS file using readr::write_rds and call it “monuments.rds”.