In this lecture, we will discuss how to read and write files in R. By the end of this lecture you should be able to
Datasets in R are stored as data.frames.We can manually enter a dataset into R by simply creating a dataframe. This is similar to using DATALINES in SAS.
recall that missing values in R are stored as NA. Lets manually create a small dataset of patient information in R by creating a data.frame.
patient.info <- data.frame(name = c("Bob", "Sue", "Mary", "Joe", "John"),
height = c(70, 62, 65, NA, 68),
weight = c(195, 132, NA, 35, NA),
age = c(21, 18, 25, 35, NA))
patient.info
## name height weight age
## 1 Bob 70 195 21
## 2 Sue 62 132 18
## 3 Mary 65 NA 25
## 4 Joe NA 35 35
## 5 John 68 NA NA
Usually datasets will alread by stored in some file format and be too big to manually enter, so lets explore reading data in from a file.
Let’s look at the documentation first ?read.fwf
.
Read a table of fixed width formatted data into a data.frame
.
read.fwf(file, widths, header = FALSE, sep = "\t", skip = 0, row.names, col.names, n = -1, buffersize = 2000, fileEncoding = "", ...)
file
|
the name of the file which the data are to be read from.
Alternatively, |
widths
|
integer vector, giving the widths of the fixed-width fields (of one line), or list of integer vectors giving widths for multiline records. |
header
|
a logical value indicating whether the file contains the names of the variables as its first line. If present, the names must be delimited by |
sep
|
character; the separator used internally; should be a character that does not occur in the file (except in the header). |
skip
|
number of initial lines to skip; see |
row.names
|
see |
col.names
|
see |
n
|
the maximum number of records (lines) to be read, defaulting to no limit. |
buffersize
|
Maximum number of lines to read at one time |
fileEncoding
|
character string: if non-empty declares the encoding used on a file (not a connection) so the character data can be re-encoded. See the ‘Encoding’ section of the help for |
…
|
further arguments to be passed to |
We will now use read.fwf to read in the following small fixed width file found in fixedwidth.dat in the course page. Each observation includes
----1----2----3----4 Bob Smith 7019521 Sue Jones 6213218 Mary Lane 65 25 Joe Tesh 35 John Young68
file.path <- "H:\\BiostatCourses\\PublicHealthComputing\\Lectures\\Week2ReadinData\\R\\"
example2 <- read.fwf(paste(file.path,"fixedwidth.dat", sep=""),
widths = c(10,2,3,2),
col.names = c("name", "height", "weight", "age"),
skip = 1)
example2
## name height weight age
## 1 Bob Smith 70 195 21
## 2 Sue Jones 62 132 18
## 3 Mary Lane 65 NA 25
## 4 Joe Tesh NA NA 35
## 5 John Young 68 NA NA
Note that we specify the widths of the columns in R rather than the column locations as we did in SAS.
If you have a working directory that contains your files for a project, you can store that in your R session using setwd()
so that you don’t need to reference the full path each time.
setwd("H:\\BiostatCourses\\PublicHealthComputing\\Lectures\\Week2ReadinData\\R\\")
read.fwf("fixedwidth.dat", #now I only need to specify the filename and not the full path
widths = c(10,2,3,2),
col.names = c("name", "height", "weight", "age"),
skip = 1)
## name height weight age
## 1 Bob Smith 70 195 21
## 2 Sue Jones 62 132 18
## 3 Mary Lane 65 NA 25
## 4 Joe Tesh NA NA 35
## 5 John Young 68 NA NA
With fixed with formats we can specify negative widths to skip columns.
read.fwf("fixedwidth.dat",
widths = c(10,-2,-3,2),
col.names = c("name", "age"),
skip = 1)
## name age
## 1 Bob Smith 21
## 2 Sue Jones 18
## 3 Mary Lane 25
## 4 Joe Tesh 35
## 5 John Young NA
BAse R does not have as many built in formats as SAS. Instead, we will need to use string manipulation to parse some formats. R does have a date data type built in and the lubridate
package provides some other functions for dealing with dates as well.
The next file formatsfile.dat has some dates, currency records, percent, and decimals. We will first read in the file.
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----10---+ 3/4/2010 4-Mar-10 4 Mar 10 76,604 $76,604 53.2% 0.991418 South Carolina South Carolina 3/11/2010 11-Mar-10 11 Mar 10 90,376 $90,376 0.8% 0.983612 North Carolina North Carolina 3/18/2010 18-Mar-10 18 Mar 10 96,263 $96,263 16.1% 0.888531 South Dakota South Dakota 3/25/2010 25-Mar-10 25 Mar 10 33,608 $33,608 3.3% 0.530136 North Dakota North Dakota 4/1/2010 1-Apr-10 1 Apr 10 26,577 $26,577 73.0% 0.672677 California California 4/8/2010 8-Apr-10 8 Apr 10 88,943 $88,943 8.5% 0.111815 Arizona Arizona 4/15/2010 15-Apr-10 15 Apr 10 49,988 $49,988 79.7% 0.958651 Georgia Georgia 4/22/2010 22-Apr-10 22 Apr 10 25,755 $25,755 48.7% 0.746895 Florida Florida 4/29/2010 29-Apr-10 29 Apr 10 53,169 $53,169 76.1% 0.802572 Mississippi Mississippi 5/6/2010 6-May-10 6 May 10 95,509 $95,509 91.5% 0.564680 Louisiana Louisiana 5/13/2010 13-May-10 13 May 10 86,442 $86,442 19.4% 0.896147 Alabama Alabama 5/20/2010 20-May-10 20 May 10 57,135 $57,135 67.8% 0.965187 Texas Texas 5/27/2010 27-May-10 27 May 10 33,458 $33,458 61.1% 0.588009 New York New York 6/3/2010 3-Jun-10 3 Jun 10 770 $770 93.4% 0.886857 New Jersey New Jersey 6/10/2010 10-Jun-10 10 Jun 10 65,739 $65,739 98.1% 0.858632 Deleware Deleware 6/17/2010 17-Jun-10 17 Jun 10 99,529 $99,529 0.0% 0.869562 Virginia Virginia 6/24/2010 24-Jun-10 24 Jun 10 79,056 $79,056 13.8% 0.673845 West Virginia West Virginia 7/1/2010 1-Jul-10 1 Jul 10 2,397 $2,397 59.4% 0.700637 Ohio Ohio 7/8/2010 8-Jul-10 8 Jul 10 66,614 $66,614 71.4% 0.027379 Idaho Idaho
example3 <- read.fwf("formatsfile.dat",
widths = c(11,12,11,9,9,8,10,16,14),
col.names = c("date1", "date2","date3","num1","currency","percent","decimal","state1","state2"),
skip = 1)
example3
## date1 date2 date3 num1 currency percent
## 1 3/4/2010 4-Mar-10 4 Mar 10 76,604 $76,604 53.2%
## 2 3/11/2010 11-Mar-10 11 Mar 10 90,376 $90,376 0.8%
## 3 3/18/2010 18-Mar-10 18 Mar 10 96,263 $96,263 16.1%
## 4 3/25/2010 25-Mar-10 25 Mar 10 33,608 $33,608 3.3%
## 5 4/1/2010 1-Apr-10 1 Apr 10 26,577 $26,577 73.0%
## 6 4/8/2010 8-Apr-10 8 Apr 10 88,943 $88,943 8.5%
## 7 4/15/2010 15-Apr-10 15 Apr 10 49,988 $49,988 79.7%
## 8 4/22/2010 22-Apr-10 22 Apr 10 25,755 $25,755 48.7%
## 9 4/29/2010 29-Apr-10 29 Apr 10 53,169 $53,169 76.1%
## 10 5/6/2010 6-May-10 6 May 10 95,509 $95,509 91.5%
## 11 5/13/2010 13-May-10 13 May 10 86,442 $86,442 19.4%
## 12 5/20/2010 20-May-10 20 May 10 57,135 $57,135 67.8%
## 13 5/27/2010 27-May-10 27 May 10 33,458 $33,458 61.1%
## 14 6/3/2010 3-Jun-10 3 Jun 10 770 $770 93.4%
## 15 6/10/2010 10-Jun-10 10 Jun 10 65,739 $65,739 98.1%
## 16 6/17/2010 17-Jun-10 17 Jun 10 99,529 $99,529 0.0%
## 17 6/24/2010 24-Jun-10 24 Jun 10 79,056 $79,056 13.8%
## 18 7/1/2010 1-Jul-10 1 Jul 10 2,397 $2,397 59.4%
## 19 7/8/2010 8-Jul-10 8 Jul 10 66,614 $66,614 71.4%
## decimal state1 state2
## 1 0.991418 South Carolina South Carolina
## 2 0.983612 North Carolina North Carolina
## 3 0.888531 South Dakota South Dakota
## 4 0.530136 North Dakota North Dakota
## 5 0.672677 California California
## 6 0.111815 Arizona Arizona
## 7 0.958651 Georgia Georgia
## 8 0.746895 Florida Florida
## 9 0.802572 Mississippi Mississippi
## 10 0.564680 Louisiana Louisiana
## 11 0.896147 Alabama Alabama
## 12 0.965187 Texas Texas
## 13 0.588009 New York New York
## 14 0.886857 New Jersey New Jersey
## 15 0.858632 Deleware Deleware
## 16 0.869562 Virginia Virginia
## 17 0.673845 West Virginia West Virginia
## 18 0.700637 Ohio Ohio
## 19 0.027379 Idaho Idaho
class(example3$date1) #read in as a character value
## [1] "factor"
class(example3$currency) #read in as a character value
## [1] "factor"
Note that any value that contains no numerical values was read in as a character and then coerced to a factor in the data.frame. First let’s examine how to format the dates.
R has another built in data type called Date. We can format dates using the as.Date() along with specifying the format. When formatted as dates, we can perform math operations to find the time between dates. The following table gives the symbols need to format the dates in R.
Symbol Meaning Example %d day as a number (0-31) 01-31 %a abbreviated weekday Mon %A unabbreviated weekday Monday %m month (00-12) 00-12 %b abbreviated month Jan %B unabbreviated month January %y 2-digit year 07 %Y 4-digit year 2007
example3$date1 <- as.Date(example3$date1, "%m/%d/%Y")
example3$date2 <- as.Date(example3$date2, "%d-%b-%y")
example3$date3 <- as.Date(example3$date3, "%d %b %y")
example3[1:5,1:3]
## date1 date2 date3
## 1 2010-03-04 2010-03-04 2010-03-04
## 2 2010-03-11 2010-03-11 2010-03-11
## 3 2010-03-18 2010-03-18 2010-03-18
## 4 2010-03-25 2010-03-25 2010-03-25
## 5 2010-04-01 2010-04-01 2010-04-01
example3$date1[1] - example3$date1[2]
## Time difference of -7 days
as.numeric(example3$date1[1] - example3$date1[2])
## [1] -7
To format the numeric data, we need to first remove the character values such as the dollar sign $ and commas. Then we can coerce the data to numbers by using as.numeric()
. To search and replace a character we can use the function gsub()
.
example3$num1 <- as.numeric(gsub(",","",example3$num1))
example3$currency <- as.numeric(gsub(",","",gsub("$","",example3$currency,fixed=T)))
example3$percent <- as.numeric(gsub("%","",example3$percent,fixed=TRUE))
example3[1:5,4:6]
## num1 currency percent
## 1 76604 76604 53.2
## 2 90376 90376 0.8
## 3 96263 96263 16.1
## 4 33608 33608 3.3
## 5 26577 26577 73.0
A useful function for formatting strings is the trimws
function which is used to trim excess whitspaces on either end of the string.
example3$state2[19]
## [1] Idaho
## 19 Levels: Ohio Idaho Texas ... South Carolina
example3$state1 <- trimws(example3$state1,which = "both")
example3$state2 <- trimws(example3$state2,which = "both")
example3$state2[19]
## [1] "Idaho"
example3
## date1 date2 date3 num1 currency percent decimal
## 1 2010-03-04 2010-03-04 2010-03-04 76604 76604 53.2 0.991418
## 2 2010-03-11 2010-03-11 2010-03-11 90376 90376 0.8 0.983612
## 3 2010-03-18 2010-03-18 2010-03-18 96263 96263 16.1 0.888531
## 4 2010-03-25 2010-03-25 2010-03-25 33608 33608 3.3 0.530136
## 5 2010-04-01 2010-04-01 2010-04-01 26577 26577 73.0 0.672677
## 6 2010-04-08 2010-04-08 2010-04-08 88943 88943 8.5 0.111815
## 7 2010-04-15 2010-04-15 2010-04-15 49988 49988 79.7 0.958651
## 8 2010-04-22 2010-04-22 2010-04-22 25755 25755 48.7 0.746895
## 9 2010-04-29 2010-04-29 2010-04-29 53169 53169 76.1 0.802572
## 10 2010-05-06 2010-05-06 2010-05-06 95509 95509 91.5 0.564680
## 11 2010-05-13 2010-05-13 2010-05-13 86442 86442 19.4 0.896147
## 12 2010-05-20 2010-05-20 2010-05-20 57135 57135 67.8 0.965187
## 13 2010-05-27 2010-05-27 2010-05-27 33458 33458 61.1 0.588009
## 14 2010-06-03 2010-06-03 2010-06-03 770 770 93.4 0.886857
## 15 2010-06-10 2010-06-10 2010-06-10 65739 65739 98.1 0.858632
## 16 2010-06-17 2010-06-17 2010-06-17 99529 99529 0.0 0.869562
## 17 2010-06-24 2010-06-24 2010-06-24 79056 79056 13.8 0.673845
## 18 2010-07-01 2010-07-01 2010-07-01 2397 2397 59.4 0.700637
## 19 2010-07-08 2010-07-08 2010-07-08 66614 66614 71.4 0.027379
## state1 state2
## 1 South Carolina South Carolina
## 2 North Carolina North Carolina
## 3 South Dakota South Dakota
## 4 North Dakota North Dakota
## 5 California California
## 6 Arizona Arizona
## 7 Georgia Georgia
## 8 Florida Florida
## 9 Mississippi Mississippi
## 10 Louisiana Louisiana
## 11 Alabama Alabama
## 12 Texas Texas
## 13 New York New York
## 14 New Jersey New Jersey
## 15 Deleware Deleware
## 16 Virginia Virginia
## 17 West Virginia West Virginia
## 18 Ohio Ohio
## 19 Idaho Idaho
The read.table function is a general function for reading in delimited files. There are also a few special cases of read.table for csv files and other delimited files. The following is from the help file ?read.table
.
Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.
read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...) read.delim(file, header = TRUE, sep = "\t", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) read.delim2(file, header = TRUE, sep = "\t", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
Let’s read in the csv file containing the air quality dataset.
airq <- read.table("H:\\BiostatCourses\\PublicHealthComputing\\Data\\usair_a.csv",
sep = ",", header = TRUE)
head(airq)
## city so2 temp factories population windspeed rain rainydays
## 1 Phoenix 10 70.3 213 582 6.0 7.05 36
## 2 Little Rock 13 61.0 91 132 8.2 48.52 100
## 3 San Francisco 12 56.7 453 716 8.7 20.66 67
## 4 Denver 17 51.9 454 515 9.0 12.95 86
## 5 Hartford 56 49.1 412 158 9.0 43.37 127
## 6 Wilmington 36 54.0 80 80 9.0 40.25 114
We could have also used read.csv which has sep =“,” by defualt. The head=TRUE
option lets R know that the first line contains column names. When reading files, it may be easier to select the file from a file explorer. This can be done in R by using file.choose()
.
airq <- read.table(file.choose(),
sep = ",", header = TRUE)
head(airq)
File.choose causes an explorer window to pop up so that you can browse your files and choose the correct one.
There are many packages that you can use to read in other file formats such as excel or SAS and SPSS datasets. See
Base R and most packages have built in datasets that are used to illustrate their function and test their code. To see the data sets that come with base R use data()
. Try it!
One of the datasets in base R is the biochemical oxygen demain (BOD).
data(BOD) #load the dataset into your workspace as a data.frame
head(BOD) #look at the first few observations
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 15.6
## 6 7 19.8
Use ?BOD
to see a description of the dataset in the help file.
The BOD
data frame has 6 rows and 2 columns giving the biochemical oxygen demand versus time in an evaluation of water quality.
BOD
This data frame contains the following columns:
Time
A numeric vector giving the time of the measurement (days).
demand
A numeric vector giving the biochemical oxygen demand (mg/l).
Bates, D.M. and Watts, D.G. (1988), Nonlinear Regression Analysis and Its Applications, Wiley, Appendix A1.4.
Originally from Marske (1967), Biochemical Oxygen Demand Data Interpretation Using Sum of Squares Surface M.Sc. Thesis, University of Wisconsin – Madison.
We can view datasets in other packages by either loading the package first and using data()
or by using the package option data(package="MASS")
data(package="MASS")
data(OME) #package is not loaded
## Warning in data(OME): data set 'OME' not found
data(OME, package = "MASS")
head(OME)
## ID Age OME Loud Noise Correct Trials
## 1 1 30 low 35 coherent 1 4
## 2 1 30 low 35 incoherent 4 5
## 3 1 30 low 40 coherent 0 3
## 4 1 30 low 40 incoherent 1 1
## 5 1 30 low 45 coherent 2 4
## 6 1 30 low 45 incoherent 2 2
?OME #Help file is in the package MASS which is not loaded
## No documentation for 'OME' in specified packages and libraries:
## you could try '??OME'
library("MASS")
Now ?OME
gives
Experiments were performed on children on their ability to differentiate a signal in broad-band noise. The noise was played from a pair of speakers and a signal was added to just one channel; the subject had to turn his/her head to the channel with the added signal. The signal was either coherent (the amplitude of the noise was increased for a period) or incoherent (independent noise was added for the same period to form the same increase in power).
The threshold used in the original analysis was the stimulus loudness needs to get 75% correct responses. Some of the children had suffered from otitis media with effusion (OME).
OME
The OME
data frame has 1129 rows and 7 columns:
ID
Subject ID (1 to 99, with some IDs missing). A few subjects were measured at different ages.
OME
“low”
or “high”
or “N/A”
(at ages other than 30 and 60 months).
Age
Age of the subject (months).
Loud
Loudness of stimulus, in decibels.
Noise
Whether the signal in the stimulus was “coherent”
or “incoherent”
.
Correct
Number of correct responses from Trials
trials.
Trials
Number of trials performed.
The experiment was to study otitis media with effusion (OME), a very common childhood condition where the middle ear space, which is normally air-filled, becomes congested by a fluid. There is a concomitant fluctuating, conductive hearing loss which can result in various language, cognitive and social deficits. The term ‘binaural hearing’ is used to describe the listening conditions in which the brain is processing information from both ears at the same time. The brain computes differences in the intensity and/or timing of signals arriving at each ear which contributes to sound localisation and also to our ability to hear in background noise.
Some years ago, it was found that children of 7–8 years with a history of significant OME had significantly worse binaural hearing than children without such a history, despite having equivalent sensitivity. The question remained as to whether it was the timing, the duration, or the degree of severity of the otitis media episodes during critical periods, which affected later binaural hearing. In an attempt to begin to answer this question, 95 children were monitored for the presence of effusion every month since birth. On the basis of OME experience in their first two years, the test population was split into one group of high OME prevalence and one of low prevalence.
Sarah Hogan, Dept of Physiology, University of Oxford, via Dept of Statistics Consulting Service