In this lecture, we will discuss how to read and write files in R. By the end of this lecture you should be able to

Data Frames

Datasets in R are stored as data.frames.We can manually enter a dataset into R by simply creating a dataframe. This is similar to using DATALINES in SAS.

recall that missing values in R are stored as NA. Lets manually create a small dataset of patient information in R by creating a data.frame.

patient.info <- data.frame(name = c("Bob", "Sue", "Mary", "Joe", "John"),
                       height = c(70, 62, 65, NA, 68),
                       weight = c(195, 132, NA, 35, NA),
                       age = c(21, 18, 25, 35, NA))
patient.info
##   name height weight age
## 1  Bob     70    195  21
## 2  Sue     62    132  18
## 3 Mary     65     NA  25
## 4  Joe     NA     35  35
## 5 John     68     NA  NA

Usually datasets will alread by stored in some file format and be too big to manually enter, so lets explore reading data in from a file.

Fixed Width Files and read.fwf

Let’s look at the documentation first ?read.fwf.

Read Fixed Width Format Files

Description

Read a table of fixed width formatted data into a data.frame.

Usage

read.fwf(file, widths, header = FALSE, sep = "\t",
         skip = 0, row.names, col.names, n = -1,
         buffersize = 2000, fileEncoding = "", ...)

Arguments

file

the name of the file which the data are to be read from.

Alternatively, file can be a connection, which will be opened if necessary, and if so closed at the end of the function call.

widths

integer vector, giving the widths of the fixed-width fields (of one line), or list of integer vectors giving widths for multiline records.

header

a logical value indicating whether the file contains the names of the variables as its first line. If present, the names must be delimited by sep.

sep

character; the separator used internally; should be a character that does not occur in the file (except in the header).

skip

number of initial lines to skip; see read.table.

row.names

see read.table.

col.names

see read.table.

n

the maximum number of records (lines) to be read, defaulting to no limit.

buffersize

Maximum number of lines to read at one time

fileEncoding

character string: if non-empty declares the encoding used on a file (not a connection) so the character data can be re-encoded. See the ‘Encoding’ section of the help for file, the ‘R Data Import/Export Manual’ and ‘Note’.

further arguments to be passed to read.table. Useful such arguments include as.is, na.strings, colClasses and strip.white.

We will now use read.fwf to read in the following small fixed width file found in fixedwidth.dat in the course page. Each observation includes

----1----2----3----4
Bob Smith 7019521
Sue Jones 6213218
Mary Lane 65   25
Joe Tesh       35
John Young68
file.path <- "H:\\BiostatCourses\\PublicHealthComputing\\Lectures\\Week2ReadinData\\R\\"
example2 <- read.fwf(paste(file.path,"fixedwidth.dat", sep=""),
                     widths = c(10,2,3,2),
                     col.names = c("name", "height", "weight", "age"),
                     skip = 1)
example2
##         name height weight age
## 1 Bob Smith      70    195  21
## 2 Sue Jones      62    132  18
## 3 Mary Lane      65     NA  25
## 4 Joe Tesh       NA     NA  35
## 5 John Young     68     NA  NA

Note that we specify the widths of the columns in R rather than the column locations as we did in SAS.

If you have a working directory that contains your files for a project, you can store that in your R session using setwd() so that you don’t need to reference the full path each time.

setwd("H:\\BiostatCourses\\PublicHealthComputing\\Lectures\\Week2ReadinData\\R\\")
read.fwf("fixedwidth.dat", #now I only need to specify the filename and not the full path
         widths = c(10,2,3,2),
         col.names = c("name", "height", "weight", "age"),
         skip = 1)
##         name height weight age
## 1 Bob Smith      70    195  21
## 2 Sue Jones      62    132  18
## 3 Mary Lane      65     NA  25
## 4 Joe Tesh       NA     NA  35
## 5 John Young     68     NA  NA

With fixed with formats we can specify negative widths to skip columns.

read.fwf("fixedwidth.dat", 
         widths = c(10,-2,-3,2),
         col.names = c("name", "age"),
         skip = 1)
##         name age
## 1 Bob Smith   21
## 2 Sue Jones   18
## 3 Mary Lane   25
## 4 Joe Tesh    35
## 5 John Young  NA

Formats with R

BAse R does not have as many built in formats as SAS. Instead, we will need to use string manipulation to parse some formats. R does have a date data type built in and the lubridate package provides some other functions for dealing with dates as well.

The next file formatsfile.dat has some dates, currency records, percent, and decimals. We will first read in the file.

----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----10---+
3/4/2010   4-Mar-10    4 Mar 10   76,604   $76,604  53.2%   0.991418  South Carolina  South Carolina 
3/11/2010  11-Mar-10   11 Mar 10  90,376   $90,376  0.8%    0.983612  North Carolina  North Carolina 
3/18/2010  18-Mar-10   18 Mar 10  96,263   $96,263  16.1%   0.888531  South Dakota      South Dakota 
3/25/2010  25-Mar-10   25 Mar 10  33,608   $33,608  3.3%    0.530136  North Dakota      North Dakota 
4/1/2010   1-Apr-10    1 Apr 10   26,577   $26,577  73.0%   0.672677  California          California 
4/8/2010   8-Apr-10    8 Apr 10   88,943   $88,943  8.5%    0.111815  Arizona                Arizona 
4/15/2010  15-Apr-10   15 Apr 10  49,988   $49,988  79.7%   0.958651  Georgia                Georgia 
4/22/2010  22-Apr-10   22 Apr 10  25,755   $25,755  48.7%   0.746895  Florida                Florida 
4/29/2010  29-Apr-10   29 Apr 10  53,169   $53,169  76.1%   0.802572  Mississippi        Mississippi 
5/6/2010   6-May-10    6 May 10   95,509   $95,509  91.5%   0.564680  Louisiana            Louisiana 
5/13/2010  13-May-10   13 May 10  86,442   $86,442  19.4%   0.896147  Alabama                Alabama
5/20/2010  20-May-10   20 May 10  57,135   $57,135  67.8%   0.965187  Texas                    Texas 
5/27/2010  27-May-10   27 May 10  33,458   $33,458  61.1%   0.588009  New York              New York 
6/3/2010   3-Jun-10    3 Jun 10   770      $770     93.4%   0.886857  New Jersey          New Jersey 
6/10/2010  10-Jun-10   10 Jun 10  65,739   $65,739  98.1%   0.858632  Deleware              Deleware 
6/17/2010  17-Jun-10   17 Jun 10  99,529   $99,529  0.0%    0.869562  Virginia              Virginia 
6/24/2010  24-Jun-10   24 Jun 10  79,056   $79,056  13.8%   0.673845  West Virginia    West Virginia 
7/1/2010   1-Jul-10    1 Jul 10   2,397    $2,397   59.4%   0.700637  Ohio                      Ohio 
7/8/2010   8-Jul-10    8 Jul 10   66,614   $66,614  71.4%   0.027379  Idaho                    Idaho
example3 <- read.fwf("formatsfile.dat",
                     widths = c(11,12,11,9,9,8,10,16,14),
                     col.names = c("date1", "date2","date3","num1","currency","percent","decimal","state1","state2"),
                     skip = 1)
example3
##          date1        date2       date3      num1  currency  percent
## 1  3/4/2010    4-Mar-10     4 Mar 10    76,604    $76,604   53.2%   
## 2  3/11/2010   11-Mar-10    11 Mar 10   90,376    $90,376   0.8%    
## 3  3/18/2010   18-Mar-10    18 Mar 10   96,263    $96,263   16.1%   
## 4  3/25/2010   25-Mar-10    25 Mar 10   33,608    $33,608   3.3%    
## 5  4/1/2010    1-Apr-10     1 Apr 10    26,577    $26,577   73.0%   
## 6  4/8/2010    8-Apr-10     8 Apr 10    88,943    $88,943   8.5%    
## 7  4/15/2010   15-Apr-10    15 Apr 10   49,988    $49,988   79.7%   
## 8  4/22/2010   22-Apr-10    22 Apr 10   25,755    $25,755   48.7%   
## 9  4/29/2010   29-Apr-10    29 Apr 10   53,169    $53,169   76.1%   
## 10 5/6/2010    6-May-10     6 May 10    95,509    $95,509   91.5%   
## 11 5/13/2010   13-May-10    13 May 10   86,442    $86,442   19.4%   
## 12 5/20/2010   20-May-10    20 May 10   57,135    $57,135   67.8%   
## 13 5/27/2010   27-May-10    27 May 10   33,458    $33,458   61.1%   
## 14 6/3/2010    3-Jun-10     3 Jun 10    770       $770      93.4%   
## 15 6/10/2010   10-Jun-10    10 Jun 10   65,739    $65,739   98.1%   
## 16 6/17/2010   17-Jun-10    17 Jun 10   99,529    $99,529   0.0%    
## 17 6/24/2010   24-Jun-10    24 Jun 10   79,056    $79,056   13.8%   
## 18 7/1/2010    1-Jul-10     1 Jul 10    2,397     $2,397    59.4%   
## 19 7/8/2010    8-Jul-10     8 Jul 10    66,614    $66,614   71.4%   
##     decimal           state1         state2
## 1  0.991418 South Carolina   South Carolina
## 2  0.983612 North Carolina   North Carolina
## 3  0.888531 South Dakota       South Dakota
## 4  0.530136 North Dakota       North Dakota
## 5  0.672677 California           California
## 6  0.111815 Arizona                 Arizona
## 7  0.958651 Georgia                 Georgia
## 8  0.746895 Florida                 Florida
## 9  0.802572 Mississippi         Mississippi
## 10 0.564680 Louisiana             Louisiana
## 11 0.896147 Alabama                 Alabama
## 12 0.965187 Texas                     Texas
## 13 0.588009 New York               New York
## 14 0.886857 New Jersey           New Jersey
## 15 0.858632 Deleware               Deleware
## 16 0.869562 Virginia               Virginia
## 17 0.673845 West Virginia     West Virginia
## 18 0.700637 Ohio                       Ohio
## 19 0.027379 Idaho                     Idaho
class(example3$date1) #read in as a character value
## [1] "factor"
class(example3$currency) #read in as a character value
## [1] "factor"

Note that any value that contains no numerical values was read in as a character and then coerced to a factor in the data.frame. First let’s examine how to format the dates.

R has another built in data type called Date. We can format dates using the as.Date() along with specifying the format. When formatted as dates, we can perform math operations to find the time between dates. The following table gives the symbols need to format the dates in R.

Symbol Meaning Example
 %d day as a number (0-31) 01-31
 %a abbreviated weekday Mon
 %A unabbreviated weekday Monday
 %m month (00-12) 00-12
 %b abbreviated month Jan
 %B unabbreviated month January
 %y 2-digit year 07
 %Y 4-digit year 2007
example3$date1 <- as.Date(example3$date1, "%m/%d/%Y")
example3$date2 <- as.Date(example3$date2, "%d-%b-%y")
example3$date3 <- as.Date(example3$date3, "%d %b %y")

example3[1:5,1:3]
##        date1      date2      date3
## 1 2010-03-04 2010-03-04 2010-03-04
## 2 2010-03-11 2010-03-11 2010-03-11
## 3 2010-03-18 2010-03-18 2010-03-18
## 4 2010-03-25 2010-03-25 2010-03-25
## 5 2010-04-01 2010-04-01 2010-04-01
example3$date1[1] - example3$date1[2]
## Time difference of -7 days
as.numeric(example3$date1[1] - example3$date1[2]) 
## [1] -7

To format the numeric data, we need to first remove the character values such as the dollar sign $ and commas. Then we can coerce the data to numbers by using as.numeric(). To search and replace a character we can use the function gsub().

example3$num1 <- as.numeric(gsub(",","",example3$num1))
example3$currency <- as.numeric(gsub(",","",gsub("$","",example3$currency,fixed=T)))
example3$percent <- as.numeric(gsub("%","",example3$percent,fixed=TRUE))

example3[1:5,4:6]
##    num1 currency percent
## 1 76604    76604    53.2
## 2 90376    90376     0.8
## 3 96263    96263    16.1
## 4 33608    33608     3.3
## 5 26577    26577    73.0

A useful function for formatting strings is the trimws function which is used to trim excess whitspaces on either end of the string.

example3$state2[19]
## [1]          Idaho
## 19 Levels:           Ohio          Idaho          Texas ... South Carolina
example3$state1 <- trimws(example3$state1,which = "both")
example3$state2 <- trimws(example3$state2,which = "both")

example3$state2[19]
## [1] "Idaho"
example3
##         date1      date2      date3  num1 currency percent  decimal
## 1  2010-03-04 2010-03-04 2010-03-04 76604    76604    53.2 0.991418
## 2  2010-03-11 2010-03-11 2010-03-11 90376    90376     0.8 0.983612
## 3  2010-03-18 2010-03-18 2010-03-18 96263    96263    16.1 0.888531
## 4  2010-03-25 2010-03-25 2010-03-25 33608    33608     3.3 0.530136
## 5  2010-04-01 2010-04-01 2010-04-01 26577    26577    73.0 0.672677
## 6  2010-04-08 2010-04-08 2010-04-08 88943    88943     8.5 0.111815
## 7  2010-04-15 2010-04-15 2010-04-15 49988    49988    79.7 0.958651
## 8  2010-04-22 2010-04-22 2010-04-22 25755    25755    48.7 0.746895
## 9  2010-04-29 2010-04-29 2010-04-29 53169    53169    76.1 0.802572
## 10 2010-05-06 2010-05-06 2010-05-06 95509    95509    91.5 0.564680
## 11 2010-05-13 2010-05-13 2010-05-13 86442    86442    19.4 0.896147
## 12 2010-05-20 2010-05-20 2010-05-20 57135    57135    67.8 0.965187
## 13 2010-05-27 2010-05-27 2010-05-27 33458    33458    61.1 0.588009
## 14 2010-06-03 2010-06-03 2010-06-03   770      770    93.4 0.886857
## 15 2010-06-10 2010-06-10 2010-06-10 65739    65739    98.1 0.858632
## 16 2010-06-17 2010-06-17 2010-06-17 99529    99529     0.0 0.869562
## 17 2010-06-24 2010-06-24 2010-06-24 79056    79056    13.8 0.673845
## 18 2010-07-01 2010-07-01 2010-07-01  2397     2397    59.4 0.700637
## 19 2010-07-08 2010-07-08 2010-07-08 66614    66614    71.4 0.027379
##            state1         state2
## 1  South Carolina South Carolina
## 2  North Carolina North Carolina
## 3    South Dakota   South Dakota
## 4    North Dakota   North Dakota
## 5      California     California
## 6         Arizona        Arizona
## 7         Georgia        Georgia
## 8         Florida        Florida
## 9     Mississippi    Mississippi
## 10      Louisiana      Louisiana
## 11        Alabama        Alabama
## 12          Texas          Texas
## 13       New York       New York
## 14     New Jersey     New Jersey
## 15       Deleware       Deleware
## 16       Virginia       Virginia
## 17  West Virginia  West Virginia
## 18           Ohio           Ohio
## 19          Idaho          Idaho

Reading Delimited Files

The read.table function is a general function for reading in delimited files. There are also a few special cases of read.table for csv files and other delimited files. The following is from the help file ?read.table.

Description

Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.

Usage

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = "\"",
          dec = ",", fill = TRUE, comment.char = "", ...)

read.delim(file, header = TRUE, sep = "\t", quote = "\"",
           dec = ".", fill = TRUE, comment.char = "", ...)

read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
            dec = ",", fill = TRUE, comment.char = "", ...)

Let’s read in the csv file containing the air quality dataset.

airq <- read.table("H:\\BiostatCourses\\PublicHealthComputing\\Data\\usair_a.csv",
                   sep = ",", header = TRUE)
head(airq)
##            city so2 temp factories population windspeed  rain rainydays
## 1       Phoenix  10 70.3       213        582       6.0  7.05        36
## 2   Little Rock  13 61.0        91        132       8.2 48.52       100
## 3 San Francisco  12 56.7       453        716       8.7 20.66        67
## 4        Denver  17 51.9       454        515       9.0 12.95        86
## 5      Hartford  56 49.1       412        158       9.0 43.37       127
## 6    Wilmington  36 54.0        80         80       9.0 40.25       114

We could have also used read.csv which has sep =“,” by defualt. The head=TRUE option lets R know that the first line contains column names. When reading files, it may be easier to select the file from a file explorer. This can be done in R by using file.choose().

airq <- read.table(file.choose(),
                   sep = ",", header = TRUE)
head(airq)

File.choose causes an explorer window to pop up so that you can browse your files and choose the correct one.

Reading in other file types

There are many packages that you can use to read in other file formats such as excel or SAS and SPSS datasets. See

Built in Datasets

Base R and most packages have built in datasets that are used to illustrate their function and test their code. To see the data sets that come with base R use data(). Try it!

One of the datasets in base R is the biochemical oxygen demain (BOD).

data(BOD) #load the dataset into your workspace as a data.frame
head(BOD) #look at the first few observations
##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

Use ?BOD to see a description of the dataset in the help file.

Biochemical Oxygen Demand

Description

The BOD data frame has 6 rows and 2 columns giving the biochemical oxygen demand versus time in an evaluation of water quality.

Usage

BOD

Format

This data frame contains the following columns:

Time

A numeric vector giving the time of the measurement (days).

demand

A numeric vector giving the biochemical oxygen demand (mg/l).

Source

Bates, D.M. and Watts, D.G. (1988), Nonlinear Regression Analysis and Its Applications, Wiley, Appendix A1.4.

Originally from Marske (1967), Biochemical Oxygen Demand Data Interpretation Using Sum of Squares Surface M.Sc. Thesis, University of Wisconsin – Madison.

We can view datasets in other packages by either loading the package first and using data() or by using the package option data(package="MASS")

data(package="MASS")
data(OME) #package is not loaded
## Warning in data(OME): data set 'OME' not found
data(OME, package = "MASS")
head(OME)
##   ID Age OME Loud      Noise Correct Trials
## 1  1  30 low   35   coherent       1      4
## 2  1  30 low   35 incoherent       4      5
## 3  1  30 low   40   coherent       0      3
## 4  1  30 low   40 incoherent       1      1
## 5  1  30 low   45   coherent       2      4
## 6  1  30 low   45 incoherent       2      2
?OME #Help file is in the package MASS which is not loaded
## No documentation for 'OME' in specified packages and libraries:
## you could try '??OME'
library("MASS")

Now ?OME gives

Tests of Auditory Perception in Children with OME

Description

Experiments were performed on children on their ability to differentiate a signal in broad-band noise. The noise was played from a pair of speakers and a signal was added to just one channel; the subject had to turn his/her head to the channel with the added signal. The signal was either coherent (the amplitude of the noise was increased for a period) or incoherent (independent noise was added for the same period to form the same increase in power).

The threshold used in the original analysis was the stimulus loudness needs to get 75% correct responses. Some of the children had suffered from otitis media with effusion (OME).

Usage

OME

Format

The OME data frame has 1129 rows and 7 columns:

ID

Subject ID (1 to 99, with some IDs missing). A few subjects were measured at different ages.

OME

“low” or “high” or “N/A” (at ages other than 30 and 60 months).

Age

Age of the subject (months).

Loud

Loudness of stimulus, in decibels.

Noise

Whether the signal in the stimulus was “coherent” or “incoherent”.

Correct

Number of correct responses from Trials trials.

Trials

Number of trials performed.

Background

The experiment was to study otitis media with effusion (OME), a very common childhood condition where the middle ear space, which is normally air-filled, becomes congested by a fluid. There is a concomitant fluctuating, conductive hearing loss which can result in various language, cognitive and social deficits. The term ‘binaural hearing’ is used to describe the listening conditions in which the brain is processing information from both ears at the same time. The brain computes differences in the intensity and/or timing of signals arriving at each ear which contributes to sound localisation and also to our ability to hear in background noise.

Some years ago, it was found that children of 7–8 years with a history of significant OME had significantly worse binaural hearing than children without such a history, despite having equivalent sensitivity. The question remained as to whether it was the timing, the duration, or the degree of severity of the otitis media episodes during critical periods, which affected later binaural hearing. In an attempt to begin to answer this question, 95 children were monitored for the presence of effusion every month since birth. On the basis of OME experience in their first two years, the test population was split into one group of high OME prevalence and one of low prevalence.

Source

Sarah Hogan, Dept of Physiology, University of Oxford, via Dept of Statistics Consulting Service