Chapter 7 Data Classes

In this section, we will discuss the different (one dimensional/vector) data types/classes in R

  • numeric
  • character
  • integer
  • factor
  • logical
  • Date/POSIXct

as well as the other more complex R classes

  • lists
  • data.frame/tibble
  • matrix

7.1 One Dimensional Data Classes

One dimensional classes/types (vectors) include

  • numeric: any real number(s)
  • character: strings or individual characters, quoted
  • integer: any integer(s)/whole numbers
  • factor: categorical/qualitative variables
  • logical: variables composed of TRUE or FALSE
  • Date/POSIXct: represents calendar dates and times

7.1.1 Character and Numeric

We have already seen character and numeric types.

class(c("Robert", "Parker"))
[1] "character"
class(c(1, 4, 7))
[1] "numeric"

7.1.2 Integer

Integer is a special subset of numeric that contains only whole numbers. A sequence of numbers is an example of the integer type.

x = seq(from = 1, to = 5) #seq() is a function
x
[1] 1 2 3 4 5
class(x)
[1] "integer"

The colon : is a shortcut for making sequences of numbers. [num]:[num2] makes a consecutive integer sequence from [num1] to [num2] by 1.

1:5
[1] 1 2 3 4 5

7.1.3 Logical

The logical type is a type that only has two possible values: TRUE and FALSE

x = c(TRUE, FALSE, TRUE, TRUE, FALSE)
class(x)
[1] "logical"
is.numeric(c("Robert", "Parker"))
[1] FALSE
is.character(c("Robert", "Parker"))
[1] TRUE

Note that logicale elements are NOT in quotes.

z = c("TRUE", "FALSE", "TRUE", "FALSE")
class(z)
[1] "character"
as.logical(z)
[1]  TRUE FALSE  TRUE FALSE

In certain cases, it can be useful to treat logical values as numeric. In this case, TRUE = 1 and FALSE = 0. For example, if we apply sum() and mean() to a logical vector, they would return the total number of TRUEs and the proportion of TRUEs, respectively.

sum(as.logical(z)) # 2 TRUEs and 2 FALSEs
[1] 2
mean(as.logical(z))
[1] 0.5

There are two useful functions associated with practically all R classes, which relate to logically checking the underlying class, is.CLASS(), and coercing to a class, as.CLASS(). Every class has there own version of these two functions, e.g. is.numeric() and as.numeric().

We can check to see if an R object contains a certain class of data using is.CLASS().

is.numeric(c("Robert", "Parker")) # Does this vector contain numeric data?
[1] FALSE
is.character(c("Robert", "Parker")) # Does this vector contain character data?
[1] TRUE

We can force a vector to change to a new type using as.CLASS() if the conversion makes sense. For example, we can convert numeric data to character data.

as.character(c(1, 4, 7))
[1] "1" "4" "7"

But we can’t necessarily convert character to numeric data.

as.numeric(c("Robert", "Parker"))
Warning: NAs introduced by coercion
[1] NA NA

7.1.4 Factors

A factor is a special character vector where the elements have pre-defined groups or levels. You can think of these as qualitative or categorical variables. Consider the following categorical variable x which records the gender for five children as either boy or girl.

x = factor(c("boy", "girl", "girl", "boy", "girl"))
x
[1] boy  girl girl boy  girl
Levels: boy girl
class(x)
[1] "factor"

Not that by default the levels are chosen in alphanumeric order. This will be important for statistical analysis, since the first level will be chosen as the reference category in R.

Factors are used to represent categorical data and can also be used for ordinal data (i.e. categories have an intrinsic order) by setting ordered = TRUE.

Note that some R functions such as read.csv read in character variables as factor by default, but other such as read_csv read them in a character vectors.

The function factor() is used to encode a vector as a factor.

factor(x = character(), levels, labels = levels,       
       exclude = NA, ordered = is.ordered(x))

Since the order of the levels can matter, how can we alter or set the order of these levels in a way other than alphanumeric order?

Suppose we have a vector of case-control statuses.

cc = factor(c("case", "case", "case", "control", "control", "control"))
cc
[1] case    case    case    control control control
Levels: case control

With this factor case will be chosen as the reference group, but usually we would want control to be the reference group. We can reset the levels using the levels function, but this is bad and can cause problems. You should use the levels argument in the factor() function, when defining the factor.

levels(cc) = c("control", "case")
cc
[1] control control control case    case    case   
Levels: control case

Note that with the levels function, we did change the order of the levels, but we also mistakenly switched observations that were case to control and vice versa. To correctly set the levels, we do this in the factor() call.

casecontrol = c("case", "case", "case", "control", "control", "control")
factor(casecontrol, levels = c("control", "case"))
[1] case    case    case    control control control
Levels: control case
factor(casecontrol, levels = c("control", "case"), ordered = TRUE) #Example of an ordered factor
[1] case    case    case    control control control
Levels: control < case

Another way to change the reference category once the factor is already defined is with the relevel() function.

cc = factor(c("case", "case", "case", "control", "control", "control"))
cc
[1] case    case    case    control control control
Levels: case control
cc2 = relevel(cc, "control")
cc2
[1] case    case    case    control control control
Levels: control case

One of the tidyverse packages forcats offers useful functionality for interacting with factors. For example, there is a function for releveling factors, fct_relevel.

library(forcats)
fct_relevel(cc, "control")
[1] case    case    case    control control control
Levels: control case

There are other useful functions for dictating the levels of factors, like in the order they appear in the vector, by frequency, or into collapsed groups.

  • fct_inorder(): creates factory with levels in the order in which they first appear.
  • fct_infreq(): creates factory with levels by the number of observations with each level (largest first)
  • fct_lump(): creates factory with levels grouped together if they have too few observations.
levels(factor(chickwts$feed)) # alphanumeric level order
[1] "casein"    "horsebean" "linseed"   "meatmeal"  "soybean"   "sunflower"
levels(fct_inorder(chickwts$feed)) # levels in order that they appear first
[1] "horsebean" "linseed"   "soybean"   "sunflower" "meatmeal"  "casein"   
table(chickwts$feed)

   casein horsebean   linseed  meatmeal   soybean sunflower 
       12        10        12        11        14        12 
levels(fct_infreq(chickwts$feed)) # levels ordered by frequency
[1] "soybean"   "casein"    "linseed"   "sunflower" "meatmeal"  "horsebean"
levels(fct_lump(chickwts$feed, n = 1)) # lumps all but the most frequently occurring category together
[1] "soybean" "Other"  

Factors can be converted to numeric or character very easily.

x = factor(casecontrol, levels = c("control", "case"))
as.character(x)
[1] "case"    "case"    "case"    "control" "control" "control"
as.numeric(x)
[1] 2 2 2 1 1 1

Note that R codes the reference category as 1, the next category as 2, and so forth when converting to numeric.

A useful function for generating new variables is rep() (repeat). The repeat function will repeat the elements of a vector a given number of times.

To create a character vector with “boy” as the first 50 entries and “girl” as the next 50 entries, we can use rep with each = 50.

bg = rep(c("boy", "girl"), each = 50) 
head(bg)
[1] "boy" "boy" "boy" "boy" "boy" "boy"
length(bg)
[1] 100

If we want to alternate “boy” “girl” 50 times each, we can use times = 50 to repeat the whole vector 50 times.

bg = rep(c("boy", "girl"), times = 50) 
head(bg)
[1] "boy"  "girl" "boy"  "girl" "boy"  "girl"
length(bg)
[1] 100

7.1.5 Dates

You can convert date-like strings into the Date class (see this tutorial on Dates for more information) using the lubridate package.

Let’s work with the dates in the charm city circulator ridership dataset.

library(readr)
circ = read_csv("./data/Charm_City_Circulator_Ridership.csv")
Rows: 1146 Columns: 15
── Column specification ──────────────────────────────────────────────
Delimiter: ","
chr  (2): day, date
dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sort(circ$date))
[1] "01/01/2011" "01/01/2012" "01/01/2013" "01/02/2011" "01/02/2012"
[6] "01/02/2013"

Note that as character strings, the dates are not being sorted properly, so let’s change them to a date.

library(lubridate)
library(dplyr)
circ = mutate(circ, newDate = mdy(date))
head(circ$newDate)
[1] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
[6] "2010-01-16"
range(circ$newDate)
[1] "2010-01-11" "2013-03-01"
class(circ$newDate)
[1] "Date"

Now that the dates are transformed to the Date class, we can treat them as properly ordered numbers. For example, we used range to see that the dates in data set range from January 1, 2010 to March 1, 2013.

We did this by using date transformation function from lubriate that matches the date pattern in the strings. In this case, we used mdy because the dates were in month day year format. Other formats include

  • ymd: year month day
  • ydm: year day month
  • ymd_hms: year month day hours minutes seconds

and so forth. For example, the following date is in the format year month day hours minutes seconds.

x = c("2014-02-4 05:02:00","2016/09/24 14:02:00")
ymd_hms(x)
[1] "2014-02-04 05:02:00 UTC" "2016-09-24 14:02:00 UTC"

If we use the wrong format, then we will get an error.

ymd_hm(x)
Warning: All formats failed to parse. No formats found.
[1] NA NA

The POSIXct class is like a more general date format (with hours, minutes, and second).

x = c("2014-02-4 05:02:00","2016/09/24 14:02:00")
class(ymd_hms(x))
[1] "POSIXct" "POSIXt" 

The as.period command is helpful for adding time to a date.

theTime = Sys.time() # get the current time
theTime
[1] "2024-02-07 11:47:44 EST"
class(theTime)
[1] "POSIXct" "POSIXt" 
theTime + as.period(20, unit = "minutes") #20 minutes past the current time
[1] "2024-02-07 12:07:44 EST"

You can subtract times as well with the difftime functions. You can set the units for the times differences. Note that difftime(time1, time2) = time1 - time2.

the_future = ymd_hms("2020-12-31 11:59:59")
the_future - theTime
Time difference of -1133.2 days
difftime(the_future, theTime, units = "weeks")
Time difference of -161.8857 weeks

7.2 Data Frames and Matrices

Recall that we have already seen the data.frame class. This is an R dataset, like an excel spreadsheet, where the number of rows corresponds to the total number of observations and each column corresponds to a variable.

R also has another 2 dimensional class called a matrix. A matrix is a two dimensional array, composed of rows and columns (just like the data.frame), but unlike the data frame the entire matrix is composed of one R class, e.g. all numeric, all characters, all logical, etc.

Matrices are a special case of the more general array class, but we will not discuss arrays here.

We can build a matrix with the matrix functions.

n = 1:9
n
[1] 1 2 3 4 5 6 7 8 9
mat = matrix(n, nrow = 3)
mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Note that we specified 3 rows, so when we filled it with 9 elements, R automatically set the number of columns to 3. We could have used ncol to set the number of columns manually. The matrix was also filled down columns. If we wanted to fill it across rows, we could set byrow = TRUE.

Matrices have two “slots” you can use to select data, which represent rows and columns, that are separated by a comma, so the syntax is matrix[row, column] (just like a data frame). Note you cannot use dplyr functions on matrices.

mat[1, 1] # select the element in row 1 and column 1
[1] 1
mat[1, ] # select the entire first row
[1] 1 4 7
mat[, 1] # select the entire first column
[1] 1 2 3

Note that the class of the returned object is no longer a matrix.

class(mat[1, ])
[1] "integer"
class(mat[, 1])
[1] "integer"

To review, the data.frame/tbl_df are the other two dimensional variable classes. Again, data frames are like matrices, but each column is a vector that can have its own class. So some columns might be character and others might be numeric, while others maybe a factor.

7.3 Lists

The most generic data class is the list, which can be created using the list() function. A list can hold vectors, strings, matrices, models, lists of other lists, or any other object you can create in R. You reference elements of a list by using $, [], or [[]].

mylist <- list(letters = c("A", "b", "c"),
               numbers = 1:3,
               matrix(1:25, ncol = 5))
head(mylist)
$letters
[1] "A" "b" "c"

$numbers
[1] 1 2 3

[[3]]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

Depending on how you reference elements of a list you may get a list returned or the actual object class that you selected.

mylist[1] # returns a list containing the vector of letters
$letters
[1] "A" "b" "c"
mylist["letters"] # returns a list containing the vector of letters
$letters
[1] "A" "b" "c"
mylist[[1]] # returns the character vector letters
[1] "A" "b" "c"
mylist$letters # returns the character vector letters
[1] "A" "b" "c"
mylist[["letters"]] # returns the character vector letters
[1] "A" "b" "c"

You can also select multiple elements of the list with single brackets to be returned in a list.

mylist[1:2] # returns a list
$letters
[1] "A" "b" "c"

$numbers
[1] 1 2 3

You can also select down several levels of a list at once.

mylist$letters[1]
[1] "A"
mylist[[2]][1]
[1] 1
mylist[[3]][1:2, 1:2]
     [,1] [,2]
[1,]    1    6
[2,]    2    7

7.4 Exercises

For these exercises, we will use the bike lanes dataset, Bike_Lanes.csv. The data frame containing this dataset will be referred to as bike below.

Part of the lab will make use of %in% which checks to see if something is contained in a vector.

x = c(0, 2, 2, 3, 4)
(x == 0 | x == 2)
[1]  TRUE  TRUE  TRUE FALSE FALSE
x %in% c(0, 2) # Note this will never return NA
[1]  TRUE  TRUE  TRUE FALSE FALSE
  1. Get all the different types of bike lanes from the type column. Use sort(unique()). Assign this to an object btypes. Type dput(btypes).
  2. By rearranging vector btypes and using dput, recode type as a factor that has SIDEPATH as the first level. Print head(bike$type). Note what you see. Run table(bike$type) afterwards and note the order.
  3. Make a column called type2, which is a factor of the type column, with the levels: c("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE"). Run table(bike$type2), with the options useNA = "always". Note, we do not have to make type a character again before doing this.
    1. Reassign dateInstalled into a character using as.character. Run head(bike$dateInstalled).
    2. Reassign dateInstalled as a factor, using the default levels. Run head(bike$dateInstalled).
    3. Do not reassign dateInstalled, but simply run head(as.numeric(bike$dateInstalled)). We are looking to see what happens when we try to go from factor to numeric.
    4. Do not reassign dateInstalled, but simply run head(as.numeric(as.character(bike$dateInstalled))). This is how you get a “numeric” value back if they were incorrectly converted to factors.
  4. Convert type back to a character vector. Make a column type2 (replacing the old one), where if the type is one of these categories c("CONTRAFLOW", "SHARED BUS BIKE", "SHARROW", "SIGNED ROUTE") call it "OTHER". Use %in% and ifelse. Make type2 a factor with the levels c("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE", "OTHER").
  5. Parse the following dates using the correct lubridate functions:
    1. “2014/02-14”
    2. “04/22/14 03:20” assume mdy
    3. “4/5/2016 03:2:22” assume mdy