Chapter 7 Data Classes
In this section, we will discuss the different (one dimensional/vector) data types/classes in R
- numeric
- character
- integer
- factor
- logical
- Date/POSIXct
as well as the other more complex R classes
- lists
- data.frame/tibble
- matrix
7.1 One Dimensional Data Classes
One dimensional classes/types (vectors) include
numeric
: any real number(s)character
: strings or individual characters, quotedinteger
: any integer(s)/whole numbersfactor
: categorical/qualitative variableslogical
: variables composed ofTRUE
orFALSE
Date
/POSIXct
: represents calendar dates and times
7.1.1 Character and Numeric
We have already seen character
and numeric
types.
[1] "character"
[1] "numeric"
7.1.2 Integer
Integer
is a special subset of numeric
that contains only whole numbers. A sequence of numbers is an example of the integer type.
[1] 1 2 3 4 5
[1] "integer"
The colon :
is a shortcut for making sequences of numbers. [num]:[num2]
makes a consecutive integer sequence from [num1]
to [num2]
by 1.
[1] 1 2 3 4 5
7.1.3 Logical
The logical
type is a type that only has two possible values: TRUE
and FALSE
[1] "logical"
[1] FALSE
[1] TRUE
Note that logicale
elements are NOT in quotes.
[1] "character"
[1] TRUE FALSE TRUE FALSE
In certain cases, it can be useful to treat logical
values as numeric
. In this case, TRUE = 1
and FALSE = 0
. For example, if we apply sum()
and mean()
to a logical
vector, they would return the total number of TRUE
s and the proportion of TRUE
s, respectively.
[1] 2
[1] 0.5
There are two useful functions associated with practically all R classes, which relate to logically checking the underlying class, is.CLASS()
, and coercing to a class, as.CLASS()
. Every class has there own version of these two functions, e.g. is.numeric()
and as.numeric()
.
We can check to see if an R object contains a certain class of data using is.CLASS()
.
[1] FALSE
[1] TRUE
We can force a vector to change to a new type using as.CLASS()
if the conversion makes sense. For example, we can convert numeric data to character data.
[1] "1" "4" "7"
But we can’t necessarily convert character to numeric data.
Warning: NAs introduced by coercion
[1] NA NA
7.1.4 Factors
A factor
is a special character
vector where the elements have pre-defined groups or levels. You can think of these as qualitative or categorical variables. Consider the following categorical variable x
which records the gender for five children as either boy or girl.
[1] boy girl girl boy girl
Levels: boy girl
[1] "factor"
Not that by default the levels are chosen in alphanumeric order. This will be important for statistical analysis, since the first level will be chosen as the reference category in R.
Factors are used to represent categorical data and can also be used for ordinal data (i.e. categories have an intrinsic order) by setting ordered = TRUE
.
Note that some R functions such as read.csv
read in character variables as factor
by default, but other such as read_csv
read them in a character
vectors.
The function factor()
is used to encode a vector as a factor
.
factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x))
Since the order of the levels can matter, how can we alter or set the order of these levels in a way other than alphanumeric order?
Suppose we have a vector of case-control statuses.
[1] case case case control control control
Levels: case control
With this factor case
will be chosen as the reference group, but usually we would want control
to be the reference group. We can reset the levels using the levels
function, but this is bad and can cause problems. You should use the levels
argument in the factor()
function, when defining the factor.
[1] control control control case case case
Levels: control case
Note that with the levels
function, we did change the order of the levels, but we also mistakenly switched observations that were case to control and vice versa. To correctly set the levels, we do this in the factor()
call.
casecontrol = c("case", "case", "case", "control", "control", "control")
factor(casecontrol, levels = c("control", "case"))
[1] case case case control control control
Levels: control case
[1] case case case control control control
Levels: control < case
Another way to change the reference category once the factor is already defined is with the relevel()
function.
[1] case case case control control control
Levels: case control
[1] case case case control control control
Levels: control case
One of the tidyverse
packages forcats
offers useful functionality for interacting with factors. For example, there is a function for releveling factors, fct_relevel
.
[1] case case case control control control
Levels: control case
There are other useful functions for dictating the levels of factors, like in the order they appear in the vector, by frequency, or into collapsed groups.
fct_inorder()
: creates factory with levels in the order in which they first appear.fct_infreq()
: creates factory with levels by the number of observations with each level (largest first)fct_lump()
: creates factory with levels grouped together if they have too few observations.
[1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"
[1] "horsebean" "linseed" "soybean" "sunflower" "meatmeal" "casein"
casein horsebean linseed meatmeal soybean sunflower
12 10 12 11 14 12
[1] "soybean" "casein" "linseed" "sunflower" "meatmeal" "horsebean"
levels(fct_lump(chickwts$feed, n = 1)) # lumps all but the most frequently occurring category together
[1] "soybean" "Other"
Factors can be converted to numeric
or character
very easily.
[1] "case" "case" "case" "control" "control" "control"
[1] 2 2 2 1 1 1
Note that R codes the reference category as 1, the next category as 2, and so forth when converting to numeric.
A useful function for generating new variables is rep()
(repeat). The repeat function will repeat the elements of a vector a given number of times.
To create a character vector with “boy” as the first 50 entries and “girl” as the next 50 entries, we can use rep
with each = 50
.
[1] "boy" "boy" "boy" "boy" "boy" "boy"
[1] 100
If we want to alternate “boy” “girl” 50 times each, we can use times = 50
to repeat the whole vector 50 times.
[1] "boy" "girl" "boy" "girl" "boy" "girl"
[1] 100
7.1.5 Dates
You can convert date-like strings into the Date
class (see this tutorial on Dates for more information) using the lubridate
package.
Let’s work with the dates in the charm city circulator ridership dataset.
Rows: 1146 Columns: 15
── Column specification ──────────────────────────────────────────────
Delimiter: ","
chr (2): day, date
dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "01/01/2011" "01/01/2012" "01/01/2013" "01/02/2011" "01/02/2012"
[6] "01/02/2013"
Note that as character strings, the dates are not being sorted properly, so let’s change them to a date.
[1] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
[6] "2010-01-16"
[1] "2010-01-11" "2013-03-01"
[1] "Date"
Now that the dates are transformed to the Date
class, we can treat them as properly ordered numbers. For example, we used range
to see that the dates in data set range from January 1, 2010 to March 1, 2013.
We did this by using date transformation function from lubriate
that matches the date pattern in the strings. In this case, we used mdy
because the dates were in month day year format. Other formats include
ymd
: year month dayydm
: year day monthymd_hms
: year month day hours minutes seconds
and so forth. For example, the following date is in the format year month day hours minutes seconds.
[1] "2014-02-04 05:02:00 UTC" "2016-09-24 14:02:00 UTC"
If we use the wrong format, then we will get an error.
Warning: All formats failed to parse. No formats found.
[1] NA NA
The POSIXct
class is like a more general date format (with hours, minutes, and second).
[1] "POSIXct" "POSIXt"
The as.period
command is helpful for adding time to a date.
[1] "2024-02-07 11:47:44 EST"
[1] "POSIXct" "POSIXt"
[1] "2024-02-07 12:07:44 EST"
You can subtract times as well with the difftime
functions. You can set the units for the times differences. Note that difftime(time1, time2) = time1 - time2
.
Time difference of -1133.2 days
Time difference of -161.8857 weeks
7.2 Data Frames and Matrices
Recall that we have already seen the data.frame
class. This is an R dataset, like an excel spreadsheet, where the number of rows corresponds to the total number of observations and each column corresponds to a variable.
R also has another 2 dimensional class called a matrix. A matrix is a two dimensional array, composed of rows and columns (just like the data.frame
), but unlike the data frame the entire matrix is composed of one R class, e.g. all numeric, all characters, all logical, etc.
Matrices are a special case of the more general array
class, but we will not discuss array
s here.
We can build a matrix with the matrix
functions.
[1] 1 2 3 4 5 6 7 8 9
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Note that we specified 3 rows, so when we filled it with 9 elements, R automatically set the number of columns to 3. We could have used ncol
to set the number of columns manually. The matrix was also filled down columns. If we wanted to fill it across rows, we could set byrow = TRUE
.
Matrices have two “slots” you can use to select data, which represent rows and columns, that are separated by a comma, so the syntax is matrix[row, column]
(just like a data frame). Note you cannot use dplyr
functions on matrices.
[1] 1
[1] 1 4 7
[1] 1 2 3
Note that the class of the returned object is no longer a matrix.
[1] "integer"
[1] "integer"
To review, the data.frame
/tbl_df
are the other two dimensional variable classes. Again, data frames are like matrices, but each column is a vector that can have its own class. So some columns might be character
and others might be numeric
, while others maybe a factor
.
7.3 Lists
The most generic data class is the list
, which can be created using the list()
function. A list can hold vectors, strings, matrices, models, lists of other lists, or any other object you can create in R. You reference elements of a list by using $
, []
, or [[]]
.
$letters
[1] "A" "b" "c"
$numbers
[1] 1 2 3
[[3]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
Depending on how you reference elements of a list you may get a list
returned or the actual object class that you selected.
$letters
[1] "A" "b" "c"
$letters
[1] "A" "b" "c"
[1] "A" "b" "c"
[1] "A" "b" "c"
[1] "A" "b" "c"
You can also select multiple elements of the list with single brackets to be returned in a list.
$letters
[1] "A" "b" "c"
$numbers
[1] 1 2 3
You can also select down several levels of a list at once.
[1] "A"
[1] 1
[,1] [,2]
[1,] 1 6
[2,] 2 7
7.4 Exercises
For these exercises, we will use the bike lanes dataset, Bike_Lanes.csv. The data frame containing this dataset will be referred to as bike
below.
Part of the lab will make use of %in%
which checks to see if something is contained in a vector.
[1] TRUE TRUE TRUE FALSE FALSE
[1] TRUE TRUE TRUE FALSE FALSE
- Get all the different types of bike lanes from the
type
column. Usesort(unique())
. Assign this to an objectbtypes
. Typedput(btypes)
. - By rearranging vector
btypes
and usingdput
, recodetype
as a factor that hasSIDEPATH
as the first level. Printhead(bike$type)
. Note what you see. Runtable(bike$type)
afterwards and note the order. - Make a column called
type2
, which is a factor of thetype
column, with the levels:c("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE")
. Runtable(bike$type2)
, with the optionsuseNA = "always"
. Note, we do not have to make type a character again before doing this. - Reassign
dateInstalled
into a character usingas.character
. Runhead(bike$dateInstalled)
. - Reassign
dateInstalled
as a factor, using the default levels. Runhead(bike$dateInstalled)
. - Do not reassign
dateInstalled
, but simply runhead(as.numeric(bike$dateInstalled))
. We are looking to see what happens when we try to go from factor to numeric. - Do not reassign
dateInstalled
, but simply runhead(as.numeric(as.character(bike$dateInstalled)))
. This is how you get a “numeric” value back if they were incorrectly converted to factors.
- Reassign
- Convert
type
back to a character vector. Make a columntype2
(replacing the old one), where if the type is one of these categoriesc("CONTRAFLOW", "SHARED BUS BIKE", "SHARROW", "SIGNED ROUTE")
call it"OTHER"
. Use%in%
andifelse
. Maketype2
a factor with the levelsc("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE", "OTHER")
. - Parse the following dates using the correct
lubridate
functions:- “2014/02-14”
- “04/22/14 03:20” assume
mdy
- “4/5/2016 03:2:22” assume
mdy