Chapter 7 Data Classes
In this section, we will discuss the different (one dimensional/vector) data types/classes in R
- numeric
- character
- integer
- factor
- logical
- Date/POSIXct
as well as the other more complex R classes
- lists
- data.frame/tibble
- matrix
7.1 One Dimensional Data Classes
One dimensional classes/types (vectors) include
numeric: any real number(s)character: strings or individual characters, quotedinteger: any integer(s)/whole numbersfactor: categorical/qualitative variableslogical: variables composed ofTRUEorFALSEDate/POSIXct: represents calendar dates and times
7.1.1 Character and Numeric
We have already seen character and numeric types.
[1] "character"
[1] "numeric"
7.1.2 Integer
Integer is a special subset of numeric that contains only whole numbers. A sequence of numbers is an example of the integer type.
[1] 1 2 3 4 5
[1] "integer"
The colon : is a shortcut for making sequences of numbers. [num]:[num2] makes a consecutive integer sequence from [num1] to [num2] by 1.
[1] 1 2 3 4 5
7.1.3 Logical
The logical type is a type that only has two possible values: TRUE and FALSE
[1] "logical"
[1] FALSE
[1] TRUE
Note that logicale elements are NOT in quotes.
[1] "character"
[1] TRUE FALSE TRUE FALSE
In certain cases, it can be useful to treat logical values as numeric. In this case, TRUE = 1 and FALSE = 0. For example, if we apply sum() and mean() to a logical vector, they would return the total number of TRUEs and the proportion of TRUEs, respectively.
[1] 2
[1] 0.5
There are two useful functions associated with practically all R classes, which relate to logically checking the underlying class, is.CLASS(), and coercing to a class, as.CLASS(). Every class has there own version of these two functions, e.g. is.numeric() and as.numeric().
We can check to see if an R object contains a certain class of data using is.CLASS().
[1] FALSE
[1] TRUE
We can force a vector to change to a new type using as.CLASS() if the conversion makes sense. For example, we can convert numeric data to character data.
[1] "1" "4" "7"
But we can’t necessarily convert character to numeric data.
Warning: NAs introduced by coercion
[1] NA NA
7.1.4 Factors
A factor is a special character vector where the elements have pre-defined groups or levels. You can think of these as qualitative or categorical variables. Consider the following categorical variable x which records the gender for five children as either boy or girl.
[1] boy girl girl boy girl
Levels: boy girl
[1] "factor"
Not that by default the levels are chosen in alphanumeric order. This will be important for statistical analysis, since the first level will be chosen as the reference category in R.
Factors are used to represent categorical data and can also be used for ordinal data (i.e. categories have an intrinsic order) by setting ordered = TRUE.
Note that some R functions such as read.csv read in character variables as factor by default, but other such as read_csv read them in a character vectors.
The function factor() is used to encode a vector as a factor.
factor(x = character(), levels, labels = levels,
exclude = NA, ordered = is.ordered(x))
Since the order of the levels can matter, how can we alter or set the order of these levels in a way other than alphanumeric order?
Suppose we have a vector of case-control statuses.
[1] case case case control control control
Levels: case control
With this factor case will be chosen as the reference group, but usually we would want control to be the reference group. We can reset the levels using the levels function, but this is bad and can cause problems. You should use the levels argument in the factor() function, when defining the factor.
[1] control control control case case case
Levels: control case
Note that with the levels function, we did change the order of the levels, but we also mistakenly switched observations that were case to control and vice versa. To correctly set the levels, we do this in the factor() call.
casecontrol = c("case", "case", "case", "control", "control", "control")
factor(casecontrol, levels = c("control", "case"))[1] case case case control control control
Levels: control case
[1] case case case control control control
Levels: control < case
Another way to change the reference category once the factor is already defined is with the relevel() function.
[1] case case case control control control
Levels: case control
[1] case case case control control control
Levels: control case
One of the tidyverse packages forcats offers useful functionality for interacting with factors. For example, there is a function for releveling factors, fct_relevel.
[1] case case case control control control
Levels: control case
There are other useful functions for dictating the levels of factors, like in the order they appear in the vector, by frequency, or into collapsed groups.
fct_inorder(): creates factory with levels in the order in which they first appear.fct_infreq(): creates factory with levels by the number of observations with each level (largest first)fct_lump(): creates factory with levels grouped together if they have too few observations.
[1] "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"
[1] "horsebean" "linseed" "soybean" "sunflower" "meatmeal" "casein"
casein horsebean linseed meatmeal soybean sunflower
12 10 12 11 14 12
[1] "soybean" "casein" "linseed" "sunflower" "meatmeal" "horsebean"
levels(fct_lump(chickwts$feed, n = 1)) # lumps all but the most frequently occurring category together[1] "soybean" "Other"
Factors can be converted to numeric or character very easily.
[1] "case" "case" "case" "control" "control" "control"
[1] 2 2 2 1 1 1
Note that R codes the reference category as 1, the next category as 2, and so forth when converting to numeric.
A useful function for generating new variables is rep() (repeat). The repeat function will repeat the elements of a vector a given number of times.
To create a character vector with “boy” as the first 50 entries and “girl” as the next 50 entries, we can use rep with each = 50.
[1] "boy" "boy" "boy" "boy" "boy" "boy"
[1] 100
If we want to alternate “boy” “girl” 50 times each, we can use times = 50 to repeat the whole vector 50 times.
[1] "boy" "girl" "boy" "girl" "boy" "girl"
[1] 100
7.1.5 Dates
You can convert date-like strings into the Date class (see this tutorial on Dates for more information) using the lubridate package.
Let’s work with the dates in the charm city circulator ridership dataset.
Rows: 1146 Columns: 15
── Column specification ──────────────────────────────────────────────
Delimiter: ","
chr (2): day, date
dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "01/01/2011" "01/01/2012" "01/01/2013" "01/02/2011" "01/02/2012"
[6] "01/02/2013"
Note that as character strings, the dates are not being sorted properly, so let’s change them to a date.
[1] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
[6] "2010-01-16"
[1] "2010-01-11" "2013-03-01"
[1] "Date"
Now that the dates are transformed to the Date class, we can treat them as properly ordered numbers. For example, we used range to see that the dates in data set range from January 1, 2010 to March 1, 2013.
We did this by using date transformation function from lubriate that matches the date pattern in the strings. In this case, we used mdy because the dates were in month day year format. Other formats include
ymd: year month dayydm: year day monthymd_hms: year month day hours minutes seconds
and so forth. For example, the following date is in the format year month day hours minutes seconds.
[1] "2014-02-04 05:02:00 UTC" "2016-09-24 14:02:00 UTC"
If we use the wrong format, then we will get an error.
Warning: All formats failed to parse. No formats found.
[1] NA NA
The POSIXct class is like a more general date format (with hours, minutes, and second).
[1] "POSIXct" "POSIXt"
The as.period command is helpful for adding time to a date.
[1] "2024-02-07 11:47:44 EST"
[1] "POSIXct" "POSIXt"
[1] "2024-02-07 12:07:44 EST"
You can subtract times as well with the difftime functions. You can set the units for the times differences. Note that difftime(time1, time2) = time1 - time2.
Time difference of -1133.2 days
Time difference of -161.8857 weeks
7.2 Data Frames and Matrices
Recall that we have already seen the data.frame class. This is an R dataset, like an excel spreadsheet, where the number of rows corresponds to the total number of observations and each column corresponds to a variable.
R also has another 2 dimensional class called a matrix. A matrix is a two dimensional array, composed of rows and columns (just like the data.frame), but unlike the data frame the entire matrix is composed of one R class, e.g. all numeric, all characters, all logical, etc.
Matrices are a special case of the more general array class, but we will not discuss arrays here.
We can build a matrix with the matrix functions.
[1] 1 2 3 4 5 6 7 8 9
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Note that we specified 3 rows, so when we filled it with 9 elements, R automatically set the number of columns to 3. We could have used ncol to set the number of columns manually. The matrix was also filled down columns. If we wanted to fill it across rows, we could set byrow = TRUE.
Matrices have two “slots” you can use to select data, which represent rows and columns, that are separated by a comma, so the syntax is matrix[row, column] (just like a data frame). Note you cannot use dplyr functions on matrices.
[1] 1
[1] 1 4 7
[1] 1 2 3
Note that the class of the returned object is no longer a matrix.
[1] "integer"
[1] "integer"
To review, the data.frame/tbl_df are the other two dimensional variable classes. Again, data frames are like matrices, but each column is a vector that can have its own class. So some columns might be character and others might be numeric, while others maybe a factor.
7.3 Lists
The most generic data class is the list, which can be created using the list() function. A list can hold vectors, strings, matrices, models, lists of other lists, or any other object you can create in R. You reference elements of a list by using $, [], or [[]].
$letters
[1] "A" "b" "c"
$numbers
[1] 1 2 3
[[3]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
Depending on how you reference elements of a list you may get a list returned or the actual object class that you selected.
$letters
[1] "A" "b" "c"
$letters
[1] "A" "b" "c"
[1] "A" "b" "c"
[1] "A" "b" "c"
[1] "A" "b" "c"
You can also select multiple elements of the list with single brackets to be returned in a list.
$letters
[1] "A" "b" "c"
$numbers
[1] 1 2 3
You can also select down several levels of a list at once.
[1] "A"
[1] 1
[,1] [,2]
[1,] 1 6
[2,] 2 7
7.4 Exercises
For these exercises, we will use the bike lanes dataset, Bike_Lanes.csv. The data frame containing this dataset will be referred to as bike below.
Part of the lab will make use of %in% which checks to see if something is contained in a vector.
[1] TRUE TRUE TRUE FALSE FALSE
[1] TRUE TRUE TRUE FALSE FALSE
- Get all the different types of bike lanes from the
typecolumn. Usesort(unique()). Assign this to an objectbtypes. Typedput(btypes). - By rearranging vector
btypesand usingdput, recodetypeas a factor that hasSIDEPATHas the first level. Printhead(bike$type). Note what you see. Runtable(bike$type)afterwards and note the order. - Make a column called
type2, which is a factor of thetypecolumn, with the levels:c("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE"). Runtable(bike$type2), with the optionsuseNA = "always". Note, we do not have to make type a character again before doing this. - Reassign
dateInstalledinto a character usingas.character. Runhead(bike$dateInstalled). - Reassign
dateInstalledas a factor, using the default levels. Runhead(bike$dateInstalled). - Do not reassign
dateInstalled, but simply runhead(as.numeric(bike$dateInstalled)). We are looking to see what happens when we try to go from factor to numeric. - Do not reassign
dateInstalled, but simply runhead(as.numeric(as.character(bike$dateInstalled))). This is how you get a “numeric” value back if they were incorrectly converted to factors.
- Reassign
- Convert
typeback to a character vector. Make a columntype2(replacing the old one), where if the type is one of these categoriesc("CONTRAFLOW", "SHARED BUS BIKE", "SHARROW", "SIGNED ROUTE")call it"OTHER". Use%in%andifelse. Maketype2a factor with the levelsc("SIDEPATH", "BIKE BOULEVARD", "BIKE LANE", "OTHER"). - Parse the following dates using the correct
lubridatefunctions:- “2014/02-14”
- “04/22/14 03:20” assume
mdy - “4/5/2016 03:2:22” assume
mdy