Chapter 12 Functions
In this chapter, we will cover the basics of writing your own functions. Two of the greatest strengths of R are adaptability and extensibility. R can be customized to do almost anything (of course this may require time). In many cases, the tool you want may already exist in a package, but if it is not, then you can define it yourself by writing your own function. Writing a function is useful for automatic tasks that you find yourself doing repeatedly with only changes to the input, such as making the same plots for reports with updated datasets. Writing your own function is also useful for creating new functionality in R and sharing it with others. When researchers develop a new statistical procedure that others would want to use, the researcher can write a function to then share with others who only need to know how to us the function but not all the details needed to code it.
We will also briefly cover loops in R to reduce repetitive code.
12.1 Writing Your Own Function
12.1.1 Defining a Function
The basic syntax for defining your own function is very simple. One simply uses the function()
function to define it.
functionName = function(inputs) { < function body > return(value) }
Then you would run the 4 lines of the code, which adds a function with the name functionName
to your workspace.
Here we will write a function that returns the second element of a vector:
[1] 4
Note that your function will automatically return the last line of code run:
[1] 4
And if your function is really one line or evaluation, like here, you do not need the curly brackets, and you can put everything on one line:
[1] 4
12.1.2 Function Inputs
Functions can take multiple inputs. The inputs or arguments to the function are defined the parentheses when calling function()
. We can simply give the inputs names and require the user to provide a value to them when calling the function. For example, maybe you want users to select which element to extract.
[1] 5
We can also specify default values for function inputs, so if the user does not pass a value it will use this default value. We set a default value when defining the function by setting varName = defaultValue
in the function()
call. For example, let’s have our function return the 1st element of the vector by default.
[1] 1
Also note that when calling a function using named inputs, we can put them in any order.
[1] 4
But if we do not use the input names, then we must use the same order as in the function definition.
[1] 2 NA NA NA
12.1.3 Scope
Let’s take a look at another function which takes in to numbers x and y and calculates \(x^2-y\).
fn.demo = function(x = 0, y = 0, verbose = FALSE) {
z <- x^2 - y
if (verbose == TRUE) {
cat(paste(x, " squared minus ", y, " = ", z, "\n", sep = " "))
} else {
return(z)
}
}
fn.demo(3, 5)
[1] 4
3 squared minus 5 = 4
[1] "TRUE" "FALSE" "TRUE" "FALSE"
Note that even though we created a variable called z
inside the function and stored the value of 4 in it, a variable called z does not exist in our workspace. This is due to the scope of this variable. It was defined inside the function, so it is created when the function is called and deleted once the function is finished. Generally, any variable that you want your function to use that is defined outside the function should be passed in as an input argument to the function. Do not reference variables outside the function.
q = 5
bad_function = function(p) {
p / q
}
good_function = function(p, q) {
p / q
}
bad_function(10)
[1] 2
[1] 5
[1] 5
Note that when we used the good_function
to divide 10 by 2 (i.e. changed q to 2), it did not use the q defined in our workspace outside the function. The global
q kept the value of 5
.
12.1.4 Returning Multiple Values
In more complex functions, we usually want to return multiple values/objects to the user. We can do this by returning the values in a list
, vector
, or any other object that can hold the multiple values that we want to return.
Let’s create a function that takes in a 2x2 table and calculates the odds ratio and a 95% confidence interval. It returns all three of these values.
For a 2x2 table of the form
Diseased
Exposed Yes No
Yes "a" "b"
No "c" "d"
The estimated odd ratio is calculated as \(\widehat{OR} = (a\cdot d) / (b\cdot c)\), and an (approximate) 95% confidence interval is given by \(\widehat{OR}e^{\pm 1.96*\sqrt{1/a +1/b +1/c +1/d}}\).
odds.ratio = function(table) {
or = (table[1,1] * table[2,2]) / (table[1,2] * table[2,1])
SE.logor = sqrt(1/table[1,1] + 1/table[1,2] + 1/table[2,1] + 1/table[2,2])
zcrit = 1.96
ci.lower = or*exp(-zcrit * SE.logor)
ci.upper = or*exp(zcrit * SE.logor)
return(list("Odds_Ratio" = or, "CI.lower" = ci.lower, "CI.Upper" = ci.upper))
}
#some test data on the association between hypertension and cardiovascular disease
CVD_data = matrix(c(992, 2260, 165, 1017), nrow = 2, ncol = 2,
byrow = TRUE, dimnames = list("Hypertension" = c("Yes", "No"),
"CVD" = c("Yes", "No")))
CVD_data
CVD
Hypertension Yes No
Yes 992 2260
No 165 1017
$Odds_Ratio
[1] 2.705455
$CI.lower
[1] 2.258331
$CI.Upper
[1] 3.241103
odds ratio with 95% C.I.
Hypertension estimate lower upper
No 1.000000 NA NA
Yes 2.705455 2.258339 3.241093
If you want to save this output to use later, be sure to assign it to an R object.
[1] 2.705455
12.1.5 Finding and Preventing Errors
When writing a function, we will need to error check ourselves (it is not uncommon even for experienced programmers to have typos in their code on the first pass) to get our code working, and to try to prevent future users of our functions from making (at least some types or) errors.
For example, let’s look at the function fn.demo
that expects two numbers x and y and calculates \(x + 10 - y\). What happens if we pass in a string instead of a number?
Error in x + 10: non-numeric argument to binary operator
This error was a user mistake. Of course this function wasn’t intended to have a string passed in to the argument x
, but maybe the error message is too ambiguous to let the user know what went wrong. We can add error checking to our functions to provide informative feedback using the stop()
function.
fn.demo = function(x, y = 5) {
if (is.numeric(x) == FALSE | is.numeric(y) == FALSE) {
stop("x and y must be numeric")
}
z = x + 10 - y
return(z)
}
fn.demo("black")
Error in fn.demo("black"): x and y must be numeric
The more difficult errors are those in which a function returns a value but returns the wrong value. In the following, we will see an example of this that is due to NA
values.
mean.se = function(x) {
m = mean(x)
se = sd(x) / sqrt(length(x))
cat(paste("mean = ", m, "; se = ", se, "\n"))
}
a <- c(2, 3, 4, 5, 6)
mean.se(a) #this is correct
mean = 4 ; se = 0.707106781186548
Now let’s add in an NA
.
mean = NA ; se = NA
This is a correct result, but what if we want the function to still do the calculation but ignoring the NA
s? One might think to add in the na.rm = TRUE
option to mean
and sd
functions to get around the NA
s.
mean.se = function(x) {
m = mean(x, na.rm = TRUE)
se = sd(x, na.rm = TRUE) / sqrt(length(x))
cat(paste("mean = ", m, "; se = ", se, "\n"))
}
mean.se(a) #this is correct
mean = 4 ; se = 0.707106781186548
mean = 4 ; se = 0.816496580927726
The standard error returned is wrong for the case that includes an NA
. This is due to the length
function counting the NA
. The length
does not have an option for na.rm = TRUE
, so we will need to find another way to count the non NA
values.
mean.se = function(x) {
m = mean(x, na.rm = TRUE)
se = sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x)))
cat(paste("mean = ", m, "; se = ", se, "\n"))
}
mean.se(a) #this is correct
mean = 4 ; se = 0.707106781186548
mean = 4 ; se = 0.912870929175277
The more complicated the function gets, the more careful you must be in checking your function to be sure that you are getting the intended output.
To write good functions:
- Break the prob the problem down into steps. Get a simple, fixed example of your code working first, then add replaceable variables that will eventually be the function arguments, and finally make it a function.
- Try to break any problem down into simple steps and write smaller functions for each of these simple parts. You can then package them together into one larger function (or keep them separate if each individual step is important on its own).
- Name arguments as clearly as possible (we used x and y here which is not very descriptive).
- Comment your code! Write a header describing what your function will do (what are the inputs and expected output). Comment the internal steps of the function as well.
- Think imaginatively about what could go wrong, check inputs, and provide informative error messages.
Developing custom functions could save you a lot of time and make your work simpler. This blog post provides examples.
12.1.6 apply
and Functions on the Fly
You can also designate functions “on the fly”. This can be useful when using the apply
family of functions.
$x
[,1] [,2]
[1,] 1 6
[2,] 2 7
$y
[,1] [,2]
[1,] 26 31
[2,] 27 32
sapply()
is a user-friendly version and wrapper of lapply
by default returning a vector, matrix, or array
x y
[1,] 5 5
[2,] 5 5
x y
[1,] "matrix" "matrix"
[2,] "array" "array"
myList = list(a=1:10, b=c(2,4,5), c = c("a","b","c"),
d = factor(c("boy","girl","girl")))
tmp = lapply(myList,function(x) x[1])
tmp
$a
[1] 1
$b
[1] 2
$c
[1] "a"
$d
[1] boy
Levels: boy girl
a b c d
"integer" "numeric" "character" "factor"
12.2 Loops
Loops are also helpful to reduce repetitive code. Let’s calculate the column means of a matrix.
x = matrix(1:25, ncol=5, nrow = 5)
x_means = c(0,0,0,0,0)
x_means[1] = mean(x[,1])
x_means[2] = mean(x[,2])
x_means[3] = mean(x[,3])
x_means[4] = mean(x[,4])
x_means[5] = mean(x[,5])
x_means
[1] 3 8 13 18 23
We can reduce the repeated code with a for loop
[1] 3 8 13 18 23
or a while loop
[1] 3 8 13 18 23
Loops are slow in R and it is better to use vectorized function when possible such as apply or functions like colMeans, since these have been optimized.
[1] 3 8 13 18 23
[1] 3 8 13 18 23
12.3 Exercises
- Write a function,
sqdif
, that does the following:- takes two numbers
x
andy
with default values of 2 and 3. - takes the difference
- squares this difference
- then returns the final value
- checks that
x
andy
are numeric andstop
s with an error message otherwise
- takes two numbers
- Try to write a function called
top()
that takes amatrix
ordata.frame
and a numbern
, and returns the firstn
rows and columns, with the default value ofn=5
. - Write a function that will calculate a 95% one sample t interval. The results will be stored in a list to be returned containing sample mean and the confidence interval. The input to the functions is the numeric vector containing our data. For review, the formula for a 95% one sample t interval is \(\bar{x} \pm 1.96*s/\sqrt{n}\).