Chapter 12 Functions

In this chapter, we will cover the basics of writing your own functions. Two of the greatest strengths of R are adaptability and extensibility. R can be customized to do almost anything (of course this may require time). In many cases, the tool you want may already exist in a package, but if it is not, then you can define it yourself by writing your own function. Writing a function is useful for automatic tasks that you find yourself doing repeatedly with only changes to the input, such as making the same plots for reports with updated datasets. Writing your own function is also useful for creating new functionality in R and sharing it with others. When researchers develop a new statistical procedure that others would want to use, the researcher can write a function to then share with others who only need to know how to us the function but not all the details needed to code it.

We will also briefly cover loops in R to reduce repetitive code.

12.1 Writing Your Own Function

12.1.1 Defining a Function

The basic syntax for defining your own function is very simple. One simply uses the function() function to define it.

functionName = function(inputs) {
< function body >
return(value)
}

Then you would run the 4 lines of the code, which adds a function with the name functionName to your workspace.

Here we will write a function that returns the second element of a vector:

return2 = function(x) {
  return(x[2])
}
return2(c(1,4,5,76))
[1] 4

Note that your function will automatically return the last line of code run:

return2a = function(x) {
  x[2]
}
return2a(c(1,4,5,76))
[1] 4

And if your function is really one line or evaluation, like here, you do not need the curly brackets, and you can put everything on one line:

return2b = function(x) x[2]
return2b(c(1,4,5,76))
[1] 4

12.1.2 Function Inputs

Functions can take multiple inputs. The inputs or arguments to the function are defined the parentheses when calling function(). We can simply give the inputs names and require the user to provide a value to them when calling the function. For example, maybe you want users to select which element to extract.

return2c = function(x, n) x[n]
return2c(c(1,4,5,76), 3)
[1] 5

We can also specify default values for function inputs, so if the user does not pass a value it will use this default value. We set a default value when defining the function by setting varName = defaultValue in the function() call. For example, let’s have our function return the 1st element of the vector by default.

return2d = function(x, n = 1) x[n]
return2d(c(1,4,5,76))
[1] 1

Also note that when calling a function using named inputs, we can put them in any order.

return2d(n = 2, x = c(1, 4, 5, 76))
[1] 4

But if we do not use the input names, then we must use the same order as in the function definition.

return2d(2, c(1, 4, 5, 76))
[1]  2 NA NA NA

12.1.3 Scope

Let’s take a look at another function which takes in to numbers x and y and calculates \(x^2-y\).

fn.demo = function(x = 0, y = 0, verbose = FALSE) {
  z <- x^2 - y
  if (verbose == TRUE) {
    cat(paste(x, " squared minus ", y, " = ", z, "\n", sep = " "))
  } else {
    return(z)
  }
}
fn.demo(3, 5)
[1] 4
fn.demo(3, 5, TRUE)
3  squared minus  5  =  4 
z
[1] "TRUE"  "FALSE" "TRUE"  "FALSE"

Note that even though we created a variable called z inside the function and stored the value of 4 in it, a variable called z does not exist in our workspace. This is due to the scope of this variable. It was defined inside the function, so it is created when the function is called and deleted once the function is finished. Generally, any variable that you want your function to use that is defined outside the function should be passed in as an input argument to the function. Do not reference variables outside the function.

q = 5
bad_function = function(p) {
  p / q
}
good_function = function(p, q) {
  p / q
}
bad_function(10)
[1] 2
good_function(10, 2)
[1] 5
q
[1] 5

Note that when we used the good_function to divide 10 by 2 (i.e. changed q to 2), it did not use the q defined in our workspace outside the function. The global q kept the value of 5.

12.1.4 Returning Multiple Values

In more complex functions, we usually want to return multiple values/objects to the user. We can do this by returning the values in a list, vector, or any other object that can hold the multiple values that we want to return.

Let’s create a function that takes in a 2x2 table and calculates the odds ratio and a 95% confidence interval. It returns all three of these values.

For a 2x2 table of the form

       Diseased
Exposed Yes No 
    Yes "a" "b"
    No  "c" "d"

The estimated odd ratio is calculated as \(\widehat{OR} = (a\cdot d) / (b\cdot c)\), and an (approximate) 95% confidence interval is given by \(\widehat{OR}e^{\pm 1.96*\sqrt{1/a +1/b +1/c +1/d}}\).

odds.ratio = function(table) {
  or = (table[1,1] * table[2,2]) / (table[1,2] * table[2,1])
  SE.logor = sqrt(1/table[1,1] + 1/table[1,2] + 1/table[2,1] + 1/table[2,2])
  zcrit = 1.96
  ci.lower = or*exp(-zcrit * SE.logor)
  ci.upper = or*exp(zcrit * SE.logor)
  return(list("Odds_Ratio" = or, "CI.lower" = ci.lower, "CI.Upper" = ci.upper))
}

#some test data on the association between hypertension and cardiovascular disease
CVD_data = matrix(c(992, 2260, 165, 1017), nrow = 2, ncol = 2,
                  byrow = TRUE, dimnames = list("Hypertension" = c("Yes", "No"),
                                                "CVD" = c("Yes", "No")))
CVD_data
            CVD
Hypertension Yes   No
         Yes 992 2260
         No  165 1017
odds.ratio(CVD_data)
$Odds_Ratio
[1] 2.705455

$CI.lower
[1] 2.258331

$CI.Upper
[1] 3.241103
library(epitools)
oddsratio(CVD_data, method = "wald", rev = "b")$measure
            odds ratio with 95% C.I.
Hypertension estimate    lower    upper
         No  1.000000       NA       NA
         Yes 2.705455 2.258339 3.241093

If you want to save this output to use later, be sure to assign it to an R object.

or_ci = odds.ratio(CVD_data)
or_ci$Odds_Ratio
[1] 2.705455

12.1.5 Finding and Preventing Errors

When writing a function, we will need to error check ourselves (it is not uncommon even for experienced programmers to have typos in their code on the first pass) to get our code working, and to try to prevent future users of our functions from making (at least some types or) errors.

For example, let’s look at the function fn.demo that expects two numbers x and y and calculates \(x + 10 - y\). What happens if we pass in a string instead of a number?

fn.demo = function(x, y = 5) {
  z = x + 10 - y
  return(z)
}
fn.demo("black")
Error in x + 10: non-numeric argument to binary operator

This error was a user mistake. Of course this function wasn’t intended to have a string passed in to the argument x, but maybe the error message is too ambiguous to let the user know what went wrong. We can add error checking to our functions to provide informative feedback using the stop() function.

fn.demo = function(x, y = 5) {
  if (is.numeric(x) == FALSE | is.numeric(y) == FALSE) {
    stop("x and y must be numeric")
  }
  z = x + 10 - y
  return(z)
}
fn.demo("black")
Error in fn.demo("black"): x and y must be numeric

The more difficult errors are those in which a function returns a value but returns the wrong value. In the following, we will see an example of this that is due to NA values.

mean.se = function(x) {
  m = mean(x)
  se = sd(x) / sqrt(length(x))
  cat(paste("mean = ", m, "; se = ", se, "\n"))
}
a <- c(2, 3, 4, 5, 6)
mean.se(a) #this is correct
mean =  4 ; se =  0.707106781186548 

Now let’s add in an NA.

b <- a
b[3] <- NA
mean.se(b)
mean =  NA ; se =  NA 

This is a correct result, but what if we want the function to still do the calculation but ignoring the NAs? One might think to add in the na.rm = TRUE option to mean and sd functions to get around the NAs.

mean.se = function(x) {
  m = mean(x, na.rm = TRUE)
  se = sd(x, na.rm = TRUE) / sqrt(length(x))
  cat(paste("mean = ", m, "; se = ", se, "\n"))
}
mean.se(a) #this is correct
mean =  4 ; se =  0.707106781186548 
mean.se(b) #this is not correct
mean =  4 ; se =  0.816496580927726 

The standard error returned is wrong for the case that includes an NA. This is due to the length function counting the NA. The length does not have an option for na.rm = TRUE, so we will need to find another way to count the non NA values.

mean.se = function(x) {
  m = mean(x, na.rm = TRUE)
  se = sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x)))
  cat(paste("mean = ", m, "; se = ", se, "\n"))
}
mean.se(a) #this is correct
mean =  4 ; se =  0.707106781186548 
mean.se(b) #this is now correct
mean =  4 ; se =  0.912870929175277 

The more complicated the function gets, the more careful you must be in checking your function to be sure that you are getting the intended output.

To write good functions:

  • Break the prob the problem down into steps. Get a simple, fixed example of your code working first, then add replaceable variables that will eventually be the function arguments, and finally make it a function.
  • Try to break any problem down into simple steps and write smaller functions for each of these simple parts. You can then package them together into one larger function (or keep them separate if each individual step is important on its own).
  • Name arguments as clearly as possible (we used x and y here which is not very descriptive).
  • Comment your code! Write a header describing what your function will do (what are the inputs and expected output). Comment the internal steps of the function as well.
  • Think imaginatively about what could go wrong, check inputs, and provide informative error messages.

Developing custom functions could save you a lot of time and make your work simpler. This blog post provides examples.

12.1.6 apply and Functions on the Fly

You can also designate functions “on the fly”. This can be useful when using the apply family of functions.

matList = list(x = matrix(1:25,nc=5),y=matrix(26:50,nc=5))
lapply(matList, function(x) x[1:2,1:2])
$x
     [,1] [,2]
[1,]    1    6
[2,]    2    7

$y
     [,1] [,2]
[1,]   26   31
[2,]   27   32

sapply() is a user-friendly version and wrapper of lapply by default returning a vector, matrix, or array

sapply(matList, dim)
     x y
[1,] 5 5
[2,] 5 5
sapply(matList, class)
     x        y       
[1,] "matrix" "matrix"
[2,] "array"  "array" 
myList = list(a=1:10, b=c(2,4,5), c = c("a","b","c"),
                d = factor(c("boy","girl","girl")))
tmp = lapply(myList,function(x) x[1])
tmp
$a
[1] 1

$b
[1] 2

$c
[1] "a"

$d
[1] boy
Levels: boy girl
sapply(tmp, class)
          a           b           c           d 
  "integer"   "numeric" "character"    "factor" 

12.2 Loops

Loops are also helpful to reduce repetitive code. Let’s calculate the column means of a matrix.

x = matrix(1:25, ncol=5, nrow = 5)
x_means = c(0,0,0,0,0)
x_means[1] = mean(x[,1])
x_means[2] = mean(x[,2])
x_means[3] = mean(x[,3])
x_means[4] = mean(x[,4])
x_means[5] = mean(x[,5])
x_means
[1]  3  8 13 18 23

We can reduce the repeated code with a for loop

x_means = numeric(5)
for (i in 1:5) {
  x_means[i] = mean(x[,i])
}
x_means
[1]  3  8 13 18 23

or a while loop

x_means = numeric(5)
i = 1
while (i <= 5) {
  x_means[i] = mean(x[,i])
  i = i + 1
}
x_means
[1]  3  8 13 18 23

Loops are slow in R and it is better to use vectorized function when possible such as apply or functions like colMeans, since these have been optimized.

colMeans(x)
[1]  3  8 13 18 23
apply(x, 2, mean)
[1]  3  8 13 18 23

12.3 Exercises

  1. Write a function, sqdif, that does the following:
    1. takes two numbers x and y with default values of 2 and 3.
    2. takes the difference
    3. squares this difference
    4. then returns the final value
    5. checks that x and y are numeric and stops with an error message otherwise
  2. Try to write a function called top() that takes a matrix or data.frame and a number n, and returns the first n rows and columns, with the default value of n=5.
  3. Write a function that will calculate a 95% one sample t interval. The results will be stored in a list to be returned containing sample mean and the confidence interval. The input to the functions is the numeric vector containing our data. For review, the formula for a 95% one sample t interval is \(\bar{x} \pm 1.96*s/\sqrt{n}\).