Practical Biological Data Analysis: Class 11: Programming in R (part 1)

The Rmarkdown for this class is on github

# Conditionally download files from github
source("https://raw.githubusercontent.com/rnabioco/bmsc-7810-pbda/main/_posts/2023-12-13-class-11-programming-pt-1/download-files.R")

What is a function?

As an analyst you will eventually find yourself in the position of wanting to reuse a block of code. There are two general ways to do this:

copy-and-paste
write a function

A function is essentially a block of code that you’ve given a name and saved for later. Functions have several advantages:

They make your code easier to read
They reduce the chance of mistakes from repeated copying and pasting
They make it easier to adapt your code for different requirements

How to write a function

There are three general steps for writing functions:

Pick a name
Identify the inputs
Add code to the body

# Lets write a function to rescale a numeric vector
rescale_vec <- function(x) {
  
  rng <- range(x)
  
  (x - rng[1]) / (rng[2] - rng[1])
}

rescale_vec(b)
rescale_vec(c)

Write functions for the following bits of code

# function 1
x / sum(x)

# function 2
(x + y) / z

# function 3
sqrt(sum((x - mean(x))^2) / (length(x) - 1))

Show answer

calc_sd <- function(x) {
  sqrt(sum((x - mean(x))^2) / (length(x) - 1))
}

calc_sd <- function(x) {
  l <- length(x) - 1
  m <- mean(x)
  v <- sum((x - m)^2) / l
  sqrt(v)
}

The function execution environment

When running a function an execution environment is created, which is separate from the global environment
The execution environment contains objects created within the function
When R searches for an object referenced by a function, the execution environment takes precedence
If an object is not found in the function environment, R will search in the global environment

Can objects present in the global environment be referenced from within a function?

# Earlier we saved a numeric vector "a"
a

#>  [1] -0.375738193  2.206017052  1.119596807 -1.368699695  2.067158551
#>  [6] -0.143291626  1.243639680  0.129434705 -1.358442998 -0.007440558

sum_nums <- function(x) {
  x + a
}
# Yes!
sum_nums(10)

#>  [1]  9.624262 12.206017 11.119597  8.631300 12.067159  9.856708
#>  [7] 11.243640 10.129435  8.641557  9.992559

Can code executed within a function modify an object present in the global environment?

sum_nums <- function(x) {
  a <- x + a
}

# When we run sum_nums(), will this overwrite our original vector?
sum_nums(10)

# No! (not when using the '<-' assignment operator)
a

#>  [1] -0.375738193  2.206017052  1.119596807 -1.368699695  2.067158551
#>  [6] -0.143291626  1.243639680  0.129434705 -1.358442998 -0.007440558

A more relevant example

The brauer_gene_exp data contains a data set from a manuscript describing how gene expression changes in yeast under several nutrient limitation conditions. We’ll use this data to illustrate the utility and the power of functions.

Using the Brauer data lets create a scatter plot comparing growth rate vs expression for the gene YDL104C. Use facet_wrap() to create a separate plot for each nutrient.

brauer_gene_exp <- read_csv("data/brauer_gene_exp.csv.gz")

What if you want to create this plot for other genes? Write a function the takes a data.frame and systematic_name as inputs and creates scatter plots for each nutrient

# Fill in the function body
# You can include default values for your arguments
plot_expr <- function(input, sys_name = "YNL049C") {
  
  ????
  
}

Show answer

plot_expr <- function(input, sys_name = "YNL049C") {
  gg_data <- input |>
    filter(systematic_name == sys_name)
  
  gg_data |>
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_point(size = 2) +
    facet_wrap(~ nutrient) +
    theme(legend.position = "none")
}

p <- plot_expr(
  input = brauer_gene_exp, 
  sys_name = "YDL104C"
)

# You can also use the |> pipe with your custom functions
p <- brauer_gene_exp |>
  plot_expr(sys_name = "YDL104C")

p

Modify our plotting function to add the gene name as the plot title and the molecular function (MF) as a subtitle

Show answer

plot_expr <- function(input, sys_name) {
  gg_data <- input |>
    filter(systematic_name == sys_name)
  
  plot_title <- gg_data$name[1]
  plot_sub <- gg_data$MF[1]
  
  gg_data |>
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_point(size = 2) +
    labs(title = plot_title, subtitle = plot_sub) +
    ggtitle(plot_title) +
    facet_wrap(~ nutrient) +
    theme(legend.position = "none")
}

brauer_gene_exp |>
  plot_expr("YDL104C")

Copy-on-modify semantics

As you’ve seen objects that are passed to a function are not modified within the function by default. Intuitively you can think of each object being copied within the function environment to avoid modification of the original. However this would be memory inefficient and slow approach, as copying multiple large objects takes time and space.

Instead R adopts a “copy-on-modify” approach with objects. Objects are only copied when it is necessary. The same is true of objects outside of functions.

change_to_char <- function(large_object) {
  
  # large object is not a copy, but a reference
  large_object
  
  # now a new copy of large_object is made
  large_object <- as.character(large_object)
  large_object
}

mat <- matrix(1:100, nrow = 10)

# not copied
a <- mat

# mat not copied yet
mat[1:5, 1:5]

# now a copy is made
mat2 <- as.character(mat)
mat2 <- as.data.frame(mat)

Conditional statements

if statements allow you to execute code depending on defined conditions.

if (condition) {
  code executed when condition is TRUE
  
} else {
  code executed when condition is FALSE
}

R has a set of operators that can be used to write conditional statements

Operator	Description
<	less than
<=	less or equal
>	greater than
>=	greater or equal
==	equal
!=	not equal
!x	not x
x \| y	x or y (returns a vector of logicals)
x \|\| y	x or y (returns single TRUE or FALSE)
x & y	x and y (returns a vector of logicals)
x && y	x and y (returns single TRUE or FALSE)
x %in% y	x is present in y

Add an if statement to our plotting function to account for a missing gene name

plot_expr <- function(input, sys_name) {
  gg_data <- input |>
    filter(systematic_name == sys_name)
  
  plot_title <- gg_data$name[1]
  plot_sub <- gg_data$MF[1]
  
  ????
  
  gg_data |>
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_point(size = 2) +
    labs(title = plot_title, subtitle = plot_sub) +
    facet_wrap(~ nutrient) +
    theme(legend.position = "none")
}

Show answer

plot_expr <- function(input, sys_name) {
  gg_data <- input |>
    filter(systematic_name == sys_name)
  
  plot_title <- gg_data$name[1]
  plot_sub <- gg_data$MF[1]
  
  if (is.na(plot_title)) {
    plot_title <- sys_name
  }
  
  gg_data |>
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_point(size = 2) +
    labs(title = plot_title, subtitle = plot_sub) +
    facet_wrap(~ nutrient) +
    theme(legend.position = "none")
}

brauer_gene_exp |>
  plot_expr("YNL095C")

Conditional statements can be linked together

# Using 'else if'
if (condition_1) {
  executed when condition_1 is TRUE
  
} else if (condition_2) {
  executed when condition_1 is FALSE and condition_2 is TRUE
  
} else {
  executed when condition_1 and condition_2 are FALSE
}

# The 'and' operator
if (condition_1 && condition_2) {
  executed when condition_1 and condition_2 are TRUE
  
} else {
  executed when condition_1 or condition_2 are FALSE
}

# The 'or' operator
if (condition_1 || condition_2) {
  executed when condition_1 or condition_2 are TRUE
  
} else {
  executed when condition_1 and condition_2 are FALSE
}

Messages, warnings, and errors

stop(), warning(), message(), and stopifnot() are commonly used functions in R for reporting information and/or stopping execution based on a condition.

stop("information about error to user, stops execution")
warning("information about warning to user, does not stop execution")
message("information that is not an error or warning, does not stop execution")
stopifnot(2 + 2 != 4)  # shortcut for if (condition is FALSE) stop()

See also tryCatch() for “catching” errors and performing alternative actions.

Checking inputs

When writing functions it can be useful to check input values to make sure they are valid. Lets modify our plotting function to check that sys_name is a string.

is.character()
is.numeric()
is.logical()
is.factor()

plot_expr <- function(input, sys_name) {
  
  if (!is.character(sys_name)) {
    stop("sys_name must be a string!")
  }
  
  gg_data <- input |>
    filter(systematic_name == sys_name)
  
  plot_title <- gg_data$name[1]
  plot_sub <- gg_data$MF[1]
  
  if (is.na(plot_title)) {
    plot_title <- sys_name
  }
  
  gg_data |>
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_point(size = 2) +
    labs(title = plot_title, subtitle = plot_sub) +
    facet_wrap(~ nutrient) +
    theme(legend.position = "none")
}

brauer_gene_exp |>
  plot_expr("YDL104C")

Modify our plotting function to check that sys_name is present in the input. Hint: try the %in% operator

plot_expr <- function(input, sys_name) {
  
  if (!is.character(sys_name)) {
    stop("sys_name must be a string!")
  }
  
  if ( ???? ) {
    stop( ???? )
  }
  
  gg_data <- input |>
    filter(systematic_name == sys_name)
  
  plot_title <- gg_data$name[1]
  plot_sub <- gg_data$MF[1]
  
  if (is.na(plot_title) ){
    plot_title <- sys_name
  }
  
  gg_data |>
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_point(size = 2) +
    labs(title = plot_title, subtitle = plot_sub) +
    facet_wrap(~ nutrient) +
    theme(legend.position = "none")
}

Show answer

plot_expr <- function(input, sys_name) {
  
  if (!is.character(sys_name)) {
    stop("sys_name must be a string!")
  }
  
  if (!sys_name %in% input$systematic_name) {
    stop("sys_name not found in input data!")
  }
  
  gg_data <- input |>
    filter(systematic_name == sys_name)
  
  plot_title <- gg_data$name[1]
  plot_sub <- gg_data$MF[1]
  
  if (plot_title == "") {
    plot_title <- sys_name
  }
  
  gg_data |>
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_point(size = 2) +
    labs(title = plot_title, subtitle = plot_sub) +
    facet_wrap(~ nutrient) +
    theme(legend.position = "none")
}

Passing arguments with the ellipsis (…)

The ellipsis allows a function to take an arbitrary number of arguments, which can then be passed to an inner function. This is nice when you have an inner function that has a lot of useful arguments. Lets first try this with our simple rescale_vec() function.

rescale_vec <- function(x, ...) {
  rng <- range(x, ...)
  (x - rng[1]) / (rng[2] - rng[1])
}

rescale_vec(a)

#>  [1] 0.277773477 1.000000000 0.696082145 0.000000000 0.961155383
#>  [6] 0.342798648 0.730782202 0.419091779 0.002869233 0.380801958


a[1] <- NA

rescale_vec(a, na.rm = T)

#>  [1]          NA 1.000000000 0.696082145 0.000000000 0.961155383
#>  [6] 0.342798648 0.730782202 0.419091779 0.002869233 0.380801958

Modify our plotting function so the user can change the point size, shape, and alpha

# A cumbersome way
plot_expr <- function(input, sys_name, pt_size = 2, pt_shape = 1, pt_alpha = 1) {
  input |>
    filter(systematic_name == sys_name) |>
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_point(size = pt_size, shape = pt_shape, alpha = pt_alpha) +
    facet_wrap(~ nutrient) +
    theme(legend.position = "none")
}

# With the ellipsis
plot_expr <- function(input, sys_name, ...) {
  input |>
    filter(systematic_name == sys_name) |>
    ggplot(aes(rate, expression, color = nutrient)) +
    geom_point(...) +
    facet_wrap(~ nutrient) +
    theme(legend.position = "none")
}

# Now we can easily change the point size and shape
plot_expr(
  input = brauer_gene_exp,
  sys_name = "YDL104C",
  size = 5,
  shape = 2,
  alpha = 0.75
)

Saving your functions for later

A good way to save commonly used functions is to keep them in a separate R script. You can load your functions using the source() command.

source("path/to/my_functions.R")

Show session info

sessionInfo()

#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/Denver
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods  
#> [7] base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3    
#>  [5] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
#>  [9] ggplot2_3.4.3   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.7        utf8_1.2.3        generics_0.1.3   
#>  [4] stringi_1.7.12    distill_1.6       hms_1.1.3        
#>  [7] digest_0.6.33     magrittr_2.0.3    evaluate_0.22    
#> [10] grid_4.3.1        timechange_0.2.0  fastmap_1.1.1    
#> [13] jsonlite_1.8.7    fansi_1.0.4       scales_1.2.1     
#> [16] jquerylib_0.1.4   cli_3.6.1         rlang_1.1.1      
#> [19] crayon_1.5.2      bit64_4.0.5       munsell_0.5.0    
#> [22] withr_2.5.1       cachem_1.0.8      yaml_2.3.7       
#> [25] parallel_4.3.1    tools_4.3.1       tzdb_0.4.0       
#> [28] memoise_2.0.1     colorspace_2.1-0  vctrs_0.6.3      
#> [31] R6_2.5.1          lifecycle_1.0.3   bit_4.0.5        
#> [34] vroom_1.6.3       pkgconfig_2.0.3   pillar_1.9.0     
#> [37] bslib_0.5.1       gtable_0.3.4      glue_1.6.2       
#> [40] xfun_0.40         tidyselect_1.2.0  rstudioapi_0.15.0
#> [43] knitr_1.44        farver_2.1.1      htmltools_0.5.6  
#> [46] labeling_0.4.3    rmarkdown_2.25    compiler_4.3.1   
#> [49] downlit_0.4.3