The Rmarkdown for this class is on github
# Conditionally download files from github
source("https://raw.githubusercontent.com/rnabioco/bmsc-7810-pbda/main/_posts/2023-12-13-class-11-programming-pt-1/download-files.R")
As an analyst you will eventually find yourself in the position of wanting to reuse a block of code. There are two general ways to do this:
A function is essentially a block of code that you’ve given a name and saved for later. Functions have several advantages:
Further reading
# An example: you want to rescale a numeric vector so all values are between 0 and 1
a <- rnorm(n = 10)
a
#> [1] -0.375738193 2.206017052 1.119596807 -1.368699695 2.067158551
#> [6] -0.143291626 1.243639680 0.129434705 -1.358442998 -0.007440558
rng <- range(a)
(a - rng[1]) / (rng[2] - rng[1])
#> [1] 0.277773477 1.000000000 0.696082145 0.000000000 0.961155383
#> [6] 0.342798648 0.730782202 0.419091779 0.002869233 0.380801958
There are three general steps for writing functions:
# Lets write a function to rescale a numeric vector
rescale_vec <- function(x) {
rng <- range(x)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale_vec(b)
rescale_vec(c)
Write functions for the following bits of code
Can objects present in the global environment be referenced from within a function?
# Earlier we saved a numeric vector "a"
a
#> [1] -0.375738193 2.206017052 1.119596807 -1.368699695 2.067158551
#> [6] -0.143291626 1.243639680 0.129434705 -1.358442998 -0.007440558
sum_nums <- function(x) {
x + a
}
# Yes!
sum_nums(10)
#> [1] 9.624262 12.206017 11.119597 8.631300 12.067159 9.856708
#> [7] 11.243640 10.129435 8.641557 9.992559
Can code executed within a function modify an object present in the global environment?
sum_nums <- function(x) {
a <- x + a
}
# When we run sum_nums(), will this overwrite our original vector?
sum_nums(10)
# No! (not when using the '<-' assignment operator)
a
#> [1] -0.375738193 2.206017052 1.119596807 -1.368699695 2.067158551
#> [6] -0.143291626 1.243639680 0.129434705 -1.358442998 -0.007440558
The brauer_gene_exp
data contains a data set from a manuscript describing how gene expression changes in yeast under several nutrient limitation conditions. We’ll use this data to illustrate the utility and the power of functions.
Using the Brauer data lets create a scatter plot comparing growth rate vs expression for the gene YDL104C. Use facet_wrap()
to create a separate plot for each nutrient.
brauer_gene_exp <- read_csv("data/brauer_gene_exp.csv.gz")
What if you want to create this plot for other genes? Write a function the takes a data.frame and systematic_name as inputs and creates scatter plots for each nutrient
# Fill in the function body
# You can include default values for your arguments
<- function(input, sys_name = "YNL049C") {
plot_expr
????
}
plot_expr <- function(input, sys_name = "YNL049C") {
gg_data <- input |>
filter(systematic_name == sys_name)
gg_data |>
ggplot(aes(rate, expression, color = nutrient)) +
geom_point(size = 2) +
facet_wrap(~ nutrient) +
theme(legend.position = "none")
}
p <- plot_expr(
input = brauer_gene_exp,
sys_name = "YDL104C"
)
# You can also use the |> pipe with your custom functions
p <- brauer_gene_exp |>
plot_expr(sys_name = "YDL104C")
p
Modify our plotting function to add the gene name as the plot title and the molecular function (MF) as a subtitle
plot_expr <- function(input, sys_name) {
gg_data <- input |>
filter(systematic_name == sys_name)
plot_title <- gg_data$name[1]
plot_sub <- gg_data$MF[1]
gg_data |>
ggplot(aes(rate, expression, color = nutrient)) +
geom_point(size = 2) +
labs(title = plot_title, subtitle = plot_sub) +
ggtitle(plot_title) +
facet_wrap(~ nutrient) +
theme(legend.position = "none")
}
brauer_gene_exp |>
plot_expr("YDL104C")
As you’ve seen objects that are passed to a function are not modified within the function by default. Intuitively you can think of each object being copied within the function environment to avoid modification of the original. However this would be memory inefficient and slow approach, as copying multiple large objects takes time and space.
Instead R adopts a “copy-on-modify” approach with objects. Objects are only copied when it is necessary. The same is true of objects outside of functions.
change_to_char <- function(large_object) {
# large object is not a copy, but a reference
large_object
# now a new copy of large_object is made
large_object <- as.character(large_object)
large_object
}
mat <- matrix(1:100, nrow = 10)
# not copied
a <- mat
# mat not copied yet
mat[1:5, 1:5]
# now a copy is made
mat2 <- as.character(mat)
mat2 <- as.data.frame(mat)
if
statements allow you to execute code depending on defined conditions.
if (condition) {
TRUE
code executed when condition is
else {
} FALSE
code executed when condition is }
R has a set of operators that can be used to write conditional statements
Operator | Description |
---|---|
< | less than |
<= | less or equal |
> | greater than |
>= | greater or equal |
== | equal |
!= | not equal |
!x | not x |
x | y | x or y (returns a vector of logicals) |
x || y | x or y (returns single TRUE or FALSE) |
x & y | x and y (returns a vector of logicals) |
x && y | x and y (returns single TRUE or FALSE) |
x %in% y | x is present in y |
Add an if
statement to our plotting function to account for a missing gene name
plot_expr <- function(input, sys_name) {
gg_data <- input |>
filter(systematic_name == sys_name)
plot_title <- gg_data$name[1]
plot_sub <- gg_data$MF[1]
????
gg_data |>
ggplot(aes(rate, expression, color = nutrient)) +
geom_point(size = 2) +
labs(title = plot_title, subtitle = plot_sub) +
facet_wrap(~ nutrient) +
theme(legend.position = "none")
}
plot_expr <- function(input, sys_name) {
gg_data <- input |>
filter(systematic_name == sys_name)
plot_title <- gg_data$name[1]
plot_sub <- gg_data$MF[1]
if (is.na(plot_title)) {
plot_title <- sys_name
}
gg_data |>
ggplot(aes(rate, expression, color = nutrient)) +
geom_point(size = 2) +
labs(title = plot_title, subtitle = plot_sub) +
facet_wrap(~ nutrient) +
theme(legend.position = "none")
}
brauer_gene_exp |>
plot_expr("YNL095C")
Conditional statements can be linked together
# Using 'else if'
if (condition_1) {
TRUE
executed when condition_1 is
else if (condition_2) {
} FALSE and condition_2 is TRUE
executed when condition_1 is
else {
} FALSE
executed when condition_1 and condition_2 are
}
# The 'and' operator
if (condition_1 && condition_2) {
TRUE
executed when condition_1 and condition_2 are
else {
} FALSE
executed when condition_1 or condition_2 are
}
# The 'or' operator
if (condition_1 || condition_2) {
TRUE
executed when condition_1 or condition_2 are
else {
} FALSE
executed when condition_1 and condition_2 are }
stop()
, warning()
, message()
, and stopifnot()
are commonly used functions in R for reporting information and/or stopping execution based on a condition.
See also tryCatch()
for “catching” errors and performing alternative actions.
When writing functions it can be useful to check input values to make sure they are valid. Lets modify our plotting function to check that sys_name
is a string.
is.character()
is.numeric()
is.logical()
is.factor()
plot_expr <- function(input, sys_name) {
if (!is.character(sys_name)) {
stop("sys_name must be a string!")
}
gg_data <- input |>
filter(systematic_name == sys_name)
plot_title <- gg_data$name[1]
plot_sub <- gg_data$MF[1]
if (is.na(plot_title)) {
plot_title <- sys_name
}
gg_data |>
ggplot(aes(rate, expression, color = nutrient)) +
geom_point(size = 2) +
labs(title = plot_title, subtitle = plot_sub) +
facet_wrap(~ nutrient) +
theme(legend.position = "none")
}
brauer_gene_exp |>
plot_expr("YDL104C")
Modify our plotting function to check that sys_name
is present in the input. Hint: try the %in%
operator
<- function(input, sys_name) {
plot_expr
if (!is.character(sys_name)) {
stop("sys_name must be a string!")
}
if ( ???? ) {
stop( ???? )
}
<- input |>
gg_data filter(systematic_name == sys_name)
<- gg_data$name[1]
plot_title <- gg_data$MF[1]
plot_sub
if (is.na(plot_title) ){
<- sys_name
plot_title
}
|>
gg_data ggplot(aes(rate, expression, color = nutrient)) +
geom_point(size = 2) +
labs(title = plot_title, subtitle = plot_sub) +
facet_wrap(~ nutrient) +
theme(legend.position = "none")
}
plot_expr <- function(input, sys_name) {
if (!is.character(sys_name)) {
stop("sys_name must be a string!")
}
if (!sys_name %in% input$systematic_name) {
stop("sys_name not found in input data!")
}
gg_data <- input |>
filter(systematic_name == sys_name)
plot_title <- gg_data$name[1]
plot_sub <- gg_data$MF[1]
if (plot_title == "") {
plot_title <- sys_name
}
gg_data |>
ggplot(aes(rate, expression, color = nutrient)) +
geom_point(size = 2) +
labs(title = plot_title, subtitle = plot_sub) +
facet_wrap(~ nutrient) +
theme(legend.position = "none")
}
The ellipsis allows a function to take an arbitrary number of arguments, which can then be passed to an inner function. This is nice when you have an inner function that has a lot of useful arguments. Lets first try this with our simple rescale_vec()
function.
rescale_vec <- function(x, ...) {
rng <- range(x, ...)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale_vec(a)
#> [1] 0.277773477 1.000000000 0.696082145 0.000000000 0.961155383
#> [6] 0.342798648 0.730782202 0.419091779 0.002869233 0.380801958
a[1] <- NA
rescale_vec(a, na.rm = T)
#> [1] NA 1.000000000 0.696082145 0.000000000 0.961155383
#> [6] 0.342798648 0.730782202 0.419091779 0.002869233 0.380801958
Modify our plotting function so the user can change the point size, shape, and alpha
# A cumbersome way
plot_expr <- function(input, sys_name, pt_size = 2, pt_shape = 1, pt_alpha = 1) {
input |>
filter(systematic_name == sys_name) |>
ggplot(aes(rate, expression, color = nutrient)) +
geom_point(size = pt_size, shape = pt_shape, alpha = pt_alpha) +
facet_wrap(~ nutrient) +
theme(legend.position = "none")
}
# With the ellipsis
plot_expr <- function(input, sys_name, ...) {
input |>
filter(systematic_name == sys_name) |>
ggplot(aes(rate, expression, color = nutrient)) +
geom_point(...) +
facet_wrap(~ nutrient) +
theme(legend.position = "none")
}
# Now we can easily change the point size and shape
plot_expr(
input = brauer_gene_exp,
sys_name = "YDL104C",
size = 5,
shape = 2,
alpha = 0.75
)
A good way to save commonly used functions is to keep them in a separate R script. You can load your functions using the source()
command.
source("path/to/my_functions.R")
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/Denver
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods
#> [7] base
#>
#> other attached packages:
#> [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3
#> [5] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
#> [9] ggplot2_3.4.3 tidyverse_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.7 utf8_1.2.3 generics_0.1.3
#> [4] stringi_1.7.12 distill_1.6 hms_1.1.3
#> [7] digest_0.6.33 magrittr_2.0.3 evaluate_0.22
#> [10] grid_4.3.1 timechange_0.2.0 fastmap_1.1.1
#> [13] jsonlite_1.8.7 fansi_1.0.4 scales_1.2.1
#> [16] jquerylib_0.1.4 cli_3.6.1 rlang_1.1.1
#> [19] crayon_1.5.2 bit64_4.0.5 munsell_0.5.0
#> [22] withr_2.5.1 cachem_1.0.8 yaml_2.3.7
#> [25] parallel_4.3.1 tools_4.3.1 tzdb_0.4.0
#> [28] memoise_2.0.1 colorspace_2.1-0 vctrs_0.6.3
#> [31] R6_2.5.1 lifecycle_1.0.3 bit_4.0.5
#> [34] vroom_1.6.3 pkgconfig_2.0.3 pillar_1.9.0
#> [37] bslib_0.5.1 gtable_0.3.4 glue_1.6.2
#> [40] xfun_0.40 tidyselect_1.2.0 rstudioapi_0.15.0
#> [43] knitr_1.44 farver_2.1.1 htmltools_0.5.6
#> [46] labeling_0.4.3 rmarkdown_2.25 compiler_4.3.1
#> [49] downlit_0.4.3