Read the Guide to RMarkdown for an exhaustive description of the various formats and options for using RMarkdown documents. Note that HTML for this class were all made from Rmd, using the distill blog format
There is also a newer format, also built by Rstudio (now named Posit) called Quarto. Quarto documents are very similar to RMarkdown, have broader support for additional programming languages, and will likely eventually replace the Rmarkdown format.
The Rmarkdown for this post is on github
You can speed up knitting of your Rmds by using caching to store the results from each chunk, instead of rerunning them each time. Note that if you modify the code chunk, previous caching is ignored.
For each chunk, set {r, cache = TRUE}
Or the option can be set globally at top of the document. Like this:
::opts_chunk$set(cache = TRUE) knitr
Run once, save, and load instead of rerunning resource intensive parts.
If you have non-deterministic functions, such as kmeans
, remember to set.seed
, or save and load result objects.
if(!file.exists("long-step.rds")){
...
code or a scriptsource("path/to/script.R")
...saveRDS(obj, "long-step.rds")
else {
} <- readRDS("long-step.rds")
obj }
here
, folder structure managementhttps://github.com/jennybc/here_here
styler
, clean up code readabilityRefer to this style guide often, so you don’t have to go back to make the code readable/publishable later.
styler::style_text("
my_fun <- function(x,
y,
z) {
x+ z
}
")
#> my_fun <- function(
#> x,
#> y,
#> z) {
#> x + z
#> }
# styler::style_file # for an entire file
# styler::style_dir # directory
It’s very helpful to have a record of which packages were used in an analysis. One approach is to print the sessionInfo()
.
#> R version 4.3.1 (2023-06-16)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.2.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Denver
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods
#> [7] base
#>
#> other attached packages:
#> [1] here_1.0.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
#> [5] dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
#> [9] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] styler_1.10.2 sass_0.4.7 utf8_1.2.4
#> [4] generics_0.1.3 stringi_1.8.2 distill_1.6
#> [7] hms_1.1.3 digest_0.6.33 magrittr_2.0.3
#> [10] evaluate_0.23 grid_4.3.1 timechange_0.2.0
#> [13] fastmap_1.1.1 R.oo_1.25.0 R.cache_0.16.0
#> [16] rprojroot_2.0.4 jsonlite_1.8.8 R.utils_2.12.3
#> [19] fansi_1.0.5 scales_1.3.0 jquerylib_0.1.4
#> [22] cli_3.6.1 rlang_1.1.2 R.methodsS3_1.8.2
#> [25] munsell_0.5.0 withr_2.5.2 cachem_1.0.8
#> [28] yaml_2.3.7 tools_4.3.1 tzdb_0.4.0
#> [31] memoise_2.0.1 colorspace_2.1-0 vctrs_0.6.5
#> [34] R6_2.5.1 lifecycle_1.0.4 pkgconfig_2.0.3
#> [37] pillar_1.9.0 bslib_0.6.1 gtable_0.3.4
#> [40] glue_1.6.2 xfun_0.41 tidyselect_1.2.0
#> [43] rstudioapi_0.15.0 knitr_1.45 htmltools_0.5.7
#> [46] rmarkdown_2.25 compiler_4.3.1 downlit_0.4.3
See also the sessioninfo
package, which provide more details:
sessioninfo::session_info()
#> ─ Session info ─────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.1 (2023-06-16)
#> os macOS Monterey 12.2.1
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Denver
#> date 2023-12-14
#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ─────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> bslib 0.6.1 2023-11-28 [1] CRAN (R 4.3.1)
#> cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.0)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
#> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
#> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)
#> distill 1.6 2023-10-06 [1] CRAN (R 4.3.1)
#> downlit 0.4.3 2023-06-29 [1] CRAN (R 4.3.0)
#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.1)
#> fansi 1.0.5 2023-10-08 [1] CRAN (R 4.3.1)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
#> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
#> ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.1)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
#> gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)
#> here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)
#> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
#> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.1)
#> jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.3.0)
#> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.1)
#> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.1)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
#> lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
#> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
#> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0)
#> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.3.1)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
#> readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
#> rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.1)
#> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1)
#> rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.3.1)
#> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)
#> sass 0.4.7 2023-07-15 [1] CRAN (R 4.3.0)
#> scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
#> stringi 1.8.2 2023-11-23 [1] CRAN (R 4.3.1)
#> stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.1)
#> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.0)
#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
#> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
#> tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
#> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
#> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1)
#> withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.1)
#> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.1)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
#>
#> [1] /Users/kriemo/Library/R/arm64/4.3/library
#> [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#>
#> ────────────────────────────────────────────────────────────────────
The renv
package allows you to have a separate set of R packages for each project. It also can record and restore the set of R pacakges used in a project. This is very helpful when you need to return to a project months (or years) later and want to have the same set of packages. It also makes it easier to share your packages with collaborators.
See also:
conda for managing various command line programs (python, R, c, etc.)
docker for generating a fully reproducible operating system environment.
microbenchmark
and profvis
# example, compare base and readr csv reading functions
path_to_file <- here("data/class3/dmel_peptides_lifecycle.csv.gz")
res <- microbenchmark::microbenchmark(
base = read.csv(path_to_file),
readr = readr::read_csv(path_to_file),
data.table = data.table::fread(path_to_file),
times = 5,
unit = "ms"
)
print(res, signif = 2)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> base 750 760 800 800 810 880 5
#> readr 180 190 230 230 260 300 5
#> data.table 140 190 190 200 210 210 5
R has a debugger built in. You can debug a function:
e.g.:
Rstudio has great support for debugging functions in Rscripts or in packages:
https://support.posit.co/hc/en-us/articles/200713843-Debugging-R-code-with-the-RStudio-IDE
cool_function <- function(x) {
internal_function <- function(y) {
hard_to_find <- function(z) {
"doesn't work here" + 10
}
hard_to_find()
}
internal_function()
}
cool_function(1)
traceback()
It’s surprisingly easy, particularly with Rstudio, to write your own R package to store your code. Putting your code in a package makes it much easier to debug, document, add tests, and distribute your code.
shiny
, interactive web app for data explorationMaking an interactive interface to data and plotting is easy in R. Examples and corresponding code can be found at https://shiny.rstudio.com/gallery/.
There are generally packages built to read and write from most formats. Some examples:
Check out jsonlite
library(jsonlite)
json_file <- "http://api.worldbank.org/country?per_page=10®ion=OED&lendingtype=LNX&format=json"
worldbank_data <- fromJSON(json_file, flatten=TRUE)
worldbank_data
#> [[1]]
#> [[1]]$page
#> [1] 1
#>
#> [[1]]$pages
#> [1] 4
#>
#> [[1]]$per_page
#> [1] "10"
#>
#> [[1]]$total
#> [1] 32
#>
#>
#> [[2]]
#> id iso2Code name capitalCity longitude latitude region.id
#> 1 AUS AU Australia Canberra 149.129 -35.282 EAS
#> 2 AUT AT Austria Vienna 16.3798 48.2201 ECS
#> 3 BEL BE Belgium Brussels 4.36761 50.8371 ECS
#> 4 CAN CA Canada Ottawa -75.6919 45.4215 NAC
#> 5 CHE CH Switzerland Bern 7.44821 46.948 ECS
#> 6 CZE CZ Czechia Prague 14.4205 50.0878 ECS
#> 7 DEU DE Germany Berlin 13.4115 52.5235 ECS
#> 8 DNK DK Denmark Copenhagen 12.5681 55.6763 ECS
#> 9 ESP ES Spain Madrid -3.70327 40.4167 ECS
#> 10 EST EE Estonia Tallinn 24.7586 59.4392 ECS
#> region.iso2code region.value adminregion.id
#> 1 Z4 East Asia & Pacific
#> 2 Z7 Europe & Central Asia
#> 3 Z7 Europe & Central Asia
#> 4 XU North America
#> 5 Z7 Europe & Central Asia
#> 6 Z7 Europe & Central Asia
#> 7 Z7 Europe & Central Asia
#> 8 Z7 Europe & Central Asia
#> 9 Z7 Europe & Central Asia
#> 10 Z7 Europe & Central Asia
#> adminregion.iso2code adminregion.value incomeLevel.id
#> 1 HIC
#> 2 HIC
#> 3 HIC
#> 4 HIC
#> 5 HIC
#> 6 HIC
#> 7 HIC
#> 8 HIC
#> 9 HIC
#> 10 HIC
#> incomeLevel.iso2code incomeLevel.value lendingType.id
#> 1 XD High income LNX
#> 2 XD High income LNX
#> 3 XD High income LNX
#> 4 XD High income LNX
#> 5 XD High income LNX
#> 6 XD High income LNX
#> 7 XD High income LNX
#> 8 XD High income LNX
#> 9 XD High income LNX
#> 10 XD High income LNX
#> lendingType.iso2code lendingType.value
#> 1 XX Not classified
#> 2 XX Not classified
#> 3 XX Not classified
#> 4 XX Not classified
#> 5 XX Not classified
#> 6 XX Not classified
#> 7 XX Not classified
#> 8 XX Not classified
#> 9 XX Not classified
#> 10 XX Not classified
Check out rtracklayer
and Rsamtools
:
e.g. read a FASTA file into R:
library(Rsamtools)
# get path to test file included in a package
fasta_file <- system.file('extdata', 'ce2dict1.fa', package = 'Rsamtools')
scanFa(fasta_file)
#> DNAStringSet object of length 5:
#> width seq names
#> [1] 18 GCGAAACTAGGAGAGGCT pattern01
#> [2] 25 CTGTTAGCTAATTTTAAAAATAAAT pattern02
#> [3] 24 ACTACCACCCAAATTTAGATATTC pattern03
#> [4] 24 AAATTTTTTTTGTTGCAAATTTGA pattern04
#> [5] 25 TCTTCTTGGCTTTGGTGGTACTTTT pattern05
The command line can be accessed via the Terminal app on macOS, or using the windows subsystem for linux (WSL).
There command line is a place where you can run executable programs (a C, python, R, or whatever). It’s what using a computer looked like before the existence of a Graphical User Interface. It is impossible to conduct data analysis without gaining some experience with working on the command line.
R is an executable, and we can pull up an R console using:
R
In Rmarkdown you can also include other languages including bash (which is a common language of the command line).You need to change the r
to bash
in the code chunk (or to python or other languages).
You can run simple commands by using Rscript
or R
with the -e
option.
R -e "print('hello')"
#>
#> R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
#> Copyright (C) 2023 The R Foundation for Statistical Computing
#> Platform: aarch64-apple-darwin20 (64-bit)
#>
#> R is free software and comes with ABSOLUTELY NO WARRANTY.
#> You are welcome to redistribute it under certain conditions.
#> Type 'license()' or 'licence()' for distribution details.
#>
#> Natural language support but running in an English locale
#>
#> R is a collaborative project with many contributors.
#> Type 'contributors()' for more information and
#> 'citation()' on how to cite R or R packages in publications.
#>
#> Type 'demo()' for some demos, 'help()' for on-line help, or
#> 'help.start()' for an HTML browser interface to help.
#> Type 'q()' to quit R.
#>
#> > print('hello')
#> [1] "hello"
#> >
#> >
Rscript -e "print('hello')"
#> [1] "hello"
Alternatively you can write a R script, which can be then called from Rscript. For example if we wrote an R script called cool_function.R
.
#!/usr/bin/env Rscript # allows calling with ./cool_function.R if executable
= commandArgs(trailingOnly=TRUE) # collect command line arguments
args print(args) # args is a list e.g. argument1 argument2...
We could call on the command line:
Rscript path/to/cool_function.R argument1 argument2 ...
#or
path/to/cool_function.R argument1 argument2 ...
Git is a command line tool for version control, which allows us to:
roll back code to a previous state if needed
branched development, tackling individual issues/tasks
collaboration
Git was first created by Linus Torvalds for coordinating development of Linux. Read this guide for Getting started , checkout this interactive guide and check out this Tutorial written from an R data analyst perspective.
# for bioinformatics, get comfortable with command line too
ls
git status # list changes to tracked files
git blame resources.Rmd # see who contributed
git commit -m "added something cool" # save state
git push # push git to a git repository (e.g. github)
git pull # pull changes from git repository
This can be handled by Rstudio as well (new tab next to Connections
and Build
)
As you write more code, especially as functions and script pipelines, hosting and documenting them on GitHub is great way to make them portable and searchable. Even the free tier of GitHub accounts now has private repositories (repos).
If you have any interest in a career in data science/informatics, GitHub is also a common showcase of what (and how well/often) you can code. After some accumulation of code, definitely put your GitHub link on your CV/resume.
Check out the quickstart from github: https://docs.github.com/en/get-started/quickstart/hello-world
Every package should include README, installation instructions, and a maintained issues
page where questions and bugs can be reported and addressed. Example: readr GitHub page Don’t be afraid to file new issues, but very often your problems are already answered in the closed
section.
In most cases, what you need is already made into well-documented packages, and you don’t have to reinvent the wheel (but sometimes you should?). Depending on where the package is curated, installation is different. Some examples below:
Gviz
- visualize gene modeleuler
- making custom euler/venn diagramsemo
- inserting emojis into Rmd# devtools::install_github("hadley/emo") # from github
emo::ji("smile")
#> 😄
2,000+ R packages dedicated to bioinformatics. Includes a coherent framework of data structures (e.g. SummarizedExperiment) built by dedicated Core members. Also includes many annotation and experimental datasets built into R packages and objects (See AnnotationHub and ExperimentHub)
BiocManager::install()
to install these packagesThe R studio community forums are a great resource for asking questions about tidyverse related packages.
StackOverflow provides user-contributed questions and answers on a variety of topics.
For help with bioconductor packages, visit the Bioc support page
Find out if others are having similar issues by searching the issue on the package GitHub page.
Rstudio links to common ones here: Help
-> Cheatsheets
. More are hosted online, such as for regular expressions.
The RBI fellows hold standing office hours on Thursdays over zoom. We are happy to help out with coding and RNA/DNA-related informatics questions. Send us an email to schedule a time (rbi.fellows@cuanschutz.edu
).
No one writes perfect code. Developers often expect that there will be bugs in their code. If you suspect bugs or mishandled edge cases, go to the package GitHub and search the issues section to see if the problem has been reported or fixed. If not, submit an issue that describes the problem.
The reprex package makes it easy to produce well-formatted reproducible examples that demonstrate the problem. Often developers will be thankful for your help with making their software better.