Practical Biological Data Analysis: Class wrap up: Data analysis, tips and resources

Rmarkdown and Quarto

Read the Guide to RMarkdown for an exhaustive description of the various formats and options for using RMarkdown documents. Note that HTML for this class were all made from Rmd, using the distill blog format

There is also a newer format, also built by Rstudio (now named Posit) called Quarto. Quarto documents are very similar to RMarkdown, have broader support for additional programming languages, and will likely eventually replace the Rmarkdown format.

The Rmarkdown for this post is on github

Caching

You can speed up knitting of your Rmds by using caching to store the results from each chunk, instead of rerunning them each time. Note that if you modify the code chunk, previous caching is ignored.

For each chunk, set {r, cache = TRUE}

Or the option can be set globally at top of the document. Like this:

knitr::opts_chunk$set(cache = TRUE)

Better, save and load .rds files

Run once, save, and load instead of rerunning resource intensive parts.

If you have non-deterministic functions, such as kmeans, remember to set.seed, or save and load result objects.

if(!file.exists("long-step.rds")){
  ...
  code or a script
  source("path/to/script.R")
  ...
  saveRDS(obj, "long-step.rds")
} else {
  obj <- readRDS("long-step.rds")
}

`here`, folder structure management

https://github.com/jennybc/here_here

here::here() # always points to top-level of current project

here::here("_posts", "2022-10-03-install-r", "install-r.Rmd") # never confused about folder structure

`styler`, clean up code readability

Refer to this style guide often, so you don’t have to go back to make the code readable/publishable later.

styler::style_text("
my_fun <- function(x, 
y, 
z) {
  x+ z
}
                   ")

#> my_fun <- function(
#>     x,
#>     y,
#>     z) {
#>   x + z
#> }


# styler::style_file   # for an entire file
# styler::style_dir    # directory

Reproducibility

Print environment information with sessionInfo()

It’s very helpful to have a record of which packages were used in an analysis. One approach is to print the sessionInfo().

Show session info

sessionInfo()

#> R version 4.3.1 (2023-06-16)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.2.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Denver
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods  
#> [7] base     
#> 
#> other attached packages:
#>  [1] here_1.0.1      lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1  
#>  [5] dplyr_1.1.4     purrr_1.0.2     readr_2.1.4     tidyr_1.3.0    
#>  [9] tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.10.2     sass_0.4.7        utf8_1.2.4       
#>  [4] generics_0.1.3    stringi_1.8.2     distill_1.6      
#>  [7] hms_1.1.3         digest_0.6.33     magrittr_2.0.3   
#> [10] evaluate_0.23     grid_4.3.1        timechange_0.2.0 
#> [13] fastmap_1.1.1     R.oo_1.25.0       R.cache_0.16.0   
#> [16] rprojroot_2.0.4   jsonlite_1.8.8    R.utils_2.12.3   
#> [19] fansi_1.0.5       scales_1.3.0      jquerylib_0.1.4  
#> [22] cli_3.6.1         rlang_1.1.2       R.methodsS3_1.8.2
#> [25] munsell_0.5.0     withr_2.5.2       cachem_1.0.8     
#> [28] yaml_2.3.7        tools_4.3.1       tzdb_0.4.0       
#> [31] memoise_2.0.1     colorspace_2.1-0  vctrs_0.6.5      
#> [34] R6_2.5.1          lifecycle_1.0.4   pkgconfig_2.0.3  
#> [37] pillar_1.9.0      bslib_0.6.1       gtable_0.3.4     
#> [40] glue_1.6.2        xfun_0.41         tidyselect_1.2.0 
#> [43] rstudioapi_0.15.0 knitr_1.45        htmltools_0.5.7  
#> [46] rmarkdown_2.25    compiler_4.3.1    downlit_0.4.3

See also the sessioninfo package, which provide more details:

sessioninfo::session_info()

#> ─ Session info ─────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16)
#>  os       macOS Monterey 12.2.1
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Denver
#>  date     2023-12-14
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ─────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bslib         0.6.1   2023-11-28 [1] CRAN (R 4.3.1)
#>  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.0)
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
#>  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
#>  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
#>  distill       1.6     2023-10-06 [1] CRAN (R 4.3.1)
#>  downlit       0.4.3   2023-06-29 [1] CRAN (R 4.3.0)
#>  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
#>  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.1)
#>  fansi         1.0.5   2023-10-08 [1] CRAN (R 4.3.1)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
#>  ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
#>  gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.0)
#>  here        * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
#>  hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.1)
#>  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.3.0)
#>  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
#>  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.1)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
#>  lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.3.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
#>  purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils       2.12.3  2023-11-18 [1] CRAN (R 4.3.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
#>  readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
#>  rlang         1.1.2   2023-11-04 [1] CRAN (R 4.3.1)
#>  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
#>  rprojroot     2.0.4   2023-11-05 [1] CRAN (R 4.3.1)
#>  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
#>  sass          0.4.7   2023-07-15 [1] CRAN (R 4.3.0)
#>  scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  stringi       1.8.2   2023-11-23 [1] CRAN (R 4.3.1)
#>  stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.1)
#>  styler        1.10.2  2023-08-29 [1] CRAN (R 4.3.0)
#>  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
#>  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
#>  timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
#>  tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
#>  withr         2.5.2   2023-10-30 [1] CRAN (R 4.3.1)
#>  xfun          0.41    2023-11-01 [1] CRAN (R 4.3.1)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] /Users/kriemo/Library/R/arm64/4.3/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ────────────────────────────────────────────────────────────────────

Use renv to manage package dependencies

The renv package allows you to have a separate set of R packages for each project. It also can record and restore the set of R pacakges used in a project. This is very helpful when you need to return to a project months (or years) later and want to have the same set of packages. It also makes it easier to share your packages with collaborators.

Benchmarking, with `microbenchmark` and `profvis`

# example, compare base and readr csv reading functions
path_to_file <- here("data/class3/dmel_peptides_lifecycle.csv.gz")

res <- microbenchmark::microbenchmark(
  base = read.csv(path_to_file),
  readr = readr::read_csv(path_to_file),
  data.table = data.table::fread(path_to_file),
  times = 5,
  unit = "ms"
)
print(res, signif = 2)

#> Unit: milliseconds
#>        expr min  lq mean median  uq max neval
#>        base 750 760  800    800 810 880     5
#>       readr 180 190  230    230 260 300     5
#>  data.table 140 190  190    200 210 210     5

# example, looking at each step of a script
library(tidyverse)
p <- profvis::profvis({

  df <- readr::read_csv(path_to_file)
  df <- df %>%
    filter(e02_4 < 50) %>%
    mutate(sequence = str_to_lower(Sequence))
})
p

Debugging R code

R has a debugger built in. You can debug a function:

e.g.:

debug(read_csv) # set a function to debug
read_csv(path_to_file) # will enter debug mode
undebug(read_csv)

Rstudio has great support for debugging functions in Rscripts or in packages:

https://support.posit.co/hc/en-us/articles/200713843-Debugging-R-code-with-the-RStudio-IDE

Look at the call stack with traceback()

cool_function <- function(x) {
  
  internal_function <- function(y) {
    
    hard_to_find <- function(z) {
      "doesn't work here" + 10
    }
    
    hard_to_find()
  }
  internal_function()
}

cool_function(1)
traceback()

Building your own R package

It’s surprisingly easy, particularly with Rstudio, to write your own R package to store your code. Putting your code in a package makes it much easier to debug, document, add tests, and distribute your code.

`shiny`, interactive web app for data exploration

Making an interactive interface to data and plotting is easy in R. Examples and corresponding code can be found at https://shiny.rstudio.com/gallery/.

Parsing specific types of formats:

There are generally packages built to read and write from most formats. Some examples:

JSON

Check out jsonlite

library(jsonlite)
json_file <- "http://api.worldbank.org/country?per_page=10&region=OED&lendingtype=LNX&format=json"
worldbank_data <- fromJSON(json_file, flatten=TRUE)
worldbank_data

#> [[1]]
#> [[1]]$page
#> [1] 1
#> 
#> [[1]]$pages
#> [1] 4
#> 
#> [[1]]$per_page
#> [1] "10"
#> 
#> [[1]]$total
#> [1] 32
#> 
#> 
#> [[2]]
#>     id iso2Code        name capitalCity longitude latitude region.id
#> 1  AUS       AU   Australia    Canberra   149.129  -35.282       EAS
#> 2  AUT       AT     Austria      Vienna   16.3798  48.2201       ECS
#> 3  BEL       BE     Belgium    Brussels   4.36761  50.8371       ECS
#> 4  CAN       CA      Canada      Ottawa  -75.6919  45.4215       NAC
#> 5  CHE       CH Switzerland        Bern   7.44821   46.948       ECS
#> 6  CZE       CZ     Czechia      Prague   14.4205  50.0878       ECS
#> 7  DEU       DE     Germany      Berlin   13.4115  52.5235       ECS
#> 8  DNK       DK     Denmark  Copenhagen   12.5681  55.6763       ECS
#> 9  ESP       ES       Spain      Madrid  -3.70327  40.4167       ECS
#> 10 EST       EE     Estonia     Tallinn   24.7586  59.4392       ECS
#>    region.iso2code          region.value adminregion.id
#> 1               Z4   East Asia & Pacific               
#> 2               Z7 Europe & Central Asia               
#> 3               Z7 Europe & Central Asia               
#> 4               XU         North America               
#> 5               Z7 Europe & Central Asia               
#> 6               Z7 Europe & Central Asia               
#> 7               Z7 Europe & Central Asia               
#> 8               Z7 Europe & Central Asia               
#> 9               Z7 Europe & Central Asia               
#> 10              Z7 Europe & Central Asia               
#>    adminregion.iso2code adminregion.value incomeLevel.id
#> 1                                                    HIC
#> 2                                                    HIC
#> 3                                                    HIC
#> 4                                                    HIC
#> 5                                                    HIC
#> 6                                                    HIC
#> 7                                                    HIC
#> 8                                                    HIC
#> 9                                                    HIC
#> 10                                                   HIC
#>    incomeLevel.iso2code incomeLevel.value lendingType.id
#> 1                    XD       High income            LNX
#> 2                    XD       High income            LNX
#> 3                    XD       High income            LNX
#> 4                    XD       High income            LNX
#> 5                    XD       High income            LNX
#> 6                    XD       High income            LNX
#> 7                    XD       High income            LNX
#> 8                    XD       High income            LNX
#> 9                    XD       High income            LNX
#> 10                   XD       High income            LNX
#>    lendingType.iso2code lendingType.value
#> 1                    XX    Not classified
#> 2                    XX    Not classified
#> 3                    XX    Not classified
#> 4                    XX    Not classified
#> 5                    XX    Not classified
#> 6                    XX    Not classified
#> 7                    XX    Not classified
#> 8                    XX    Not classified
#> 9                    XX    Not classified
#> 10                   XX    Not classified

Genomics data (fasta, fastq, vcf, bam, bed, bigwig)

Check out rtracklayer and Rsamtools:

e.g. read a FASTA file into R:

library(Rsamtools)

# get path to test file included in a package
fasta_file <- system.file('extdata', 'ce2dict1.fa', package = 'Rsamtools')
scanFa(fasta_file)

#> DNAStringSet object of length 5:
#>     width seq                                     names               
#> [1]    18 GCGAAACTAGGAGAGGCT                      pattern01
#> [2]    25 CTGTTAGCTAATTTTAAAAATAAAT               pattern02
#> [3]    24 ACTACCACCCAAATTTAGATATTC                pattern03
#> [4]    24 AAATTTTTTTTGTTGCAAATTTGA                pattern04
#> [5]    25 TCTTCTTGGCTTTGGTGGTACTTTT               pattern05

Using R on the command-line

The command line can be accessed via the Terminal app on macOS, or using the windows subsystem for linux (WSL).

There command line is a place where you can run executable programs (a C, python, R, or whatever). It’s what using a computer looked like before the existence of a Graphical User Interface. It is impossible to conduct data analysis without gaining some experience with working on the command line.

R is an executable, and we can pull up an R console using:

In Rmarkdown you can also include other languages including bash (which is a common language of the command line).You need to change the r to bash in the code chunk (or to python or other languages).

You can run simple commands by using Rscript or R with the -e option.

R -e "print('hello')"
#> 
#> R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
#> Copyright (C) 2023 The R Foundation for Statistical Computing
#> Platform: aarch64-apple-darwin20 (64-bit)
#> 
#> R is free software and comes with ABSOLUTELY NO WARRANTY.
#> You are welcome to redistribute it under certain conditions.
#> Type 'license()' or 'licence()' for distribution details.
#> 
#>   Natural language support but running in an English locale
#> 
#> R is a collaborative project with many contributors.
#> Type 'contributors()' for more information and
#> 'citation()' on how to cite R or R packages in publications.
#> 
#> Type 'demo()' for some demos, 'help()' for on-line help, or
#> 'help.start()' for an HTML browser interface to help.
#> Type 'q()' to quit R.
#> 
#> > print('hello')
#> [1] "hello"
#> > 
#> >

Rscript -e "print('hello')"
#> [1] "hello"

Alternatively you can write a R script, which can be then called from Rscript. For example if we wrote an R script called cool_function.R.

#!/usr/bin/env Rscript  # allows calling with ./cool_function.R if executable

args = commandArgs(trailingOnly=TRUE) # collect command line arguments 
print(args) # args is a list e.g. argument1 argument2...

We could call on the command line:

Rscript path/to/cool_function.R argument1 argument2 ...
#or
path/to/cool_function.R argument1 argument2 ...

Git and Github

From https://jmcglone.com/guides/github-pages/

Git is a command line tool for version control, which allows us to:

roll back code to a previous state if needed
branched development, tackling individual issues/tasks
collaboration

From https://blog.programster.org/git-workflows

Git was first created by Linus Torvalds for coordinating development of Linux. Read this guide for Getting started , checkout this interactive guide and check out this Tutorial written from an R data analyst perspective.

# for bioinformatics, get comfortable with command line too

ls
git status # list changes to tracked files
git blame resources.Rmd    # see who contributed
git commit -m "added something cool" # save state
git push # push git to a git repository (e.g. github)
git pull # pull changes from git repository

This can be handled by Rstudio as well (new tab next to Connections and Build)

Put your code on GitHub

As you write more code, especially as functions and script pipelines, hosting and documenting them on GitHub is great way to make them portable and searchable. Even the free tier of GitHub accounts now has private repositories (repos).

If you have any interest in a career in data science/informatics, GitHub is also a common showcase of what (and how well/often) you can code. After some accumulation of code, definitely put your GitHub link on your CV/resume.

Check out the quickstart from github: https://docs.github.com/en/get-started/quickstart/hello-world

Example repos

this class
valr
r-source # mirror of R source code
readr

Asking for help with other packages on GitHub

Every package should include README, installation instructions, and a maintained issues page where questions and bugs can be reported and addressed. Example: readr GitHub page Don’t be afraid to file new issues, but very often your problems are already answered in the closed section.

Finding useful packages

In most cases, what you need is already made into well-documented packages, and you don’t have to reinvent the wheel (but sometimes you should?). Depending on where the package is curated, installation is different. Some examples below:

Gviz - visualize gene model
euler - making custom euler/venn diagrams
emo - inserting emojis into Rmd

# BiocManager::install("Gviz") # from bioconductor
vignette("Gviz")

# install.packages("eulerr") # from CRAN
plot(eulerr::euler(list(set1 = c("geneA", "geneB", "geneC"), 
                       set2 = c("geneC", "geneD"))))

# devtools::install_github("hadley/emo") # from github
emo::ji("smile")

#> 😄

Bioconductor

2,000+ R packages dedicated to bioinformatics. Includes a coherent framework of data structures (e.g. SummarizedExperiment) built by dedicated Core members. Also includes many annotation and experimental datasets built into R packages and objects (See AnnotationHub and ExperimentHub)

https://bioconductor.org/
Use BiocManager::install() to install these packages
RNA-seq workflow
Orchestrating single cell analysis

Finding help online

The R studio community forums are a great resource for asking questions about tidyverse related packages.

StackOverflow provides user-contributed questions and answers on a variety of topics.

For help with bioconductor packages, visit the Bioc support page

Find out if others are having similar issues by searching the issue on the package GitHub page.

Cheat sheets

Rstudio links to common ones here: Help -> Cheatsheets. More are hosted online, such as for regular expressions.

Offline help

The RBI fellows hold standing office hours on Thursdays over zoom. We are happy to help out with coding and RNA/DNA-related informatics questions. Send us an email to schedule a time (rbi.fellows@cuanschutz.edu).

Sometimes code is just broken

No one writes perfect code. Developers often expect that there will be bugs in their code. If you suspect bugs or mishandled edge cases, go to the package GitHub and search the issues section to see if the problem has been reported or fixed. If not, submit an issue that describes the problem.

The reprex package makes it easy to produce well-formatted reproducible examples that demonstrate the problem. Often developers will be thankful for your help with making their software better.

Class wrap up: Data analysis, tips and resources

Rmarkdown and Quarto

Caching

Better, save and load .rds files

`here`, folder structure management

`styler`, clean up code readability

Reproducibility

Print environment information with sessionInfo()

Use renv to manage package dependencies

Benchmarking, with `microbenchmark` and `profvis`

Debugging R code

Look at the call stack with traceback()

Building your own R package

`shiny`, interactive web app for data exploration

Parsing specific types of formats:

JSON

Genomics data (fasta, fastq, vcf, bam, bed, bigwig)

Using R on the command-line

Git and Github

Put your code on GitHub

Example repos

Asking for help with other packages on GitHub

Finding useful packages

Bioconductor

Finding help online

Cheat sheets

Offline help

Sometimes code is just broken

Additional Resources

General/Data science

Genomics

A meta-list of R resources

Writing high-performance R functions with R + C++

Class wrap up: Data analysis, tips and resources

Rmarkdown and Quarto

Caching

Better, save and load .rds files

here, folder structure management

styler, clean up code readability

Reproducibility

Print environment information with sessionInfo()

Use renv to manage package dependencies

Benchmarking, with microbenchmark and profvis

Debugging R code

Look at the call stack with traceback()

Building your own R package

shiny, interactive web app for data exploration

Parsing specific types of formats:

JSON

Genomics data (fasta, fastq, vcf, bam, bed, bigwig)

Using R on the command-line

Git and Github

Put your code on GitHub

Example repos

Asking for help with other packages on GitHub

Finding useful packages

Bioconductor

Finding help online

Cheat sheets

Offline help

Sometimes code is just broken

Additional Resources

General/Data science

Genomics

A meta-list of R resources

Writing high-performance R functions with R + C++

`here`, folder structure management

`styler`, clean up code readability

Benchmarking, with `microbenchmark` and `profvis`

`shiny`, interactive web app for data exploration