Skip to contents

Please fill out end-of-class survey to improve future classes

Class material can be accessed at:

https://rnabioco.github.io/practical-data-analysis/

1. Speeding up repeated Rmd knitting

Read the Guide to RMarkdown for an exhaustive description of the various formats and options for using RMarkdown documents. Note that HTMLs and slides for this class were all made from Rmd.

Caching

You can speed up knitting of your Rmds by using caching to store the results from each chunk, instead of rerunning them each time. Note that if you modify the code chunk, previous caching is ignored.

For each chunk, set {r, cache = TRUE}

Or the option can be set globally at top of the document. Like this:

knitr::opts_chunk$set(cache = TRUE)

Alternatively, save and load .Rds/.Rda files

Run once, save, and load instead of rerunning resource intensive parts.

If you have non-deterministic functions, such as kmeans, remember to set.seed, or save and load result objects.

2. Finding useful packages

In most cases, what you need is already made into well-documented packages, and you don’t have to reinvent the wheel (but sometimes you should?). Depending on where the package is curated, installation is different. Some examples below:

  1. Gviz - visualize gene model
  2. VennDiagram - making custom venn diagrams
  3. emo - inserting emojis into Rmd
# BiocManager::install("Gviz") # from bioconductor
vignette("Gviz")

# install.packages("eulerr") # from CRAN
plot(eulerr::euler(list(set1 = c("geneA", "geneB", "geneC"), 
                       set2 = c("geneC", "geneD"))))

# devtools::install_github("hadley/emo") # from github
emo::ji("smile")

3. Git - version control

Advantages of using version control include:

  1. rolling back code if needed

  2. branched development, tackling individual issues/tasks

  3. collaboration, etc

Git was first created by Linus Torvalds for coordinating development of Linux. Read this guide for Getting started and try out the Interactive practice.

# for bioinformatics, get comfortable with command line too
# and avoid using windows for compatibility issues

ls
git status # list changes
git blame assignments.Rmd # see who contributed

This can be handled by Rstudio as well (new tab next to Connections and Build)

Put your code on GitHub

As you write more code, especially as functions and script pipelines, hosting and documenting them on GitHub is great way to make them portable and searchable. Even the free tier of GitHub accounts now has private repositories (repo).

If you have any interest in a career in data science/informatics, GitHub is also a common showcase of what (and how well/often) you can code. After some accumulation of code, definitely put your GitHub link on your CV/resume.

Example repo (RBI) - valr

Refer to package examples and guide to building packages to build your own package and website documentation.

Asking for help with other packages on GitHub

Every package should include README, installation instructions, and a maintained issues page where questions and bugs can be reported and addressed. Example: readr GitHub page Don’t be afraid to file new issues, but very often your problems are already answered in the closed section.

4. Other tips/best practices

a. Code and data organization

Read this: A Quick Guide to Organizing Computational Biology Projects

NEWS.md             # markdown document for tracking progress
data                # data directory for storing raw and processed data
dbases              # database files downloaded from other places
docs                # publication documents and random project files
your-project.Rproj  # Make an Rproject in Rstudio for every project
results             # store all RMarkdown here
src                 # store useful scripts here

a2. here, folder structure management

https://github.com/jennybc/here_here

here::here() # always points to top-level of current project

here::here("vignettes", "class-12.Rmd") # never confused about folder structure

b. styler, clean up code readability

Refer to this style guide often, so you don’t have to go back to make the code readable/publishable later.

styler::style_text("
my_fun <- function(x, 
y, 
z) {
  x+
    z
}
                   ")
#> my_fun <- function(x,
#>                    y,
#>                    z) {
#>   x +
#>     z
#> }

# styler::style_file   # for an entire file
# styler::style_dir    # or folder

c. Benchmarking, with microbenchmark and profvis

# example, compare base and readr csv reading functions
path_to_file <- system.file("extdata", "gene_tibble.csv", package = "pbda")
res <- microbenchmark::microbenchmark(
  base = read.csv(path_to_file),
  readr = readr::read_csv(path_to_file),
  times = 5
)
print(res, signif = 2)
microbenchmark:::autoplot.microbenchmark(res)

# example, looking at each step of a script
library(dplyr)
p <- profvis::profvis({
  path_to_file <- system.file("extdata", "gene_tibble.csv", package = "pbda")
  c1 <- readr::read_csv(path_to_file)
  c1 <- c1 %>%
    filter(padj < 0.00001) %>%
    mutate(gene = stringr::str_to_lower(gene))
})
p

d. shiny, interactive web app for data exploration

Making an interactive interface to data and plotting is easy in R. Examples and corresponding code can be found at https://shiny.rstudio.com/gallery/.

e. Running R script from terminal

Save functions in .R files to easily source() them later. If you have code that does not need input from beginning to end, .R files can be executed from the command line/terminal as well.

# example, mat.R that makes matrix and does calculations
Rscript /Users/rf/teach/practical-data-analysis/R/mat.R

5. Finding help online

The R studio community forums are a great resource for asking questions about tidyverse related packages.

StackOverflow provides user-contributed questions and answers on a variety of topics.

For help with bioconductor packages, visit the Bioc support page

Find out if others are having similar issues by searching the issue on the package GitHub page.

General R coding resources

Hadley Wickham has written very good (free) ebooks on using R.

R for Data Science

Advanced R

Bioinformatic resources

For general bioinformatics advice, we found the following text very useful.

Bioinformatics Data Skills by Vince Buffalo

For statistics related to bioinformatics, this free course is excellent:

PH525x series - Biomedical Data Science

For more detailed descriptions of single cell RNA-Seq analysis:

https://scrnaseq-course.cog.sanger.ac.uk/website/index.html

https://satijalab.org/seurat/vignettes.html

Cheat sheets

Rstudio links to common ones here: Help -> Cheatsheets. More are hosted online, such as for regular expressions.

Useful to keep your own stash too.

6. Offline help

The RBI fellows hold standing office hours on Thursdays, walk-ins 1-2pm, by appointment 2-4pm, in RC1-South Room 9101. We are happy to help out with coding and RNA/DNA-related informatics questions. Send us an email to let us know you will be stopping by (rbi.fellows@cuanschutz.edu).

7. Sometimes code is just broken

No one writes perfect code. A certain highly-regarded and cited scRNA-seq package drew cell type progression arrows in the wrong direction, for months before the issue was fixed!

If you suspect bugs or mishandled edge cases, go to the package GitHub and search the issues section to see if the problem has been reported or fixed. If not, submit an issue that describes the problem. The reprex package makes it easy to produce well-formatted reproducible examples that demonstrate the problem. Often developers will be thankful for your help with making their software better.

8. Use R often. Don’t use Excel!

Excel:

  1. Gets unwieldy with large files

  2. Poor plotting options

  3. Hard to do dplyr-like manipulations

  4. Hard to integrate with other informatics tools

  5. Annoying to apply same analysis over multiple files

  6. And random things like this: