The Rmarkdown for this class is on github
R code can be executed using R scripts, which have the .R
extension. R scripts can only contain R code, not plain text or markdown. Scripts are executed line by line starting at the top of the document.
R scripts are useful if you have code that you want to run but don’t need the additional functionality of an Rmarkdown. You can also put custom R functions or R expression into an .R script and then use them in another document. The source()
function will execute the R code in a Rscript.
# can be a path to a .R file or a URL
source("https://raw.githubusercontent.com/rnabioco/bmsc-7810-pbda/main/_posts/2023-11-27-class-2/custom-functions.R")
# defined in script at URL
greeting("class")
important_list
As an aside, on the command line (e.g. terminal) you can run a R script (or expression):
R -e 'print("Hello World")'
Rscript your_awesome_code.R
Rmarkdown is a reproducible framework to create, collaborate, and communicate your work.
Rmarkdown supports a number of output formats including pdfs, word documents, slide shows, html, etc.
An Rmarkdown document is a plain text file with the extension .Rmd
and contains the following basic components:
Rmarkdown documents are executable documents. You can execute the code and render the markdown into html using the render()
function, or alternatively by clicking the knit button in Rstudio.
We have spent a large amount of time focused on vectors because these are the fundamental building blocks of more complex data structures.
As we have seen we can use relational operators (e.g. ==
, >
, <=
) to compare values in a vector.
Returning to our state data, say we wanted to identify states that are located in the south or in the west. How might we approach this?
There are a few approaches:
We can combine relational operators with logical operators, such as the or
operator |
, similarly we can use the and
operator &
.
# return TRUE if state is in the South or the West
state.region == "South" | state.region == "West"
[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[12] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
[23] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
[34] FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
[45] FALSE TRUE TRUE TRUE FALSE TRUE
# states can't be in two regions, so these are all FALSE
state.region == "South" & state.region == "West"
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE
What if we wanted to ask if the state is in the South, West, or Northeast?
We could add another or
statement with |
state.region == "South" | state.region == "West" | state.region == "Northeast"
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[23] FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[34] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[45] TRUE TRUE TRUE TRUE FALSE TRUE
A more efficient approach when testing for the presence of multiple values is to use the %in%
operator. This operator tests if an element in a vector on the left is present in the vector on the right.
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[23] FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[34] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[45] TRUE TRUE TRUE TRUE FALSE TRUE
This is a very common operation used to select particular subsets of a vector.
What we want to find states not in the west or the south?
Again there are multiple approaches. We could use the !=
operator to ask if
a vector does not equal a value. We then combine this with the &
operator to find values that do not satisfy either condition.
# TRUE if state is not in the south AND the state is not in the WEST
state.region != "South" & state.region != "West"
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[12] FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE
[23] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
[34] TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
[45] TRUE FALSE FALSE FALSE TRUE FALSE
Alternatively we can use the !
operator, which inverts TRUE to FALSE and vice versa.
x <- c(TRUE, FALSE, TRUE)
!x
[1] FALSE TRUE FALSE
!(state.region == "South" | state.region == "West")
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[12] FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE
[23] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
[34] TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
[45] TRUE FALSE FALSE FALSE TRUE FALSE
Also we can use the !
operator with %in%
:
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[12] FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE
[23] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
[34] TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
[45] TRUE FALSE FALSE FALSE TRUE FALSE
What if we want to test if all values are TRUE?
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[14] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[27] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[40] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
all(is_in_regions)
[1] TRUE
What if we want to test if any values are TRUE?
When printing the state.region
object you may have noticed the Levels: Northeast South North Central West
. What is this?
state.region
is a special type of integer vector called a factor
. These are commonly used to represent categorical data, and allow one to define a custom order for a category. In various statistical models factors are treated differently from numeric data. In our class you will use them mostly when you are plotting.
Internally they are represented as integers, with levels that map a value to each integer value.
typeof(state.region)
[1] "integer"
class(state.region)
[1] "factor"
levels(state.region)
[1] "Northeast" "South" "North Central" "West"
You can convert a vector into a factor using factor()
.
[1] cat fish fish bear bear
Levels: bear cat fish
Note that the levels are sorted lexicographically by default
levels(animals)
[1] "bear" "cat" "fish"
We can add custom ordering by setting the levels
[1] cat fish fish bear bear
Levels: cat bear fish
# sorting will reorder based on the levels
sort(animals)
[1] cat bear bear fish fish
Levels: cat bear fish
Vectors in R can also have names
, which provide additional information about elements in an object and provide a convenient method to identify elements by name, rather than by position.
A use case: what if we wanted to determine a state name corresponding to a state abbreviation?
We can set the names()
of the state.name vector to be the abbreviations.
names(state.name) <- state.abb
state.name[1:5]
AL AK AZ AR CA
"Alabama" "Alaska" "Arizona" "Arkansas" "California"
Now the names are displayed above each element of the vector.
With names, now we query the vector by the abbreviations, which will then return the state names.
state.name[c("UT", "CO")]
UT CO
"Utah" "Colorado"
Names will become more important next when we start to discuss data.frames and matrices, which can have names corresponding to rows and columns.
A matrix is a 2 dimensional rectangular data structure, where all values have the same type. It is at is core just a vector, but with a special attribute called dim
which specifies the number of rows and columns.
A matrix is used to store a collection of vectors of the same type and same length.
[1] "integer"
m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
We can subset or assign values to specific rows or columns using bracket notation, with values denoting rows and/or columns to keep.
matrix[rows to keep, columns to keep]
.
# keep first two rows
m[1:2, ]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
# keep first two columns
m[, 1:2]
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
# keep first two rows and first 3 columns
m[1:2, 1:3]
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
# replace values
m[1, 1] <- 1000
Matrices can have column names and row names that identify the columns. These names can also be used to subset the matrix by row name or column name.
A B C D E
a 1000 6 11 16 21
b 2 7 12 17 22
c 3 8 13 18 23
d 4 9 14 19 24
e 5 10 15 20 25
Many functions that operate on vectors also operate on matrices:
Matrices are a very commonly used data structure, used in many statistics and genomic packages. We will use matrices later in the course as part of a discussion of clustering and heatmaps.
A list
is similar to a vector, in that it is a container for multiple elements, however it can contain elements from different classes or types. Each element can have a different length or type and can even be a list to generate a nested list of lists.
$vals
[1] 1 2 3 4
$ids
[1] "bear" "dog"
$is_valid
[1] TRUE
$aux
A B C D E
a 1000 6 11 16 21
b 2 7 12 17 22
c 3 8 13 18 23
d 4 9 14 19 24
e 5 10 15 20 25
We can subset a list using []
and select elements with [[
.
1] # list of length 1
lst[
1]] # first element of list
lst[[
1]][1] # first value in first element of list lst[[
If the list has names we can also use the $
operator or [[
to extract an element by name or subset the list to contain only certain elements based on position.
A single [
operator when used on a list, returns a list, whereas [[
operators returns the entry in the list. The [[
operator only returns 1 element, whereas [
can return multiple elements.
# extract ids element, these are all equivalent
lst$ids # by name
[1] "bear" "dog"
lst[[2]] # by position
[1] "bear" "dog"
lst[["ids"]] # by name, with double bracket notation
[1] "bear" "dog"
# subset to first two list elements, returns a list of length 2
# these are equivalent
lst[1:2]
$vals
[1] 1 2 3 4
$ids
[1] "bear" "dog"
lst[c("vals", "ids")] # using names to subset list
$vals
[1] 1 2 3 4
$ids
[1] "bear" "dog"
lst[c(TRUE, TRUE, FALSE, FALSE)] # using a logical vector
$vals
[1] 1 2 3 4
$ids
[1] "bear" "dog"
Similar to vectors, we can also add or replace elements in lists. In this case using the $
operator adds an entry to the list with a name (e.g. new_entry
). Using the [
approach (with two [[
)
Lists are a very useful data structure that is commonly used as a foundation for storing many different data types in a single object.
For example many statistical tests return lists that store various information about the test results.
[1] "list"
names(res)
[1] "statistic" "parameter" "p.value" "conf.int"
[5] "estimate" "null.value" "stderr" "alternative"
[9] "method" "data.name"
res$p.value
[1] 3.574345e-61
A data.frame
is similar to a matrix, but each column can have a different type. This property makes the data.frame a very useful data structure to store multiple types of related information about an observation.
A data.frame can be generated using data.frame()
or by coercing a matrix or other data structure (as.data.frame()
).
df <- data.frame(vals = 1:4,
animal = c("cat", "fish", "bear", "dog"),
is_mammal = c(TRUE, FALSE, TRUE, TRUE))
df
vals animal is_mammal
1 1 cat TRUE
2 2 fish FALSE
3 3 bear TRUE
4 4 dog TRUE
Individual columns (vectors) can be accessed using the $
symbol and treated like regular vectors.
A data.frame is actually a specialized form of a list, whereby each list entry is a vector, and all the vectors have the same length. This is why the syntax is somewhat similar to a list.
# convert df to a list, then back to a data.frame
df_lst <- as.list(df)
df_lst
as.data.frame(df_lst)
# you can also use the double brackets to extract a column, similar to extracting an element from a list
df$is_mammal
df[["is_mammal"]]
df[[3]]
Just like with vectors and matrices we can also subset data.frames using logical vectors, positions, and names if they have column and row names.
For the next exercises we will use the mtcars
dataset built into R. It is data.frame with information about various vehicles from the 1970s. see ?mtcars
for a description.
Here I am using the head()
function to print only the first 6 rows (there is also a tail()
function).
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can subset or select data in the data.frame using the [
notation, similar to matrices.
df[rows to keep, columns to keep]
# mimic the head() function, keep first 6 rows
mtcars[1:6, ]
# first row, columns 2 and 3
mtcars[1, 2:3]
# all data from rows 2 and 4
mtcars[c(2, 4), ]
# all data from columns 1 and 3
mtcars[, c(1, 3)]
# extract first 2 columns with logical vector (rep() repeats elements)
lgl_vec <- c(TRUE, TRUE, rep(FALSE, 9))
mtcars[, lgl_vec]
This data.frame has row names, which are names that denote individual rows and column names that indicate columns. The rownames are in a column on the far left with no column name. We can subset columns and rows using these names.
rownames(mtcars)[1:5]
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout"
colnames(mtcars)[1:5]
[1] "mpg" "cyl" "disp" "hp" "drat"
cyl hp
Duster 360 8 245
Datsun 710 4 93
For cars with miles per gallon (mpg
) of at least 30, how many cylinders (cyl
) do they have?
Which car has the highest horsepower (hp
)?
The data.frame
and related variants (e.g. tibble or data.table) are a workhorse data structure that we will return to again and again in the next classes.
We have already used many functions e.g. seq
, typeof
, matrix
, as.data.frame
. Functions have rules for how arguments are specified.
round(x, digits = 0)
round
: function name
x
: required argument
digits
: optional argument (Defaults to 0)
The positional order of the arguments specifies that nums
will be assigned to x
. Alternatively you can explicitly provide the argument x = nums
.
round(x = nums, digits = 1)
[1] 1.5 1.4 -1.6 0.0
round(nums, 1)
[1] 1.5 1.4 -1.6 0.0
round(digits = 1, x = nums)
[1] 1.5 1.4 -1.6 0.0
You can write your own functions as well. Functions reduce copying and pasting code, which reduces errors and simplifies code by reducing objects in the global environment.
We’ll learn more about functions later in the course.
add_stuff <- function(x, y, z = 10) {
x + y + z
}
add_stuff(2, 2)
[1] 14
As we’ve seen it is common to combine multiple functions into a single expression, which can be hard to read.
Instead we can use the pipe operator (|>
) to pipe data from 1 function to another. The operator takes output from the left hand side and pipes it into the right hand side expression.
[1] 30823
[1] 30823
[1] 30823
Implicitly, the data coming from the pipe is passed as the first argument to the right hand side expression.
f(x, y) == x |> f(y)
The pipe allows complex operations to be conducted without having many intermediate variables or many unreadable nested parathenses.
If we need to pass the data to another argument or refer to the data we can use the _
placeholder. When used in a function the _
placeholder must be supplied with the argument name.
We still need to assign the result to a variable in order to store it.
Lastly, it is common to break up each function call into a separate line for readability
The magrittr
package first introduced the pipe operator, but it is different %>%
. The two are similar, however the magrittr pipe uses .
as a placeholder. You may see the %>%
pipe in help and documentation.
R expression can fail due to invalid syntax or other problems. If an expression fails, it generally will not return the expected value and an “error” will be issued.
Errors stop execution, and will cause your scripts to stop. If we include the below chunk in a R script or Rmarkdown it will fail.
w <- "0" / 1
w # w does not exist
In contrast, a R command may return a message or warning, both of which will not terminate the execution, but are providing some information about the command being run. Warnings generally should not be ignored as they often are pointing to issues you need to address.
Messages usually indicate something about the command being run, but are not indicative of an issue. For example, reporting to the user the number of lines processed by a function.
message("we have processed X number of lines")
Often in your analysis code it is useful to throw an error if something strange or unexpected happens. stopifnot()
is a useful command to do this.
Objects that we assign to variables get stored in an environment known as the Global Environment. You can see the objects in the global environment using the ls()
function, or by clicking on the environment tab in Rstudio.
ls()
[1] "add_stuff" "animals" "df" "is_in_regions"
[5] "lst" "m" "n_cyl" "nums"
[9] "res" "state.name" "top_hp_car" "total_area"
[13] "ww" "x"
Objects can be removed from the environment, which can be helpful if you have a large memory object that is no longer needed.
When you close Rstudio, by default your global R environment is saved to a hidden file called .Rdata in the project directory. When you relaunch rstudio, R objects from your previous environment will be reloaded. This behavior can lead to many problems and we recommend disabling this option
To disable this option, go to Rstudio preferences and uncheck the “Restore .RData into workspace at startup” option and select the “Never” option for the “Save workspace to .RData on exit”.
We will discuss in later classes how you can save and reload specific R objects and discuss methods to import/export specific data types.
A little bit of time spent upfront organizing your projects will make analyses easier to manage and reproduce.
Use Rstudio projects. For the course I recommend making a new project for each class.
Use multiple directories to separate raw data files from the analysis of the data. Organize the analyses with directories names with chronological dates
Here’s an example organization strategy.
.
├── data
│ ├── 2022-09-flow
│ ├── 2022-09-rnaseq-1
│ └── 2022-09-rnaseq-2
├── docs
│ └── project-goals.txt
├── results
│ ├── 2022-09-01-rnaseq-expt1
│ │ └── gene-expression-analysis.Rmd
│ ├── 2022-09-28-rnaseq-expt2
│ │ └── splicing-analysis.Rmd
│ └── 2022-10-01-flow-expt1
│ └── flow-plots.R
└── src
└── rnaseq_pipeline.sh
Some very good ideas and examples are discussed here:
Noble WS. A quick guide to organizing computational biology projects. PLoS Comput Biol. 2009 Jul;5(7):e1000424. doi: 10.1371/journal.pcbi.1000424.
Provide meaningful names for your files. Consider including ordinal values (e.g. 01, 02, 03) if analyses depend on previous results to indicate ordering of execution.
# bad
models.R
analysis.R
explore.R-redo-final-v2.R analysis
# good
-data.R
clean-model.R
fit-data.R plot
# better
-data.R
01_clean-model.R
02_fit-data.R 03_plot
“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.”
— Hadley Wickham
Code is used to communicate with your computer, but it also is used to communicate with your future self and your colleagues.
Don’t just write code for yourself right now, instead write your code with the expectation that your future self will need to reread, understand, and modify it in 6 months.
#
character tells R to ignore a line of text.# convert x to zscores
<- (x - mean(x)) / sd(x) zs
# Load data ---------------------------
<- read_csv("awesome-data.csv)
dat colnames(dat) <- c("sample", "color", "score", "prediction")
...
...
# modify data -------------------------
dat <- mutate(dat, result = score + prediction)
...
...
# Plot data ---------------------------
ggplot(dat, aes(sample, score)) +
geom_point()
plot_df
) or camelCase (plotDf
) approach.# good
<- width * height
a <- 2 * width + 2 * height
p <- data.frame(area = a, perimeter = p) measurement_df
# bad
<- x1 * x2
y <- 2*x1 + 2*x2
yy <- data.frame(a = y, b = yy) tmp
# Good
<- mean(feet / 12 + inches, na.rm = TRUE)
average
# Bad
<-mean(feet/12+inches,na.rm=TRUE) average
%>%
). Indenting the code can also help with readability.# good
<- complicated_function(x,
data minimizer = 1.4,
sigma = 100,
scale_values = FALSE,
verbose = TRUE,
additional_args = list(x = 100,
fun = rnorm))
# bad
<- complicated_function(x, minimizer = 1.4, sigma = 100, scale_values = FALSE, verbose = TRUE, additional_args = list(x = 100, fun = rnorm)) data
#good
<- read_csv("awesome_data.csv") %>%
plot_df select(sample, scores, condition) %>%
mutate(norm_scores = scores / sum(scores))
#bad
<- read_csv("awesome_data.csv") %>% select(sample, scores, condition) %>% mutate(norm_scores = scores / sum(scores)) plot_df
Rstudio has a shortcuts to help format code
Code -> Reformat code
Code -> Reindent lines
The content of this lecture was inspired by and borrows concepts from the following excellent tutorials:
https://github.com/sjaganna/molb7910-2019
https://github.com/matloff/fasteR
https://r4ds.had.co.nz/index.html
https://bookdown.org/rdpeng/rprogdatascience/
http://adv-r.had.co.nz/Style.html