Practical Biological Data Analysis: Class 2: R fundamentals part 2

The Rmarkdown for this class is on github

Using R scripts

R code can be executed using R scripts, which have the .R extension. R scripts can only contain R code, not plain text or markdown. Scripts are executed line by line starting at the top of the document.

R scripts are useful if you have code that you want to run but don’t need the additional functionality of an Rmarkdown. You can also put custom R functions or R expression into an .R script and then use them in another document. The source() function will execute the R code in a Rscript.

# can be a path to a .R file or a URL
source("https://raw.githubusercontent.com/rnabioco/bmsc-7810-pbda/main/_posts/2023-11-27-class-2/custom-functions.R")

# defined in script at URL
greeting("class")

important_list

As an aside, on the command line (e.g. terminal) you can run a R script (or expression):

R -e 'print("Hello World")'

Rscript your_awesome_code.R

Using Rmarkdown to conduct data analysis

Rmarkdown is a reproducible framework to create, collaborate, and communicate your work.
Rmarkdown supports a number of output formats including pdfs, word documents, slide shows, html, etc.
An Rmarkdown document is a plain text file with the extension .Rmd and contains the following basic components:
- An (optional) YAML header surrounded by —s.
- Chunks of R code surrounded by ```.
- Text mixed with simple text formatting like # heading and italics.
Rmarkdown documents are executable documents. You can execute the code and render the markdown into html using the render() function, or alternatively by clicking the knit button in Rstudio.

library(rmarkdown)
render("your-rmarkdown.Rmd")

More on vectors

We have spent a large amount of time focused on vectors because these are the fundamental building blocks of more complex data structures.

Logical operations

As we have seen we can use relational operators (e.g. ==, >, <=) to compare values in a vector.

Returning to our state data, say we wanted to identify states that are located in the south or in the west. How might we approach this?

There are a few approaches:

We can combine relational operators with logical operators, such as the or operator |, similarly we can use the and operator &.

# return TRUE if state is in the South or the West
state.region == "South" | state.region == "West"

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[12]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE
[23] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
[34] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
[45] FALSE  TRUE  TRUE  TRUE FALSE  TRUE

# states can't be in two regions, so these are all FALSE
state.region == "South" & state.region == "West"

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE

What if we wanted to ask if the state is in the South, West, or Northeast?

We could add another or statement with |

state.region == "South" | state.region == "West" | state.region == "Northeast"

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[12]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
[23] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[34] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
[45]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

A more efficient approach when testing for the presence of multiple values is to use the %in% operator. This operator tests if an element in a vector on the left is present in the vector on the right.

state.region %in% c("South", "West", "Northeast")

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[12]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
[23] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[34] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
[45]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

This is a very common operation used to select particular subsets of a vector.

Negation

What we want to find states not in the west or the south?

Again there are multiple approaches. We could use the != operator to ask if a vector does not equal a value. We then combine this with the & operator to find values that do not satisfy either condition.

# TRUE if state is not in the south AND the state is not in the WEST
state.region != "South" & state.region != "West"

 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[12] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
[23]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
[34]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
[45]  TRUE FALSE FALSE FALSE  TRUE FALSE

Alternatively we can use the ! operator, which inverts TRUE to FALSE and vice versa.

e.g.:

x <- c(TRUE, FALSE, TRUE)
!x

[1] FALSE  TRUE FALSE

!(state.region == "South" | state.region == "West")

 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[12] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
[23]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
[34]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
[45]  TRUE FALSE FALSE FALSE  TRUE FALSE

Also we can use the ! operator with %in%:

!(state.region %in% c("South", "West"))

 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[12] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
[23]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
[34]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
[45]  TRUE FALSE FALSE FALSE  TRUE FALSE

any and all

What if we want to test if all values are TRUE?

is_in_regions <- state.region %in% c("South", "West", "Northeast", "North Central")
is_in_regions

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[14] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[27] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[40] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

all(is_in_regions)

[1] TRUE

What if we want to test if any values are TRUE?

any(state.region %in% c("Mountain", "Alpine"))

[1] FALSE

any(state.region  == "West")

[1] TRUE

# useful to quickly check for problematic data points
any(is.na(state.region))

[1] FALSE

Factors

When printing the state.region object you may have noticed the Levels: Northeast South North Central West. What is this?

state.region is a special type of integer vector called a factor. These are commonly used to represent categorical data, and allow one to define a custom order for a category. In various statistical models factors are treated differently from numeric data. In our class you will use them mostly when you are plotting.

Internally they are represented as integers, with levels that map a value to each integer value.

typeof(state.region)

[1] "integer"

class(state.region)

[1] "factor"

levels(state.region)

[1] "Northeast"     "South"         "North Central" "West"

You can convert a vector into a factor using factor().

animals <- c("cat", "fish", "fish", "bear", "bear")
animals <- factor(animals)
animals

[1] cat  fish fish bear bear
Levels: bear cat fish

Note that the levels are sorted lexicographically by default

levels(animals)

[1] "bear" "cat"  "fish"

We can add custom ordering by setting the levels

animals <- factor(animals, levels =  c("cat", "bear", "fish"))
animals

[1] cat  fish fish bear bear
Levels: cat bear fish

# sorting will reorder based on the levels
sort(animals)

[1] cat  bear bear fish fish
Levels: cat bear fish

Names

Vectors in R can also have names, which provide additional information about elements in an object and provide a convenient method to identify elements by name, rather than by position.

A use case: what if we wanted to determine a state name corresponding to a state abbreviation?

We can set the names() of the state.name vector to be the abbreviations.

names(state.name) <- state.abb
state.name[1:5]

          AL           AK           AZ           AR           CA 
   "Alabama"     "Alaska"    "Arizona"   "Arkansas" "California"

Now the names are displayed above each element of the vector.

With names, now we query the vector by the abbreviations, which will then return the state names.

state.name[c("UT", "CO")]

        UT         CO 
    "Utah" "Colorado"

Names will become more important next when we start to discuss data.frames and matrices, which can have names corresponding to rows and columns.

Additional data structures in R

Ceballos, Maite and Nicolás Cardiel. 2013. Data structure. First Steps in R. https://web.archive.org/web/20200621022950/http://venus.ifca.unican.es/Rintro/dataStruct.html

Figure 1: Ceballos, Maite and Nicolás Cardiel. 2013. Data structure. First Steps in R. https://web.archive.org/web/20200621022950/http://venus.ifca.unican.es/Rintro/dataStruct.html

matrix

A matrix is a 2 dimensional rectangular data structure, where all values have the same type. It is at is core just a vector, but with a special attribute called dim which specifies the number of rows and columns.

A matrix is used to store a collection of vectors of the same type and same length.

m <- matrix(1:25, nrow = 5, ncol = 5)
typeof(m)

[1] "integer"

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

We can subset or assign values to specific rows or columns using bracket notation, with values denoting rows and/or columns to keep.

matrix[rows to keep, columns to keep].

# keep first two rows
m[1:2, ]

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22

# keep first two columns
m[, 1:2]

     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

# keep first two rows and first 3 columns
m[1:2, 1:3]

     [,1] [,2] [,3]
[1,]    1    6   11
[2,]    2    7   12

# replace values
m[1, 1] <- 1000

Matrices can have column names and row names that identify the columns. These names can also be used to subset the matrix by row name or column name.

colnames(m) <- LETTERS[1:5]
rownames(m) <- letters[1:5]
m

     A  B  C  D  E
a 1000  6 11 16 21
b    2  7 12 17 22
c    3  8 13 18 23
d    4  9 14 19 24
e    5 10 15 20 25

m[c("a", "b", "c"), c("C", "D")]

Many functions that operate on vectors also operate on matrices:

# total values in m
sum(m)
mean(m)
max(m)

# add 100 to every value
m + 100
# element-wise addition or division
m + m
m / m

# replace specific values
m[m > 10] <- 123455
m

Matrices are a very commonly used data structure, used in many statistics and genomic packages. We will use matrices later in the course as part of a discussion of clustering and heatmaps.

list

A list is similar to a vector, in that it is a container for multiple elements, however it can contain elements from different classes or types. Each element can have a different length or type and can even be a list to generate a nested list of lists.

lst <- list(vals = 1:4, 
            ids = c("bear", "dog"),
            is_valid = TRUE,
            aux = m)
lst

$vals
[1] 1 2 3 4

$ids
[1] "bear" "dog" 

$is_valid
[1] TRUE

$aux
     A  B  C  D  E
a 1000  6 11 16 21
b    2  7 12 17 22
c    3  8 13 18 23
d    4  9 14 19 24
e    5 10 15 20 25

We can subset a list using [] and select elements with [[.

lst[1] # list of length 1

lst[[1]] # first element of list

lst[[1]][1] # first value in first element of list

If the list has names we can also use the $ operator or [[ to extract an element by name or subset the list to contain only certain elements based on position.

A single [ operator when used on a list, returns a list, whereas [[ operators returns the entry in the list. The [[ operator only returns 1 element, whereas [ can return multiple elements.

# extract ids element, these are all equivalent
lst$ids         # by name

[1] "bear" "dog"

lst[[2]]        # by position

[1] "bear" "dog"

lst[["ids"]]    # by name, with double bracket notation

[1] "bear" "dog"

# subset to first two list elements, returns a list of length 2
# these are equivalent
lst[1:2]

$vals
[1] 1 2 3 4

$ids
[1] "bear" "dog"

lst[c("vals", "ids")]                   # using names to subset list

$vals
[1] 1 2 3 4

$ids
[1] "bear" "dog"

lst[c(TRUE, TRUE, FALSE, FALSE)] # using a logical vector

$vals
[1] 1 2 3 4

$ids
[1] "bear" "dog"

Similar to vectors, we can also add or replace elements in lists. In this case using the $ operator adds an entry to the list with a name (e.g. new_entry). Using the [ approach (with two [[)

lst$new_entry <- c("hello", "world!")
lst[[6]] <- c("hello", "again!")

Lists are a very useful data structure that is commonly used as a foundation for storing many different data types in a single object.

For example many statistical tests return lists that store various information about the test results.

res <- t.test(x = 1:100, y = 100:200)
typeof(res)

[1] "list"

names(res)

 [1] "statistic"   "parameter"   "p.value"     "conf.int"   
 [5] "estimate"    "null.value"  "stderr"      "alternative"
 [9] "method"      "data.name"

res$p.value

[1] 3.574345e-61

data.frame

A data.frame is similar to a matrix, but each column can have a different type. This property makes the data.frame a very useful data structure to store multiple types of related information about an observation.

A data.frame can be generated using data.frame() or by coercing a matrix or other data structure (as.data.frame()).

df <- data.frame(vals = 1:4, 
                 animal = c("cat", "fish", "bear", "dog"),
                 is_mammal = c(TRUE, FALSE, TRUE, TRUE))
df

  vals animal is_mammal
1    1    cat      TRUE
2    2   fish     FALSE
3    3   bear      TRUE
4    4    dog      TRUE

Individual columns (vectors) can be accessed using the $ symbol and treated like regular vectors.

df$animal

[1] "cat"  "fish" "bear" "dog"

sum(df$is_mammal)

[1] 3

A data.frame is actually a specialized form of a list, whereby each list entry is a vector, and all the vectors have the same length. This is why the syntax is somewhat similar to a list.

# convert df to a list, then back to a data.frame
df_lst <- as.list(df)
df_lst
as.data.frame(df_lst)

# you can also use the double brackets to extract a column, similar to extracting an element from a list
df$is_mammal
df[["is_mammal"]] 
df[[3]]

Subsetting and working with data.frames

Just like with vectors and matrices we can also subset data.frames using logical vectors, positions, and names if they have column and row names.

For the next exercises we will use the mtcars dataset built into R. It is data.frame with information about various vehicles from the 1970s. see ?mtcars for a description.

Here I am using the head() function to print only the first 6 rows (there is also a tail() function).

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We can subset or select data in the data.frame using the [ notation, similar to matrices.

df[rows to keep, columns to keep]

# mimic the head() function, keep first 6 rows
mtcars[1:6, ]

# first row, columns 2 and 3
mtcars[1, 2:3]

# all data from rows 2 and 4
mtcars[c(2, 4), ]

# all data from columns 1 and 3
mtcars[, c(1, 3)]

# extract first 2 columns with logical vector (rep() repeats elements)
lgl_vec <- c(TRUE, TRUE, rep(FALSE, 9))
mtcars[, lgl_vec]

This data.frame has row names, which are names that denote individual rows and column names that indicate columns. The rownames are in a column on the far left with no column name. We can subset columns and rows using these names.

rownames(mtcars)[1:5]

[1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
[4] "Hornet 4 Drive"    "Hornet Sportabout"

colnames(mtcars)[1:5]

[1] "mpg"  "cyl"  "disp" "hp"   "drat"

mtcars[c("Duster 360", "Datsun 710"), c("cyl", "hp")]

           cyl  hp
Duster 360   8 245
Datsun 710   4  93

Exercises:

For cars with miles per gallon (mpg) of at least 30, how many cylinders (cyl) do they have?

n_cyl <- mtcars[mtcars$mpg > 30, "cyl"]
n_cyl

[1] 4 4 4 4

unique(n_cyl)

[1] 4

Which car has the highest horsepower (hp)?

top_hp_car <- mtcars[mtcars$hp == max(mtcars$hp), ]
rownames(top_hp_car)

[1] "Maserati Bora"

The data.frame and related variants (e.g. tibble or data.table) are a workhorse data structure that we will return to again and again in the next classes.

Functions in R

We have already used many functions e.g. seq, typeof, matrix, as.data.frame. Functions have rules for how arguments are specified.

round(x, digits = 0)

round: function name
x: required argument
digits: optional argument (Defaults to 0)

nums <- c(1.5, 1.4, -1.6, 0.0099)
round(nums)

[1]  2  1 -2  0

round(nums, digits = 1)

[1]  1.5  1.4 -1.6  0.0

The positional order of the arguments specifies that nums will be assigned to x. Alternatively you can explicitly provide the argument x = nums.

round(x = nums, digits = 1)

[1]  1.5  1.4 -1.6  0.0

round(nums, 1)

[1]  1.5  1.4 -1.6  0.0

round(digits = 1, x = nums)

[1]  1.5  1.4 -1.6  0.0

You can write your own functions as well. Functions reduce copying and pasting code, which reduces errors and simplifies code by reducing objects in the global environment.

We’ll learn more about functions later in the course.

add_stuff <- function(x, y, z = 10) {
  x + y + z
}
add_stuff(2, 2)

[1] 14

Chaining operations with the pipe operator |>

As we’ve seen it is common to combine multiple functions into a single expression, which can be hard to read.

# calculate total area of 6 smallest states
sum(head(sort(state.area)))

[1] 30823

Instead we can use the pipe operator (|>) to pipe data from 1 function to another. The operator takes output from the left hand side and pipes it into the right hand side expression.

state.area |> sort() |> head() |> sum()

[1] 30823

# equivalently
sort(state.area) |> head() |> sum()

[1] 30823

# equivalently
sum(head(sort(state.area)))

[1] 30823

Implicitly, the data coming from the pipe is passed as the first argument to the right hand side expression.

f(x, y) == x |> f(y)

The pipe allows complex operations to be conducted without having many intermediate variables or many unreadable nested parathenses.

If we need to pass the data to another argument or refer to the data we can use the _ placeholder. When used in a function the _ placeholder must be supplied with the argument name.

state.area |> sort(x = _) |> head(x = _) |> sum(x = _)

# emulate head with selecting the fix 6 obs. 
state.area |> sort() |> _[1:6] |> sum()

We still need to assign the result to a variable in order to store it.

total_area <- state.area |> sort() |> head() |> sum()

# this also works, but is discouraged...
state.area |> sort() |> head() |> sum() -> total_area

Lastly, it is common to break up each function call into a separate line for readability

total_area <- state.area |> 
  sort() |> 
  head() |> 
  sum()

The magrittr package first introduced the pipe operator, but it is different %>%. The two are similar, however the magrittr pipe uses . as a placeholder. You may see the %>% pipe in help and documentation.

Errors, warnings, and messages

R expression can fail due to invalid syntax or other problems. If an expression fails, it generally will not return the expected value and an “error” will be issued.

Errors stop execution, and will cause your scripts to stop. If we include the below chunk in a R script or Rmarkdown it will fail.

w <- "0" / 1
w # w does not exist

In contrast, a R command may return a message or warning, both of which will not terminate the execution, but are providing some information about the command being run. Warnings generally should not be ignored as they often are pointing to issues you need to address.

ww <- c(1, 2, 3) + c(1, 2)
ww

[1] 2 4 4

Messages usually indicate something about the command being run, but are not indicative of an issue. For example, reporting to the user the number of lines processed by a function.

message("we have processed X number of lines")

Often in your analysis code it is useful to throw an error if something strange or unexpected happens. stopifnot() is a useful command to do this.

stopifnot(1 + 1 == 2)
stopifnot(2 + 2 == 5)

Workspaces

Objects that we assign to variables get stored in an environment known as the Global Environment. You can see the objects in the global environment using the ls() function, or by clicking on the environment tab in Rstudio.

ls()

 [1] "add_stuff"     "animals"       "df"            "is_in_regions"
 [5] "lst"           "m"             "n_cyl"         "nums"         
 [9] "res"           "state.name"    "top_hp_car"    "total_area"   
[13] "ww"            "x"

Objects can be removed from the environment, which can be helpful if you have a large memory object that is no longer needed.

big_matrix <- matrix(1:1e6, nrow = 1e5, ncol = 100)
# show # of rows and columns
dim(big_matrix)
#' [1] 100000    100

# remove matrix from environment
rm(big_matrix)
big_matrix
# 'Error: object 'big_matrix' not found

When you close Rstudio, by default your global R environment is saved to a hidden file called .Rdata in the project directory. When you relaunch rstudio, R objects from your previous environment will be reloaded. This behavior can lead to many problems and we recommend disabling this option

To disable this option, go to Rstudio preferences and uncheck the “Restore .RData into workspace at startup” option and select the “Never” option for the “Save workspace to .RData on exit”.

We will discuss in later classes how you can save and reload specific R objects and discuss methods to import/export specific data types.

Organizing analyses

A little bit of time spent upfront organizing your projects will make analyses easier to manage and reproduce.

Use Rstudio projects. For the course I recommend making a new project for each class.
Use multiple directories to separate raw data files from the analysis of the data. Organize the analyses with directories names with chronological dates

Here’s an example organization strategy.

.
├── data
│   ├── 2022-09-flow
│   ├── 2022-09-rnaseq-1
│   └── 2022-09-rnaseq-2
├── docs
│   └── project-goals.txt
├── results
│   ├── 2022-09-01-rnaseq-expt1
│   │   └── gene-expression-analysis.Rmd
│   ├── 2022-09-28-rnaseq-expt2
│   │   └── splicing-analysis.Rmd
│   └── 2022-10-01-flow-expt1
│       └── flow-plots.R
└── src
    └── rnaseq_pipeline.sh

Some very good ideas and examples are discussed here:

Noble WS. A quick guide to organizing computational biology projects. PLoS Comput Biol. 2009 Jul;5(7):e1000424. doi: 10.1371/journal.pcbi.1000424.

Provide meaningful names for your files. Consider including ordinal values (e.g. 01, 02, 03) if analyses depend on previous results to indicate ordering of execution.

# bad
models.R
analysis.R
explore.R
analysis-redo-final-v2.R

# good
clean-data.R
fit-model.R
plot-data.R

# better
01_clean-data.R
02_fit-model.R
03_plot-data.R

Organizing your code

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.”

— Hadley Wickham

Code is used to communicate with your computer, but it also is used to communicate with your future self and your colleagues.

Don’t just write code for yourself right now, instead write your code with the expectation that your future self will need to reread, understand, and modify it in 6 months.

Use comments to remind yourself what the code does. The # character tells R to ignore a line of text.

# convert x to zscores
zs <- (x - mean(x)) / sd(x)

Use comments to break up long scripts into logical blocks

# Load data ---------------------------
dat <- read_csv("awesome-data.csv)
colnames(dat) <- c("sample", "color", "score", "prediction")
...
...
# modify data -------------------------
dat <- mutate(dat, result = score + prediction)
...
...
# Plot data ---------------------------
ggplot(dat, aes(sample, score)) + 
  geom_point()

Use sensible names for variables. Keep them short, but meaningful. Separate words with snake_case (e.g plot_df) or camelCase (plotDf) approach.

# good
a <- width * height
p <- 2 * width + 2 * height
measurement_df <- data.frame(area = a, perimeter = p)

# bad
y <- x1 * x2
yy <- 2*x1 + 2*x2
tmp <- data.frame(a = y, b = yy)

Space is free in code, use it liberally. Add spaces around operators.

# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)

Split up complicated operations or long function calls into multiple lines. In general you can add a newline after a comma or a pipe operation (%>%). Indenting the code can also help with readability.

# good
data <- complicated_function(x,
                             minimizer = 1.4, 
                             sigma = 100,
                             scale_values = FALSE, 
                             verbose = TRUE, 
                             additional_args = list(x = 100,
                                                    fun = rnorm))
# bad
data <- complicated_function(x, minimizer = 1.4, sigma = 100, scale_values = FALSE, verbose = TRUE, additional_args = list(x = 100, fun = rnorm))

#good
plot_df <- read_csv("awesome_data.csv") %>% 
  select(sample, scores, condition) %>%
  mutate(norm_scores = scores / sum(scores))
  
#bad
plot_df <- read_csv("awesome_data.csv") %>% select(sample, scores, condition) %>% mutate(norm_scores = scores / sum(scores))

Rstudio has a shortcuts to help format code

Code -> Reformat code

Code -> Reindent lines

Acknowledgements and additional references

The content of this lecture was inspired by and borrows concepts from the following excellent tutorials:

https://github.com/sjaganna/molb7910-2019
https://github.com/matloff/fasteR
https://r4ds.had.co.nz/index.html
https://bookdown.org/rdpeng/rprogdatascience/
http://adv-r.had.co.nz/Style.html