Single-cell RNA-seq Problem Set II

Author

Your name here

Published

October 21, 2024

Grade (out of 20):

For this problem set we will be reanalyzing some public single cell RNA-seq data (publication). The dataset contains PBMCs from a patient with Acute Myeloid Leukemia (AML). The dataset is a little different than the one we looked at in Mondays problem set, as we will also include a day 4 sample. The data we will be analyzing consists of three samples, PBMCs taken 4 days (day4), 2 days (day2), and prior to (day 0) treatment with a chemotherapeutic (Venetoclax and Azacitidine).

The three single cell RNA-seq datasets have already been preprocessed and QC’d. There is an .rds file (data/aml/d0_d2_d4_filtered.rds) provided that contains the combined samples in a single seurat object. We will use this seurat object for this homework.

Q1 4 points) Read the .rds file containing the Seurat object into R using the readRDS() function. Use tab-completion to ensure that you are specifying the correct path to the object.

Q2 4 points) Process the dataset to generate a UMAP projection. Plot your UMAP with each cell colored by the day of sample (e.g day 0, day 2 or day 4). Examine the meta.data to find the column that contains the day of sample information. Consult the simplified work-flow shown at the beginning of class on Monday for a default approach to this question (you don’t need to worry about picking parameters here).

#head(so@meta.data)

Q3 4 points) Make a UMAP plot showing the clusters that you have generated. To make this plot more informative use the split.by argument set to the column with the day information. This will split the UMAP into three plots ( day0, day2 and day4). Remember that to plot categorical data you need to use UMAPPlot() and for numeric data use FeaturePlot().

# code here

Q4 2 points) Use the clustifyr package to annotate cell types using a reference dataset from clustifyrdatahub. Use the ref_hema_microarray() reference shown in class. Plot a heatmap (using pheatmap) of the correlation coefficients between your clusters and the cell types in the reference data. Note that you need to set obj_out = FALSE to return the correlation coefficients as a data.frame.

Loading required package: ExperimentHub
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following object is masked from 'package:SeuratObject':

    intersect
The following objects are masked from 'package:lubridate':

    intersect, setdiff, union
The following objects are masked from 'package:dplyr':

    combine, intersect, setdiff, union
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min
Loading required package: AnnotationHub
Loading required package: BiocFileCache
Loading required package: dbplyr

Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':

    ident, sql
library(pheatmap)
# code here

Q5 2 points) Run clustifyr but this time assign the output of clustify to return a Seurat object. The cell classifications will be listed in the type column. Make a UMAP plot colored by the assigned cell types.

# code here

Q6 4 points) For each day (e.g. day0 day2 and day4) calculate the % of cells present in each cluster. Using ggplot, plot with the cell type on the x axis and the % on the y-axis. Then use a fill aesthetic to color by the day of each sample. If your x-axis labels are all squished together consider rotating the label using the following pseudocode:

#ggplot... +
#  theme(axis.text.x = element_text(angle = 90))
# Hint - calculate % of cells using tidyverse, remember so@meta.data gives a data frame
# % of cells will be # of cells in the cluster / total cells for each day. Summarize can
# help you get the total number of cells
# ex
#so@meta.data %>%
#  dplyr::group_by(orig.ident, type) # Finish this to count the number of cells
# code here

How does the relative abundance change of each cell type change?

Short answer here...