RNA Block - Problem Set 23

Published

October 21, 2024

Problem Set

Total points: 20. First problem is worth 10 points, second and third problems are worth 5 points.

Load libraries

Start by loading libraries you need analysis in the code chunk below.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
here() starts at /home/runner/work/molb-7950/molb-7950

Attaching package: 'cowplot'

The following object is masked from 'package:gt':

    as_gtable

The following object is masked from 'package:lubridate':

    stamp

We have an experiment where we can take neuronal cells and mechanically separate them into soma and neurite fractions. By sequencing RNA from both of these fractions and comparing the relative abundances of RNAs, we can get a sense of how neurite-localized every RNA is. We can also combine this approach with knockouts of specific RBPs. If we have an RBP that we think is involved in this process, we can do this subcellular fractionation in sequencing in both WT and RBP-knockout (KO) cells. Transcripts that depend upon the RBP for transcript to the neurite should be less neurite-enriched in the KO samples than the WT samples.

We recently completed this process in mouse cells that lack the RBP TDP-43. We have RNA sequence data for 4 conditions: WT soma, WT neurite, KO soma, and KO neurite with 3 replicates of each condition. These samples have been quantified with salmon.

Read in this data, collapse salmon’s transcript-level quantification to gene-level quantification with tximport. Then assess the quality of this data by performing hierarchical clustering of pairwise spearman correlation values and PCA analysis of TPM expression values.

The salmon data lives in data/block-rna/salmon_tdp43. In that directory, you will find one salmon output directory for each sample.

Q1: read in salmon data (10 pts)

# There are some hints to help you get started

# Use biomaRt to get a table of transcript/gene relationships
mart <- biomaRt::useMart(
  "ENSEMBL_MART_ENSEMBL",
  dataset = "mmusculus_gene_ensembl",
  host = "https://www.ensembl.org"
)

t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "external_gene_name"), mart = mart) |> dplyr::select(ensembl_transcript_id, ensembl_gene_id)



# Read in salmon quantification files



metadata <- data.frame(
  sample_id = list.files(here("data/block-rna/salmon_tdp43")),
  salmon_dirs = list.files(here("data/block-rna/salmon_tdp43"), recursive = T, pattern = ".gz$", full.names = T)
) |>
  separate(col = sample_id, into = c("cell", "loc", "geno", "rep"), sep = "_", remove = F)

metadata$rep <- gsub(pattern = "Rep", replacement = "", metadata$rep)

rownames(metadata) <- metadata$sample_id


# Get gene-level TPM values with tximport

salmdir <- metadata$salmon_dirs
names(salmdir) <- metadata$sample_id


txi <- tximport(
  files = salmdir,
  type = "salmon",
  tx2gene = t2g,
  dropInfReps = TRUE,
  countsFromAbundance = "lengthScaledTPM"
)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 
transcripts missing from tx2gene: 4816
summarizing abundance
summarizing counts
summarizing length
# Filter genes to remove those that are not expressed at at least 1 TPM in EVERY sample
tpms <- txi$abundance |>
  as.data.frame() |>
  rownames_to_column(var = "ensembl_gene_id")


tpms.cutoff <-
  mutate(tpms, nSamples = rowSums(tpms[, 2:13] > 1)) |>
  # Now filter for rows where nSamples is at least 12
  # Meaning that at least 12 samples passed the threshold
  filter(nSamples >= 12) |>
  # Get rid of the nSamples column
  dplyr::select(-nSamples)

Q2: make correlation heatmap (5 pts)

# Use cor() to get a matrix of pairwise correlations between samples
tpms.cor <- cor(tpms.cutoff[, -1], method = "spearman")

# Use pheatmap() to plot correlation matrix
pheatmap(
  tpms.cor,
  annotation_col = metadata[, 3:5],
  fontsize = 7,
  show_colnames = FALSE
)

Provide 1-2 sentences of interpretation of the similarity of the samples based on the heatmap.

The biggest differences are betweent soma and neurite. Then the WT vs KO. The replicates cluster together nicely.

Q3: make PCA plot (5 pts)

# Start with the filtered TPM table from above
tpms.cutoff.matrix <-
  dplyr::select(tpms.cutoff, -ensembl_gene_id) |>
  as.matrix()


# Use prcomp() to derive principle component coordinants of *LOGGED*  and *Scaled* TPM values
tpms.cutoff.matrix <- log2(tpms.cutoff.matrix + 1e-3)

tpms.cutoff.matrix <- t(scale(t(tpms.cutoff.matrix)))

tpms.pca <- prcomp(t(tpms.cutoff.matrix))
# Add annotations of the cell compartment (soma / neurite) and TDP-43 status (WT / KO) of the samples
tpms.pca.pc <- tpms.pca$x %>%
  as.data.frame() %>%
  rownames_to_column(var = "sample_id") %>%
  left_join(., metadata[, c(1, 3:5)], by = "sample_id")

##

tpms.pca.summary <- summary(tpms.pca)$importance
pc1var <- round(tpms.pca.summary[2, 1] * 100, 1)
pc2var <- round(tpms.pca.summary[2, 2] * 100, 1)

# Plot PCA data

ggplot(
  data = tpms.pca.pc,
  aes(
    x = PC1, y = PC2,
    color = paste(loc, geno), label = sample_id
  )
) +
  geom_point(size = 5) +
  scale_color_brewer(palette = "Set1") +
  theme_cowplot(16) +
  labs(
    x = paste("PC1,", pc1var, "% explained var."),
    y = paste("PC2,", pc2var, "% explained var.")
  ) +
  geom_text_repel()

Provide 1-2 sentences of interpretation of the similarity of the samples based on the heatmap.

PC1, which explains ~86% of the variance is correlating with the difference between soma and neurite. PC2, which explains ~6% of the variance is correlating with the difference between WT and KO.The replicates group together nicely.