RNA Block - Problem Set 24

Author

Your Name Here

Published

October 21, 2024

Problem Set

Total points: 20. Q1 - 10 pts, Q2,3 - 5 points each.

Perform differential expression analysis comparing DIV28 vs DIV0 (10 pts)
Are axonogenesis and cell cycle genes significantly differentially expressed? If so, in what direction (up/down-regulated)? (5 pts)
Perform GSEA analysis using the Hallmark gene set. Make enrichment plot of HALLMARK_G2M_CHECKPOINT geneset. (5 pts)

Load libraries and generate gene information files (0 pts)

Make sure to run the code chunk below to load required libraries and generate t2g (file for tximport) and gene id to symbol mapping file.

library(biomaRt)
library(tximport)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::select() masks biomaRt::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DESeq2)

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:lubridate':

    intersect, setdiff, union

The following objects are masked from 'package:dplyr':

    combine, intersect, setdiff, union

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
    tapply, union, unique, unsplit, which.max, which.min


Attaching package: 'S4Vectors'

The following objects are masked from 'package:lubridate':

    second, second<-

The following objects are masked from 'package:dplyr':

    first, rename

The following object is masked from 'package:tidyr':

    expand

The following object is masked from 'package:utils':

    findMatches

The following objects are masked from 'package:base':

    expand.grid, I, unname

Loading required package: IRanges

Attaching package: 'IRanges'

The following object is masked from 'package:lubridate':

    %within%

The following objects are masked from 'package:dplyr':

    collapse, desc, slice

The following object is masked from 'package:purrr':

    reduce

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'matrixStats'

The following object is masked from 'package:dplyr':

    count


Attaching package: 'MatrixGenerics'

The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'Biobase'

The following object is masked from 'package:MatrixGenerics':

    rowMedians

The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians

Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'

library(ggrepel)
library(here)

here() starts at /home/runner/work/molb-7950/molb-7950

library(cowplot)


Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp

library(rstatix)


Attaching package: 'rstatix'

The following object is masked from 'package:IRanges':

    desc

The following object is masked from 'package:biomaRt':

    select

The following object is masked from 'package:stats':

    filter

library(clusterProfiler)


clusterProfiler v4.12.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141

Attaching package: 'clusterProfiler'

The following object is masked from 'package:IRanges':

    slice

The following object is masked from 'package:S4Vectors':

    rename

The following object is masked from 'package:purrr':

    simplify

The following object is masked from 'package:biomaRt':

    select

The following object is masked from 'package:stats':

    filter

library(enrichplot)
library(msigdbr)
# run the code below to generate t2g file and gene id-symbol mapping file
mart <- biomaRt::useMart(
  "ENSEMBL_MART_ENSEMBL",
  dataset = "mmusculus_gene_ensembl"
)

t2g <- biomaRt::getBM(
  attributes = c(
    "ensembl_transcript_id",
    "ensembl_gene_id",
    "external_gene_name"
  ),
  mart = mart
) |>
  as_tibble()

# maps systematic to common gene names
gene_name_map <- t2g |>
  dplyr::select(-ensembl_transcript_id) |>
  unique()

Q1 Perform differential expression analysis comparing DIV28 vs DIV0

Prepare `metadata` and use `tximport`

metadata <- data.frame(
  sample_id = list.files(here("??"),
                         pattern = "^DIV"),
  salmon_dirs = list.files(here("??"),
                           recursive = ??,
                           pattern = "quant.sf",
                           full.names = ??)
  ) |>
  separate(col = sample_id, into = c("timepoint","rep"), sep = "\\.", remove = F)

metadata$rep <- gsub(pattern = "Rep", replacement = "", metadata$rep) 

rownames(metadata) <- metadata$sample_id

metadata <- metadata |> 
  filter(timepoint %in% c("??","??")) 

salmdir <- metadata$salmon_dirs
names(salmdir) <- metadata$sample_id

txi <- tximport(
  files = salmdir,
  type = "salmon",
  tx2gene = t2g,
  dropInfReps = TRUE,
  countsFromAbundance = "lengthScaledTPM"
)

Filter genes and perform `DESeq2`

# examine distribution of TPMs
hist(log2(1 + rowSums(txi$??)), breaks = 40)

# decide a cutoff
keepG <- txi$abundance[log2(1 + rowSums(txi$abundance)) > ??,] |>
  rownames()

ddsTxi <- DESeqDataSetFromTximport(
  ??,
  colData = metadata,
  design = ~??
)

# keep genes with sufficient expession
ddsTxi <- ??

# run DESeq2
dds <- ??

# create a dataframe containing results and join w/gene symbols
diff <-
  results(
    dds,
    contrast = c("timepoint", "??", "??")
  ) |>
  # Change this into a dataframe
  as.data.frame() |>
  # Move ensembl gene IDs into their own column
  rownames_to_column(var = "ensembl_gene_id") |>
  # drop unused columns
  dplyr::select(-c(baseMean, lfcSE, stat, pvalue)) |>
  # Merge this with a table relating ensembl_gene_id with gene short names
  inner_join(gene_name_map) |>
  # Rename external_gene_name column
  dplyr::rename(gene = external_gene_name) |>
  as_tibble()

Q2. Are axonogenesis and cell cycle genes significantly differentially expressed? If so, in what direction (up/down-regulated)?

cellcyclegenes <- ??(
  attributes = c("ensembl_gene_id"),
  filters = c("go_parent_term"),
  values = c("GO:0045787"),
  mart = mart
)


axongenes <- ??(
  attributes = c("ensembl_gene_id"),
  filters = c("go_parent_term"),
  values = c("GO:0007409"),
  mart = mart
)

diff_paths <-
  diff |>
  mutate(
    annot = case_when(
      ensembl_gene_id %in% axongenes$ensembl_gene_id ~ "??",
      ensembl_gene_id %in% cellcyclegenes$ensembl_gene_id ~ "??",
      .default = "none"
    )
  ) |>
  drop_na() # drop na

# Reorder these for plotting purposes
diff_paths$annot <-
  factor(
    diff_paths$annot,
    levels = c("none",
               "axonogenesis",
               "cellcycle")
  )

# calculate and report p-value using wilcox test
pvals <- wilcox_test(data = diff_paths,
                     ?? ~ ??,
                     ref.group = "??")

# make a plot of the LFC (y-axis) ~ pathways (x-axis)
ggplot(
  diff_paths,
  aes(
    x = ??,
    y = ??,
    fill = ??
  )
) +
  labs(
    x = "Gene class",
    y = "DIV28/DIV0, log2"
  ) +
  geom_hline(
    yintercept = 0,
    color = "gray",
    linetype = "dashed"
  ) +
  geom_boxplot(
    notch = TRUE,
    outlier.shape = NA
  ) +
  ylim(-5,5) +
  theme_cowplot()

Q3. Perform GSEA analysis using the Hallmark gene set.

# retrieve mouse hallmark gene set from msigdb
mouse_hallmark <- msigdbr(species = "??") %>%
  filter(gs_cat == "??") %>% 
  dplyr::select(gs_name, gene_symbol)


# create a list of gene LFCs
rankedgenes <- diff %>% pull(??)

# add symbols as names of the list
names(rankedgenes) <- diff$??

# sort by LFC
rankedgenes <- sort(??, decreasing = TRUE)

# deduplicate
rankedgenes <- rankedgenes[!duplicated(names(rankedgenes))]


# run gsea
?? <- GSEA(geneList = ??,
                      eps = 0,
                     pAdjustMethod = "fdr",
                     pvalueCutoff = .05,
                     minGSSize = 20,
                     maxGSSize = 1000,
                     TERM2GENE = ??)


# plot "HALLMARK_G2M_CHECKPOINT"
gseaplot(x = ??, geneSetID = "??")