Problem Set Stats Bootcamp - class 15

Dealing with big data

Author

Neelanjan Mukherjee

Published

October 21, 2024

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.3     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.3     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.2     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'rstatix'


The following object is masked from 'package:stats':

    filter



Attaching package: 'janitor'


The following object is masked from 'package:rstatix':

    make_clean_names


The following objects are masked from 'package:stats':

    chisq.test, fisher.test


here() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950
ang <- read_csv(here("data/bootcamp/edger.csv.gz")) |>
  clean_names() |>
  filter(fdr < 0.05) |>
  select(log_fc_time0_25:log_fc_time8) |>
  as.matrix()

colnames(ang) <- gsub(pattern = "log_fc_", "", colnames(ang))

Problem # 1

Make sure to run the chunk above. The data represent the avg fold change in gene expression for an angiotensin II time course (.25, .5, .75, 1, 1.5, 2, 3, 4, 6, 8, 24 hrs) compared to unstimulated.

correlation

Create hierarchical clustering heatmap of pairwise pearson correlation coefficients. And provide 1-2 observations.

# scale

# pairwise pearson correlation


# make heatmap

Timepoints close to each other tend to correlate strongly with each other. The 4,6, and 8 hr time points are the most different from all others.

PCA

Perform PCA and visualize PC1 vs PC2.Provide 1-2 observations.

# pca


# gather info from summary




# we make a dataframe out of the rotations and will use this to plot


# plot

Calculate the empirical p-value of the cluster most enriched for DUX4 targets by sampling

In order to do this, you will need to:

  1. Identify which cluster is the most enriched for DUX4 targets.
    • Determine how many genes are in the cluster. You will need to know this to figure out how many genes to sample from the whole data set.
    • Determine how many of the genes in the cluster are DUX4 targets. This is the metric that you are interested in comparing between the null distribution and your observation.
  2. Generate 1000 random sample of the proper size from all genes and find out how many of them are DUX4 targets.
  3. Visualize the distribution of DUX4 targets in these 1000 random (your null distribution) and overlay the number of DUX4 targets you observed in the cluster that was most enriched for DUX4 targets.
# read in data
cd <- read_tsv(here("data", "dux4_clustering_results.csv.gz"))
Rows: 10566 Columns: 15
-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr  (2): gene_symbol, target
dbl (13): hour00_rep1, hour00_rep2, hour00_rep3, hour04_rep1, hour04_rep2, h...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# how many genes are in cluster of interest?

# how many dux targets are in cluster interest?



# initialize empty vector
sampled_targets <- vector()

# randomly sample # genes above from data 1000x and tally number of dux4 targets each random sampling



# plot

What is the p-value?

What is your interpretation?