-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.3 v readr 2.1.4
v forcats 1.0.0 v stringr 1.5.0
v ggplot2 3.4.3 v tibble 3.2.1
v lubridate 1.9.2 v tidyr 1.3.0
v purrr 1.0.2
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'rstatix'
The following object is masked from 'package:stats':
filter
Attaching package: 'janitor'
The following object is masked from 'package:rstatix':
make_clean_names
The following objects are masked from 'package:stats':
chisq.test, fisher.test
here() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950
Problem Set Stats Bootcamp - class 15
Dealing with big data
Problem # 1
Make sure to run the chunk above. The data represent the avg fold change in gene expression for an angiotensin II time course (.25, .5, .75, 1, 1.5, 2, 3, 4, 6, 8, 24 hrs) compared to unstimulated.
correlation
Create hierarchical clustering heatmap of pairwise pearson correlation coefficients. And provide 1-2 observations.
# scale
# pairwise pearson correlation
# make heatmap
Timepoints close to each other tend to correlate strongly with each other. The 4,6, and 8 hr time points are the most different from all others.
PCA
Perform PCA and visualize PC1 vs PC2.Provide 1-2 observations.
# pca
# gather info from summary
# we make a dataframe out of the rotations and will use this to plot
# plot
Calculate the empirical p-value of the cluster most enriched for DUX4 targets by sampling
In order to do this, you will need to:
- Identify which cluster is the most enriched for DUX4 targets.
- Determine how many genes are in the cluster. You will need to know this to figure out how many genes to sample from the whole data set.
- Determine how many of the genes in the cluster are DUX4 targets. This is the metric that you are interested in comparing between the null distribution and your observation.
- Generate 1000 random sample of the proper size from all genes and find out how many of them are DUX4 targets.
- Visualize the distribution of DUX4 targets in these 1000 random (your null distribution) and overlay the number of DUX4 targets you observed in the cluster that was most enriched for DUX4 targets.
# read in data
cd <- read_tsv(here("data", "dux4_clustering_results.csv.gz"))
Rows: 10566 Columns: 15
-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (2): gene_symbol, target
dbl (13): hour00_rep1, hour00_rep2, hour00_rep3, hour04_rep1, hour04_rep2, h...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# how many genes are in cluster of interest?
# how many dux targets are in cluster interest?
# initialize empty vector
sampled_targets <- vector()
# randomly sample # genes above from data 1000x and tally number of dux4 targets each random sampling
# plot