Code
# Load required libraries
You have two tasks for this problem set.
Each problem below is worth 20 points.
Load libraries you’ll need for analysis below. You’ll need the tidyverse
and valr
packages.
# Load required libraries
We’ll work with a few different files for the next questions.
hg19.refGene.chr22.bed.gz
is a BED12 file containing RefSeq gene (mRNA) information for chr22.hg19.rmsk.chr22.bed.gz
is a BED6 containing repetitive elements in the human genome.hg19.dnase1.bw
is a bigWig file containing DNaseI signal.You can find the path to each with valr_example()
. Load each one individually using the read_*
functions.
# Load the three data files
Some valr functions require a “genome” file, which is just a tibble of chromosome names and sizes.
The hg19 genome file is available at valr_example("hg19.chrom.sizes.gz")
. Use read_genome()
to load it.
Inspect the genome tibble. How many columns does it have? What is the largest chromosome?
# Load genome file and find the largest chromosome
Answers:
There are **___** columns in the genome tibble and the largest chromosome is **___**.
Which repeat class covers the largest amount of chromosome 22? You need to calculate the sum of the sizes of all intervals in each repeat class.
A common pattern for this is e.g. arrange(desc(size))
+ pull(name)
+ head(1)
.
You could also use slice_max(size, n = 1))
+ pull(name)
.
# Calculate total coverage by repeat class and find the highest
Answer:
The repeat class with the highest coverage is _____.
Which promoter has the highest DNase I accessibility?
Use the create_tss()
function to generate transcription start sites from the refGene annotations. How big are these intervals?
Generate promoter regions with bed_slop()
, adding 500 bp up- and downstream of the TSS. bed_slop()
requires the genome file above.
Use bed_map()
to calculate the total (i.e., summed) DNase I signal in the promoters (the score
column in the DNase file).
Which gene has the highest DNase I in the region you defined above?
# Create TSS, generate promoter regions, and calculate DNase I signal
Answer:
The gene with the highest DNase I signal in its promoter is _____.
Is DNase I accessibility in promoters significantly higher than expected by chance?
Calculate the mean DNase I signal across all promoters from Question 3.
Use bed_shuffle()
to generate 1000 random intervals of the same size as your promoters. You’ll need to provide the genome file and set.seed(42)
for reproducibility.
Use bed_map()
to calculate DNase I signal in these random regions.
Calculate what fraction of random regions have mean DNase I signal greater than or equal to your observed promoter mean. This is your empirical p-value.
# Calculate observed mean and perform permutation test
Answer:
The observed mean DNase I signal in promoters is _____, the mean of random signals is _____, and the empirical p-value is _____, so DNase I accessibility in promoters is _____.
Be sure to click the “Render” button to render the HTML output.
Then paste the URL of this Posit Cloud project into the problem set on Canvas.