-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.3 v readr 2.1.4
v forcats 1.0.0 v stringr 1.5.0
v ggplot2 3.4.3 v tibble 3.2.1
v lubridate 1.9.2 v tidyr 1.3.0
v purrr 1.0.2
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
DNA Block - Problem Set 16
Problem Set
You have two tasks for this problem set.
Read the two papers in the preparation document before class on Wed.
Look over the vignettes for the software in the preparation document. Use valr to complete the tasks below. These problems are due Wed at 5pm.
Each problem below is worth 5 points.
Setup
Load libraries you’ll need for analysis below.
Question 1 – 5 points
We’ll work with a few different files for the next questions.
-
hg19.refGene.chr22.bed.gz
is a BED12 file containing gene (mRNA) information for chr22. -
hg19.rmsk.chr22.bed.gz
is a BED6 containing repetitive elements in the human genome. -
hg19.dnase1.bw
is a bigWig file containing DNase-seq signal.
You can find the path to each with valr_example()
. Load each one individually using the valr read_*
functions.
Some valr functions require a “genome” file, which is just a tibble of chromosome names and sizes.
The hg19 genome file is available at valr_example("hg19.chrom.sizes.gz")
. Use read_genome()
to load it.
Inspect the tibble. How many columns does it have? What is the largest chromosome?
Question 2 – 5 points
Which repeat class covers the largest amount of chromosome 22? Use dplyr tools to analyze the repeats in hg19.rmsk.chr22.bed.gz
.
Question 3 – 5 points
Which promoter has the highest DNase I accessibility?
- Use the valr
create_tss()
function to generate transcription start sites from the BED12 refGene annotations. How big are these intervals? - Generate promoter regions from the TSS with
bed_slop()
, adding 500 bp up- and downstream (i.e., both sides).bed_slop()
requires the genome file above. How big are the regions now? - Use
bed_map()
to calculate the total (i.e., summed) DNase I signal in the promoters (using thescore
column in the DNase file).
Which gene has the highest DNase I in the regions you defined above?
Submit
Be sure to click the “Render” button to render the HTML output.
Then paste the URL of this Posit Cloud project into the problem set on Canvas.