-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.3 v readr 2.1.4
v forcats 1.0.0 v stringr 1.5.0
v ggplot2 3.4.3 v tibble 3.2.1
v lubridate 1.9.2 v tidyr 1.3.0
v purrr 1.0.2
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
DNA Block - Problem Set 16
Problem Set
You have two tasks for this problem set.
Read the two papers in the preparation document before class on Wed.
Look over the vignettes for the software in the preparation document. Use valr to complete the tasks below. These problems are due Wed at 5pm.
Each problem below is worth 15 points.
Setup
Load libraries you’ll need for analysis below.
Question 1 – 5 points
We’ll work with a few different files for the next questions.
-
hg19.refGene.chr22.bed.gz
is a BED12 file containing gene (mRNA) information for chr22. -
hg19.rmsk.chr22.bed.gz
is a BED6 containing repetitive elements in the human genome. -
hg19.dnase1.bw
is a bigWig file containing DNaseI signal.
You can find the path to each with valr_example()
. Load each one invdividually using the read_*
functions.
gene_tbl <- read_bed(valr_example("hg19.refGene.chr22.bed.gz"))
rmsk_tbl <- read_bed(valr_example("hg19.rmsk.chr22.bed.gz"))
dnase_tbl <- read_bigwig(valr_example("hg19.dnase1.bw"))
Some valr functions require a “genome” file, which is just a tibble of chromosome names and sizes.
The hg19 genome file is available at valr_example("hg19.chrom.sizes.gz")
. Use read_genome()
to load it.
Inspect the tibble. How many columns does it have? What is the largest chromosome?
genome <- read_genome(valr_example("hg19.chrom.sizes.gz"))
ncol(genome)
[1] 2
# A tibble: 1 x 2
chrom size
<chr> <dbl>
1 chr1 249250621
Question 2 – 5 points
Which repeat class covers the largest amount of chromosome 22?
# N.B.: the following doesn't collapse intervals within repeat classes, but it's a
# decent approximation.
rmsk_tbl |>
mutate(int_size = end - start) |>
group_by(name) |>
summarize(total_size = sum(int_size)) |>
arrange(desc(total_size))
# A tibble: 1,053 x 2
name total_size
<chr> <int>
1 ALR/Alpha 110767
2 AluSx1 103112
3 AluY 94814
4 AluSx 88923
5 MIRb 83177
6 L2a 73070
7 AluSz 72584
8 AluJb 67103
9 AluSq2 61591
10 AluJr 49571
# i 1,043 more rows
Question 3
Which promoter has the highest DNase I accessibility?
- Use the
create_tss()
function to generate transcription start sites from the refGene annotations. How be are these intervals? - Generate some promoter regions with
bed_slop()
, adding 500 bp up- and downstream of the TSS.bed_slop()
requires the genome file above. - Use
bed_map()
to calculate the total (i.e., summed) DNase I signal in the promoters (thescore
column in the DNase file).
Which gene has the highest DNase I in the region you defined above?
tss_tbl <- create_tss(gene_tbl)
promoter_tbl <- bed_slop(tss_tbl, genome, both = 500)
bed_map(
promoter_tbl,
dnase_tbl,
score_sum = sum(score)
) |>
arrange(desc(score_sum)) |>
head(1)
# A tibble: 1 x 7
chrom start end name score strand score_sum
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 chr22 17082300 17083301 NR_001591 0 + 863
Submit
Be sure to click the “Render” button to render the HTML output.
Then paste the URL of this Posit Cloud project into the problem set on Canvas.