DNA Block - Problem Set 16

Published

October 21, 2024

Problem Set

You have two tasks for this problem set.

Read the two papers in the preparation document before class on Wed.
Look over the vignettes for the software in the preparation document. Use valr to complete the tasks below. These problems are due Wed at 5pm.

Each problem below is worth 15 points.

Setup

Load libraries you’ll need for analysis below.

library(tidyverse)

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.3     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.3     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.2     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(valr)

Question 1 – 5 points

We’ll work with a few different files for the next questions.

hg19.refGene.chr22.bed.gz is a BED12 file containing gene (mRNA) information for chr22.
hg19.rmsk.chr22.bed.gz is a BED6 containing repetitive elements in the human genome.
hg19.dnase1.bw is a bigWig file containing DNaseI signal.

You can find the path to each with valr_example(). Load each one invdividually using the read_* functions.

gene_tbl <- read_bed(valr_example("hg19.refGene.chr22.bed.gz"))
rmsk_tbl <- read_bed(valr_example("hg19.rmsk.chr22.bed.gz"))
dnase_tbl <- read_bigwig(valr_example("hg19.dnase1.bw"))

Some valr functions require a “genome” file, which is just a tibble of chromosome names and sizes.

The hg19 genome file is available at valr_example("hg19.chrom.sizes.gz"). Use read_genome() to load it.

Inspect the tibble. How many columns does it have? What is the largest chromosome?

genome <- read_genome(valr_example("hg19.chrom.sizes.gz"))

ncol(genome)

[1] 2

arrange(genome, desc(size)) |> head(1)

# A tibble: 1 x 2
  chrom      size
  <chr>     <dbl>
1 chr1  249250621

Question 2 – 5 points

Which repeat class covers the largest amount of chromosome 22?

# N.B.: the following doesn't collapse intervals within repeat classes, but it's a
# decent approximation.
rmsk_tbl |>
  mutate(int_size = end - start) |>
  group_by(name) |>
  summarize(total_size = sum(int_size)) |>
  arrange(desc(total_size))

# A tibble: 1,053 x 2
   name      total_size
   <chr>          <int>
 1 ALR/Alpha     110767
 2 AluSx1        103112
 3 AluY           94814
 4 AluSx          88923
 5 MIRb           83177
 6 L2a            73070
 7 AluSz          72584
 8 AluJb          67103
 9 AluSq2         61591
10 AluJr          49571
# i 1,043 more rows

Question 3

Which promoter has the highest DNase I accessibility?

Use the create_tss() function to generate transcription start sites from the refGene annotations. How be are these intervals?
Generate some promoter regions with bed_slop(), adding 500 bp up- and downstream of the TSS. bed_slop() requires the genome file above.
Use bed_map() to calculate the total (i.e., summed) DNase I signal in the promoters (the score column in the DNase file).

Which gene has the highest DNase I in the region you defined above?

tss_tbl <- create_tss(gene_tbl)
promoter_tbl <- bed_slop(tss_tbl, genome, both = 500)

bed_map(
  promoter_tbl,
  dnase_tbl,
  score_sum = sum(score)
) |>
  arrange(desc(score_sum)) |>
  head(1)

# A tibble: 1 x 7
  chrom    start      end name      score strand score_sum
  <chr>    <dbl>    <dbl> <chr>     <chr> <chr>      <dbl>
1 chr22 17082300 17083301 NR_001591 0     +            863

Submit

Be sure to click the “Render” button to render the HTML output.

Then paste the URL of this Posit Cloud project into the problem set on Canvas.