DNA Block - Problem Set 16

Author

Your Name Here

Published

October 21, 2024

Problem Set

You have two tasks for this problem set.

  1. Read the two papers in the preparation document before class on Wed.

  2. Look over the vignettes for the software in the preparation document. Use valr to complete the tasks below. These problems are due Wed at 5pm.

Each problem below is worth 5 points.

Setup

Load libraries you’ll need for analysis below.

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.3     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.3     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.2     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Question 1 – 5 points

We’ll work with a few different files for the next questions.

  • hg19.refGene.chr22.bed.gz is a BED12 file containing gene (mRNA) information for chr22.
  • hg19.rmsk.chr22.bed.gz is a BED6 containing repetitive elements in the human genome.
  • hg19.dnase1.bw is a bigWig file containing DNase-seq signal.

You can find the path to each with valr_example(). Load each one individually using the valr read_* functions.

Some valr functions require a “genome” file, which is just a tibble of chromosome names and sizes.

The hg19 genome file is available at valr_example("hg19.chrom.sizes.gz"). Use read_genome() to load it.

Inspect the tibble. How many columns does it have? What is the largest chromosome?

Question 2 – 5 points

Which repeat class covers the largest amount of chromosome 22? Use dplyr tools to analyze the repeats in hg19.rmsk.chr22.bed.gz.

Question 3 – 5 points

Which promoter has the highest DNase I accessibility?

  1. Use the valr create_tss() function to generate transcription start sites from the BED12 refGene annotations. How big are these intervals?
  2. Generate promoter regions from the TSS with bed_slop(), adding 500 bp up- and downstream (i.e., both sides). bed_slop() requires the genome file above. How big are the regions now?
  3. Use bed_map() to calculate the total (i.e., summed) DNase I signal in the promoters (using the score column in the DNase file).

Which gene has the highest DNase I in the regions you defined above?

Submit

Be sure to click the “Render” button to render the HTML output.

Then paste the URL of this Posit Cloud project into the problem set on Canvas.