Mapping chromatin structure and transactions

Jay Hesselberth

RNA Bioscience Initiative | CU Anschutz

2024-10-21

Final Projects

Throughout the DNA and RNA blocks will refer to several data sets that might be good starting points for your final project (worth 20% of your grade).

We will ask for a sketch of the rough plan for a final project by the end of week 7 (Fri Oct 11). In addition, if you plan to work in a group, we’d like to know who you will be working with.

Final projects will be due Friday Nov 3. We will schedule short (5 minute) talks by each group on Oct 28 and Oct 29.

Gene regulation by chromatin

Genome organization

Cell signaling

Organism development

DNA accessibility drives all biochemical transactions on the genome

  • Transcription Initiation
  • Transcription Elongation
  • DNA Repair
  • Initiation of DNA Replication
  • Recombination
  • Viral Integration

Mapping & measurement are first steps toward understanding.

Before genome-wide DNA accessibility measurements, we knew about chromatin transactions at only a handful of loci.

This was a classic “keys under the lamppost” situation, leading to general models of chromatin-based gene regulation.

Using micrococcal nuclease (MNase) to map chromatin

  • Micrococcal nuclease is an endo/exonuclease from Staphylococcus aureus
  • Efficient, multiple turnover enzyme that digests accessible DNA (& RNA)
  • Dependent on calcium (Ca2+)

Using ATAC to map chromatin

  • Tn5 transposase catalyzes “cut-and-paste” insertion of DNA into a target
  • In ATAC (“Assay for Transposase-Accessible Chromatin”), the transposase enzymes are loaded with DNA sequencing adaptors (blue and red in the image), so the products of transposition are ready to PCR.
  • Single turnover enzyme that acts on accessible DNA.
  • Requires about ~60 bp of accessible for transposition.

Chromatin mapping experimental summary

DNase-seq ATAC-seq MNase-seq
Genome representation Most active regions Most active regions Whole genome
Ease of experiment Very difficult Easy peasy One day’s work
What is profiled? Accessible DNA, “footprints” at low cut frequency Accessible DNA. Not really “footprints”, single turnover enzyme, so fragments are not informative Protections of TFs and nucleosomes

Experimental details

Analysis details

Experimental workflow

Data structures and tools used for analysis of chromatin mapping experiments

File formats

BED format

  • Contains information about genomic intervals.
  • Used to represent genomic features (exons, introns, transcription start sites)
  • First 3 columns are: chrom, start, end
  • Next 3 are: name, score, strand. Strand can be +, -, or . (no strand)
chr7 127473530 127474697 Pos3 0  +
chr7 127474697 127475864 Pos4 0  +
chr7 127475864 127477031 Neg1 0  -
chr7 127477031 127478198 Neg2 0  -

WIG / bedGraph

  • WIG and bedGraph store interval signals.

    • Genome sequencing coverage
    • Coverage of reads from MNase / ATAC-seq
  • Many studies will provide genome-scale data in these formats

chr19 49302000 49302300 -1.0
chr19 49302300 49302600 -0.75
chr19 49302600 49302900 -0.50
chr19 49302900 49303200 -0.25

Interval analysis

The primary tool in the genome interval analysis is BEDtools – it’s the Swiss-army knife of internal analysis.

We wrote an R package called valr that provides the same tools, but you don’t need to leave RStudio. valr provides the same tools for reading and manipulating genome intervals.

bed_intersect() is a fundamental operation. It identifies intervals from two tibbles that intersect and reports their overlaps.

Let’s take a look at that it does.

bed_intersect() example

library(valr)
library(dplyr)

x <- tribble(
  ~chrom, ~start, ~end,
  "chr1", 25, 50,
  "chr1", 100, 125
)

y <- tribble(
  ~chrom, ~start, ~end,
  "chr1", 30,     75
)

bed_intersect(x, y)
# A tibble: 1 × 6
  chrom start.x end.x start.y end.y .overlap
  <chr>   <dbl> <dbl>   <dbl> <dbl>    <int>
1 chr1       25    50      30    75       20

valr example

You can use read_bed() and related functions to load genome annotations and signals.

library(valr)

snps <- read_bed(
  valr_example("hg19.snps147.chr22.bed.gz"),
  n_fields = 6
)

genes <- read_bed(
  valr_example("genes.hg19.chr22.bed.gz"),
  n_fields = 6
)

What is in snps and genes?

head(snps)
# A tibble: 6 × 6
  chrom    start      end name        score strand
  <chr>    <int>    <int> <chr>       <chr> <chr> 
1 chr22 16053247 16053248 rs587721086 0     +     
2 chr22 16053443 16053444 rs80167676  0     +     
3 chr22 16055964 16055965 rs587706951 0     +     
4 chr22 16069373 16069374 rs2154787   0     +     
5 chr22 16069782 16069783 rs1963212   0     +     
6 chr22 16100513 16100514 rs8140563   0     +     
head(genes)
# A tibble: 6 × 6
  chrom    start      end name      score strand
  <chr>    <int>    <int> <chr>     <chr> <chr> 
1 chr22 16150259 16193004 AK022914  8     -     
2 chr22 16162065 16172265 LINC00516 3     +     
3 chr22 16179617 16181004 BC017398  1     -     
4 chr22 16239287 16239327 DQ590589  1     +     
5 chr22 16240245 16240277 DQ573684  1     -     
6 chr22 16240300 16240340 DQ595048  1     -     

Interval manipulation

Let’s find and characterize intergenic SNPs. We’ll use the tools bed_substract() and bed_closest(). Take a look and their examples in the documentation to see what they do.

# find snps in intergenic regions
intergenic <- bed_subtract(snps, genes)

# find distance from intergenic snps to nearest gene
nearby <- bed_closest(intergenic, genes)

Take a look at the intergenic and nearby objects in the console.

Interval manipulation

Now that you have overlaps and distances between SNPs and genes, you can go back to dplyr tools to generate reports.

library(dplyr)

nearby |>
  select(
    starts_with("name"),
    .overlap,
    .dist
  ) |>
  filter(abs(.dist) < 5000)
# A tibble: 1,047 × 4
   name.x      name.y   .overlap .dist
   <chr>       <chr>       <int> <int>
 1 rs530458610 P704P           0  2579
 2 rs2261631   P704P           0  -268
 3 rs570770556 POTEH           0  -913
 4 rs538163832 POTEH           0  -953
 5 rs190224195 POTEH           0 -1399
 6 rs2379966   DQ571479        0  4750
 7 rs142687051 DQ571479        0  3558
 8 rs528403095 DQ571479        0  3309
 9 rs555126291 DQ571479        0  2745
10 rs5747567   DQ571479        0 -1778
# ℹ 1,037 more rows

bed_map() example

bed_map() does two things in order:

  1. It identifies intersecting intervals between x and y
  2. Calculates summary statistics based on the intersection

A typical use is to count up signals (e.g., coverage from an MNase-seq experiment) over specific regions (e.g., promoter regions).

bed_map() example

Copy / paste these into your console.

x <- tribble(
  ~chrom, ~start, ~end,
  "chr1", 100,    250,
  "chr2", 250,    500
)

y <- tribble(
  ~chrom, ~start, ~end, ~value,
  "chr1", 100,    250,  10,
  "chr1", 150,    250,  20,
  "chr2", 250,    500,  500
)

bed_map() example continued

First examine the intersecting intervals.

bed_intersect(x, y)
# A tibble: 3 × 7
  chrom start.x end.x start.y end.y value.y .overlap
  <chr>   <dbl> <dbl>   <dbl> <dbl>   <dbl>    <int>
1 chr1      100   250     100   250      10      150
2 chr1      100   250     150   250      20      100
3 chr2      250   500     250   500     500      250
bed_map(
  x, y,
  .sum = sum(value),
  .count = length(value)
)
# A tibble: 2 × 5
  chrom start   end  .sum .count
  <chr> <dbl> <dbl> <dbl>  <int>
1 chr1    100   250    30      2
2 chr2    250   500   500      1