Chromatin accessibility I

Chromatin-centric measurement of genomic features

Jay Hesselberth

RNA Bioscience Initiative | CU Anschutz

2024-10-21

Chromatin accessbility patterns and genome function

This class we’ll examine chromatin accessibility patterns and begin to get a sense of what they mean, both at the fine-scale (single base-pair) and across the genome.

Load the libraries

These are libraries we’ve used before. patchwork is a library for composing groups of plots.

These are new libraries specifically for genome analysis. You learned about valr and Gviz for your homework.

rtracklayer provides a few utility functions we’ll use today, and TxDb.Scerevisiae.UCSC.sacCer3.sgdGene provides gene annotations for the S. cerevisiae genome.

Load the data

In this and the next class we will analyze ATAC-seq and MNase-seq data sets from budding yeast. Here are the references for the two data set:

ATAC-seq

Schep AN, Buenrostro JD, Denny SK, Schwartz K, Sherlock G, Greenleaf WJ. Structured nucleosome fingerprints enable high-resolution mapping of chromatin architecture within regulatory regions. Genome Res. 2015 PMID: 26314830; PMCID: PMC4617971. [Link] [Data]

MNase-seq

Zentner GE, Henikoff S. Mot1 redistributes TBP from TATA-containing to TATA-less promoters. Mol Cell Biol. 2013 PMID: 24144978; PMCID: PMC3889552. [Link] [Data]

Experimental consideration

In a standard MNase-seq experiment, DNA around ~150 bp is extracted to look closely at nucleosome occupancy & positioning. However, the above study did not perform size selection. This is important as now we can look at both transcription factor binding sites and nucleosome positions.

Fragment size distributions are informative

First, we will determine the fragment size distributions obtained from the two experiments. These sizes are the fingerprints of particles that were protecting nuclear DNA from digestion.

I have performed the alignment of paired-end reads and converted all reads into a bed file where each line of the bed file denotes a single fragment from start to end.

First, load the ATAC-seq reads:

atac_tbl <- read_bed(
  here("data/block-dna/yeast_atac_chrII.bed.gz"),
)
atac_tbl

# A tibble: 409,870 × 3
   chrom start   end
   <chr> <int> <int>
 1 chrII     0    42
 2 chrII     0    52
 3 chrII     0    71
 4 chrII     0    75
 5 chrII     0    82
 6 chrII     0    83
 7 chrII     0    83
 8 chrII     0    83
 9 chrII     0   122
10 chrII     0   135
# ℹ 409,860 more rows

Next, load the MNase reads.

mnase_tbl <- read_bed(
  here("data/block-dna/yeast_mnase_chrII.bed.gz")
)
mnase_tbl

# A tibble: 1,236,334 × 3
   chrom start   end
   <chr> <int> <int>
 1 chrII     1   228
 2 chrII     2    80
 3 chrII     3   226
 4 chrII     4    60
 5 chrII     4    66
 6 chrII     4    77
 7 chrII     4    81
 8 chrII     5    56
 9 chrII     5    60
10 chrII     5    60
# ℹ 1,236,324 more rows

Working with a small genome is a huge advantage – we can study the whole chromosome in this class.

Expectations for chromatin fragment lengths

Let’s remind ourselves of the expectations for chromatin fragment lengths from MNase-seq and ATAC-seq experiments.

MNase-seq

ATAC-seq

Length distributions of chromatin-derived DNA fragments

For the MNase-seq BED file, you see that there are only three columns: chrom, start, and end. Calculating fragment length is simple:

mutate(mnase_tbl, frag_len = end - start)

# A tibble: 1,236,334 × 4
   chrom start   end frag_len
   <chr> <int> <int>    <int>
 1 chrII     1   228      227
 2 chrII     2    80       78
 3 chrII     3   226      223
 4 chrII     4    60       56
 5 chrII     4    66       62
 6 chrII     4    77       73
 7 chrII     4    81       77
 8 chrII     5    56       51
 9 chrII     5    60       55
10 chrII     5    60       55
# ℹ 1,236,324 more rows

Let’s use this approach to examine the fragment length distribution.

acc_tbl <-
  bind_rows(
    mutate(mnase_tbl, type = "mnase"),
    mutate(atac_tbl, type = "atac")
  )

ggplot(
  acc_tbl,
  aes(x = end - start)
) +
  geom_histogram(
    # single base-pair resolution
    binwidth = 1
  ) +
  facet_grid(
    rows = vars(type),
    scales = "free_y"
  ) +
  xlim(30, 500) +
  labs(
    x = "fragment length (bp)",
    title = "Histogram of fragment lengths from\naccessibility measurements"
  ) +
  theme_cowplot()

Interpretations

How would you describe the two fragment length distributions? Are they similar?
Can you make any biological conclusions based on the length distributions?

Periodicity in the fragment lengths

The ATAC data seems to be periodic. How can we test that hypothesis? We can calculate the autocorrelation of the length distribution. Can someone explain what autocorrelation means?

We’ll use the base hist function to calculate the densities of the above histogram. Let’s write a function we can use to analyze fragment lengths.

fragment_len_hist <- function(tbl) {
  frag_lens <-
    mutate(
      tbl,
      frag_len = end - start
    ) |>
    filter(
      frag_len >= 30 &
        frag_len <= 500
    ) |>
    pull(frag_len)

  hist(
    frag_lens,
    breaks = seq(30, 500, 1),
    plot = FALSE
  )
}

Autocorrelation

The density slot contains a vector of densities at base-pair resolution. We will use acf() to calculate the autocorrelation of these values, and will store the tidied result.

atac_acf_tbl <-
  acf(
    atac_frag_lens$density,
    lag.max = 40,
    plot = FALSE
  ) |>
  broom::tidy()

atac_acf_tbl

# A tibble: 41 × 2
     lag   acf
   <dbl> <dbl>
 1     0 1    
 2     1 0.975
 3     2 0.937
 4     3 0.916
 5     4 0.909
 6     5 0.902
 7     6 0.891
 8     7 0.882
 9     8 0.877
10     9 0.887
# ℹ 31 more rows

Autocorrelation

Now let’s plot the autocorrelation.

Autocorrelation

So, it looks like there significant bumps in autocorrelation at 10 and 21 bp positions, indicating that ATAC length distribution is periodic.

How do we confirm these bumps are interesting? Let’s calculate the acf of a negative control – the length distribution of the MNase data.

mnase_frag_lens <-
  fragment_len_hist(mnase_tbl)

mnase_acf_tbl <-
  acf(
    mnase_frag_lens$density,
    lag.max = 40,
    plot = FALSE
  ) |>
  broom::tidy()

plot_mnase_acf <- plot_acf(mnase_acf_tbl, title = "MNase ACF")

# patchwork plot
plot_atac_acf / plot_mnase_acf

Interpretation

We can see a monotonic decrease in the MNase-seq data, which confirms that the bumps we see are distinctive features of ATAC-seq data.

What are these features? Consider that the specificity of binding of DNase, MNase, and Tn5 is not completely generic. These enzymes have specificity for the minor groove of DNA, and there is an optimal substrate geometry for cleavage. You can see this in previous studies, where DNase-seq revealed high-resolution views of DNA:protein structures.

So what then, exactly is the ~10-11 bp periodicity? And why is this not present in MNase data?

Molecular Picture of DNA accessibility

Visualize read density in genomic region

We will use Gviz to visualize read densities relative to a reference.

Load tracks

First, we load the gene annotations from the Saccharomyces Genome Databases (SGD).

sgd_genes <-
  GeneRegionTrack(
    TxDb.Scerevisiae.UCSC.sacCer3.sgdGene,
    chromosome = "chrII",
    start = 530811,
    end = 540885,
  )

sgd_genes

GeneRegionTrack 'GeneRegionTrack'
| genome: sacCer3
| active chromosome: chrII
| annotation features: 7

Import bigWig

Next, import the bigwig file containing yeast nucleosome-sized fragments (via MNase-seq) using rtracklayer::import.bw().

Inspect the object.

(What is “GRanges”?)

mnase_nuc_gr <- import.bw(
  here("data/block-dna/yeast_mnase_134_160.bw"),
  as = "GRanges"
)
mnase_nuc_gr

GRanges object with 81160 ranges and 1 metadata column:
          seqnames        ranges strand |     score
             <Rle>     <IRanges>  <Rle> | <numeric>
      [1]    chrII         10-19      * |   3.81086
      [2]    chrII         20-29      * |   3.81086
      [3]    chrII         30-39      * |   5.71629
      [4]    chrII         40-49      * |   5.71629
      [5]    chrII         50-59      * |   5.71629
      ...      ...           ...    ... .       ...
  [81156]    chrII 813130-813139      * |  15.24344
  [81157]    chrII 813140-813149      * |   7.62172
  [81158]    chrII 813150-813159      * |   7.62172
  [81159]    chrII 813160-813169      * |   7.62172
  [81160]    chrII 813170-813179      * |   5.71629
  -------
  seqinfo: 1 sequence from an unspecified genome

Load track

Next, load the GRanges object as a track for Gviz to plot:

DataTrack 'MNase_nuc'
| genome: NA
| active chromosome: chrII
| positions: 81160
| samples:1
| strand: *

Vizualize a genomic region

Now, we can make a plot for this particular region of chrII:

# special track for the x-axis
x_axis <- GenomeAxisTrack()

plotTracks(
  c(
    sgd_genes,
    mnase_nuc_trk,
    x_axis
  ),
  from = 530811,
  to = 540885,
  chromosome = "chrII",
  transcriptAnnotation = "gene",
  shape = "arrow",
  type = "histogram"
)

Load the remaining data

That looks great! Let’s load all the other data sets.

Load each bigWig as a GRanges object with rtracklayer::import.bw()
Convert each to a Gviz::DataTrack() for plotting

Load the remaining data

We can do this one of two ways. We can do it one-by-one:

# complete the following steps for each of the four tracks
file_name <- "yeast_mnase_lt50.bw"
track_name <- "MNase_Short"
big_wig <- import.bw(here("data/block-dna", file_name))
data_track <- DataTrack(big_wig, track_name)

Load the remaining data

Or we can create a tibble with file and track names, and use purrr to load and convert each one:

track_info <-
  tibble(
    file_name = c(
      "yeast_mnase_lt50.bw",
      "yeast_mnase_134_160.bw",
      "yeast_atac_lt120.bw",
      "yeast_atac_gt120.bw"
    ),
    file_path = here("data/block-dna", file_name),
    track_name = c(
      "MNase_Short", "MNase_Long",
      "ATAC_Short", "ATAC_Long"
    )
  )

Load the remaining data

Now, we just have to make a list of tracks to plot and Gviz takes care of the rest.

plotTracks(
  c(
    sgd_genes,
    track_info$data_track
  ),
  from = 530811,
  to = 540885,
  chromosome = "chrII",
  transcriptAnnotation = "gene",
  shape = "arrow",
  type = "histogram"
)

Interpretations

Recall this plot:

Some questions to think about as you look at the tracks:

What is each data set reporting on?
What are the major differences between MNase-seq and ATAC-seq based on these tracks?
What can you infer about gene regulation based on these tracks?