Factor-centric chromatin analysis

Author

Jay Hesselberth

Published

October 21, 2024

Where do transcription factors bind in the genome?

Today we’ll look at where two yeast transcription factors bind in the genome using CUT&RUN.

Techniques like CUT&RUN require an affinity reagent (e.g., an antibody) that uniquely recognizes a transcription factor in the cell.

This antibody is added to permeabilized cells, and the antibody associates with the epitope. A separate reagent, a fusion of Protein A (which binds IgG) and micrococcal nuclease (MNase) then associates with the antibody. Addition of calcium activates MNase, and nearby DNA is digested. These DNA fragments are then isolated and sequenced to identify sites of TF association in the genome.

Fig 1a, Skene et al.

Data download and pre-processing

CUT&RUN data were downloaded from the NCBI GEO page for Skene et al.

I selected the 16 second time point for S. cerevisiae Abf1 and Reb1 (note the paper combined data from the 1-32 second time points).

BED files containing mapped DNA fragments were separated by size and converted to bigWig with:

# separate fragments by size
awk '($3 - $2 <= 120)' Abf1.bed > CutRun_Abf1_lt120.bed
awk '($3 - $2 => 150)' Abf1.bed > CutRun_Abf1_gt150.bed

# for each file with the different sizes
bedtools genomecov -i Abf1.bed -g sacCer3.chrom.sizes -bg > Abf1.bg
bedGraphToBigWig Abf1.bg sacCer3.chrom.sizes Abf1.bw

The bigWig files are available here in the data/ directory.

Questions

  1. How do you ensure your antibody recognizes what you think it recognizes? What are important controls for ensuring it recognizes a specific epitope?

  2. What are important controls for CUT&RUN experiments?

CUT&RUN analysis

Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
# genome viz
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
library(Gviz)
library(rtracklayer)

# motif discovery and viz
library(BSgenome.Scerevisiae.UCSC.sacCer3)
library(rGADEM)
library(seqLogo)

Examine genome coverage

track_start <- 90000
track_end <- 150000

# genes track
sgd_genes_trk <-
  GeneRegionTrack(
    TxDb.Scerevisiae.UCSC.sacCer3.sgdGene,
    chromosome = "chrII",
    start = track_start,
    end = track_end,
    fill = "#009E73",
    background.title = "white",
    col.title = "black",
    fontsize = 14
  )

# signal tracks
track_info <-
  tibble(
    file_name = c(
      "CutRun_Reb1_lt120.bw",
      "CutRun_Abf1_lt120.bw",
      "CutRun_Reb1_gt150.bw",
      "CutRun_Abf1_gt150.bw"
    ),
    sample_type = c(
      "Reb1_Short", "Abf1_Short",
      "Reb1_Long", "Abf1_Long"
    )
  ) |>
  mutate(
    file_path = here("data/block-dna", file_name),
    big_wig = purrr::map(
      file_path, ~ import.bw(.x, as = "GRanges")
    ),
    data_track = purrr::map2(
      big_wig, sample_type,
      ~ DataTrack(
          .x,
          name = .y,
          background.title = "white",
          col.title = "black",
          col.axis = "black",
          fontsize = 12
      )
    )
  ) |>
  dplyr::select(sample_type, big_wig, data_track)

# x-axis track
x_axis_trk <- GenomeAxisTrack(
  col = "black",
  col.axis = "black",
  fontsize = 16
)

Now that we have tracks loaded, we can make a plot.

plotTracks(
  c(
    sgd_genes_trk,
    track_info$data_track,
    x_axis_trk
  ),
  from = track_start,
  to = track_end,
  chromosome = "chrII",
  transcriptAnnotation = "gene",
  shape = "arrow",
  type = "histogram"
)

Questions

  1. What features stand out in the above tracks? What is different between Reb1 and Abf1? Between the short and long fragments?

  2. Where are the major signals with respect to genes?

Peak calling

A conceptually simple approach to identification of regions containing “peaks” where a transcription factor was bound is available in the MACS software (paper, github). There’s also a nice blog post covering the main ideas.

Theory

The Poisson distribution is a discrete probability distribution of the form:

\[ P_\lambda (X=k) = \frac{ \lambda^k }{ k! * e^{-\lambda} } \]

where \(\lambda\) captures both the mean and variance of the distribution.

The R functions dpois(), ppois(), and rpois() provide access to the density, distribution, and random generation for the Poisson distribution. See ?dpois for details.

Practice

Here, we model read coverage using the Poisson distribution. Given a genome size \(G\) and and a number of reads collected \(N\), we can approximate \(\lambda\) from \(N/G\), scaled by the size of the sequence read (about 100 bp).

The intuition here is: where are \(N\) reads placed randomly on a genome of size \(G\)?

MACS uses this value (the “genomic” lambda) and also calculates several “local” lambda values to account for variation among genomic regions. We’ll just using the genomic lambda, which provides a conservative threshold for peak calling.

Using genomic lambda, we can use the Poisson distribution to address the question: How surprised should I be to see \(k\) reads at position X?

abf1_tbl <- read_bigwig(here("data/block-dna/CutRun_Abf1_lt120.bw"))

# number of reads in the original Abf1 BED file
total_reads <- 16e6

genome <- read_genome(here("data/block-dna/sacCer3.chrom.sizes"))
# how can we calculate genome size?
genome_size <- ___ 

genome_lambda <- total_reads / genome_size

peak_calls <-
  abf1_tbl |>
  # define single-base sites
  mutate(
    midpoint = start + round((end - start) / 2),
    start = midpoint,
    end = start + 1,
    # use the poisson to calculate a p-value with the genome-wide lambda
    pval = ___(score, genome_lambda),
    # convert p-values to FDR
    fdr = p.adjust(pval, method = "fdr")
  )

Let’s take a look at a plot of the p-value across a chromosome. What do you notice about this plot, when compared to the coverage of the CUT&RUN coverage above?

ggplot(
  filter(peak_calls, chrom == "chrII"),
  # convert p-values to positive values for plotting
  aes(start, ___)
) + 
  geom_line() +
  xlim(track_start, track_end) +
  theme_cowplot() 

How many peaks are called in this region?

# most stringent cut-off
peak_calls_sig <- 
  filter(
    peak_calls,
    fdr == 0
    ) |>
    # collapse neighboring, significant sites
    bed_merge(max_dist = 20)

filter(
  peak_calls_sig,
  chrom == "chrII" &
    start >= track_start &
    end <= track_end
)

Let’s visualize these peaks in the context of genomic CUT&RUN signal. We need to define an AnnotationTrack with the peak intervals, which we can plot against the CUT&RUN coverage we defined above.

# need a GRanges object to convert to an AnnotationTrack
peak_calls_gr <-
  GRanges(
    seqnames = peak_calls_sig$chrom,
    ranges = IRanges(peak_calls_sig$start, peak_calls_sig$end)
  )

peak_calls_trk <-
  AnnotationTrack(
    peak_calls_gr,
    name = "Peak calls",
    fill = "red",
    background.title = "white",
    col.title = "red",
    fontsize = 16,
    rotation.title = 0
  )

abf1_short_trk <-
  filter(
    track_info,
    sample_type == "Abf1_Short"
  ) |>
  pull(data_track)

plotTracks(
  c(
    sgd_genes_trk,
    abf1_short_trk,
    peak_calls_trk,
    x_axis_trk
  ),
  from = track_start,
  to = track_end,
  chromosome = "chrII",
  transcriptAnnotation = "gene",
  shape = "arrow",
  type = "histogram"
)

Questions

  1. How many peaks were called throughout the genome? How wide are the called peaks, on average?

  2. How else might we define a significance threshold for identifying peaks?

  3. What might the relative heights of the peaks indicate? What types of technical or biological variables might influence peak heights?