Factor-centric chromatin analysis

Jay Hesselberth

RNA Bioscience Initiative | CU Anschutz

2024-10-21

Where do transcription factors bind in the genome?

Today we’ll look at where two yeast transcription factors bind in the genome using CUT&RUN.

Where do transcription factors bind in the genome?

Techniques like CUT&RUN require an affinity reagent (e.g., an antibody) that uniquely recognizes a transcription factor in the cell.

This antibody is added to permeabilized cells, and the antibody associates with the epitope. A separate reagent, a fusion of Protein A (which binds IgG) and micrococcal nuclease (MNase) then associates with the antibody. Addition of calcium activates MNase, and nearby DNA is digested. These DNA fragments are then isolated and sequenced to identify sites of TF association in the genome.

Where do transcription factors bind in the genome?

Fig 1a, Skene et al.

Data download and pre-processing

CUT&RUN data were downloaded from the NCBI GEO page for Skene et al.

I selected the 16 second time point for S. cerevisiae Abf1 and Reb1 (note the paper combined data from the 1-32 second time points).

BED files containing mapped DNA fragments were separated by size and converted to bigWig with:

# separate fragments by size
awk '($3 - $2 <= 120)' Abf1.bed > CutRun_Abf1_lt120.bed
awk '($3 - $2 => 150)' Abf1.bed > CutRun_Abf1_gt150.bed

# for each file with the different sizes
bedtools genomecov -i Abf1.bed -g sacCer3.chrom.sizes -bg > Abf1.bg
bedGraphToBigWig Abf1.bg sacCer3.chrom.sizes Abf1.bw

The bigWig files are available here in the data/ directory.

CUT&RUN analysis

# A tibble: 5 × 3
  chrom  start    end
  <chr>  <int>  <int>
1 chrII 100248 100289
2 chrII 101292 101393
3 chrII 124916 124949
4 chrII 136181 136264
5 chrII 141070 141121

How do proteins recognize specific locations in the genome to bind?

Motif discovery

Theory

There are two major approaches to defining sequence motifs enriched in a sample: enumerative and probabilistic approaches.

Theory

Here we’ll apply a probabilistic approach (GADEM) to discover motifs in a collection of DNA sequences. During the RNA block, you’ll learn about k-mer analysis, which is a form of enumerative approach.

In each case, the goal is to define a set of sequence motifs that are encriched in a set of provided sequences (i.e., peaks from CUT&RUN data) relative to a genomic background.

Theory

Motifs are expressed in a Position Weight Matrix, which captures the propensities for a position to be a particular nucleotide in a sequence motif.

These PWMs can be represented as sequence logos, visually represent the amount of information provided by the motif, typically using “information content”, expressed in bits.

Theory

LexA sequence motif

Practice

We’ll use the rGADEM package from Bioconductor to derive sequence motifs from the peaks we called above. This is a straightforward process:

Collect the DNA sequences within the peak windows using the BSgenome for S. cerevisiae
Provide those sequences and the genomic background to rGADEM::GADEM(), which runs uses an Expectation-Maximization (EM) approach to identify and refine motifs.
Examine the discovered motifs, and plot as a logo using seqLogo::seqLogo().

peak_seqs <- BSgenome::getSeq(
  # provided by BSgenome.Scerevisiae.UCSC.sacCer3
  Scerevisiae,
  peak_calls_gr
)

# takes ~2 minutes to run
gadem <- rGADEM::GADEM(
  peak_seqs,
  genome = Scerevisiae,
  verbose = 1
)

*** Start C Programm ***
==============================================================================================
input sequence file:  
number of sequences and average length:             452 114.8
Use pgf method to approximate llr null distribution
parameters estimated from sequences in:  

number of GA generations & population size:         5 100

PWM score p-value cutoff for binding site declaration:      2.000000e-04
ln(E-value) cutoff for motif declaration:           0.000000

number of EM steps:                     40
minimal no. sites considered for a motif:           22

[a,c,g,t] frequencies in input data:                0.281618 0.218382 0.218382 0.281618
==============================================================================================
*** Running an unseeded analysis ***
GADEM cycle  1: enumerate and count k-mers... top 3  4, 5-mers: 10 32 58
Done.
Initializing GA... Done.
GADEM cycle[  1] generation[  1] number of unique motif: 4
   spacedDyad: cgtgnnnnnnnnnncgacg  motifConsensus: nwnrTCAynnnnnACGrnn   0.60 fitness: -425.76
   spacedDyad: tttcnnnnnnnnttcg     motifConsensus: wyTTTTTTyTTTTTyk      0.90 fitness: -134.01
   spacedDyad: aaatnnnnncttcc       motifConsensus: AymArGTTACTTCC        0.10 fitness:  -41.29
   spacedDyad: ttttgnnnaaaat        motifConsensus: sTTGkACkAAAAT         0.10 fitness:  -30.86

GADEM cycle[  1] generation[  2] number of unique motif: 4
   spacedDyad: catcnnnnnnnncgacg    motifConsensus: nrTCAyywnnnACGrnn     0.90 fitness: -459.04
   spacedDyad: ttttnnnnnnnnncggc    motifConsensus: yTTTTTTyTTTTTykky     0.90 fitness: -142.43
   spacedDyad: aaatnnnnncttcc       motifConsensus: AymArGTTACTTCC        0.10 fitness:  -41.29
   spacedDyad: ttttgnnnaaaat        motifConsensus: sTTGkACkAAAAT         0.10 fitness:  -30.86

GADEM cycle[  1] generation[  3] number of unique motif: 5
   spacedDyad: catcnnnnnnnncgacg    motifConsensus: nrTCAyywnnnACGrnn     0.90 fitness: -459.04
   spacedDyad: ttttnnnnnnnnncggc    motifConsensus: yTTTTTTyTTTTTykky     0.90 fitness: -142.43
   spacedDyad: aaatnnnnncttcc       motifConsensus: AymArGTTACTTCC        0.10 fitness:  -41.29
   spacedDyad: ttttgnnnaaaat        motifConsensus: sTTGkACkAAAAT         0.10 fitness:  -30.86
   spacedDyad: cgtcnnnnnnnnncacg    motifConsensus: mGACmymdTCAyCCACG     0.10 fitness:  -11.04

GADEM cycle[  1] generation[  4] number of unique motif: 5
   spacedDyad: ccgtnnnnnnnntca      motifConsensus: yCGTnnnwrrTGAyn       0.90 fitness: -472.08
   spacedDyad: ttttnnnnnnnnncggc    motifConsensus: yTTTTTTyTTTTTykky     0.90 fitness: -142.43
   spacedDyad: aaatnnnnncttcc       motifConsensus: AymArGTTACTTCC        0.10 fitness:  -41.29
   spacedDyad: ttttgnnnaaaat        motifConsensus: sTTGkACkAAAAT         0.10 fitness:  -30.86
   spacedDyad: cgtcnnnnnnnnncacg    motifConsensus: mGACmymdTCAyCCACG     0.10 fitness:  -11.04

GADEM cycle[  1] generation[  5] number of unique motif: 5
   spacedDyad: ccgtnnnnnnnntca      motifConsensus: yCGTnnnwrrTGAyw       1.00 fitness: -476.24
   spacedDyad: ttttnnnnnnnnncggc    motifConsensus: yTTTTTTyTTTTTykky     0.90 fitness: -142.43
   spacedDyad: aaatnnnnncttcc       motifConsensus: AymArGTTACTTCC        0.10 fitness:  -41.29
   spacedDyad: ttttgnnnaaaat        motifConsensus: sTTGkACkAAAAT         0.10 fitness:  -30.86
   spacedDyad: cgtcnnnnnnnnncacg    motifConsensus: mGACmymdTCAyCCACG     0.10 fitness:  -11.04

*** Running an unseeded analysis ***
GADEM cycle  2: enumerate and count k-mers... top 3  4, 5-mers: 10 20 30
Done.
Initializing GA... Done.
GADEM cycle[  2] generation[  1] number of unique motif: 4
   spacedDyad: ggaaannnnnnnttc      motifConsensus: GGAmATATwGCmTmT       0.10 fitness:  -49.91
   spacedDyad: ggtgcnnntttg         motifConsensus: rGkGGrAswTTG          0.10 fitness:   -9.35
   spacedDyad: tttcannnnnnnaaaag    motifConsensus: nsTCAmACwGrmAAwmw     1.00 fitness:   -6.73
   spacedDyad: caaannnnnnnnnngcggc  motifConsensus: sAwAAATGTwTkrysCsGm   0.10 fitness:   -6.29

GADEM cycle[  2] generation[  2] number of unique motif: 4
   spacedDyad: gaannnnnnnnnnttttc   motifConsensus: CmrAyAkrmAAwAwTTCC    0.30 fitness:  -86.10
   spacedDyad: ggtgcnnntttg         motifConsensus: rGkGGrAswTTG          0.10 fitness:   -9.35
   spacedDyad: caaannnnnnnnnngcggc  motifConsensus: sAwAAATGTwTkrysCsGm   0.10 fitness:   -6.29
   spacedDyad: tttgnnnnnntgg        motifConsensus: TTTkTywkwsTGr         0.10 fitness:   -6.22

GADEM cycle[  2] generation[  3] number of unique motif: 4
   spacedDyad: gaannnnnnnnnnttttc   motifConsensus: CmrAyAkrmAAwAwTTCC    0.30 fitness:  -86.10
   spacedDyad: ggtgcnnntttg         motifConsensus: rGkGGrAswTTG          0.10 fitness:   -9.35
   spacedDyad: caaannnnnnnnnngcggc  motifConsensus: sAwAAATGTwTkrysCsGm   0.10 fitness:   -6.29
   spacedDyad: tttgnnnnnntgg        motifConsensus: TTTkTywkwsTGr         0.10 fitness:   -6.22

GADEM cycle[  2] generation[  4] number of unique motif: 4
   spacedDyad: gaannnnnnnnnnttttc   motifConsensus: CmrAyAkrmAAwAwTTCC    0.30 fitness:  -86.10
   spacedDyad: gaaaannnnnnnnttc     motifConsensus: AAAAAkGyAwTAyyyC      0.20 fitness:  -21.81
   spacedDyad: ccttgnnnnnnnnnnaaatg motifConsensus: yyTTTTymTAsTGkGAmATT  0.10 fitness:  -12.64
   spacedDyad: ggtgcnnntttg         motifConsensus: rGkGGrAswTTG          0.10 fitness:   -9.35

GADEM cycle[  2] generation[  5] number of unique motif: 7
   spacedDyad: gaannnnnnnnnnttttc   motifConsensus: CmrAyAkrmAAwAwTTCC    0.30 fitness:  -86.10
   spacedDyad: tttttnnnnnnnnngcgg   motifConsensus: TwwTwCnwTykTrsGmGA    0.20 fitness:  -44.11
   spacedDyad: gaaaannnnnnnnttc     motifConsensus: AAAAAkGyAwTAyyyC      0.20 fitness:  -21.81
   spacedDyad: ccttgnnnnnnnnnnaaatg motifConsensus: yyTTTTymTAsTGkGAmATT  0.10 fitness:  -12.64
   spacedDyad: ttcnnnnnnnnaaa       motifConsensus: sTCAsACTrwmAAA        0.20 fitness:  -10.44
   spacedDyad: caaannnnnnnnnnaaatg  motifConsensus: mAAwsTyCymmhmArrAAA   0.80 fitness:  -10.18
   spacedDyad: tttgnnnnnngcggc      motifConsensus: wyTGksTGryGCGsC       0.10 fitness:   -2.96

*** Running an unseeded analysis ***
GADEM cycle  3: enumerate and count k-mers... top 3  4, 5-mers: 8 16 22
Done.
Initializing GA... Done.
GADEM cycle[  3] generation[  1] number of unique motif: 1
   spacedDyad: cggcnnctttt          motifConsensus: CGGCCTykTsT           0.10 fitness:   18.49

GADEM cycle[  3] generation[  2] number of unique motif: 1
   spacedDyad: cggcnnctttt          motifConsensus: CGGCCTykTsT           0.10 fitness:   18.49

GADEM cycle[  3] generation[  3] number of unique motif: 1
   spacedDyad: tggtgnnnnnnncgcc     motifConsensus: TGGwrAyTTykGyksk      0.10 fitness:   13.43

GADEM cycle[  3] generation[  4] number of unique motif: 1
   spacedDyad: cggcctttt            motifConsensus: CGGCCTTGT             0.10 fitness:    9.00

GADEM cycle[  3] generation[  5] number of unique motif: 1
   spacedDyad: cggcctttt            motifConsensus: CGGCCTTGT             0.10 fitness:    9.00

*** Running an unseeded analysis ***
GADEM cycle  4: enumerate and count k-mers... top 3  4, 5-mers: 4 16 22
Done.
Initializing GA... Done.
GADEM cycle[  4] generation[  1] number of unique motif: 1
   spacedDyad: ccatcgcacc           motifConsensus: ACATTGCACy            0.10 fitness:   24.22

GADEM cycle[  4] generation[  2] number of unique motif: 1
   spacedDyad: ccatcgcacc           motifConsensus: ACATTGCACy            0.10 fitness:   24.22

GADEM cycle[  4] generation[  3] number of unique motif: 1
   spacedDyad: ccatcgcacc           motifConsensus: ACATTGCACy            0.10 fitness:   24.22

GADEM cycle[  4] generation[  4] number of unique motif: 1
   spacedDyad: gaannnnnnnnngccg     motifConsensus: kAATkkkwTkrTGCsG      0.10 fitness:   19.67

GADEM cycle[  4] generation[  5] number of unique motif: 1
   spacedDyad: gaannnnnnnnngccg     motifConsensus: kAATkkkwTkrTGCsG      0.10 fitness:   19.67

# look at the consensus motifs
consensus(gadem)

[1] "yCGTnnnwrrTGAyn"                  "yTkTTTTyTyTTTyk"                 
[3] "AmsyTAkATCkTTGkACTAAAATCTGykwCwm" "nAAnwwmGACmCATTCACCCACGsnnynynm" 
[5] "yCmwmyAkrmAAwAwTTCCy"             "nAAAAATGTAwTAyCyC"               
[7] "ACGGCCTTGTs"

# how many consensus motifs are there?
nOccurrences(gadem)

[1] 436 152  33  31  94  41  47

Now let’s look at the sequence logo for the top hit.

pwm <- gadem@motifList[[1]]@pwm

seqLogo::seqLogo(pwm)

Questions

Does this motif make sense, based on what you know about the requirements and specificity of DNA binding by transcription factors?
How might you confirm that a specific sequence (that conforms to a motif) is bound directly by a transcription factor?

References

GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J Comput Biol 2009 [PMC free article] [PubMed] [Google Scholar]