Commit 05d1bdd0 authored by Yan's avatar Yan
Browse files

First complete version

parent 2f154177
Loading
Loading
Loading
Loading
+46 −1
Original line number Diff line number Diff line
# PRINT
 No newline at end of file
# multiScaleFootprinting

![image](https://user-images.githubusercontent.com/44768711/193921131-c7a9f8ab-d123-4689-b62b-d71fdf2abd43.png)

### 1. Overview

In this Github repo we present the multi-scale footprinting framework in our Hu et al. paper. In general, the algorithm takes ATAC-Seq (bulk or single cells) data as input, and try to detect DNA-protein interaction across spatial scales. More specifically, the model first internally corrects the sequence preference of Tn5, and then use a statistical model to calculate footprint score for each position within enhancers and promoters. The process is performed with a range of footprint kernel sizes, capturing DNA-binding proteins of different sizes and shapes. Conceptually, this procedure is similar to wavelet analysis where we decompose the input signal at each location across scales. 

<img src="https://user-images.githubusercontent.com/44768711/193936026-b49715d8-7ec9-4e23-8aa9-330c1f93f2e7.png" width="350" align="left">

The multi-scale footprint pattern at any genomic location delineates local chromatin structure and can be used to infer TF and nucleosome binding. We have shown in our paper that multi-scale footprints can be used as input to neural network models to predict TF binding, even for TFs that do not leave visible footprints on their own.

Additionally, we have implemented the infrastructure for generating pseudo-bulks using single cell data, as well as running multi-scale footprinting using the pseudo-bulked data. This provides us with the unique opportunity to track chromatin structure dynamics across pseudo-time.

<br>

### 2. Key Components

* Correction of Tn5 insertion bias and obtain single-base pair resolution chromatin accessibility

* Calling footprint signal across spatial scales to resolve local chromatin structure

* Tracking nucleosome binding/eviction/sliding across pseudo-time

* Infer TF binding within cis-regulatory elements (CREs)

* Segmentation of CREs into substructures

### 3. Vignettes and Tutorials

Tutorials for running multi-scale footprinting on example data can be found [here][tutorial]

[tutorial]:https://github.com/buenrostrolab/multiScaleFootprinting/blob/main/analyses/BMMCTutorial/BMMCVignette.pdf

### 4. References

Hu et al., Multi-scale chromatin footprinting reveals wide-spread encoding of CRE substructures

### 5. Installation

Currently the framework can be installed by cloning the github repo. 

### 6. Support

If you have any questions, please feel free to open an issue. You are also welcome to email me at yanhu@g.harvard.edu. We appreciate everyone's contribution!
+146 −0
Original line number Diff line number Diff line
# If running in Rstudio, set the working directory to current path
if (Sys.getenv("RSTUDIO") == "1"){
  setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
}

source("../../code/utils.R")
source("../../code/getCounts.R")
source("../../code/getBias.R")
source("../../code/getFootprints.R")
source("../../code/visualization.R")
source("../../code/getGroupData.R")
source("../../code/getTFBS.R")

###################
# Load input data #
###################

# Initialize a footprintingProject object
projectName <- "BAC"
project <- footprintingProject(projectName = projectName, 
                               refGenome = "hg38")
projectMainDir <- "../../"
projectDataDir <- paste0(projectMainDir, "data/", projectName, "/")
dataDir(project) <- projectDataDir
mainDir(project) <- projectMainDir

# Load region ranges
regions <- readRDS(paste0(projectDataDir, "tileRanges.rds"))

# Set the regionRanges slot
regionRanges(project) <- regions

# Use our neural network to predict Tn5 bias for all positions in all regions
# Remember to make a copy of the Tn5_NN_model.h5 file in projectDataDir!!
if(file.exists(paste0(projectDataDir, "predBias.rds"))){
  regionBias(project) <- readRDS(paste0(projectDataDir, "predBias.rds"))
}else{
  project <- getRegionBias(project, nCores = 24)
  saveRDS(regionBias(project), paste0(projectDataDir, "predBias.rds"))
}

# Load barcodes for each pseudo-bulk
barcodeGroups <- data.frame(barcode = paste("rep", 1:5, sep = ""),
                            group = "BAC")
barcodeGrouping(project) <- barcodeGroups
groups(project) <- mixedsort(unique(barcodeGroups$group))

# Getting the pseudobulk-by-region-by-position counts tensor from a fragment file
tileCounts <- readRDS("../../data/BAC/tileCounts.rds")
tileCounts <- tileCounts$all
tileCounts <- lapply(
  tileCounts,
  function(x){
    x <- as.data.frame(x)
    if(dim(x)[1] > 0){
      # Now we merge the 5 replicates into 1 group
      x$group <- "BAC"
    }
    x  
  }
)
countTensor(project) <- tileCounts
groupCellType(project) <- "BAC"

for(kernelSize in 2:100){
  dispModel(project, as.character(kernelSize)) <- readRDS(paste0("../../data/shared/dispModel/dispersionModel", kernelSize, "bp.rds"))
}

# Load PWM data
motifs <- readRDS(paste0(projectMainDir, "/data/shared/cisBP_human_pwms_2021.rds"))

# Load TFBS prediction model
h5Path <- "../../data/TFBSPrediction/TFBS_model.h5"
TFBSModel <- keras::load_model_hdf5(h5Path)
TFBSModel <- keras::get_weights(TFBSModel)
h5file <- H5File$new(h5Path, mode="r")
footprintMean <- h5file[["footprint_mean"]]
footprintMean <- footprintMean[1:footprintMean$dims]
footprintSd <- h5file[["footprint_sd"]]
footprintSd <- footprintSd[1:footprintSd$dims]
h5file$close_all()
TFBSModel$footprintMean <- footprintMean
TFBSModel$footprintSd <- footprintSd
TFBSModel$scales <- c(10,20,30,50,80,100)
TFBindingModel(project) <- TFBSModel

# Load chain file required for lift-over
pathToChain <- "../../data/shared/hg19ToHg38.over.tab.chain"
ch <- rtracklayer::import.chain(pathToChain)

# Find motif matches across all peaks
cisBPMotifs <- readRDS("../../data/shared/cisBP_human_pwms_2021.rds")
motifMatches <- getMotifPositions(project, cisBPMotifs, combineTFs = F)

#######################
# Run TFBS prediction #
#######################

getTFBS(project,
        nCores = 16)
TFBindingSE <- getTFBindingSE(project)
TFBSScores <- assay(TFBindingSE)
nFootprints <- sum(TFBSScores > 0.5)
print(paste("False positive rate", nFootprints / length(TFBSScores)))

##################################################################
# Load results from other methods and quantify called footprints #
##################################################################

# Retrieve results from HINT-ATAC
HINTFootprints <- read.table("../../data/BAC/HINT/BAC.bed")

# Retrieve results from TOBIAS
TobiasDir <- "../../data/BAC/Tobias/prediction"
TFs <- unname(sapply(list.files(TobiasDir), function(s){stringr::str_split(s, "_")[[1]][[1]]}))
TFs <- intersect(TFs, names(cisBPMotifs))
TOBIASFootprints <- data.table::rbindlist(pbmcapply::pbmclapply(
  TFs,
  function(TF){
    # Retrieve Tobias TFBS prediction
    TobiasBound <- read.table(paste0("../../data/BAC/Tobias/prediction/", TF, "_motif/beds/", TF, "_motif_BAC_bound.bed"))
    TobiasBound
  },
  mc.cores = 16
))
TOBIASFootprintRanges <- GRanges(seqnames = TOBIASFootprints$V1,
                                 ranges = IRanges(start = TOBIASFootprints$V2, end = TOBIASFootprints$V3))
TOBIASFootprintRanges$score <- TOBIASFootprints$V11
TOBIASFootprints <- mergeRegions(TOBIASFootprintRanges)

# Calculate total length of BAC regions
BACLength <- sum(width(regions)) / 1000

methods <- c("TOBIAS", "HINT-ATAC", "CNN-based")
plotData <- data.frame(
  FPR = c(length(TOBIASFootprints) / BACLength, 
          dim(HINTFootprints)[1] / BACLength, 
          nFootprints / BACLength),
  Method = factor(methods, levels = methods)
)
pdf("../../data/BAC/plots/TFBSBenchmark.pdf", width = 4, height = 4)
ggplot(plotData) +
  geom_bar(aes(x = Method, y = FPR), stat = "identity", width = 0.5, fill = "#CA9B80") +
  xlab("") + ylab("Number of false positive TFBS per 1kb\ndetected on BAC naked DNA") + 
  theme_classic()
dev.off()
+312 −0
Original line number Diff line number Diff line
# If running in Rstudio, set the working directory to current path
if (Sys.getenv("RSTUDIO") == "1"){
  setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
}

source("../../code/utils.R")
source("../../code/getCounts.R")
source("../../code/getBias.R")
source("../../code/getGroupData.R")
source("../../code/getFootprints.R")
source("../../code/visualization.R")
library(GenomicRanges)
library(rtracklayer)
library(ggpubr)

#################
# Load BAC data #
#################

# Load genomic ranges of BACs
BACRanges <- readRDS("../../data/BAC/BACRanges.rds")

# Convert BAC genomic ranges into 1kb tiles
tileRanges <- Reduce("c", GenomicRanges::slidingWindows(BACRanges, width = 1000, step = 1000))
tileRanges <- tileRanges[width(tileRanges) == 1000]
tileBACs <- names(tileRanges)

# Load reference genome
hg38 <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38

# Load BAC tile count tensor
countData <- readRDS("../../data/BAC/tileCounts.rds")$all
names(countData) <- 1:length(countData)

###############################################################
# Get CNN-predicted Tn5 bias and observed Tn5 insertion track #
###############################################################

# Initialize a footprintingProject object
projectName <- "BAC"
project <- footprintingProject(projectName = projectName, 
                               refGenome = "hg38")
projectMainDir <- "../../"
projectDataDir <- paste0(projectMainDir, "data/", projectName, "/")
dataDir(project) <- projectDataDir
mainDir(project) <- projectMainDir

# Set the regionRanges slot
regionRanges(project) <- tileRanges

# Use our neural network to predict Tn5 bias for all positions in all regions
# Remember to make a copy of the Tn5_NN_model.h5 file in projectDataDir!!
if(file.exists(paste0(projectDataDir, "predBias.rds"))){
  regionBias(project) <- readRDS(paste0(projectDataDir, "predBias.rds"))
}else{
  project <- getRegionBias(project, nCores = 16)
  saveRDS(regionBias(project), paste0(projectDataDir, "predBias.rds"))
}

###############
# Get Tn5 PWM #
###############

# One-hot encoding of a single DNA sequence
onehotEncode <- function(seq){
  onehotMapping <- list("A" = 1, "C" = 2, "G" = 3, "T" = 4, "N" = 0)
  len <- nchar(seq)
  onehotInd <- sapply(
    substring(seq, 1:len , 1:len),
    function(x){onehotMapping[[x]]})
  onehot <- matrix(0, nrow = 4, ncol = len)
  undeterminedBases <- onehotInd == 0
  onehot[cbind(unname(onehotInd), 1:len)[!undeterminedBases,]] <- 1
  onehot
}

# Use PWM to score a sequence
PWMScoring <- function(seq, PWM){
  onehotSeq <- onehotEncode(seq)
  sum(PWM * onehotSeq)
}

# Get position frequency matrix of Tn5 insertion
PWMRadius <- 10
tilePFM <- pbmcapply::pbmclapply(
  1:length(tileRanges),
  function(tileInd){
    tileRange <- tileRanges[tileInd]
    tilePositions <- IRanges::tile(tileRange, width = 1)[[1]]
    contextRanges <- IRanges::resize(tilePositions, fix = "center", width = PWMRadius * 2 + 1)
    contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
    ATACTrack <- rowSums(getRegionATAC(countData, 
                                       regionInd = as.character(tileInd), 
                                       groupIDs = 1:5, 
                                       width = 1000))
    
    # One-hot encoding of each sequence context that appeared in the BAC
    # This is used to quantify the background sequence frequencies
    backgroundOnehot <- lapply(
      1:length(contextSeq), 
      function(i){onehotEncode(contextSeq[i])})
    backgroundPFM <- Reduce("+", backgroundOnehot)
    
    # Since each context is cut with different frequencies, we calculate
    # the foreground frequencies by weighing the background with cutting density 
    foregroundOnehot <- lapply(
      1:length(contextSeq), 
      function(i){backgroundOnehot[[i]] * ATACTrack[i]})
    foregroundPFM <- Reduce("+", foregroundOnehot)
    
    rownames(foregroundPFM) <- c("A", "C", "G", "T")
    rownames(backgroundPFM) <- c("A", "C", "G", "T")
    
    list(foregroundPFM, backgroundPFM)
  }
)
fgPFM <- Reduce("+", lapply(tilePFM, function(x){x[[1]]}))
bgPFM <- Reduce("+", lapply(tilePFM, function(x){x[[2]]}))

# Get background base frequencies
background <- rowSums(bgPFM)
background <- background / sum(background)

# Get PPM 
PPM <- t(t(fgPFM) / colSums(fgPFM))
adjustedPPM <- PPM / background
PWM <- log2(adjustedPPM)

library(Logolas)
system("mkdir ../../data/BAC/plots")
pdf("../../data/BAC/plots/Tn5PWM.pdf", width = 8, height = 6)
logomaker(adjustedPPM, type = "Logo", return_heights = TRUE)
dev.off()

# Save the PWM to file
saveRDS(PWM, "../../data/BAC/Tn5PWM.rds")

############################
# Get k-mer Tn5 bias model #
############################

kmerBiasList <- list()
for(k in c(3,5,7)){
  
  # Generate all possible kmer sequences
  bases <- c("A", "C", "G", "T")
  kmers <- as.matrix(eval(parse(text = paste("expand.grid(", paste(rep("bases", k), collapse = ", "), ")"))))
  kmers <- sapply(1:dim(kmers)[1], function(x){paste(kmers[x,], collapse = "")})
  
  # Go through each genomic tile and record insertion numbers for kmers
  tileKmer <- pbmcapply::pbmclapply(
    1:length(tileRanges),
    function(tileInd){
      tileRange <- tileRanges[tileInd]
      tilePositions <- IRanges::tile(tileRange, width = 1)[[1]]
      contextRanges <- IRanges::resize(tilePositions, fix = "center", width = k)
      contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
      ATACTrack <- rowSums(getRegionATAC(countData, 
                                         regionInd = as.character(tileInd), 
                                         groupIDs = 1:5, 
                                         width = 1000))
      kmerFG <- rep(0, length(kmers))
      names(kmerFG) <- kmers
      kmerBG <- rep(0, length(kmers))
      names(kmerBG) <- kmers
      kmerFG[contextSeq] <- ATACTrack
      kmerBG[contextSeq] <- 1
      list(kmerFG, kmerBG)
    }
  )
  
  # Summarize results for all tiles and calculate k-mer bias
  kmerFG <- rowSums(sapply(tileKmer,function(x){x[[1]]}))
  kmerBG <- rowSums(sapply(tileKmer,function(x){x[[2]]}))
  kmerBias <- kmerFG / kmerBG
  kmerBiasList[[as.character(k)]] <- kmerBias / mean(kmerBias)
}

saveRDS(kmerBiasList, "../../data/BAC/kmerBiasList.rds")

#############################
# Benchmark on all BAC data #
#############################

kmerBiasList <- readRDS("../../data/BAC/kmerBiasList.rds")
PWM <- readRDS("../../data/BAC/Tn5PWM.rds")
TobiasBigWig <- import.bw('../../data/BAC/Tobias/BAC_bias.bw')
PWMRadius <- 10

benchmark <- t(pbmcapply::pbmcmapply(
  function(tileInd){
    
    tileRange <- tileRanges[tileInd]
    tileInsertion <- rowSums(getRegionATAC(countData, regionInd = as.character(tileInd), 
                                           groupIDs = 1:5, width = 1000))
    tilePositions <- IRanges::tile(tileRange, width = 1)[[1]]
    coverage <- sum(tileInsertion)
    
    biasBenchmark <- list()
    
    # Retrieve Tn5 bias calculated by kmer
    for(k in names(kmerBiasList)){
      contextRanges <- IRanges::resize(tilePositions, fix = "center", width = as.integer(k))
      contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
      kmerBias <-  unname(kmerBiasList[[k]][contextSeq])
      biasBenchmark[[paste0(k, "-mer")]] <- cor(kmerBias, tileInsertion)
    }
    
    # Retrieve Tn5 bias calculated by PWM
    contextRanges <- IRanges::resize(tilePositions, fix = "center", width = PWMRadius * 2 + 1)
    contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
    PWMBias  <-  2 ^ sapply(contextSeq, function(seq){PWMScoring(seq, PWM)})
    biasBenchmark[["PWM"]] <- cor(PWMBias, tileInsertion)
    
    # Retrieve Tn5 bias estimated by Tobias
    TobiasTn5Bias <- subsetByOverlaps(TobiasBigWig, tileRanges[tileInd])$score
    biasBenchmark[["dinucleotidePWM"]] <- cor(2 ^ TobiasTn5Bias, tileInsertion)
    
    # Calculate accuracy by convolution neural net
    biasBenchmark[["ConvNet"]] <- cor(regionBias(project)[tileInd,], tileInsertion / conv(tileInsertion, 50))
    
    biasBenchmark <- Reduce(c, biasBenchmark)
    
    c(coverage, biasBenchmark)
    
  },
  1:length(tileRanges),
  mc.cores = 16
))

# Filter out entries with NA values
benchmark <- benchmark[rowSums(is.na(benchmark)) == 0, ]

# Filter out low-coverage tiles
benchmark <- benchmark[benchmark[, 1] > 20000, ]

methods <- c("3-mer", "5-mer", "7-mer", "PWM", "dinucleotide PWM", "CNN")
plotData <- data.frame(
  Correlation = colMeans(benchmark)[2:7],
  Method = methods
)
pdf("../../data/BAC/plots/benchmark.pdf", width = 8, height = 8)
ggplot(plotData) +
  geom_bar(aes(x = Method, y = Correlation), stat = "identity", width = 0.5, fill = "#CA9B80") +
  xlab("Bias correction method") + ylim(0, 1) + 
  ylab("Correlation between predicted \nand observed Tn5 bias") + 
  scale_x_discrete(limits = methods) +
  theme_classic()
dev.off()

########################
# Model interpretation #
########################

# Get Tn5 bias predicted by PWM and CNN
predBias <- t(pbmcapply::pbmclapply(
  1:length(tileRanges),
  function(tileInd){
    
    tileRange <- tileRanges[tileInd]
    tileInsertion <- rowSums(getRegionATAC(countData, regionInd = as.character(tileInd), 
                                           groupIDs = 1:5, width = 1000))
    
    if(mean(tileInsertion) > 20){
      tilePositions <- IRanges::tile(tileRange, width = 1)[[1]]
      localCoverage <- conv(tileInsertion, 50) / 100
      obsBias <- tileInsertion / localCoverage
      
      # Retrieve Tn5 bias calculated by PWM
      contextRanges <- IRanges::resize(tilePositions, fix = "center", width = PWMRadius * 2 + 1)
      contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
      PWMBias  <-  2 ^ sapply(contextSeq, function(seq){PWMScoring(seq, PWM)})
      
      # Retrieve Tn5 bias estimated by CNN
      CNNBias <- regionBias(project)[tileInd,]
      
      data.frame(PWMBias = PWMBias, CNNBias = CNNBias, obsBias = obsBias, context = contextSeq)
    }else{
      NULL
    }
    
  },
  mc.cores = 16
))
predBias <- data.table::rbindlist(predBias)
predBias <- predBias[!is.na(predBias$obsBias), ]

# Visual inspection
sampleInd <- sample(1:dim(predBias)[1], 10000)
qplot(predBias$PWMBias[sampleInd], predBias$obsBias[sampleInd]) + xlim(0,50) + ylim(0,50)

# Calculate errors by the two methods on individual positions
PWMError <- log2(pmax(predBias$PWMBias, 1e-3)) - log2(pmax(predBias$obsBias, 1e-3))
CNNError <- log2(pmax(predBias$CNNBias, 1e-3)) - log2(pmax(predBias$obsBias, 1e-3))
PWMAbsError <- abs(PWMError)
CNNAbsError <- abs(CNNError)
improvement <- PWMAbsError - CNNAbsError

qplot(PWMError[sampleInd], improvement[sampleInd], )
hist(PWMError[order(improvement, decreasing = T)][1:2000])

diffContext <- predBias$context[order(improvement, decreasing = T)][1:2000]
simContext <- predBias$context[order(abs(improvement))][1:2000]
diffCGContent <- sapply(diffContext, function(x){stringr::str_count(x, "[CG]")}) / (2 * PWMRadius + 1)
simCGContent <- sapply(simContext, function(x){stringr::str_count(x, "[CG]")}) / (2 * PWMRadius + 1)

plotData <- diffCGContent
data.frame(plotData) %>%
  ggplot(aes(x = plotData)) +
  geom_histogram(bins=20, fill = '#9C5D41') +
  xlab("GC content") + ylab("Number of examples") +
  theme_classic()
+2969 −0

File added.

Preview size limit exceeded, changes collapsed.

+2640 −0

File added.

Preview size limit exceeded, changes collapsed.

Loading