First complete version (05d1bdd0) · Commits · github_fork / PRINT

README.md

+46 −1

Original line number	Diff line number	Diff line
		# PRINT
		No newline at end of file
		# multiScaleFootprinting

		![image](https://user-images.githubusercontent.com/44768711/193921131-c7a9f8ab-d123-4689-b62b-d71fdf2abd43.png)

		### 1. Overview

		In this Github repo we present the multi-scale footprinting framework in our Hu et al. paper. In general, the algorithm takes ATAC-Seq (bulk or single cells) data as input, and try to detect DNA-protein interaction across spatial scales. More specifically, the model first internally corrects the sequence preference of Tn5, and then use a statistical model to calculate footprint score for each position within enhancers and promoters. The process is performed with a range of footprint kernel sizes, capturing DNA-binding proteins of different sizes and shapes. Conceptually, this procedure is similar to wavelet analysis where we decompose the input signal at each location across scales.

		<img src="https://user-images.githubusercontent.com/44768711/193936026-b49715d8-7ec9-4e23-8aa9-330c1f93f2e7.png" width="350" align="left">

		The multi-scale footprint pattern at any genomic location delineates local chromatin structure and can be used to infer TF and nucleosome binding. We have shown in our paper that multi-scale footprints can be used as input to neural network models to predict TF binding, even for TFs that do not leave visible footprints on their own.

		Additionally, we have implemented the infrastructure for generating pseudo-bulks using single cell data, as well as running multi-scale footprinting using the pseudo-bulked data. This provides us with the unique opportunity to track chromatin structure dynamics across pseudo-time.

		<br>

		### 2. Key Components

		* Correction of Tn5 insertion bias and obtain single-base pair resolution chromatin accessibility

		* Calling footprint signal across spatial scales to resolve local chromatin structure

		* Tracking nucleosome binding/eviction/sliding across pseudo-time

		* Infer TF binding within cis-regulatory elements (CREs)

		* Segmentation of CREs into substructures

		### 3. Vignettes and Tutorials

		Tutorials for running multi-scale footprinting on example data can be found [here][tutorial]

		[tutorial]:https://github.com/buenrostrolab/multiScaleFootprinting/blob/main/analyses/BMMCTutorial/BMMCVignette.pdf

		### 4. References

		Hu et al., Multi-scale chromatin footprinting reveals wide-spread encoding of CRE substructures

		### 5. Installation

		Currently the framework can be installed by cloning the github repo.

		### 6. Support

		If you have any questions, please feel free to open an issue. You are also welcome to email me at yanhu@g.harvard.edu. We appreciate everyone's contribution!

analyses/BAC/.ipynb_checkpoints/TFBSBenchmark-checkpoint.R

0 → 100644

+146 −0

Original line number	Diff line number	Diff line
		# If running in Rstudio, set the working directory to current path
		if (Sys.getenv("RSTUDIO") == "1"){
		setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
		}

		source("../../code/utils.R")
		source("../../code/getCounts.R")
		source("../../code/getBias.R")
		source("../../code/getFootprints.R")
		source("../../code/visualization.R")
		source("../../code/getGroupData.R")
		source("../../code/getTFBS.R")

		###################
		# Load input data #
		###################

		# Initialize a footprintingProject object
		projectName <- "BAC"
		project <- footprintingProject(projectName = projectName,
		refGenome = "hg38")
		projectMainDir <- "../../"
		projectDataDir <- paste0(projectMainDir, "data/", projectName, "/")
		dataDir(project) <- projectDataDir
		mainDir(project) <- projectMainDir

		# Load region ranges
		regions <- readRDS(paste0(projectDataDir, "tileRanges.rds"))

		# Set the regionRanges slot
		regionRanges(project) <- regions

		# Use our neural network to predict Tn5 bias for all positions in all regions
		# Remember to make a copy of the Tn5_NN_model.h5 file in projectDataDir!!
		if(file.exists(paste0(projectDataDir, "predBias.rds"))){
		regionBias(project) <- readRDS(paste0(projectDataDir, "predBias.rds"))
		}else{
		project <- getRegionBias(project, nCores = 24)
		saveRDS(regionBias(project), paste0(projectDataDir, "predBias.rds"))
		}

		# Load barcodes for each pseudo-bulk
		barcodeGroups <- data.frame(barcode = paste("rep", 1:5, sep = ""),
		group = "BAC")
		barcodeGrouping(project) <- barcodeGroups
		groups(project) <- mixedsort(unique(barcodeGroups$group))

		# Getting the pseudobulk-by-region-by-position counts tensor from a fragment file
		tileCounts <- readRDS("../../data/BAC/tileCounts.rds")
		tileCounts <- tileCounts$all
		tileCounts <- lapply(
		tileCounts,
		function(x){
		x <- as.data.frame(x)
		if(dim(x)[1] > 0){
		# Now we merge the 5 replicates into 1 group
		x$group <- "BAC"
		}
		x
		}
		)
		countTensor(project) <- tileCounts
		groupCellType(project) <- "BAC"

		for(kernelSize in 2:100){
		dispModel(project, as.character(kernelSize)) <- readRDS(paste0("../../data/shared/dispModel/dispersionModel", kernelSize, "bp.rds"))
		}

		# Load PWM data
		motifs <- readRDS(paste0(projectMainDir, "/data/shared/cisBP_human_pwms_2021.rds"))

		# Load TFBS prediction model
		h5Path <- "../../data/TFBSPrediction/TFBS_model.h5"
		TFBSModel <- keras::load_model_hdf5(h5Path)
		TFBSModel <- keras::get_weights(TFBSModel)
		h5file <- H5File$new(h5Path, mode="r")
		footprintMean <- h5file[["footprint_mean"]]
		footprintMean <- footprintMean[1:footprintMean$dims]
		footprintSd <- h5file[["footprint_sd"]]
		footprintSd <- footprintSd[1:footprintSd$dims]
		h5file$close_all()
		TFBSModel$footprintMean <- footprintMean
		TFBSModel$footprintSd <- footprintSd
		TFBSModel$scales <- c(10,20,30,50,80,100)
		TFBindingModel(project) <- TFBSModel

		# Load chain file required for lift-over
		pathToChain <- "../../data/shared/hg19ToHg38.over.tab.chain"
		ch <- rtracklayer::import.chain(pathToChain)

		# Find motif matches across all peaks
		cisBPMotifs <- readRDS("../../data/shared/cisBP_human_pwms_2021.rds")
		motifMatches <- getMotifPositions(project, cisBPMotifs, combineTFs = F)

		#######################
		# Run TFBS prediction #
		#######################

		getTFBS(project,
		nCores = 16)
		TFBindingSE <- getTFBindingSE(project)
		TFBSScores <- assay(TFBindingSE)
		nFootprints <- sum(TFBSScores > 0.5)
		print(paste("False positive rate", nFootprints / length(TFBSScores)))

		##################################################################
		# Load results from other methods and quantify called footprints #
		##################################################################

		# Retrieve results from HINT-ATAC
		HINTFootprints <- read.table("../../data/BAC/HINT/BAC.bed")

		# Retrieve results from TOBIAS
		TobiasDir <- "../../data/BAC/Tobias/prediction"
		TFs <- unname(sapply(list.files(TobiasDir), function(s){stringr::str_split(s, "_")[[1]][[1]]}))
		TFs <- intersect(TFs, names(cisBPMotifs))
		TOBIASFootprints <- data.table::rbindlist(pbmcapply::pbmclapply(
		TFs,
		function(TF){
		# Retrieve Tobias TFBS prediction
		TobiasBound <- read.table(paste0("../../data/BAC/Tobias/prediction/", TF, "_motif/beds/", TF, "_motif_BAC_bound.bed"))
		TobiasBound
		},
		mc.cores = 16
		))
		TOBIASFootprintRanges <- GRanges(seqnames = TOBIASFootprints$V1,
		ranges = IRanges(start = TOBIASFootprints$V2, end = TOBIASFootprints$V3))
		TOBIASFootprintRanges$score <- TOBIASFootprints$V11
		TOBIASFootprints <- mergeRegions(TOBIASFootprintRanges)

		# Calculate total length of BAC regions
		BACLength <- sum(width(regions)) / 1000

		methods <- c("TOBIAS", "HINT-ATAC", "CNN-based")
		plotData <- data.frame(
		FPR = c(length(TOBIASFootprints) / BACLength,
		dim(HINTFootprints)[1] / BACLength,
		nFootprints / BACLength),
		Method = factor(methods, levels = methods)
		)
		pdf("../../data/BAC/plots/TFBSBenchmark.pdf", width = 4, height = 4)
		ggplot(plotData) +
		geom_bar(aes(x = Method, y = FPR), stat = "identity", width = 0.5, fill = "#CA9B80") +
		xlab("") + ylab("Number of false positive TFBS per 1kb\ndetected on BAC naked DNA") +
		theme_classic()
		dev.off()

analyses/BAC/.ipynb_checkpoints/biasCorrectionBenchmark-checkpoint.R

0 → 100644

+312 −0

Original line number	Diff line number	Diff line
		# If running in Rstudio, set the working directory to current path
		if (Sys.getenv("RSTUDIO") == "1"){
		setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
		}

		source("../../code/utils.R")
		source("../../code/getCounts.R")
		source("../../code/getBias.R")
		source("../../code/getGroupData.R")
		source("../../code/getFootprints.R")
		source("../../code/visualization.R")
		library(GenomicRanges)
		library(rtracklayer)
		library(ggpubr)

		#################
		# Load BAC data #
		#################

		# Load genomic ranges of BACs
		BACRanges <- readRDS("../../data/BAC/BACRanges.rds")

		# Convert BAC genomic ranges into 1kb tiles
		tileRanges <- Reduce("c", GenomicRanges::slidingWindows(BACRanges, width = 1000, step = 1000))
		tileRanges <- tileRanges[width(tileRanges) == 1000]
		tileBACs <- names(tileRanges)

		# Load reference genome
		hg38 <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38

		# Load BAC tile count tensor
		countData <- readRDS("../../data/BAC/tileCounts.rds")$all
		names(countData) <- 1:length(countData)

		###############################################################
		# Get CNN-predicted Tn5 bias and observed Tn5 insertion track #
		###############################################################

		# Initialize a footprintingProject object
		projectName <- "BAC"
		project <- footprintingProject(projectName = projectName,
		refGenome = "hg38")
		projectMainDir <- "../../"
		projectDataDir <- paste0(projectMainDir, "data/", projectName, "/")
		dataDir(project) <- projectDataDir
		mainDir(project) <- projectMainDir

		# Set the regionRanges slot
		regionRanges(project) <- tileRanges

		# Use our neural network to predict Tn5 bias for all positions in all regions
		# Remember to make a copy of the Tn5_NN_model.h5 file in projectDataDir!!
		if(file.exists(paste0(projectDataDir, "predBias.rds"))){
		regionBias(project) <- readRDS(paste0(projectDataDir, "predBias.rds"))
		}else{
		project <- getRegionBias(project, nCores = 16)
		saveRDS(regionBias(project), paste0(projectDataDir, "predBias.rds"))
		}

		###############
		# Get Tn5 PWM #
		###############

		# One-hot encoding of a single DNA sequence
		onehotEncode <- function(seq){
		onehotMapping <- list("A" = 1, "C" = 2, "G" = 3, "T" = 4, "N" = 0)
		len <- nchar(seq)
		onehotInd <- sapply(
		substring(seq, 1:len , 1:len),
		function(x){onehotMapping[[x]]})
		onehot <- matrix(0, nrow = 4, ncol = len)
		undeterminedBases <- onehotInd == 0
		onehot[cbind(unname(onehotInd), 1:len)[!undeterminedBases,]] <- 1
		onehot
		}

		# Use PWM to score a sequence
		PWMScoring <- function(seq, PWM){
		onehotSeq <- onehotEncode(seq)
		sum(PWM * onehotSeq)
		}

		# Get position frequency matrix of Tn5 insertion
		PWMRadius <- 10
		tilePFM <- pbmcapply::pbmclapply(
		1:length(tileRanges),
		function(tileInd){
		tileRange <- tileRanges[tileInd]
		tilePositions <- IRanges::tile(tileRange, width = 1)[[1]]
		contextRanges <- IRanges::resize(tilePositions, fix = "center", width = PWMRadius * 2 + 1)
		contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
		ATACTrack <- rowSums(getRegionATAC(countData,
		regionInd = as.character(tileInd),
		groupIDs = 1:5,
		width = 1000))

		# One-hot encoding of each sequence context that appeared in the BAC
		# This is used to quantify the background sequence frequencies
		backgroundOnehot <- lapply(
		1:length(contextSeq),
		function(i){onehotEncode(contextSeq[i])})
		backgroundPFM <- Reduce("+", backgroundOnehot)

		# Since each context is cut with different frequencies, we calculate
		# the foreground frequencies by weighing the background with cutting density
		foregroundOnehot <- lapply(
		1:length(contextSeq),
		function(i){backgroundOnehot[[i]] * ATACTrack[i]})
		foregroundPFM <- Reduce("+", foregroundOnehot)

		rownames(foregroundPFM) <- c("A", "C", "G", "T")
		rownames(backgroundPFM) <- c("A", "C", "G", "T")

		list(foregroundPFM, backgroundPFM)
		}
		)
		fgPFM <- Reduce("+", lapply(tilePFM, function(x){x[[1]]}))
		bgPFM <- Reduce("+", lapply(tilePFM, function(x){x[[2]]}))

		# Get background base frequencies
		background <- rowSums(bgPFM)
		background <- background / sum(background)

		# Get PPM
		PPM <- t(t(fgPFM) / colSums(fgPFM))
		adjustedPPM <- PPM / background
		PWM <- log2(adjustedPPM)

		library(Logolas)
		system("mkdir ../../data/BAC/plots")
		pdf("../../data/BAC/plots/Tn5PWM.pdf", width = 8, height = 6)
		logomaker(adjustedPPM, type = "Logo", return_heights = TRUE)
		dev.off()

		# Save the PWM to file
		saveRDS(PWM, "../../data/BAC/Tn5PWM.rds")

		############################
		# Get k-mer Tn5 bias model #
		############################

		kmerBiasList <- list()
		for(k in c(3,5,7)){

		# Generate all possible kmer sequences
		bases <- c("A", "C", "G", "T")
		kmers <- as.matrix(eval(parse(text = paste("expand.grid(", paste(rep("bases", k), collapse = ", "), ")"))))
		kmers <- sapply(1:dim(kmers)[1], function(x){paste(kmers[x,], collapse = "")})

		# Go through each genomic tile and record insertion numbers for kmers
		tileKmer <- pbmcapply::pbmclapply(
		1:length(tileRanges),
		function(tileInd){
		tileRange <- tileRanges[tileInd]
		tilePositions <- IRanges::tile(tileRange, width = 1)[[1]]
		contextRanges <- IRanges::resize(tilePositions, fix = "center", width = k)
		contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
		ATACTrack <- rowSums(getRegionATAC(countData,
		regionInd = as.character(tileInd),
		groupIDs = 1:5,
		width = 1000))
		kmerFG <- rep(0, length(kmers))
		names(kmerFG) <- kmers
		kmerBG <- rep(0, length(kmers))
		names(kmerBG) <- kmers
		kmerFG[contextSeq] <- ATACTrack
		kmerBG[contextSeq] <- 1
		list(kmerFG, kmerBG)
		}
		)

		# Summarize results for all tiles and calculate k-mer bias
		kmerFG <- rowSums(sapply(tileKmer,function(x){x[[1]]}))
		kmerBG <- rowSums(sapply(tileKmer,function(x){x[[2]]}))
		kmerBias <- kmerFG / kmerBG
		kmerBiasList[[as.character(k)]] <- kmerBias / mean(kmerBias)
		}

		saveRDS(kmerBiasList, "../../data/BAC/kmerBiasList.rds")

		#############################
		# Benchmark on all BAC data #
		#############################

		kmerBiasList <- readRDS("../../data/BAC/kmerBiasList.rds")
		PWM <- readRDS("../../data/BAC/Tn5PWM.rds")
		TobiasBigWig <- import.bw('../../data/BAC/Tobias/BAC_bias.bw')
		PWMRadius <- 10

		benchmark <- t(pbmcapply::pbmcmapply(
		function(tileInd){

		tileRange <- tileRanges[tileInd]
		tileInsertion <- rowSums(getRegionATAC(countData, regionInd = as.character(tileInd),
		groupIDs = 1:5, width = 1000))
		tilePositions <- IRanges::tile(tileRange, width = 1)[[1]]
		coverage <- sum(tileInsertion)

		biasBenchmark <- list()

		# Retrieve Tn5 bias calculated by kmer
		for(k in names(kmerBiasList)){
		contextRanges <- IRanges::resize(tilePositions, fix = "center", width = as.integer(k))
		contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
		kmerBias <- unname(kmerBiasList[[k]][contextSeq])
		biasBenchmark[[paste0(k, "-mer")]] <- cor(kmerBias, tileInsertion)
		}

		# Retrieve Tn5 bias calculated by PWM
		contextRanges <- IRanges::resize(tilePositions, fix = "center", width = PWMRadius * 2 + 1)
		contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
		PWMBias <- 2 ^ sapply(contextSeq, function(seq){PWMScoring(seq, PWM)})
		biasBenchmark[["PWM"]] <- cor(PWMBias, tileInsertion)

		# Retrieve Tn5 bias estimated by Tobias
		TobiasTn5Bias <- subsetByOverlaps(TobiasBigWig, tileRanges[tileInd])$score
		biasBenchmark[["dinucleotidePWM"]] <- cor(2 ^ TobiasTn5Bias, tileInsertion)

		# Calculate accuracy by convolution neural net
		biasBenchmark[["ConvNet"]] <- cor(regionBias(project)[tileInd,], tileInsertion / conv(tileInsertion, 50))

		biasBenchmark <- Reduce(c, biasBenchmark)

		c(coverage, biasBenchmark)

		},
		1:length(tileRanges),
		mc.cores = 16
		))

		# Filter out entries with NA values
		benchmark <- benchmark[rowSums(is.na(benchmark)) == 0, ]

		# Filter out low-coverage tiles
		benchmark <- benchmark[benchmark[, 1] > 20000, ]

		methods <- c("3-mer", "5-mer", "7-mer", "PWM", "dinucleotide PWM", "CNN")
		plotData <- data.frame(
		Correlation = colMeans(benchmark)[2:7],
		Method = methods
		)
		pdf("../../data/BAC/plots/benchmark.pdf", width = 8, height = 8)
		ggplot(plotData) +
		geom_bar(aes(x = Method, y = Correlation), stat = "identity", width = 0.5, fill = "#CA9B80") +
		xlab("Bias correction method") + ylim(0, 1) +
		ylab("Correlation between predicted \nand observed Tn5 bias") +
		scale_x_discrete(limits = methods) +
		theme_classic()
		dev.off()

		########################
		# Model interpretation #
		########################

		# Get Tn5 bias predicted by PWM and CNN
		predBias <- t(pbmcapply::pbmclapply(
		1:length(tileRanges),
		function(tileInd){

		tileRange <- tileRanges[tileInd]
		tileInsertion <- rowSums(getRegionATAC(countData, regionInd = as.character(tileInd),
		groupIDs = 1:5, width = 1000))

		if(mean(tileInsertion) > 20){
		tilePositions <- IRanges::tile(tileRange, width = 1)[[1]]
		localCoverage <- conv(tileInsertion, 50) / 100
		obsBias <- tileInsertion / localCoverage

		# Retrieve Tn5 bias calculated by PWM
		contextRanges <- IRanges::resize(tilePositions, fix = "center", width = PWMRadius * 2 + 1)
		contextSeq <- Biostrings::getSeq(hg38, contextRanges, as.character = T)
		PWMBias <- 2 ^ sapply(contextSeq, function(seq){PWMScoring(seq, PWM)})

		# Retrieve Tn5 bias estimated by CNN
		CNNBias <- regionBias(project)[tileInd,]

		data.frame(PWMBias = PWMBias, CNNBias = CNNBias, obsBias = obsBias, context = contextSeq)
		}else{
		NULL
		}

		},
		mc.cores = 16
		))
		predBias <- data.table::rbindlist(predBias)
		predBias <- predBias[!is.na(predBias$obsBias), ]

		# Visual inspection
		sampleInd <- sample(1:dim(predBias)[1], 10000)
		qplot(predBias$PWMBias[sampleInd], predBias$obsBias[sampleInd]) + xlim(0,50) + ylim(0,50)

		# Calculate errors by the two methods on individual positions
		PWMError <- log2(pmax(predBias$PWMBias, 1e-3)) - log2(pmax(predBias$obsBias, 1e-3))
		CNNError <- log2(pmax(predBias$CNNBias, 1e-3)) - log2(pmax(predBias$obsBias, 1e-3))
		PWMAbsError <- abs(PWMError)
		CNNAbsError <- abs(CNNError)
		improvement <- PWMAbsError - CNNAbsError

		qplot(PWMError[sampleInd], improvement[sampleInd], )
		hist(PWMError[order(improvement, decreasing = T)][1:2000])

		diffContext <- predBias$context[order(improvement, decreasing = T)][1:2000]
		simContext <- predBias$context[order(abs(improvement))][1:2000]
		diffCGContent <- sapply(diffContext, function(x){stringr::str_count(x, "[CG]")}) / (2 * PWMRadius + 1)
		simCGContent <- sapply(simContext, function(x){stringr::str_count(x, "[CG]")}) / (2 * PWMRadius + 1)

		plotData <- diffCGContent
		data.frame(plotData) %>%
		ggplot(aes(x = plotData)) +
		geom_histogram(bins=20, fill = '#9C5D41') +
		xlab("GC content") + ylab("Number of examples") +
		theme_classic()

analyses/BAC/.ipynb_checkpoints/dispersionModel-Copy1-checkpoint.ipynb

0 → 100644

+2969 −0

File added.

Preview size limit exceeded, changes collapsed.

analyses/BAC/.ipynb_checkpoints/dispersionModel-checkpoint.ipynb

0 → 100644

+2640 −0

File added.

Preview size limit exceeded, changes collapsed.

Admin message