DEET relies on the following packages. Since they are all CRAN in
origin, they should download and install automatically with
devtools::install_github
or
utils::install.packages
. The required dependencies are
listed below.
::install_github("wilsonlabgroup/DEET") devtools
# IN DEVELOPMENT
All processed DEGs, metadata, and enriched pathways in formats compatible with this package as well as other methods such as gene set enrichment analysis are stored here: https://www.wilsonlab.org/public/DEET_data/
No functions within DEET automatically load data for the user, so the data either needs to be downloaded directly from the ftp, or using the downloader function.
The DEET_data_download
function, with possible inputs
“ALL”, “metadata”, “enrich”, and “feature_extract” automatically
downloads the data required to run DEET_enrich
and/or
DEET_feature_extract
.
We reccomended using:
<- DEET_data_download("ALL")
downloaded <- downloaded$metadata
metadata <- downloaded$DEET_feature_extract
DEET_feature_extract_input <- downloaded$DEET_enrich DEET_enrich_input
Here: DEET_enrich_input
replaces
DEET_example_data
for DEET_enrich()
.
DEET_feature_extract_input
replaces
DEET_feature_extract_example_matrix
for
DEET_feature_extract()
Lastly, metadata
is not
directly used in any of the function, but summarizes all of the pairwise
comparisons using the following columns.
Once download, save these data and DEET can be used offline.
A comparison - by - explanatory piece of data dataframe providing important details to contextualize each study. For every pairwise comparison, the study name, source (SRA, TCGA, GTEx and SRA-manual), description from the DRA compendium, the number of samples (total, up-condition, and down-condition), samples (total ,up-condition, down-condition), tissue (including tumour from TCGA), number of DEs (total, up-condition, down-condition), age (mean +- sd), sex, top 15 DEGs - up, top 15 DEGs - down, top 5 enriched pathways, and top 5 enriched TFs. PMID are also available for studies selected from SRA. Lastly, each pairwise comparison was given an overall category based on those decided in Crow et al., 2019.
This is the meat and potatoes of the DEET dataset. Here, you can find
all of the significant DE genes computed within DEET (padj < 0.05),
DEGs, pathways, and TFs sorted into *gmt files compatible with
traditional pathway enrichment tools (e.g., GSEA, gprofiler etc.),
respective metadata, and the pathway enrichment and TF enrichment files
used to generate the internal pathway enrichments of
DEET_enrich
. A more specific breakdown of these objects are
below:
https://download.baderlab.org/EM_Genesets/
.https://download.baderlab.org/EM_Genesets/
.metadata
dataframe (see
above).A gene by comparison matrix populated by the log2Fold-change of genes
that are significantly DE in the comparison (padj < 0.05). The file
is the input to the mat
variable in
DEET_feature_extract.
The primary function of the DEET R package is to allow users to query
their own list of DEGs against the consistently computed DEGs within
DEET by using the function DEET_enrich()
. The optimal input
into DEET_enrich()
is a data frame of genes (human gene
symbols) with an associated p-value and coefficient (e.g., Fold-change)
in conjunction with a list of genes designating the statistical
background. DEET_enrich()
first identifies enriched
biological pathways and TF targets using the *.gmt files used for all of
the DEET comparisons (i.e.,
“Human_GO_AllPathways_with_GO_iea_June_01_2021_symbo.gmt” for pathways
and “Human_TranscriptionFactors_MSigdb_June_01_2021_symbol.gmt” for
TFs), allowing us to not only compare overlapping genes between the
user-inputted genes and the DEGs in DEET but also overlapping pathways
and TFs. All gene-set enrichment within the DEET_enrich()
functions use ActivePathways with all detected genes as the background,
Brown’s p-value fusion method, a false-discovery rate for p-value
correction, and a cutoff of 0.05. Then, DEET_enrich()
enriches the users’ inputted genes, pathways, and TF targets against the
DEGs, pathways, and TF targets stored within DEET. Enrichment of the
user’s inputted gene lists against the DE comparisons within DEET are
also completed with ActivePathways, with a minimum geneset filter of 15
and a maximum of 10000. Then, DEET_enrich()
computes the
Spearman’s and Pearson’s correlation between the coefficients within the
user’s imputed list of DEGs that overlap with the log2(Fold-change) of
DEGs within enriched pairwise comparisons.P-values of these correlations
are corrected with an FDR-adjustment. Together, DEET_enrich() returns
significantly enriched studies based on overlapping DEGs, pathways, and
TFs. Similarly, DEET_enrich()
returns the traditional
pathway and TF motif enrichment of the inputted gene list. All
enrichment outputs are in the format of the output of ActivePathways
(study, FDR-adjusted p-value, input length, DE comparison length,
overlapping genes). DEET_enrich()
also returns a dataframe
of the Spearman’s and Pearson’s correlation (with associated
FDR-adjusted p-values) between the inputted DE list with the DEGs found
in DEET as well as the intersecting DEGs within those studies.
Optionally, DEET_enrich()
may be used with a generic gene
list (i.e. without P-values or coefficients). If the inputted gene list
is ordered, then the p-value is artificially generated as equation 1 and
the coefficient is artificially generated as equation 2. We assume an
inputted list in decreasing order of significance, so the FDR and coef
in equations 1 and 2 are reversed. DEET_enrich()
then runs
normally but Pearson’s correlation between the inputted gene list and
the DEGs within DEET are excluded. If the inputted gene list is
unordered, then all of the p-values are set to 0.049 and both Spearman’s
and Pearon’s correlations between the users inputted genes and the DEGs
within DEET are excluded. If users do not provide a background set of
genes, then we assume the background set is all genes detected within
DEET.
For a sorted list of genes without a p-value or coefficient: Note, this happens internally, you do not have to do it.
<- c("a", "b", "c", "d") # list of genes user inputs
DEG_list
<- data.frame(gene_symbol = DEG_list)
DEG_processed # DEG list is the list of genes that the user inputs
<- 0.049
padj for(i in 2:nrow(DEG_processed)) {
<- padj[i-1] * 0.95
padj[i]
}<- rev(padj)
padj <- rev(seq(1, 1 + 0.1*(nrow(DEG_processed) - 1), 0.1))
log2fc
$padj <- padj
DEG_processed$coef <- log2fc
DEG_processedcolnames(DEG_processed) <- c("gene_symbol", "padj", "coef")
The DEET R package also contains plotting functions to summarize the
most significant studies based on each enrichment test and correlation
within DEET_enrich()
. The
proccess_and_plot_DEET_enrich()
function plots barplots of
the most enriched studies based on gene set enrichment (ActivePathways)
of studies enriched based on overlapping DEGs, pathways, and TF targets.
The DEET_plot_correlation()
function generates scatterplots
of the most enriched studies based on Spearman’s correlation analysis.
All plots are generated using ggplot2, and the functions return the
ggplot2 objects, allowing researchers to further modify and/or save the
plots.
Lastly, the DEET R package contains a function called
DEET_feature_extract()
, which allows researchers to
identify genes that are associated with metadata. If the response
variable are continuous (e.g., number of DEGs in study, Fold-change of
TP53 etc.) then features are extracted by calculating the coefficients
from a Gaussian family elastic net regression using the glmnet R
package, as well as Spearman’s correlation between every gene and the
response variable. If the response variable is categorical (e.g.,
Source, Category etc.), then features are extracted by calculating the
coefficients from a multinomial family elastic net regression, as well
as an ANOVA between each category within the response variable. Lastly,
in the response variable is ordinal (e.g., enriches for TNFa pathway,
Cancer study yes/no etc.), then features are extracted using a binomial
family elastic net regression, as well as a Wilcoxon’s test between the
two categories within the response variable.
DEET_enrich()
expects a list of human gene symbols
as either a character vector or as a data frame with the columns
c("gene_symbol", "padj", "coef"). If you are inputted DEGs, the log2Fold-change is the most appropriate coefficient, but I left this general for other inputted data types (e.g., effect size from GWAS). The
example_DEET_enrich_inputhas the structure required for a gene list with p-values and coefficients. An example of the unordered list is
example_DEET_enrich_input$gene_symbol`.
DEET_dataset should either be the example file in
DEET_example_data
or DEET_enrich_input
downloaded by the downloader function or manually from the FTP. other
options will likely cause the tool to crash.
ordered: This parameter only applies when the
DEG_list
variable is a character vector (not a dataframe).
It determins if the inputted list is ordered or not.
background: a character vector of genes within the universe. For example, if your input is a list of DEGs in RNA-seq, then “background” should be all of your detected genes. Like in traditional pathway enrichment, not seeting a background may bias the enriched studies by tissue and/or cell-type as more highly expressed genes have more power to be detected as DE.
data("example_DEET_enrich_input")
data("DEET_example_data")
<- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data) DEET_out
data("example_DEET_enrich_input")
data("DEET_example_data")
<- example_DEET_enrich_input$gene_symbol
geneList <- DEET_enrich(geneList, DEET_dataset = DEET_example_data, ordered = TRUE) DEET_out
data("example_DEET_enrich_input")
data("DEET_example_data")
<- example_DEET_enrich_input$gene_symbol
geneList <- DEET_enrich(geneList, DEET_dataset = DEET_example_data, ordered =FALSE) DEET_out
The output of these three comparisons will be comparable, however the correlation variable is of note when the input gene set is just a list of genes.
When the gene set is ordered, a Spearman’s correlation is interpretable, as it is simply the rank-order of genes, however a Pearson’s correlation is not interpretable as we do not know the relative difference in coefficient size of your inputted genes
If the gene list is unordered, correlation analysis is entirely
uninterpretable and is not run. You are given this message:
Input gene list is considered UNORDERED: Correlation analysis will not be run and pathway enrichment will be unordered.
Since it not run the output is
No variance in coefs. Cannot proceed with correlation.
Named list where each element contains 6 objects. Each object will contain the results (enrichment or correlation) and corresponding metadata.
AP_INPUT_BP_output
but instead of enriched
gene ontologies, there is the enriched TFs.results
, which is
the same data.table as in AP_INPUT_BP_output
but instead of
pathways, we are enriching against DE comparisons found within DEET. The
second is metadata
which is a subset of the larger metadata
dataframe that corresponds with the gene sets in
results
.AP_DEET_DE_output
where pathway gene
ontology term names are in the place of DE gene names.AP_DEET_DE_output
where pathway gene transcription factor term names are in the place of
DE gene names.results
is a data frame where the rows are
different pairwise comparisons and the columns are the ouput of the
correlational analysis between overlapping genes. Columns are the DEET
ID and DEET name, the pearson/spearman’s correlation coefficients, as
well as their raw and FDR adjusted p-values. The metadata
object are the subsetted rows from the metadata data frame that align
with the studies containing significant correlations. Lastly, the
distributions
object is a list where each element is a
diffierent significantly associated pairwise comparisons. Each element
is populated by a dataframe where rows are genes significant in at least
one study, columns are the coefficients from the input study and DEET,
and colour designates whether the gene is DE in one study (grey), both
studies in the same direction (purple) and both studies in the opposite
direction (orange).DEET_feature_extract expects three variables: a gene-by-comparison matrix populated with a statistic related to differential expresion (e.g., p-value, fold-change), a response variable (i.e., an output dependent variable), and a category.
mat - a gene-bycomparison matrix populated with a statistic
related to differential expression. The matrix in
DEET_data_download()
is populated with the log2FC of genes
if they were deemed as significant (padj < 0.05). Other matrices,
namely those populated by p-value, fold-change, and t-statitic
(including the matrices for all DEGs) can be found in
https://www.wilsonlab.org/public/DEET_data/feature_matrices
and can be downloaded separately.
respeonse: a character vector identifying each study, this can be categorical, binomial (e.g. 0/1, T/F), or continuous
datatype: a character indicating if the response is “binomial”, “categorical”, or “continuous”. This is case sensitive.
data(DEET_feature_extract_example_matrix)
data(DEET_feature_extract_example_response)
<- DEET_feature_extract(DEET_feature_extract_example_matrix,
single1 "categorical") DEET_feature_extract_example_response,
DEET feature extract outputs a list of three objects.
datatype
is “continuous”, then this will just be a
one-element list.glmnet
are default aside from the
family
, which is determined by datatype
and
alpha which is set to 0.5 (selecting an elastic net instead of L1 or L2
regularized).datatype
input was categorical, binomial or continuous
respsectively. The other object is a dataframe showing the significance
of each gene, where rows are genes, and columns are stastistics (e.g.,
F-statistic for ANOVA), P-value, and FDR-adjusted p-value.The proccess_and_plot_DEET_enrich()
function is a
wrapper that generates barplots and a dot plot of enrichment for all of
the individual outputs of DEET_enrich()
(not the
correlations). The outputs are in ggplot2 objects, allowing users to
further modify the plots or print however they like.
DEET_enrich()
a
named list with the ActivePathways enrichment from each data type. If
there wasn’t enrichment of a certain datatype (e.g. AP_DEET_BP_output)
the function runs properly but it is not plotted.The remaining varables are for graphical parameters that are passed
into the DEET_enrichment_plot()
function.
data("example_DEET_enrich_input")
data("DEET_example_data")
<- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
DEET_out <- proccess_and_plot_DEET_enrich(DEET_out, text_angle = 45,
plotting_example horizontal = TRUE, topn=4)
Another example Where AP_DEET_BP_output
is not
significant, to show the plotting function still works.
data("example_DEET_enrich_input")
data("DEET_example_data")
<- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
DEET_out $AP_DEET_DE_output <- "No enrichment to be plotted"
DEET_out<- proccess_and_plot_DEET_enrich(DEET_out, text_angle = 45,
plotting_example horizontal = TRUE, topn=4)
There are up to four outputs assuming everything is significant, each output is a list or a ggplot object.
DEET_enrich()
).DEET_enrichment_plot
. An example of
this is shown below.<- DEET_out$AP_DEET_DE_output$results
DE_example
# Changes for DEET_example_plot
$term.name <- DEET_out$AP_DEET_DE_output$metadata$DEET.Name
DE_example$domain <- "DE"
DE_example$overlap.size <- lengths(DE_example$overlap)
DE_example$p.value <- DE_example$adjusted.p.val
DE_example
<- DEET_enrichment_plot(list(DE_example = DE_example), "DE_example") DE_example_plot
As shown above, from here you can also just use
DEET_enrichment_plot
directly to have some more control
over these plots.
This function also takes the direct output from
DEET_enrich
and generates scatterplots of the correlations
of studies whose log2FCs are significantly correlated with the input DE
list.
correlation_input - The DE_correlations
object that is
the output of the DEET_enrich
function. It only works if
there was at least one study that was significantly correlated.
data("example_DEET_enrich_input")
data("DEET_example_data")
<- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
DEET_out <- DEET_out$DE_correlations
correlation_input <- DEET_plot_correlation(correlation_input) correlation_plots
As mentioned previously, the genesets within DEET are easily transferrable to other gene set enrichment datasets.
One option is to download the *gmt files diretly from our ftp.
https://www.wilsonlab.org/public/DEET_data/DEET_DE.gmt
Is
directly compatible with these tools.
The other option would be to save the downloaded DEET gmt as a gmt file. This is completed using the ActivePathways R package. Instead of using the example data as shown below, please use the full dataset.
Instead of saving to a temporary directory like in this vignette, save the file wherever you want the directory to be saved.
<- DEET_example_data$DEET_gmt_DE
DEET_gmt message(paste0("DEET_gmt is an object of class gmt?: ",ActivePathways::is.GMT(DEET_gmt) ))
::write.GMT(DEET_gmt, file = paste0(tempdir(),"/DEET_DEs.gmt")) ActivePathways
If you have two dependent gene lists to input into DEET, you can use ActivePathways directly to find combind DEET-comparison enrichment of the two gene sets.
set.seed(1234) # as I sample p-values to make the toy example
# For example two, I had the same genes but I shuffled the p-value
$padj2 <- sample(example_DEET_enrich_input$padj, length(example_DEET_enrich_input$padj), replace = FALSE)
example_DEET_enrich_input
# Make a gene-by-input-list matrix of the adjusted p-values from your multiple gene sets
<- as.matrix(example_DEET_enrich_input[,c("padj", "padj2")])
AP_matrix
# Run activepathways on the combined matrix.
# Get gmt file, again from the whole list:
<- DEET_example_data$DEET_gmt_DE
DEET_gmt
head(AP_matrix)
<- ActivePathways::ActivePathways(scores=AP_matrix, gmt=DEET_gmt, geneset.filter = c(5,10000),correction.method = "fdr") AP_example_out
The outputs of using ActivePathways
are the same as
DEET_enrich()
but with a couple extra columns. * evidence:
Whether the DEET comparison is enriched because of one gene list, both
gene lists, or an integrated version of these gene lists *
Genes_colname
: The genes that contributed to enrichment
from each inputted gene list.