TCGAretriever
is an R library aimed at downloading
clinical and molecular data from cBioPortal (https://www.cbioportal.org/), including The Cancer
Genome Atlas (TCGA) data (free-tier data). TCGA is a program aimed
at improving our understanding of Cancer Biology. Several TCGA Datasets
are available online. TCGAretriever
helps accessing and
downloading TCGA data using R via the cBioPortal API. Features
of TCGAretriever
are:
it is a simple-to-use R-based interface to the cBioPortal API;
it supports downloading molecular data of large numbers of genes.
This tutorial describes few use cases to help getting started with
TCGAretriever
.
In this section, we provide an overview of the basic operations and commonly used functions.
Here we retrieve the list of available studies and select all
tcga_pub entries. Each of the values included in the
studyId
column is a study identifier and can be passed as
csid
argument to other functions such as
get_genetic_profiles()
or
get_case_lists()
.
library(TCGAretriever)
library(reshape2)
library(ggplot2)
# Obtain a list of cancer studies from cBio
all_studies <- get_cancer_studies()
# Find published TCGA datasets
keep <- grepl("tcga_pub$", all_studies[,'studyId'])
tcga_studies <- all_studies[keep, ]
# Show results
utils::head(tcga_studies[, c(11, 1, 4)])
## studyId name
## 27 blca_tcga_pub Bladder Urothelial Carcinoma (TCGA, Nature 2014)
## 39 brca_tcga_pub Breast Invasive Carcinoma (TCGA, Nature 2012)
## 88 gbm_tcga_pub Glioblastoma (TCGA, Nature 2008)
## 112 kirc_tcga_pub Kidney Renal Clear Cell Carcinoma (TCGA, Nature 2013)
## 113 kich_tcga_pub Kidney Chromophobe (TCGA, Cancer Cell 2014)
## 116 hnsc_tcga_pub Head and Neck Squamous Cell Carcinoma (TCGA, Nature 2015)
## pmid
## 27 24476821
## 39 23000897
## 88 18772890
## 112 23792563
## 113 25155756
## 116 25631445
It is possible to retrieve genetic profiles and case lists associated to each study of interest. Genetic profiles indicate what kind of data are available for a certain study (e.g., RNA-seq, micro-array, DNA mutation…). Typically, a data type of interest may only be available for a fraction of patients in the cohort. Therefore, it is also important to obtain the identifier of the corresponding case list of interest.
# Define the cancer study id: brca_tcga_pub
my_csid <- "brca_tcga_pub"
# Obtain genetic profiles
brca_pro <- get_genetic_profiles(csid = my_csid)
utils::head(brca_pro[, c(7, 1)])
## molecularProfileId molecularAlterationType
## 1 brca_tcga_pub_rppa PROTEIN_LEVEL
## 2 brca_tcga_pub_rppa_Zscores PROTEIN_LEVEL
## 3 brca_tcga_pub_gistic COPY_NUMBER_ALTERATION
## 4 brca_tcga_pub_linear_CNA COPY_NUMBER_ALTERATION
## 5 brca_tcga_pub_mutations MUTATION_EXTENDED
## 6 brca_tcga_pub_methylation_hm27 METHYLATION
# Obtain cases
brca_cas <- get_case_lists(csid = my_csid)
utils::head(brca_cas[, c(4, 1)])
## sampleListId category
## 1 brca_tcga_pub_all all_cases_in_study
## 2 brca_tcga_pub_basal other
## 3 brca_tcga_pub_claudin other
## 4 brca_tcga_pub_cna all_cases_with_cna_data
## 5 brca_tcga_pub_cnaseq all_cases_with_mutation_and_cna_data
## 6 brca_tcga_pub_complete other
Once we identified a genetic profile of interest and a case list of
interest, we can obtain clinical data via the
TCGAretriever::get_clinical_data()
. Additionally, molecular
data can be obtained via the
TCGAretriever::get_molecular_data()
function (non-mutation
data) or the get_mutation_data()
function (mutation
data).
# Define a set of genes of interest
q_csid <- 'brca_tcga_pub'
q_genes <- c("TP53", "HMGA1", "E2F1", "EZH2")
q_cases <- "brca_tcga_pub_complete"
rna_prf <- "brca_tcga_pub_mrna"
mut_prf <- "brca_tcga_pub_mutations"
# Download Clinical Data
brca_cli <- get_clinical_data(csid = q_csid, case_list_id = q_cases)
# Download RNA
brca_RNA <- get_molecular_data(case_list_id = q_cases,
gprofile_id = rna_prf,
glist = q_genes)
NOTE 1: the resulting data.frame includes ENTREZ_GENE_IDs and OFFICIAL_SMBOLs as first and second column. The third column indicates the gene/assay type.
NOTE 2: up to n=500 genes can be queried via
the get_molecular_data()
function.
# Show results
brca_RNA[, 1:5]
## entrezGeneId hugoGeneSymbol type TCGA-A1-A0SD-01 TCGA-A1-A0SE-01
## 1 1869 E2F1 protein-coding -1.453500 -1.352500
## 2 2146 EZH2 protein-coding -1.792800 -1.565200
## 3 3159 HMGA1 protein-coding -2.519500 -1.966333
## 4 7157 TP53 protein-coding -1.010667 -0.338500
# Set SYMBOLs as rownames
# Note that you may prefer to use the tibble package for this
rownames(brca_RNA) <- brca_RNA$hugoGeneSymbol
brca_RNA <- brca_RNA[, -c(1, 2, 3)]
# Round numeric vals to 3 decimals
for (i in 1:ncol(brca_RNA)) {
brca_RNA[, i] <- round(brca_RNA[, i], digits = 3)
}
# Download mutations
brca_MUT <- get_mutation_data(case_list_id = q_cases,
gprofile_id = mut_prf,
glist = q_genes)
# Identify Samples carrying a TP53 missense mutation
tp53_mis_keep <- brca_MUT$hugoGeneSymbol == 'TP53' &
brca_MUT$mutationType == 'Missense_Mutation' &
!is.na(brca_MUT$sampleId)
tp53_mut_samples <- unique(brca_MUT$sampleId[tp53_mis_keep])
# Show results
keep_cols <- c('sampleId', 'hugoGeneSymbol', 'mutationType', 'proteinChange')
utils:::head(brca_MUT[, keep_cols])
## sampleId hugoGeneSymbol mutationType proteinChange
## 1 TCGA-A2-A04R-01 E2F1 Missense_Mutation Y168C
## 2 TCGA-E2-A154-01 E2F1 Missense_Mutation R165W
## 3 TCGA-AN-A0XW-01 EZH2 Nonsense_Mutation S644*
## 4 TCGA-AO-A0JL-01 TP53 Nonsense_Mutation R306*
## 5 TCGA-A8-A079-01 TP53 Missense_Mutation S215I
## 6 TCGA-AO-A124-01 TP53 Nonsense_Mutation R213*
NOTE: Mutation data are returned in a different format (molten data format) compared to data corresponding to other genetic profiles. Results are formatted as molten data.frames where each row is a DNA variant. Each sample / case may have 0 or more variants for each of the query genes.
It is possible to download data corresponding to all available genes
using the fetch_all_tcgadata()
function. This function
takes the same arguments as the get_profile_data()
function
but the glist
argument. Gene lists are obtained
automatically via the get_gene_identifiers()
function. Each
job is automatically split into multiple batches. Results are
automatically aggregated before being returned. If mutation data are
being queried, results are returned as a molten data.frame. The
mutations
argument is used to guide the query type as well
as the output format.
# Download all brca_pub mutation data (complete samples)
all_brca_MUT <- fetch_all_tcgadata(case_list_id = q_cases,
gprofile_id = mut_prf,
mutations = TRUE)
# Download all brca_pub RNA expression data (complete samples)
all_brca_RNA <- fetch_all_tcgadata(case_list_id = q_cases,
gprofile_id = rna_prf,
mutations = FALSE)
In this section, we discuss simple use case where
TCGAretriever
is used to study the relationship between
gene expression and mutation status of some genes of interest in breast
cancer tumors.
Here we use the data retrieved before to analyze the relationship between E2F1 and EZH2 expression in TCGA breast cancer samples.
# Visualize the correlation between EZH2 and E2F1
df <- data.frame(sample_id = colnames(brca_RNA),
EZH2 = as.numeric(brca_RNA['EZH2', ]),
E2F1 = as.numeric(brca_RNA['E2F1', ]),
stringsAsFactors = FALSE)
ggplot(df, aes(x = EZH2, y = E2F1)) +
geom_point(color = 'gray60', size = 0.75) +
theme_bw() +
geom_smooth(method = 'lm', color = 'red2',
size=0.3, fill = 'gray85') +
ggtitle('E2F1-EZH2 correlation in BRCA') +
theme(plot.title = element_text(hjust = 0.5))
Scatterplot - showing E2F1 RNA expression with respect to EZH2 RNA expression across TCGA breast cancer samples.
This is an alternative visualization that illustrates the same
relationship between E2F1 and EZH2 expression. Here, we make use of the
make_groups()
function to discretize the numeric vector
corresponding to EZH2 expression counts. This function takes a numeric
vector as its num_vector
argument (e.g., a gene
expression vector), performs binning, and returns the results as a
data.frame
. Here we show the distribution of E2F1
expression counts with respect to binned EZH2 expression.
# Bin samples according to EZH2 expression
EZH2_bins <- make_groups(num_vector = df$EZH2, groups = 5)
utils::head(EZH2_bins, 12)
## value rank
## 1 -1.793 2
## 2 -1.565 2
## 3 -2.495 1
## 4 -2.137 1
## 5 0.529 5
## 6 -1.439 3
## 7 -0.426 5
## 8 -1.164 4
## 9 -0.894 4
## 10 -0.998 4
## 11 -1.132 4
## 12 -1.483 3
# attach bin to df
df$EZH2_bin <- EZH2_bins$rank
# build Boxplot
ggplot(df, aes(x = as.factor(EZH2_bin), y = E2F1)) +
geom_boxplot(outlier.shape = NA, fill = '#fed976') +
geom_jitter(width = 0.1, size = 1) +
theme_bw() +
xlab('EZH2 Bins') +
ggtitle('E2F1 Expression vs. Binned EZH2 Expression') +
theme(plot.title = element_text(face = 'bold', hjust = 0.5))
Boxplot - showing E2F1 RNA expression with respect to binned EZH2 RNA expression across TCGA breast cancer samples.
We use the data retrieved before to analyze the relationship between HMGA1 and TP53 expression with respect to the TP53 WT status in TCGA breast cancer samples.
# Coerce to data.frame with numeric features
mol_df <- data.frame(sample_id = colnames(brca_RNA),
HMGA1 = as.numeric(brca_RNA['HMGA1', ]),
TP53 = as.numeric(brca_RNA['TP53', ]),
stringsAsFactors = FALSE)
mol_df$TP53.status = factor(ifelse(mol_df$sample_id %in% tp53_mut_samples,
'01.wild_type', '02.mutated'))
# Visualize the correlation between EZH2 and E2F1
ggplot(mol_df, aes(x = TP53, y = HMGA1)) +
geom_point(color = 'gray60', size = 0.75) +
facet_grid(cols = vars(TP53.status)) +
theme_bw() +
geom_smooth(mapping = aes(color = TP53.status),
method = 'lm', size=0.3, fill = 'gray85') +
ggtitle('HMGA1-TP53 correlation in BRCA') +
theme(plot.title = element_text(hjust = 0.5))
Scatterplot - showing HGMA1 RNA expression with respect to TP53 RNA expression across TCGA breast cancer samples. Samples with wild type (left) and mutant (right) TP53 status are shown in the two panels.
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.4.4 reshape2_1.4.4 TCGAretriever_1.9.1
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.6-3 gtable_0.3.4 jsonlite_1.8.8 highr_0.10
## [5] dplyr_1.1.4 compiler_4.3.2 tidyselect_1.2.0 Rcpp_1.0.11
## [9] stringr_1.5.1 jquerylib_0.1.4 splines_4.3.2 scales_1.3.0
## [13] yaml_2.3.8 fastmap_1.1.1 lattice_0.22-5 R6_2.5.1
## [17] plyr_1.8.9 labeling_0.4.3 generics_0.1.3 curl_5.2.0
## [21] knitr_1.45 tibble_3.2.1 munsell_0.5.0 bslib_0.6.1
## [25] pillar_1.9.0 rlang_1.1.2 utf8_1.2.4 cachem_1.0.8
## [29] stringi_1.8.3 xfun_0.41 sass_0.4.8 cli_3.6.2
## [33] mgcv_1.9-0 withr_2.5.2 magrittr_2.0.3 digest_0.6.33
## [37] grid_4.3.2 rstudioapi_0.15.0 nlme_3.1-163 lifecycle_1.0.4
## [41] vctrs_0.6.5 evaluate_0.23 glue_1.6.2 farver_2.1.1
## [45] fansi_1.0.6 colorspace_2.1-0 rmarkdown_2.25 httr_1.4.7
## [49] tools_4.3.2 pkgconfig_2.0.3 htmltools_0.5.7
Success! TCGAretriever
vignette. Date: 2024-Jan-22, by
D Fantini.