Bioinformatics tools for integrating and understanding molecular changes associated with with Immune Response, Stemness and Oncogenic processes

Instructors

Workshop Description

Recently, The Cancer Genome Atlas (TCGA’s) Pan-Cancer Atlas initiative presented a comprehensive collection of 27 studies covering 11,000 patient tumors from 33 cancer types. These studies investigated cancer complexity from different angles by integrating multi -omics and clinical data. In particular, computational analyses have led to the identification of 299 cancer-driver genes and over 3,400 driver mutations. However, it still remains critical to clarify the consequences of each alteration and the underlying biological effects. We will discuss several computational tools that are useful in clarifying gene functions when performing integrative analysis of multi-omics datasets. In order to deal with the challenges of data retrieval and integration, TCGAbiolinks (Colaprico et al. 2018, PMID: 26704973) and DeepBlueR (Albrecht et al., 2017, PMID: 28334349) were developed to retrieve data from TCGA, CPTAC, GTEx, GEO and IHEC, Blueprint, ENCODE, Roadmap, respectively. Tumor-specific cancer-driver-gene events and downstream impact can be elucidated by integrative analyses of these datasets using MoonlightR (Colaprico et al. 2018). TCGAbiolinks and MoolightR have been used successfully in multiple studies for oncogenic processes identification (Ding et al., 2018, PMID: 29625049), oncogenic clinically actionable driver genes discovery (Bailey et al., 2018, PMID: 29625053), and comprehensive immune landscape characterization (Thorsson et al., 2018, PMID: 29628290)

Workshop: overview

In this workshop we will demonstrate the capability of TCGAbiolinks, deepblueR and MoonlightR for (1) retrieving -omics data from different consortia (TCGA, GTEx, GEO and IHEC); (2) discovering cancer driver genes by integrating multi-omics datasets; (3) Identifying the six immune subtypes from TCGA PanCancer dataset and (4) Estimating stemness scores using gene expression data. Our goal is for the end users to understand how biological features (Immune Subtypes, Oncogenic Processes, Driver Genes and Stemness) can be used to expand understating of their own datasets.

The packages we need for this session are

library(SummarizedExperiment)
library(dplyr)
library(Bioc2019PanCancerStudy)

Data retrieval from (GDC-TCGA, GEO and IHEC)

Using TCGAbiolinks, you can easily query, download, and prepare multi-omics data from NCI’s Genomic Data Commons (GDC). The GDC provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET). [https://gdc.cancer.gov/]

library(TCGAbiolinks)

We will illustrate retrieval of different types of –omics datasets including: 1. Gene expression 2. Copy number 3. Protein expression (RRPA) 4. Methylation 5. Clinical data 6. microRNA expression 7. Mutation 8. Chromatin Accessibility

TCGA Gene Expression (IlluminaHiSeq_RNASeqV2) and multi -omics data

The end user can search TCGA’s samples for a specific cancer of origin, download and prepare a matrix of gene expression. This code is an example; we provide the output of this code in this workshop package (see below).

cancerType <- "BRCA"

# Query platform Illumina HiSeq with a list of barcode 
query <- GDCquery(
  project = paste0("TCGA-",cancerType),
  data.category = "Gene expression",
  data.type = "Gene expression quantification",
  experimental.strategy = "RNA-Seq",
  platform = "Illumina HiSeq",
  file.type = "results",
  legacy = TRUE
)
# 38 seconds on GO's machine

# We select 10 tumor and 10 normal samples
Sample_sel <- query$results[[1]]$cases

# TP is Primary Tumor
# NT is Solid Tissue Normal

Sample_sel_TP <- TCGAquery_SampleTypes(barcode = Sample_sel,typesample = "TP")
Sample_sel_NT <- TCGAquery_SampleTypes(barcode = Sample_sel,typesample = "NT")                               

Sample_sel_short <- c(Sample_sel_TP[1:10], Sample_sel_NT[1:10])

# we need to create a new query with the selected barcodes
system.time(
query_down <- GDCquery(
  project = paste0("TCGA-",cancerType),
  data.category = "Gene expression",
  data.type = "Gene expression quantification",
  experimental.strategy = "RNA-Seq",
  platform = "Illumina HiSeq",
  file.type = "results",
  barcode = Sample_sel_short,
  legacy = TRUE
)
# 44.30 seconds on GO's machine
)

Now that we have defined the query, we can download the respective data. Makse sure you have a directory to save this data defined beforehand. The data will be downloaded to the folder TCGA-BRCA (the first value of query_down$project) as a subdirectory of the data directory specified. In addition to downloading the data, this function will write a file called “MANIFEST.txt” to your current working directory, not to the directory you specify.

# DataDir is a directory to save your datasets
DataDir <- "TCGAanalysis"

# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query_down, directory = DataDir)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCARnaseqSE <- GDCprepare(query_down, directory = DataDir)

Next, the TCGAanalyze_Preprocessing() function performs Array-Array Intensity correlation (AAIC). It defines a square symmetric matrix of Spearman correlations among samples. According this matrix and boxplot of correlation samples by samples it is possible to find samples with low correlation that can be identified as possible outliers. The cor.cut controls the correlation threshold to filter out outliers with correlation lower than the threshold. (Rather than require that you execute the above code, we have included the output as the BRCARnaseqSE object.)

data("BRCARnaseqSE")

dataPrep <- TCGAanalyze_Preprocessing(
  BRCARnaseqSE, cor.cut = 0.6
)

dataNorm <- TCGAanalyze_Normalization(
  tabDF = dataPrep,
  geneInfo = geneInfo,
  method = "gcContent"
) 
#> I Need about  5.3 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]
#> Step 1 of 4: newSeqExpressionSet ...
#> Step 2 of 4: withinLaneNormalization ...
#> Step 3 of 4: betweenLaneNormalization ...
#> Step 4 of 4: .quantileNormalization ...

Now that we have preprocessed and normalized our data, we will inspect the sample boxplots.

colnames(dataPrep) <- substr(colnames(dataPrep), 1, 15)
colnames(dataNorm) <- substr(colnames(dataNorm), 1, 15)

par(mfrow = c(2, 1))
boxplot(dataPrep, outline = FALSE, las = 2)
boxplot(dataNorm, outline = FALSE, las = 2)

par(mfrow = c(1, 1))

TCGAanalyze_Filtering allows the user to filter genes selecting a threshold. For istance returns all mRNA with mean across all samples, higher than the threshold defined as quantile mean across all samples.

dataFilt <- TCGAanalyze_Filtering(
  tabDF = dataNorm,
  method = "quantile", 
  qnt.cut =  0.25
)  

The result is shown below:

Table 1: Example of a matrix of gene expression (10 genes in rows and 2 samples in columns)
TCGA-A7-A13G-11A-51R-A13Q-07 TCGA-C8-A1HJ-01A-11R-A13Q-07
SPINK9|643394 0 0
PPM1H|57460 924 335
CEACAM5|1048 4 32
ATP5L2|267020 3 1
SSX3|10214 0 0
ADAMTS14|140766 18 115
ZC3H3|23144 701 826
CNNM2|54805 544 520
MRPL28|10573 798 1519
FOLR2|2350 3168 103

The result from TCGAanalyze_Preprocessing and TCGAanalyze_Normalization is shown below:

To retrieve other datatypes using TCGAbiolinks it is possible to follow the example for gene expression but replacing the parameters as shown in the following table:

data.type data.category platform file.type
1 Gene expression quantification Gene expression Illumina HiSeq results
2 Protein expression quantification Protein expression MDA_RPPA_Core expression
3 Copy number segmentation Copy number variation Affymetrix SNP Array 6.0 nocnv_hg19.seg
4 Methylation Beta Value DNA methylation Illumina Human Methylation 450 HumanMethylation450
5 Simple somatic mutation Simple nucleotide variation NA NA
6 miRNA gene quantification Gene expression Illumina HiSeq hg19.mirbase20.mirna.quantification
7 NA Clinical NA xml

GTEx gene expression data

The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 53 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. [https://gtexportal.org/home/] You can easily search GTEx samples, download and prepare a matrix of gene expression [RNA-Seq].

# Genotype-Tissue Expression (GTEx) project, which is another large-scale
#   repository cataloging gene expression from healthy individuals.

# In the following example we will retrieve data from brain's tissue.

data_gtex_brain <- TCGAquery_recount2("gtex", tissue = "brain")

Gtex_brain_annotation <- colData(data_gtex_brain$gtex_brain)
sort(table(Gtex_brain_annotation$smtsd))

data_gtex_brain_GeneExpr <- assay(data_gtex_brain$gtex_brain,1)

# data_gtex_brain_GeneExpr[1:5,1:5]
#SRR2166176 SRR2167642 SRR607445 SRR663753 SRR1485892
#ENSG00000000003.14      98669     122656     74728     56933      69009
#ENSG00000000005.5         764       1018       760      2105        151
#ENSG00000000419.12    1076085    1034842     56668    207245     124647
#ENSG00000000457.13     746248     752634     48502    121776      66762
#ENSG00000000460.16     603111     588176     25143     55928      34473

IHEC gene expression data

The International Human Epigenome Consortium (IHEC) is a global consortium with the primary goal of providing free access to high-resolution reference human epigenome maps for normal and disease cell types to the research community. [http://ihec-epigenomes.org/] IHEC provides data currently available from the following consortia: * AMED-CREST (Japan) * Blueprint (Europe) * CEEHRC (Canada) * DEEP (Germany) * ENCODE (USA) * GIS (Singapore) * KNIH (South Korea) * Roadmap (USA)

Within IHEC, the BLUEPRINT consortium has been formed with the aim to further the understanding of how genes are activated or repressed in both healthy and diseased human cells. Specifically, BLUEPRINT includes data on distinct types of haematopoietic cells from healthy individuals and their malignant leukaemic counterparts.

library(DeepBlueR)
#> Loading required package: XML
#> Loading required package: RCurl
#> Loading required package: bitops
#> Welcome to the DeepBlueR package
#> DeepBlue is online

You can easily search IHEC BLUEPRINT samples, download and prepare a matrix of gene expression.

# List all IHEC samples 
IHEC_samples <- deepblue_list_samples()

#table(IHEC_samples$source)

#          BLUEPRINT Epigenome Blueprint HSC differentiation         BLUEPRINT Progenitors 
#                         1451                            76                            76 
#                       CEEHRC                    ChIP-Atlas                         CREST 
#                          529                           497                            38 
#                         DEEP                   DEEP (IHEC)                        ENCODE 
#                          194                            39                          3317 
#                   ENCODE FTP           Roadmap Epigenomics   TEPIC reprocessed IHEC data 
#                            9                           144                            28 


# List all BLUEPRINT samples
blueprint_samples <- deepblue_list_samples(
    extra_metadata = list("source" = "BLUEPRINT Epigenome"))

# Extract their ids
blueprint_samples_ids <- deepblue_extract_ids(blueprint_samples)

# Select gene expression data. We assign gene names using Gencode 22
gene_exprs_query <- deepblue_select_expressions(sample_ids = blueprint_samples_ids,
                                                expression_type = "gene",
                                                gene_model = "gencode v22")

# We request the data and define the output format
request = deepblue_get_regions(query_id = gene_exprs_query, 
                               "@GENE_ID(gencode v22),FPKM,@BIOSOURCE,@SAMPLE_ID")

# We download the data
gene_regions <- deepblue_download_request_data(request)

# We retain a table mapping sample ids to bisources
sample_names <- dplyr::select(gene_regions, `@BIOSOURCE`, `@SAMPLE_ID`) %>%
    dplyr::distinct()

# We filter out duplicated gene entries
genes_one_sample <- dplyr::filter(gene_regions, `@SAMPLE_ID` == "s10678")
duplicated_genes <- genes_one_sample[
    which(duplicated(genes_one_sample$`@GENE_ID(gencode v22)`)),
    "@GENE_ID(gencode v22)"]

# We convert the gene expression from a list to a data frame and subsequently...
genes_matrix = dplyr::filter(gene_regions,
                             !(`@GENE_ID(gencode v22)` %in% duplicated_genes)) %>%
    dplyr::select(-`@BIOSOURCE`) %>%
    tidyr::spread(key = `@SAMPLE_ID`, value = FPKM)

# ...to a numeric matrix
genes <- genes_matrix[,1]
genes_matrix <- data.matrix(genes_matrix[,-1])
rownames(genes_matrix) <- genes

### OUTPUT
### genes_matrix : The gene expression matrix for all 276 BLUEPRINT samples
### sample_names : A mapping table from sample id to cell type / biosource
IHEC_genes_matrix <- genes_matrix
IHEC_sample_names <- sample_names

save(IHEC_genes_matrix,  IHEC_sample_names, file = "IHEC_BLUEPRINT_geneExpr.Rdata")

#> IHEC_genes_matrix[1:5,1:5]
#                   s10271 s10272 s10275 s10276 s10279
#ENSG00000000003.13   0.01   0.03   2.39   0.70   0.05
#ENSG00000000005.5    0.00   0.00   0.00   0.00   0.00
#ENSG00000000419.11  30.04  17.49  16.36   5.80  11.99
#ENSG00000000457.12   3.29   2.63   2.70   0.78   1.91
#ENSG00000000460.15   3.16   2.06   1.36   0.49   1.65

#> head(IHEC_sample_names)
#                          @BIOSOURCE @SAMPLE_ID
#1:   CD8-positive, alpha-beta T cell     s10506
#2:            germinal center B cell     s10678
#3:         common myeloid progenitor     s10518
#4: adult endothelial progenitor cell     s11176
#5:            neutrophilic myelocyte     s10331
#6:        common lymphoid progenitor     s10486

GEO gene expression dataset

GEO Gene Expression Omnibus (GEO) is the public repository for high throughput data at NCBI. [https://www.ncbi.nlm.nih.gov/geo/] Here we show how to retrieve GEOs’ samples, download and prepare a matrix of gene expression. In particular to retrieve the gene expression matrix for different cell types from Nazor et al. (2012, PMID: 22560082).

library(MoonlightR)
library(GEOquery)
library(devtools)

# get gene expression data
dataNazor <- getDataGEO(GEOobject = "GSE30652",
                        platform =  "GPL6947")

require(SummarizedExperiment)
GSE30652 <- as.data.frame(exprs(dataNazor))

# summarize gene expression data from multiple probes using means
GSE30652_non_norm <- cbind(ILMN = rownames(GSE30652),
                           IDmean = rowMeans(GSE30652),
                           GSE30652)


# get phenotype data
dataNazor_samples <- pData(dataNazor)
dataNazor_samples <- as.data.frame(dataNazor_samples)
dataNazor_samples <- subset(dataNazor_samples,
                              select = c("geo_accession","characteristics_ch1.2"))
colnames(dataNazor_samples)[2] <- "CellType"

dataNazor_samples$CellType <- gsub("cell type: ","",dataNazor_samples$CellType)
dataNazor_samples$CellType <- gsub(", undifferentiated","",dataNazor_samples$CellType)

GPL6947_13512 <- fData(dataNazor)
GPL6947_13512_annot <- as.data.frame(GPL6947_13512)
GPL6947_13512_annot <- subset(GPL6947_13512_annot,
                              select = c("ID","Gene.symbol"))



GSE30652_merge <- merge(x = GPL6947_13512_annot,
                        y = GSE30652_non_norm,
                        by.x = "ID",
                        by.y = "ILMN")


GSE30652_merge <- GSE30652_merge[order(GSE30652_merge$IDmean,decreasing = TRUE),]
GSE30652_merge <- GSE30652_merge[!duplicated(GSE30652_merge$Gene.symbol),]
NazorMatrix <- GSE30652_merge
rownames(NazorMatrix) <- NazorMatrix$Gene.symbol

NazorMatrix <- NazorMatrix[,dataNazor_samples$geo_accession]

save(NazorMatrix,dataNazor_samples, file = "Nazor_GSE30652.Rdata")

Oncogenic processes and driver genes using MoonlightR

MoonlightR overview

In this section, we illustrate the moonlightR analysis pipeline using the BRCA data we downloaded from previous section. In order to make light of cancer development, it is crucial to identify the genes involved in the disease mechanisms and moreover understand what functional roles it plays. Common biological processes such as proliferation and apoptosis have been linked to cancer progression. MoonlightR is a new approach to define tumor suppressor genes (TSGs) and oncogenes (OCGs) based on functional enrichment analysis, inference of gene regulatory networks and upstream regulator analysis to score the importance of well-known biological processes with respect to the studied cancer.

In the MoonlightR pipeline, first, based on expression data we perform functional enrichment analysis, infer gene regulatory networks and upstream regulator analysis to score the importance of well-known biological processes with respect to the studied cancer. We then use these scores to predict two specific roles for genes: genes that act as TSGs and genes that act as OCGs. This methodology not only allows us to identify genes with dual roles (TSG in one cancer type and OCG in another) but also to elucidate the underlying biological processes they are involved with.

library(MoonlightR)

MoonlightR pipeline

Here we describe two Moonlight approaches: Moonlight-EB (expert based) and Moonlight-ML (machine learning). The EB and ML approaches share the following three initial steps (Fig 1b, Methods): (i) DPA: Moonlight identifies a set of Differentially Expressed Genes (DEGs) between two conditions;
(ii) GRN: The gene expression data is used to infer a Gene Regulatory Network (GRN) with the DEGs as vertices and (iii) FEA: using Functional Enrichment Analysis (FEA), Moonlight considers a DEG in a biological system and quantifies the DEG-BP (biological process) association with a Moonlight Z-score. (iv) URA/PRA: Finally, we input DEGs and their GRN to Upstream Regulatory Analysis (URA), yielding upstream regulators of BPs mediated by the DEG and its targets. The second part of the pipeline’s tool provides Pattern Recognition Analysis (PRA) that incorporates two approaches.

In the first approach, PRA takes in two objects: (i) URA’s output and (ii) selection of a subset of the BP provided by the end-user. In contrast, if the BPs are not provided, their selection is automated by a ML method (e.g. random forest model) trained on gold standard oncogenes (OCG) and tumor-suppressor genes (TSG) in the second approach. In addition, dynamic recognition analysis (DRA) detects multiple patterns of BPs when different conditions are selected.

Moonlight’s pipeline is shown below::

DPA: Differential Phenotype Analysis

Differential phenotype analysis (DPA) identifies genes or probes that are significantly different between two phenotypes such as normal vs. tumor, or normal vs. stage I, normal vs. a particular molecular subtype. For gene expression data, DPA performes a differential expression analysis (DEA) to identify differentially expressed genes (DEGs) using the TCGAanalyze_DEA function from . For methylation data, DPA is performes a differentially methylated probes analysis (DMP) to identify differentially methylated CpG sites using the TCGAanalyze_DMR function from .

For the following example we will use a dataset of breast cancer samples TCGA BRCA (10 TP tumor samples and 10 NT normal samples) previously downloaded and included in package.

# dataFilt is a gene expression dataset with lowly expressed genes filtered out
# dataFilt can be generated following the previous section
# TCGA Gene Expression` (IlluminaHiSeq_RNASeqV2) and multi -omics data

dataBRCA <- data("dataFilt")

dataDEGs <- DPA(dataFilt = dataBRCA,
                dataType = "Gene expression",
                fdr.cut = 0.05,
                logFC.cut = 0)

We can visualize those differentially expressed genes (DEGs) with a volcano plot using the TCGAVisualize_volcano function.

TCGAVisualize_volcano(
  dataDEGs$logFC, dataDEGs$FDR,
  filename = "DEGs_volcano.png",
  x.cut = 1, y.cut = 0.01,
  names = rownames(dataDEGs),
  color = c("black","red","dodgerblue3"),
  names.size = 2,
  show.names = "highlighted",
  highlight = c("SPINK4","INSL4"),
  xlab = " Gene expression fold change (Log2)",
  legend = "State",
  title = "Volcano plot (Normal NT vs Tumor TP)",
  width = 10
)

The figure generated by the code above is shown below:

FEA: Functional Enrichment Analysis

The function FEA performs a functional enrichment analysis using Fisher’s exact test, which identifies gene sets that are significantly enriched with differentially expressed genes. The steps of FEA involve (i) evaluating if DEGs are involved in a biologlical process(BP) through an assessment of the overlap between the list of DEGs and genes relevant to this BP determined by literature mining and (ii) detecting the BPs over-represented by DEGs using Fisher Exact Test. In the following, we considered BPs over-represented by genes with |Moonlight-score| >=1 and FDR <= 0.01. (See Methodology in https://www.biorxiv.org/content/10.1101/265322v1.article-info)

In each BP we defined a Moonlight Z-score as the ratio between the sum of all predicted effects (according literature mining) for all the gene involved in the specific BP and the square-root of the number of all genes.

data(DEGsmatrix, package = "MoonlightR")
dataFEA <- FEA(DEGsmatrix = DEGsmatrix)
dataFEA <- cbind(dataFEA, FDR = p.adjust(dataFEA$p.Value,method = "fdr") )
dataFEA_filt <- dataFEA[dataFEA$FDR < 0.01,]

The output can be visualized with a FEA plot (Functional Enrichment Analysis Plot)

The user can plot the result of a functional enrichment analysis using the function plotFEA. A negative z-score indicates that the process’ activity is decreased. A positive z-score indicates that the process’ activity is increased. Enrichment Analysis showing the BPs enriched significantly with |Moonlight Z-score| >=1 and FDR <= 0.01; increased levels are reported in yellow, decreased in purple, and green shows the -logFDR / 10. A negative Z-score indicates that the process’ activity is decreased while a positive Z-score indicates that the process’ activity is increased. Values in parentheses indicate the number of genes in common between the genes annotated in the biological process and the genes used as input for the functional enrichment.

plotFEA(
  dataFEA = dataFEA_filt,
  topBP = nrow(dataFEA_filt),
  additionalFilename = "_BRCA_DEGs", 
  height = 10, 
  width = 10
)

The figure generated by the above code is shown below:

GRN: Gene Regulatory Network

Next, the GRN function perform a gene regulatory network analysis to identify genes regulated by each DEG. In this function, considering only tumor samples, we calculate the pairwise mutual information between the DEGs and genes in dataFilt matrix (see previous section) . Afterwards, DEGs’ regulon, representing the genes regulated by a DEG, are defined by filtering out non-significant (p-value > 0.05) interactions using a permutation test (nboot = 100, nGenesPerm = 1000) and thus obtaining a set of regulated genes for each DEG.

# this will take a long time with nboot = 100 and nGenesPerm = 1000, so output
#   is saved in dataGRN. To make this toy example faster, we illustrate with a
#   smaller dataset and 10 permutations and 10 bootstrap samples 

system.time(
dataGRN <- GRN(
  TFs = rownames(DEGsmatrix)[1:100], 
  normCounts = dataFilt,
  nGenesPerm = 10,
  kNearest = 3,
  nBoot = 10
)
)
# 14.14 seconds for the first 100 with 10 perms and nBoot = 10
#  9.21 minutes for all 3390 with 10 perms and nBoot = 10

Rather than estimate this gene regulatory network yourself, a subset of this object is included in the Moonlight package (the first 9 genes, rather than all 3390).

data("dataGRN", package = "MoonlightR")

URA: Upstream Regulator Analysis

The user can also perform upstream regulator analysis using the function URA. This function is applied to each DEG in the enriched gene set and its neighbors in the GRN. This function returns a table yielding upstream regulators of BPs mediated by the DEG and its targets, providing a score between DEG-BP.


#BPselected <- dataFEA$Diseases.or.Functions.Annotation[1:5]
BPselected <- c("apoptosis","proliferation of cells")

dataURA <- URA(dataGRN = dataGRN,
               DEGsmatrix = DEGsmatrix,
               BPname = BPselected, 
               nCores = 1)
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
# 25.35 seconds

head(dataURA)
#>         apoptosis proliferation of cells
#> A2M    -0.1462545              3.8426388
#> AADAC   1.1338934              0.4866643
#> AATK   -0.5057217              4.6390383
#> ABAT    0.6928203              1.4925558
#> ABCA10 -0.8944272              1.5118579
#> ABCA12  1.5011107              0.3849002

PRA: Pattern Recognition Analysis

PRA allows for the identification of a list of TSGs and OCGs when a list of BPs such as apoptosis and proliferation are provided. Alternatively, we can also classify genes as OCGs or TSGs using a Random Forest based classifier trained on validated cancer driver genes (TSGs and OCGs). More specifically, we define a pattern for which the group of genes classified as OCGs share similar BP as apoptosis (DOWN) and proliferation (UP) while genes classified as TSGs share apoptosis (UP) and proliferation (DOWN). Once we have a table of gene scores for each BP from URA, we can incorporate the prior biological information to filter out genes with discordant biological effects. For example, it is known that apoptosis and proliferation of cells are opposite biological effects in tumor growth, so the A2M gene (with positive scores in both BPs) should be excluded. Pattern Recognition Analysis (PRA) combines information from the BPs with the output from URA, as shown in the next example:

The input of the PRA’s function is the output from URA’s function and the list of biological processes(BPs) selected. The output is shown in the following example. The Moonlight package includes this URA output for the first 100 genes (alphabetically).

data(dataURA)

dataDual <- PRA(
  dataURA = dataURA,
  BPname = c("apoptosis","proliferation of cells"),
  thres.role = 0
)

# Tumour-suppressors
dataDual$TSG 
#>     ACN9    ADCK5  AFAP1L1 
#> 1.123390 1.126988 1.112131

# Oncogenes
head(dataDual$OCG)
#>    ABCA3    ABCG1    ABCG2    ABHD6    ABTB1    ACADL 
#> 3.382539 1.108825 3.306286 1.652131 2.769927 2.895305

These scores are computed as average of absolute values from scores of each BP.

Also, the user can plot the result of a Upstream Regulatory Analysis using the function plotURA.

TSGs_genes <- names(dataDual$TSG)
OCGs_genes <- names(dataDual$OCG)

plotURA(dataURA = dataURA[c(TSGs_genes, OCGs_genes), ])
#> png 
#>   2

The figure generated by the above code is shown below. This heatmap shows Moonlight Z-score for upstream regulators in BRCA. Row colors: red indicates increasing and blue decreasing Moonlight Z-scores. OCGs, such as ABCA3, ABCG1, and ABCG2, share similar BP as apoptosis (DOWN) and proliferation (UP); Genes classified as TSGs (ACN9, ADCK5, AFAP1L1) share apoptosis (UP) and proliferation (DOWN).

The table to show the result from PRA and URA is shown below:

Table 2: Example of a dataURA output with TSGs and OCGs patterns
apoptosis proliferation of cells
ACN9 1.70 -0.55
ADCK5 1.82 -0.43
AFAP1L1 1.35 -0.87
ABCA3 -1.02 5.75
ABCG1 -0.61 1.61
ABCG2 -1.21 5.40

Moonlight PanCancer Study (18 tumor types)

The goal of this study was to characterize intertumoral (between tumors) genomic epigenomic and transcriptomic heterogeneity in cancer tissue by identifying dual role genes. More specifically, we used the expression levels of genes in all the samples obtained from TCGA - GDC with IlluminaHiSeq RNASeqV2 in 18 normal tissues and 18 cancer tissues according to the TCGA criteria.We applied the complete Moonlight pipeline (FEA-URA-PRA) to extract those genes that were significantly increasing or decreasing biological processes such as proliferation of cells and apoptosis. The figure below shows the results of the moonlight pipeline that can be visualized with a circos plot, using the function plotCircos from . For further examples see the ’s vignette http://bioconductor.org/packages/release/bioc/html/MoonlightR.html The figure mentioned above is shown below:

Immune subtypes from Thorsson et al. (2018, PMID: 29628290)

Thorsson et al. performed an extensive immunogenomic analysis of more than 10,000 tumors comprising 33 diverse cancer types by utilizing data compiled by TCGA. In this section we illustrate identifying Immune subtypes for TCGA and GEO tumor samples. Figure1C is from Thorsson et al. (2018, PMID: 29628290), which summarizes features of six different immune subtypes.

Across cancer types, we previously identified six immune subtypes—wound healing, IFN-dominant, inflammatory, lymphocyte depleted, immunologically quiet, and TGF-dominant, which are characterized by differences in macrophage or lymphocyte signatures, Th1:Th2 cell ratio, extent of intratumoral heterogeneity, aneuploidy, extent of neoantigen load, overall cell proliferation, expression of immunomodulatory genes, and prognosis.

Survival analysis of TCGA KIRC samples with Immune Subtypes

In the following example we first retrieve the TCGA Immune Subtypes for KIRC (Kidney Renal Clear Cell Carcinoma) from Table S1 in Thorsson et al. (2018, PMID: 29628290), next we will run a Kaplan-Meier Overall Survival Analysis for these different Immune Subtypes.


download.file(
  url = "https://ars.els-cdn.com/content/image/1-s2.0-S1074761318301213-mmc2.xlsx",
  destfile = "X1_s2_0_S1074761318301213_mmc2"
)
X1_s2_0_S1074761318301213_mmc2 <- read_excel("X1_s2_0_S1074761318301213_mmc2")
X1_s2_0_S1074761318301213_mmc2 <- as.data.frame(X1_s2_0_S1074761318301213_mmc2)

CancerType <- c("KIRC")

# a table of subtype info for each TCGA's sample
ImmuneSubtypes <- as.data.frame(X1_s2_0_S1074761318301213_mmc2)

ImmuneSubtypes <- ImmuneSubtypes[ImmuneSubtypes$`TCGA Study` %in% CancerType,]
ImmuneSubtypes <- ImmuneSubtypes[ImmuneSubtypes$`Immune Subtype`!="NA",]

dataClin_KIRC <- GDCquery_clinic(project = "TCGA-KIRC",type = "clinical")

dataClin_merged <- merge(
  x = dataClin_KIRC,
  y = ImmuneSubtypes,
  by.x = "submitter_id",
  by.y = "TCGA Participant Barcode"
)


p <- TCGAanalyze_survival(
  dataClin_merged,
  clusterCol = "Immune Subtype",
  conf.int = FALSE,
  main = "TCGA KIRC Immune Subtypes",
  height = 10, width = 10,
  risk.table = TRUE,
  filename = NULL
)

p <- p$plot +
  theme(legend.position = "right")

ggsave(p,
  filename = "figs/TCGA_KIRC_ImmuneSubtypes_survival.png",
  width = 10, height = 6
)

The result from the Immune subtypes -survival association for TCGA KIRC samples is shown below:

Immune Subtypes for GEO samples (KIRC example) using data from (Jones et al., 2005 PMID: 16115910)

We illustrate how to predict different immune subtypes for different samples.


dataGEO_KIRC<- getDataGEO(
  GEOobject = "GSE15641",
  platform =  "GPL96"
)


GSE15641 <- as.data.frame(exprs(dataGEO_KIRC))
GSE15641_non_norm <- cbind(ILMN = rownames(GSE15641),
                           IDmean = rowMeans(GSE15641),
                           GSE15641)

GSE15641_annot <- fData(dataGEO_KIRC)
GSE15641_annot <- as.data.frame(GSE15641_annot)
GSE15641_annot <- subset(GSE15641_annot,
                              select = c("ID","Gene.symbol"))


dataGEO_samples <- pData(dataGEO_KIRC)
dataGEO_samples <- as.data.frame(dataGEO_samples)
dataGEO_samples <- subset(dataGEO_samples,
                            select = c("geo_accession","source_name_ch1"))
colnames(dataGEO_samples)[2] <- "CellType"

dataGEO_samples <- dataGEO_samples[dataGEO_samples$CellType %in% "cancerous human kidney tissue, clear cell RCC",]

GSE15641_merge <- merge(x = GSE15641_annot,
                        y = GSE15641_non_norm,
                        by.x = "ID",
                        by.y = "ILMN")


GSE15641_merge <- GSE15641_merge[order(GSE15641_merge$IDmean,decreasing = TRUE),]
GSE15641_merge <- GSE15641_merge[!duplicated(GSE15641_merge$Gene.symbol),]
GSE15641_Matrix <- GSE15641_merge
rownames(GSE15641_Matrix) <- GSE15641_Matrix$Gene.symbol

GSE15641_Matrix <- GSE15641_Matrix[,dataGEO_samples$geo_accession]
save(GSE15641_Matrix, file = "Jones_GSE15641.Rdata")

# https://github.com/torongs82/Data/blob/master/GEO/Jones_GSE15641.Rdata
load("Jones_GSE15641.Rdata")

load("ImmuneMW.rda") # from data folder
dataImmuneSubtype <- TCGAanalyze_ImmuneSubtypes(ImmuneMW = ImmuneMW,
                                                dataGE = GSE15641_Matrix)

Stemness Analysis

Stemness overview

Stemness, defined as the potential for self-renewal and differentiation from the cell of origin, was originally attributed to normal stem cells that possess the ability to give rise to all cell types in the adult organism. Cancer progression involves gradual loss of a differentiated phenotype and acquisition of progenitor-like, stem-cell-like features. Undifferentiated primary tumors are more likely to result in cancer cell spread to distant organs, causing disease progression and poor prognosis, particularly because metastases are usually resistant to available therapies The Overall methodology for Development and Validation of the Stemness Indices from Malta et al. (2018, PMID: 29625051) is shown below:

First, we defined a stemness signatures to quantify stemness using publicly available molecular profiles from normal cell types that exhibit various degrees of stemness. More specifically, we derived stemness signatures using an innovative one-class logistic regression (OCLR) machine-learning algorithm (Sokolov et al., 2016, PMID: 26776204) trained on stem cell (ESC, embryonic stemcell; iPSC, induced pluripotent stem cell) classes from NHLBI Progenitor Cell Biology Consortium (PCBC) dataset (Salomonis et al. 2018, PMID: 27293150). In this way we obtained the stemness signature contained in the vector PCBC_stemSig in the , where for each gene we have an estimted weight, based on output from the OCLR’s algorithm.

Next, once the signature is obtained, it can be applied to score new samples. For RNA expression data, we computed Spearman correlations between the model’s weight vector PCBC_stemSig and the new sample’s expression profile. These are the stemness indices for gene expression(mRNAsi) , with scores ranging from 0 to 1, where 1 corresponds to more stemness, less differentiated and 0 corresponds to less stemness but more differentiated.

Stemness Highlights:

  1. Epigenetic and expression-based stemness indices measure oncogenic dedifferentiation
  2. Immune microenvironment content and PD-L1 levels associate with stemness indices
  3. Stemness index is increased in metastatic tumors and reveals intratumor heterogeneity
  4. Applying stemness indices reveals potential drug targets for anti-cancer therapies

In the following subsection we will illustrate the methodology used in Malta et al. (2018, PMID: 29625051) for (i) validating the stemness signature trained on PCBC’s dataset in two external datasets (Nazor, IHEC) (ii) deriving the stemness score for TCGA BRCA and LGG-GBM samples, with association to molecular subtypes and survival outcomes.

Validation of Stemness signature in external dataset [GEO] Nazor et al. (2012, PMID: 22560082)

# Firstly we have previously prepared the Gene Expression matrix (genes in rows,
#   samples in columns). See previous section or download from the following
#   GitHub link:
# https://github.com/torongs82/Data/blob/master/GEO/Nazor_GSE30652.Rdata
              
load("./Nazor_GSE30652.Rdata")
TCGA_mRNA_StemScoreTable <- TCGAanalyze_Stemness(stemSig = PCBC_stemSig,
                                          dataGE = NazorMatrix)


tab_mRNASi <- TCGA_mRNA_StemScoreTable

tab_mRNASi_merged <- merge(x = tab_mRNASi,
                    y = dataNazor_samples,
                    by.x = "Sample",
                    by.y = "geo_accession")
require(ggplot2)
require(ggpubr)

colnames(tab_mRNASi_merged)[3] <- "mRNASi"

tab_mRNASi_merged[tab_mRNASi_merged$CellType %in% "embryonic stem cell","CellType"] <- "ES"
tab_mRNASi_merged[tab_mRNASi_merged$CellType %in% "induced pluripotent stem cell","CellType"] <- "IPS"
tab_mRNASi_merged[tab_mRNASi_merged$CellType %in% "parthenogentic embryonic stem cell","CellType"] <- "Parthenogentic"
tab_mRNASi_merged[tab_mRNASi_merged$CellType %in% "Somatic.Primary","CellType"] <- "Primary"
tab_mRNASi_merged[tab_mRNASi_merged$CellType %in% "Somatic.Tissue","CellType"] <- "Tissue"

tab_mRNASi_merged <- tab_mRNASi_merged[order(tab_mRNASi_merged$mRNASi, decreasing = FALSE),]

library(forcats)
p <- ggplot(data=tab_mRNASi_merged, aes(x=Sample, y=mRNASi, fill = CellType)) +
    geom_bar(stat="identity")+
    scale_colour_gradient2()+
    #coord_flip()+
   # ylim(0, 15)+
    #scale_x_discrete(limits = df1$Tissue)+
    theme_classic() +
    theme(axis.title.x=element_blank(),
                          axis.text.x=element_blank(),
                          axis.ticks.x=element_blank())+
    scale_fill_manual("legend", values = c("Primary" = "orange",
                                           "Parthenogentic" = "blue",
                                           "Tissue" = "darkgreen",
                                           "ES" = "red",
                                           "IPS" = "pink"))
p <- p + theme(legend.position="bottom")

p <- p + aes(x = fct_inorder(Sample))

ggsave(p,
  file = "figs/Validation_mRNASi_Nazor.png",
  width = 6, height = 6
)

The result from TCGAanalyze_Stemness validation in Nazor’s Dataset is shown below. Embryonic stem cell (ES) in red and Induced pluripotent stem cell (IPS) in pink showed higher stemness score compared to other more mature classes such as Primary in orange and Tissue in darkgreen.

Validation of Stemness signature in external dataset [IHEC] Stunnenberg et al. (2016, PMID: 27863232)

# Firstly we have previously prepared the IHEC Gene Expression matrix (genes in rows , samples in columns) # see previous section or download from the following github link
# https://github.com/torongs82/Data/blob/master/IHEC/IHEC_BLUEPRINT_geneExpr.Rdata

load("./IHEC_BLUEPRINT_geneExpr.Rdata")

library(stringr)
library(biomaRt)

IHECMatrix <- IHEC_genes_matrix

IHEC_annot <- IHEC_sample_names
colnames(IHEC_annot) <- c("Diffname_short","X1")
IHEC_annot <- as.data.frame(IHEC_annot)


ensemblsIDS <- str_split_fixed(rownames(IHECMatrix),"\\.",2)[,1]
rownames(IHECMatrix)<- ensemblsIDS

IHEC_annot_common <- IHEC_annot[IHEC_annot$X1 %in% colnames(IHECMatrix),]


# mapping Ensembl IDs to gene Symbol

library(clusterProfiler)
library(org.Hs.eg.db)
gene.df <- bitr(ensemblsIDS, fromType = "ENSEMBL",
                toType = c( "ENTREZID", "SYMBOL"),
                OrgDb = org.Hs.eg.db)


IHECMatrix_sel <- IHECMatrix[gene.df$ENSEMBL,]
rownames(IHECMatrix_sel) <-gene.df$SYMBOL
IHECMatrix_sel <- as.data.frame(IHECMatrix_sel)

IHECMatrix_sel_filt <- TCGAanalyze_Filtering(tabDF = IHECMatrix_sel,
                                             method = "quantile",
                                             qnt.cut = 0.25)

TCGA_mRNA_StemIHEC <- TCGAanalyze_Stemness(stemSig = PCBC_stemSig,
                                                 dataGE = IHECMatrix_sel_filt)

colnames(TCGA_mRNA_StemIHEC)[3] <- "mRNASi"
colnames(IHEC_annot)[1] <- "CellType"
colnames(IHEC_annot)[2] <- "Sample"
TCGA_mRNA_StemIHEC_merge<- merge(x = TCGA_mRNA_StemIHEC,
                                 y = IHEC_annot,
                                 by.x = "Sample",
                                 by.y = "Sample")
tab_mRNASi_merged <- TCGA_mRNA_StemIHEC_merge

CellType <- table(tab_mRNASi_merged$CellType)



CellType_filt <- CellType[CellType > 5]

tab_mRNASi_merged_filt <- tab_mRNASi_merged[tab_mRNASi_merged$CellType %in% names(CellType_filt), ]
tab_mRNASi_merged <- tab_mRNASi_merged_filt


tab_mRNASi_merged <- tab_mRNASi_merged[order(tab_mRNASi_merged$mRNASi, decreasing = FALSE),]

p <- ggplot(data=tab_mRNASi_merged, aes(x=Sample, y=mRNASi, fill = CellType)) +
    geom_bar(stat="identity")+
    scale_colour_gradient2()+
    #coord_flip()+
    # ylim(0, 15)+
    #scale_x_discrete(limits = df1$Tissue)+
    theme_classic() +
    theme(axis.title.x=element_blank(),
          axis.text.x=element_blank(),
          axis.ticks.x=element_blank())
p <- p + theme(legend.position="right")

p <- p + aes(x = fct_inorder(Sample))
p <- p + theme(text = element_text(size=14))

ggsave(p,
  file = "figs/Validation_mRNASi_IHEC.png",
  width = 12, height = 6
)

The result from TCGAanalyze_Stemness validation in IHEC’s Dataset is shown below. In particular we observed that classes more pluripotent such as hematopoietic stem cell, erythroblast showed higher Stemness score whereas classes more differentiate and matures such as monocyte, mature neutrophil and macrophage showed lower Stemness score.

Generating Stemness score for TCGA BRCA gene expression data (mRNASi). Malta et al. (2018, PMID: 29625051)

# Firstly we load Gene Expression matrix (genes in rows , samples in columns) 
# from a previous generated pancancer gene expression matrix

load("./dataFilt_panCancer33new.Rdata")

curCancer <- "BRCA"
# We have previously generated a table with 33 cancer types barcodes and molecular subtypes.

dataSubt_PanCancer <- PanCancerAtlas_subtypes()

# Selecting TCGA breast cancer

dataSubt_curCancer <- dataSubt_PanCancer[dataSubt_PanCancer$cancer.type %in% "BRCA",]

commonSamples <- intersect(dataSubt_curCancer$pan.samplesID, colnames(dataFilt))

dataFilt_curCancer <- dataFilt[,commonSamples]
load("./PCBC_stemSig.rda")

TCGA_mRNA_StemScoreTable <- TCGAanalyze_Stemness(stemSig = PCBC_stemSig,
                                          dataGE = dataFilt_curCancer)

colnames(TCGA_mRNA_StemScoreTable)[1] <- "barcode"

sampleNT <- TCGAquery_SampleTypes(barcode = TCGA_mRNA_StemScoreTable$barcode,typesample = "NT")
sampleTP <- TCGAquery_SampleTypes(barcode = TCGA_mRNA_StemScoreTable$barcode,typesample = "TP")
#sampleTM <- TCGAquery_SampleTypes(barcode = TCGA_mRNA_StemScoreTable$barcode,typesample = "TM")
#sampleTAM <- TCGAquery_SampleTypes(barcode = TCGA_mRNA_StemScoreTable$barcode,typesample = "TAM")
#sampleTB <- TCGAquery_SampleTypes(barcode = TCGA_mRNA_StemScoreTable$barcode,typesample = "TB")

rownames(TCGA_mRNA_StemScoreTable) <- TCGA_mRNA_StemScoreTable$barcode

TCGA_mRNA_StemScoreTable[sampleNT,"SampleType"] <- "NT"
TCGA_mRNA_StemScoreTable[sampleTP,"SampleType"] <- "TP"
#TCGA_mRNA_StemScoreTable[sampleTM,"SampleType"] <- "TM"
#TCGA_mRNA_StemScoreTable[sampleTAM,"SampleType"] <- "TM"
#TCGA_mRNA_StemScoreTable[sampleTB,"SampleType"] <- "TB"

tab_mRNASi <- TCGA_mRNA_StemScoreTable

tab_mRNASi_merged <- merge(x = tab_mRNASi,
                    y = dataSubt_PanCancer,
                    by.x = "barcode",
                    by.y = "pan.samplesID")

colnames(tab_mRNASi_merged)[3] <- "mRNASi"



p<-ggplot(tab_mRNASi_merged, aes(x=SampleType, y=mRNASi, fill=SampleType))
p <- p + theme_classic()

p <- p +  theme(legend.position="none")
p <- p + rotate_x_text(45)
p <- p + geom_jitter(shape=16, position=position_jitter(0.2), color = "black")
p <- p + geom_boxplot(position=position_dodge(1),outlier.colour = NA)
p <- p + theme(text = element_text(size=20))

ggsave(p,
  filename = "figs/TCGA_BRCA_mRNASi_TP_NT.png",
  width = 5, height = 6
)


tab_mRNASi_merged_TP <- tab_mRNASi_merged[tab_mRNASi_merged$SampleType %in% "TP",]

p <- ggplot(
  tab_mRNASi_merged_TP,
  aes(x = Subtype_Selected, y = mRNASi, fill = Subtype_Selected)
)

p <- p + 
  theme_classic() + 
  theme(legend.position="none") +
  theme(text = element_text(size = 20)) +
  rotate_x_text(45) +
  geom_jitter(shape = 16, position = position_jitter(0.2), color = "black") +
  geom_boxplot(position = position_dodge(1), outlier.colour = NA)

ggsave(p,
  filename = "figs/TCGA_BRCA_mRNASi_subtypes.png",
  width = 5, height = 6
)

The result from TCGAanalyze_Stemness is shown below. Each point is a sample where Solid Tissue Normal (NT) are in red and Primary Tumor (TP) are in blue.

The result from TCGAanalyze_Stemness with molecular subtypes is shown below:

Generating Stemness score for TCGA LGG and GBM gene expression data (mRNASi). Malta et al. (2018, PMID: 29625051)


# working with LGG and GBM
curCancer <- c("LGG","GBM")
# We have previously generated a table with 33 cancer types barcodes and molecular subtypes.
dataSubt_PanCancer <- PanCancerAtlas_subtypes()

# Selecting TCGA breast cancer

dataSubt_curCancer <- dataSubt_PanCancer[dataSubt_PanCancer$cancer.type %in% curCancer,]

sampleCurCancer <- colnames(dataFilt)
sampleCurCancer <- sampleCurCancer[substr(sampleCurCancer,1,12) %in% dataSubt_curCancer$pan.samplesID]

dataFilt_curCancer <- dataFilt[,sampleCurCancer]
load("./data/PCBC_stemSig.rda")

TCGA_mRNA_StemScoreTable <- TCGAanalyze_Stemness(stemSig = PCBC_stemSig,
                                                 dataGE = dataFilt_curCancer)

colnames(TCGA_mRNA_StemScoreTable)[1] <- "barcode"


tab_mRNASi <- TCGA_mRNA_StemScoreTable
tab_mRNASi <- cbind(barcode12 = substr(tab_mRNASi$barcode,1,12),
                    tab_mRNASi)
tab_mRNASi$barcode12 <- as.character(tab_mRNASi$barcode12)


tab_mRNASi_merged <- merge(x = tab_mRNASi,
                           y = dataSubt_PanCancer,
                           by.x = "barcode12",
                           by.y = "pan.samplesID")

colnames(tab_mRNASi_merged)[4] <- "mRNASi"

p <- ggplot(tab_mRNASi_merged) +
  aes(x = Subtype_Selected, y = mRNASi, fill = Subtype_Selected) +
  theme_classic() +
  theme(legend.position = "none") +
  theme(text = element_text(size = 20)) +
  rotate_x_text(45) +
  geom_jitter(shape = 16, position = position_jitter(0.2), color = "black") +
  geom_boxplot(position = position_dodge(1), outlier.colour = NA)

ggsave(p,
  filename = "figs/TCGA_LGG_GBM_mRNASi_subtypes.png",
  width = 5, height = 6
)

The result from TCGAanalyze_Stemness for LGG and GBM is shown below:

Stemness index is associated with overall survival (Glioma example)


#working with survival glioma

tab_mRNASi_merged <- cbind(mRNASi_level = rep("mRNASi mod",nrow(tab_mRNASi_merged)),
                           tab_mRNASi_merged)

tabSi<- tab_mRNASi_merged
tabSi$mRNASi_level <- as.character(tabSi$mRNASi_level)

tabSi[tabSi$mRNASi < quantile(tabSi$mRNASi,1/3),"mRNASi_level"] <- "mRNASi Low"
tabSi[tabSi$mRNASi > quantile(tabSi$mRNASi,2/3),"mRNASi_level"] <- "mRNASi High"

dataClin_LGG <- GDCquery_clinic(project = "TCGA-LGG",type = "clinical")
dataClin_GBM <- GDCquery_clinic(project = "TCGA-GBM",type = "clinical")

dataClin_LGG_GBM <- rbind(dataClin_LGG,dataClin_GBM)

dataClin_LGG_GBM <- dataClin_LGG_GBM[dataClin_LGG_GBM$submitter_id %in% tabSi$barcode12,]

dataClin_merged <- merge(x = dataClin_LGG_GBM,
                         y = tabSi,
                         by.x = "submitter_id",
                         by.y = "barcode12")




p <- TCGAanalyze_survival(dataClin_merged,
                          clusterCol = "mRNASi_level",
                          conf.int = FALSE,
                          main = "TCGA LGG GBM mRNASi", 
                          height = 10,
                          width=10,
                          risk.table = TRUE,
                          filename = NULL)

p <- p$plot +
  theme(legend.position = "bottom")

ggsave(p,
  filename = "figs/TCGA_LGG_GBM_mRNASi_survival.png",
  width = 5, height = 6
)

The result from the Stemness-survival association for LGG and GBM is shown below:

Session Information


sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 9 (stretch)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] DeepBlueR_1.11.0              RCurl_1.95-4.12              
#>  [3] bitops_1.0-6                  XML_3.98-1.20                
#>  [5] Bioc2019PanCancerStudy_0.99.0 dplyr_0.8.1                  
#>  [7] SummarizedExperiment_1.15.3   DelayedArray_0.11.2          
#>  [9] BiocParallel_1.19.0           matrixStats_0.54.0           
#> [11] Biobase_2.45.0                GenomicRanges_1.37.12        
#> [13] GenomeInfoDb_1.21.1           IRanges_2.19.10              
#> [15] S4Vectors_0.23.13             BiocGenerics_0.31.4          
#> [17] MoonlightR_1.11.0             doParallel_1.0.14            
#> [19] iterators_1.0.10              foreach_1.4.4                
#> [21] TCGAbiolinks_2.13.2           knitr_1.23                   
#> [23] BiocManager_1.30.4           
#> 
#> loaded via a namespace (and not attached):
#>   [1] R.utils_2.9.0               tidyselect_0.2.5           
#>   [3] RSQLite_2.1.1               AnnotationDbi_1.47.0       
#>   [5] htmlwidgets_1.3             grid_3.6.0                 
#>   [7] DESeq_1.37.0                munsell_0.5.0              
#>   [9] codetools_0.2-16            miniUI_0.1.1.1             
#>  [11] withr_2.1.2                 colorspace_1.4-1           
#>  [13] GOSemSim_2.11.0             highr_0.8                  
#>  [15] DOSE_3.11.0                 urltools_1.7.3             
#>  [17] GenomeInfoDbData_1.2.1      hwriter_1.3.2              
#>  [19] KMsurv_0.1-5                polyclip_1.10-0            
#>  [21] diffr_0.1                   bit64_0.9-7                
#>  [23] farver_1.1.0                downloader_0.4             
#>  [25] generics_0.0.2              xfun_0.7                   
#>  [27] ggthemes_4.2.0              randomForest_4.6-14        
#>  [29] EDASeq_2.19.0               R6_2.4.0                   
#>  [31] clue_0.3-57                 locfit_1.5-9.1             
#>  [33] manipulateWidget_0.10.0     gridGraphics_0.4-1         
#>  [35] fgsea_1.11.0                assertthat_0.2.1           
#>  [37] promises_1.0.1              scales_1.0.0               
#>  [39] ggraph_1.0.2                enrichplot_1.5.0           
#>  [41] gtable_0.3.0                HiveR_0.3.42               
#>  [43] sva_3.33.1                  rlang_0.3.4                
#>  [45] genefilter_1.67.1           cmprsk_2.2-8               
#>  [47] GlobalOptions_0.1.0         splines_3.6.0              
#>  [49] rtracklayer_1.45.1          lazyeval_0.2.2             
#>  [51] GEOquery_2.53.0             selectr_0.4-1              
#>  [53] europepmc_0.3               broom_0.5.2                
#>  [55] rgl_0.100.19                yaml_2.2.0                 
#>  [57] reshape2_1.4.3              GenomicFeatures_1.37.1     
#>  [59] crosstalk_1.0.0             backports_1.1.4            
#>  [61] httpuv_1.5.1                qvalue_2.17.0              
#>  [63] clusterProfiler_3.13.0      tools_3.6.0                
#>  [65] tcltk_3.6.0                 bookdown_0.11.1            
#>  [67] ggplotify_0.0.3             ggplot2_3.2.0              
#>  [69] gplots_3.0.1.1              RColorBrewer_1.1-2         
#>  [71] ggridges_0.5.1              Rcpp_1.0.1                 
#>  [73] plyr_1.8.4                  progress_1.2.2             
#>  [75] zlibbioc_1.31.0             purrr_0.3.2                
#>  [77] prettyunits_1.0.2           ggpubr_0.2                 
#>  [79] GetoptLong_0.1.7            viridis_0.5.1              
#>  [81] cowplot_0.9.4               zoo_1.8-6                  
#>  [83] ggrepel_0.8.1               cluster_2.0.8              
#>  [85] magrittr_1.5                data.table_1.12.2          
#>  [87] blogdown_0.13.1             DO.db_2.9                  
#>  [89] circlize_0.4.6              triebeard_0.3.0            
#>  [91] survminer_0.4.4             aroma.light_3.15.0         
#>  [93] hms_0.4.2                   mime_0.7                   
#>  [95] evaluate_0.14               xtable_1.8-4               
#>  [97] jpeg_0.1-8                  gridExtra_2.3              
#>  [99] shape_1.4.4                 compiler_3.6.0             
#> [101] biomaRt_2.41.3              tibble_2.1.3               
#> [103] KernSmooth_2.23-15          crayon_1.3.4               
#> [105] R.oo_1.22.0                 htmltools_0.3.6            
#> [107] mgcv_1.8-28                 later_0.8.0                
#> [109] tidyr_0.8.3                 geneplotter_1.63.0         
#> [111] filehash_2.4-2              DBI_1.0.0                  
#> [113] tweenr_1.0.1                matlab_1.0.2               
#> [115] ComplexHeatmap_2.1.0        MASS_7.3-51.4              
#> [117] ShortRead_1.43.0            Matrix_1.2-17              
#> [119] readr_1.3.1                 parmigene_1.0.2            
#> [121] gdata_2.18.0                R.methodsS3_1.7.1          
#> [123] igraph_1.2.4.1              pkgconfig_2.0.2            
#> [125] km.ci_0.5-2                 rvcheck_0.1.3              
#> [127] GenomicAlignments_1.21.2    RISmed_2.1.7               
#> [129] xml2_1.2.0                  annotate_1.63.0            
#> [131] webshot_0.5.1               XVector_0.25.0             
#> [133] rvest_0.3.4                 settings_0.2.4             
#> [135] stringr_1.4.0               tkrgl_0.8                  
#> [137] digest_0.6.19               ConsensusClusterPlus_1.49.0
#> [139] Biostrings_2.53.0           rmarkdown_1.13             
#> [141] fastmatch_1.1-0             survMisc_0.5.5             
#> [143] edgeR_3.27.5                gtools_3.8.1               
#> [145] shiny_1.3.2                 Rsamtools_2.1.2            
#> [147] rjson_0.2.20                nlme_3.1-139               
#> [149] jsonlite_1.6                viridisLite_0.3.0          
#> [151] limma_3.41.5                pillar_1.4.1               
#> [153] lattice_0.20-38             httr_1.4.0                 
#> [155] survival_2.44-1.1           GO.db_3.8.2                
#> [157] glue_1.3.1                  UpSetR_1.4.0               
#> [159] png_0.1-7                   bit_1.1-14                 
#> [161] ggforce_0.2.2               stringi_1.4.3              
#> [163] blob_1.1.1                  caTools_1.17.1.2           
#> [165] latticeExtra_0.6-28         memoise_1.1.0

  1. Department of Public Health Sciences, Miller School of Medicine, The University of Miami, Miami, FL

  2. Department of Public Health Sciences, Miller School of Medicine, The University of Miami, Miami, FL

  3. Department of Public Health Sciences, Miller School of Medicine; Dr. McDonald Foundation Department of Human Genetics; Sylvester Comprehensive Cancer Center; The University of Miami, Miami, FL

  4. Department of Public Health Sciences, Miller School of Medicine; Sylvester Comprehensive Cancer Center; The University of Miami, Miami, FL