CONTENTS

1. Introduction

Recent advances in next-generation sequencing, microarrays and mass spectrometry for omics data production have enabled the generation and collection of different modalities of high-dimensional molecular data1. Clustering multi-omic data has the potential to reveal further systems-level insights, but raises computational and biological challenges2. This vignette aims to show how to use the package named MOVICS to perform multi-omics integrative clustering and visualization for cancer subtyping researches. This R package provides a unified interface for 10 state-of-the-art multi-omics clustering algorithms, and standardizes the output for each algorithm so as to form a pipeline for downstream analyses. Ten algorithms are CIMLR, iClusterBayes, MoCluster, COCA, ConsensusClustering, IntNMF, LRAcluster, NEMO, PINSPlus, and SNF where the former three methods can also perform the process of feature selection. For cancer subtyping studies, MOVICS also forms a pipeline for most commonly used downstream analyses for further subtype characterization and creates editable publication-quality illustrations (see more details below). Please note that MOVICS currently supports up to 6 omics data for jointly clustering and users must provide at least 2 omics datasets as input. Okay then, let us make our hands dirty to figure out how MOVICS works.

2. Installation

It is essential that you have R 4.0.1 or above already installed on your computer or server. MOVICS is a pipeline that utilizes many other R packages that are currently available from CRAN, Bioconductor and GitHub. For all of the steps of the pipeline to work, make sure that you have upgraded Bioconductor to newest version (BiocManager v3.11). After you have R and Bioconductor installed properly, you can start to install MOVICS. The easiest way to install it is by typing the following code into your R session:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
if (!require("devtools")) 
    install.packages("devtools")
devtools::install_github("xlucpu/MOVICS")

When you are installing MOVICS, you may encounter some errors saying that some packages are not installed. These errors are caused by recursively depending on R packages, so if one package was not installed properly on your computer, MOVICS would fail. To solve these errors, you simply need to check those error messages, find out which packages required are missing, then install it with command BiocManager::install("YourErrorPackage") or install.packages("YourErrorPackage") or devtools::install_github("username/YourErrorPackage") directly. After that, retry installing MOVICS, it may take several times, but eventually it should work. Or, we highly suggest that you referred to the Imports in the DESCRIPTION file, try to install all the R dependencies, and then install MOVICS. We also summarized several common problems that you may meet when installing MOVICS, please refer to Troubleshooting if applicable. After installation, you should be able to load the MOVICS package in your R session:

library("MOVICS")

3. Real Data Scenario

This package contains two pre-processed real datasets of breast cancer. One dataset is brca.tcga.RData which is a list that includes 643 samples with four complete omics data types of breast cancer retrieved from TCGA-BRCA cohort3 (i.e., mRNA expression, lncRNA expression, DNA methylation profiling and somatic mutation matrix), and corresponding clinicopathological information (i.e., age, pathological stage, PAM50 subtype, vital status and overall suvival time); such data list also contains corresponding RNA-Seq raw count table and Fragments Per Kilobase Million (FPKM) data in order to test the functions for downstream analyses (e.g., differential expression analysis, drug sensitivity analysis, etc). The other one, brca.yau.RData, is an external validation dataset which contains gene expression profiles and clinicopathological information that downloaded from BRCA-YAU cohort4 with 682 samples (one sample without annotation of PAM50 subtype was removed), which can be used to test the predictive functions available in MOVICS. These two datasets can be loaded like below:

# load example data of breast cancer
load(system.file("extdata", "brca.tcga.RData", package = "MOVICS", mustWork = TRUE))
load(system.file("extdata", "brca.yau.RData",  package = "MOVICS", mustWork = TRUE))

Since in most cases, multi-omics clustering procedure is rather time-consuming, the TCGA-BRCA dataset was therefore first pre-processed to extract top 500 mRNAs, 500 lncRNA, 1,000 promoter CGI probes/genes with high variation using statistics of median absolute deviation (MAD), and 30 genes that mutated in at least 3% of the entire cohort. All these features are used in the following clustering analyses.

4. MOVICS Pipeline

4.1 Pipeline Introduction

cloud

MOVICS Pipeline diagram above outlines the concept for this package and such pipeline comprises separate functions that can be run individually (see details about argument list by typing ?function_name() in R session). In above graphical abstract, all functions of MOVICS are included, which are categorized into three modules using three colors:

  • GET Module: get subtypes through multi-omics integrative clustering
    • getElites(): get elites which are those features that pass the filtering procedure and are used for analyses
    • getClustNum(): get optimal cluster number by calculating clustering prediction index (CPI) and Gap-statistics
    • get%algorithm_name%(): get results from one specific multi-omics integrative clustering algorithm with detailed parameters
    • getMOIC(): get a list of results from multiple multi-omics integrative clustering algorithm with parameters by default
    • getConsensusMOIC(): get a consensus matrix that indicates the clustering robustness across different clustering algorithms and generate a consensus heatmap
    • getSilhouette(): get quantification of sample similarity using silhoutte score approach
    • getStdiz(): get a standardized data for generating comprehensive multi-omics heatmap
    • getMoHeatmap(): get a comprehensive multi-omics heatmap based on clustering results

  • COMP Module: compare subtypes from multiple perspectives
    • compSurv(): compare survival outcome and generate a Kalan-Meier curve with pairwise comparison if possible
    • compClinvar(): compare and summarize clinical features among different identified subtypes
    • compMut(): compare mutational frequency and generate an OncoPrint with significant mutations
    • compTMB(): compare total mutation burden among subtypes and generate distribution of Transitions and Transversions
    • compFGA(): compare fraction genome altered among subtypes and generate a barplot for distribution comparison
    • compDrugsen(): compare estimated half maximal inhibitory concentration (\(IC_{50}\)) for drug sensitivity and generate a boxviolin for distribution comparison
    • compAgree(): compare agreement of current subtypes with other pre-existed classifications and generate an alluvial diagram and an agreement barplot

  • RUN Module: run marker identification and verify subtypes
    • runDEA(): run differential expression analysis with three popular methods for choosing, including edgeR, DESeq2, and limma
    • runMarker(): run biomarker identification to determine uniquely and significantly differential expressed genes for each subtype
    • runGSEA(): run gene set enrichment analysis (GSEA), calculate activity of functional pathways and generate a pathway-specific heatmap
    • runGSVA(): run gene set variation analysis to calculate enrichment score of each sample based on given gene set list of interest
    • runNTP(): run nearest template prediction based on identified biomarkers to evaluate subtypes in external cohorts
    • runPAM(): run partition around medoids classifier based on discovery cohort to predict subtypes in external cohorts
    • runKappa(): run consistency evaluation using Kappa statistics between two appraisements that identify or predict current subtypes

4.2 Steps of Pipeline

Basically, the above three connected modules explain the workflow of this R package. MOVICS first identifies the cancer subtype (CS) by using one or multiple clustering algorithms; if multiple clustering algorithms are specified, it is highly recommended to perform a consensus clustering based on different subtyping results in order to derive stable and robust subtypes. Second, after having subtypes it is natural to exploit the heterogeneity of subtypes from as many angles as possible. Third, each subtype should have a list of subtype-specific biomarkers for reproducing such subtype in external cohorts. Therefore, let us follow this analytic pipeline and step into each module and each function in MOVICS.

4.2.1 GET Module

1) get data from example files

Data used for this vignette should be first extracted from the list of brca.tcga, including 4 types of multi-omics data. Except for omics data, this list also contains RNA-Seq raw count, FPKM matrix, MAF data, segmented copy number, clinical and survival information for downstream analyses. Notably, all these omics data used for clustering share exactly the same samples with exactly the same order, and please make sure of this when using MOVICS on your own data.

# print name of example data
names(brca.tcga)
#> [1] "mRNA.expr"   "lncRNA.expr" "meth.beta"   "mut.status"  "count"      
#> [6] "fpkm"        "maf"         "segment"     "clin.info"
names(brca.yau)
#> [1] "mRNA.expr" "clin.info"

# extract multi-omics data
mo.data   <- brca.tcga[1:4]

# extract raw count data for downstream analyses
count     <- brca.tcga$count

# extract fpkm data for downstream analyses
fpkm      <- brca.tcga$fpkm

# extract maf for downstream analysis
maf       <- brca.tcga$maf

# extract segmented copy number for downstream analyses
segment   <- brca.tcga$segment

# extract survival information
surv.info <- brca.tcga$clin.info

2) get elites by reducing data dimension

Although all these omics data have been already processed (filtered from the original dataset), I am still pleased to show you how to use getElites() function to filter out features that meet some stringent requirements, and those features that are preserved in this procedure are considered elites by MOVICS. Five filtering methods are provided here, namely mad for median absolute deviation, sd for standard deviation, pca for principal components analysis, cox for univariate Cox proportional hazards regression, and freq for binary omics data. This function also handles missing values coded in \(NA\) by removing them directly or imputing them by \(k\) nearest neighbors using a Euclidean metric through argument of na.action. Let me show you how to use getElites() below:

# scenario 1: considering we are dealing with an expression data that have 2 rows with NA values
tmp       <- brca.tcga$mRNA.expr # get expression data
dim(tmp) # check data dimension
#> [1] 500 643
tmp[1,1]  <- tmp[2,2] <- NA # set 2 rows with NA values
tmp[1:3,1:3] # check data
#>         BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A
#> SCGB2A2            NA          1.42          7.24
#> SCGB1D2         10.11            NA          5.88
#> PIP              4.54          2.59          4.35
elite.tmp <- getElites(dat       = tmp,
                       method    = "mad",
                       na.action = "rm", # NA values will be removed
                       elite.pct = 1) # elite.pct equals to 1 means all (100%) features after NA removal will be selected even using mad method
#> --2 features with NA values are removed.
#> missing elite.num then use elite.pct
dim(elite.tmp$elite.dat) # check dimension again and see that we have removed 2 rows with NA data
#> [1] 498 643

elite.tmp <- getElites(dat       = tmp,
                       method    = "mad",
                       na.action = "impute", # NA values will be imputed
                       elite.pct = 1) 
#> missing elite.num then use elite.pct
dim(elite.tmp$elite.dat) # all data kept
#> [1] 500 643
elite.tmp$elite.dat[1:3,1:3] # NA values have been imputed 
#>         BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A
#> SCGB2A2         6.867         1.420          7.24
#> SCGB1D2        10.110         4.739          5.88
#> PIP             4.540         2.590          4.35

# scenario 2: considering we are dealing with continuous data and use mad or sd to select elites
tmp       <- brca.tcga$mRNA.expr # get expression data with 500 features
elite.tmp <- getElites(dat       = tmp,
                       method    = "mad",
                       elite.pct = 0.1) # this time only top 10% features with high mad values are kept
#> missing elite.num then use elite.pct
dim(elite.tmp$elite.dat) # get 50 elite left
#> [1]  50 643

elite.tmp <- getElites(dat       = tmp,
                       method    = "sd",
                       elite.num = 100, # this time only top 100 features with high sd values are kept
                       elite.pct = 0.1) # this time elite.pct argument will be disabled because elite.num has been already indicated.
#> elite.num has been provided then discards elite.pct.
dim(elite.tmp$elite.dat) # get 100 elites left
#> [1] 100 643

# scenario 3: 
# considering we are dealing with continuous data and use pca to select elites
tmp       <- brca.tcga$mRNA.expr # get expression data with 500 features
elite.tmp <- getElites(dat       = tmp,
                       method    = "pca",
                       pca.ratio = 0.95) # ratio of PCs is selected
#> --the ratio used to select principal component is set as 0.95
dim(elite.tmp$elite.dat) # get 204 elite (PCs) left
#> [1] 204 643

# scenario 4: considering we are dealing with data and use cox to select elite
tmp       <- brca.tcga$mRNA.expr # get expression data 
elite.tmp <- getElites(dat       = tmp,
                       method    = "cox",
                       surv.info = surv.info, # must provide survival information with 'futime' and 'fustat'
                       p.cutoff  = 0.05,
                       elite.num = 100) # this time elite.num argument will be disabled because cox method refers to p.cutoff to select elites
#> --all sample matched between omics matrix and survival data.
#> 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
dim(elite.tmp$elite.dat) # get 125 elites
#> [1] 125 643
table(elite.tmp$unicox$pvalue < 0.05) # 125 genes have nominal pvalue < 0.05 in univariate Cox regression
#> 
#> FALSE  TRUE 
#>   375   125

tmp       <- brca.tcga$mut.status # get mutation data 
elite.tmp <- getElites(dat       = tmp,
                       method    = "cox",
                       surv.info = surv.info, # must provide survival information with 'futime' and 'fustat'
                       p.cutoff  = 0.05,
                       elite.num = 100) # this time elite.num argument will be disabled because cox method refers to p.cutoff to select elites
#> --all sample matched between omics matrix and survival data.
#> 7% 13% 20% 27% 33% 40% 47% 53% 60% 67% 73% 80% 87% 93% 100%
dim(elite.tmp$elite.dat) # get 3 elites
#> [1]   3 643
table(elite.tmp$unicox$pvalue < 0.05) # 3 mutations have nominal pvalue < 0.05
#> 
#> FALSE  TRUE 
#>    27     3

# scenario 5: considering we are dealing with mutation data using freq to select elites
tmp       <- brca.tcga$mut.status # get mutation data 
rowSums(tmp) # check mutation frequency
#> PIK3CA   TP53    TTN   CDH1  GATA3   MLL3  MUC16 MAP3K1  SYNE1  MUC12    DMD 
#>    208    186    111     83     58     49     48     38     33     32     31 
#>  NCOR1    FLG   PTEN   RYR2  USH2A  SPTA1 MAP2K4  MUC5B    NEB   SPEN  MACF1 
#>     31     30     29     27     27     25     25     24     24     23     23 
#>   RYR3    DST  HUWE1  HMCN1  CSMD1  OBSCN   APOB  SYNE2 
#>     23     22     22     22     21     21     21     21
elite.tmp <- getElites(dat       = tmp,
                       method    = "freq", # must set as 'freq'
                       elite.num = 80, # note: in this scenario elite.num refers to frequency of mutation
                       elite.pct = 0.1) # discard because elite.num has been already indicated
#> --method of 'freq' only supports binary omics data (e.g., somatic mutation matrix), and in this manner, elite.pct and elite.num are used to cut frequency.
#> elite.num has been provided then discards elite.pct.
rowSums(elite.tmp$elite.dat) # only genes that are mutated in over than 80 samples are kept as elites
#> PIK3CA   TP53    TTN   CDH1 
#>    208    186    111     83

elite.tmp <- getElites(dat       = tmp,
                       method    = "freq", # must set as 'freq'
                       elite.pct = 0.2) # note: in this scenario elite.pct refers to frequency of mutation / sample size
#> --method of 'freq' only supports binary omics data (e.g., somatic mutation matrix), and in this manner, elite.pct and elite.num are used to cut frequency.
#> missing elite.num then use elite.pct
rowSums(elite.tmp$elite.dat) # only genes that are mutated in over than 0.2*643=128.6 samples are kept as elites
#> PIK3CA   TP53 
#>    208    186

# get mo.data list just like below (not run)
# mo.data <- list(omics1 = elite.tmp$elite.dat,
#                 omics2 = ...)

Now I think I have made this clear for you concerning how to reduce data dimension for clustering analysis. Since the mo.data has been already prepared, I am going to stick on this data list, and take you further to MOVICS.

3) get optimal number for clustering

The most important parameter to estimate in any clustering study is the optimum number of clusters \(k\) for the data, where \(k\) needs to be small enough to reduce noise but large enough to retain important information. Herein MOVICS refers to CPI5 and Gaps-statistics6 to estimate the number of clusters by using getClustNum() function.

# identify optimal clustering number (may take a while)
optk.brca <- getClustNum(data        = mo.data,
                         is.binary   = c(F,F,F,T), # note: the 4th data is somatic mutation which is a binary matrix
                         try.N.clust = 2:8, # try cluster number from 2 to 8
                         fig.name    = "CLUSTER NUMBER OF TCGA-BRCA")
#> calculating Cluster Prediction Index...
#> 5% complete
#> 5% complete
#> 10% complete
#> 10% complete
#> 15% complete
#> 15% complete
#> 20% complete
#> 25% complete
#> 25% complete
#> 30% complete
#> 30% complete
#> 35% complete
#> 35% complete
#> 40% complete
#> 45% complete
#> 45% complete
#> 50% complete
#> 50% complete
#> 55% complete
#> 55% complete
#> 60% complete
#> 65% complete
#> 65% complete
#> 70% complete
#> 70% complete
#> 75% complete
#> 75% complete
#> 80% complete
#> 85% complete
#> 85% complete
#> 90% complete
#> 90% complete
#> 95% complete
#> 95% complete
#> 100% complete
#> calculating Gap-statistics...
#> visualization done...
#> --the imputed optimal cluster number is 3 arbitrarily, but it would be better referring to other priori knowledge.
Figure 1. Identification of optimal cluster number by calculating CPI (blue line) and Gaps-statistics (red line) in TCGA-BRCA cohort.

Figure 1. Identification of optimal cluster number by calculating CPI (blue line) and Gaps-statistics (red line) in TCGA-BRCA cohort.

The above estimation of clustering number gives an arbitrary \(k\) of 3. However, the popular PAM50 classifier for breast cancer has 5 classifications and if taking a close look at the descriptive figure, it can be noticed that both CPI and Gaps-statistics does not decline too much at \(k\) of 5. Therefore, considering these knowledge, \(k\) of 5 is chosen as the optimal clustering number for further analyses.

4) get results from single algorithm

In this part I will first show how to use MOVICS to perform multi-omics integrative clustering by specifying one algorithm with detailed parameters. For example, let us try iClusterBayes method like below:

# perform iClusterBayes (may take a while)
iClusterBayes.res <- getiClusterBayes(data        = mo.data,
                                      N.clust     = 5,
                                      type        = c("gaussian","gaussian","gaussian","binomial"),
                                      n.burnin    = 1800,
                                      n.draw      = 1200,
                                      prior.gamma = c(0.5, 0.5, 0.5, 0.5),
                                      sdev        = 0.05,
                                      thin        = 3)
#> clustering done...
#> feature selection done...

Otherwise, a unified function can be used for all algorithms individually with detailed parameters, that is, getMOIC().

iClusterBayes.res <- getMOIC(data        = mo.data,
                             N.clust     = 5,
                             methodslist = "iClusterBayes", # specify only ONE algorithm here
                             type        = c("gaussian","gaussian","gaussian","binomial"), # data type corresponding to the list
                             n.burnin    = 1800,
                             n.draw      = 1200,
                             prior.gamma = c(0.5, 0.5, 0.5, 0.5),
                             sdev        = 0.05,
                             thin        = 3)

By specifying only one algorithm (i.e., iClusterBayes) in the argument of methodslist, the above getMOIC() will return exactly the same results from getiClusterBayes() if the same parameters are provided. The returned result contains a clust.res object that has two columns: clust to indicate the subtype which the sample belongs to, and samID records the corresponding sample name. For algorithms that provide feature selection procedure (i.e., iClusterBayes, CIMLR, and MoCluster), the result also contains a feat.res object that stores the information of such procedure. To those algorithms involving hierarchical clustering (e.g., COCA, ConsensusClustering), the corresponding dendrogram for sample clustering will be also returned as clust.dend, which is useful if the users want to put them at the heatmap.

5) get results from multiple algorithms at once

If you simultaneously specify a list of algorithms to methodslist argument in getMOIC(), it will automatically perform each algorithm with default parameters one by one, and a list of results derived from specified algorithms will be finally returned. Now that iClusterBayes has been finished, let us try other 9 algorithms at once with parameters by default. This will take some time, so have a coffee break.

# perform multi-omics integrative clustering with the rest of 9 algorithms
moic.res.list <- getMOIC(data        = mo.data,
                         methodslist = list("SNF", "PINSPlus", "NEMO", "COCA", "LRAcluster", "ConsensusClustering", "IntNMF", "CIMLR", "MoCluster"),
                         N.clust     = 5,
                         type        = c("gaussian", "gaussian", "gaussian", "binomial"))
#> --you choose more than 1 algorithm and all of them shall be run with parameters by default.
#> SNF done...
#> Clustering method: kmeans
#> Perturbation method: noise
#> PINSPlus done...
#> NEMO done...
#> COCA done...
#> LRAcluster done...
#> end fraction
#> clustered
#> clustered
#> clustered
#> clustered
#> ConsensusClustering done...
#> IntNMF done...
#> clustering done...
#> feature selection done...
#> CIMLR done...
#> clustering done...
#> feature selection done...
#> MoCluster done...

# attach iClusterBayes.res as a list using append() to moic.res.list with 9 results already
moic.res.list <- append(moic.res.list, 
                        list("iClusterBayes" = iClusterBayes.res))

# save moic.res.list to local path
save(moic.res.list, file = "moic.res.list.rda")

6) get consensus from different algorithms

Since now all the clustering results from 10 algorithms are in hand, MOVICS borrows the idea of consensus ensembles7 for later integration of the clustering results derived from different algorithms, so as to improve the clustering robustness. To be specific, if \(t_{max}\) algorithms are specified where \(2\le t_{max} \le10\), getConsensusMOIC() calculates a matrix \(M_{n\times n}^{(t)}\) per algorithm where \(n\) is the number of samples, and \(M_{ij}^{(t)}=1\) only when the sample \(i\) and \(j\) are clustered in the same subtype, otherwise \(M_{ij}^{(t)}=0\). After get all results from specified algorithms, MOVICS calculates a consensus matrix \(CM=\sum_{t=1}^{t_{max}}M^{(t)}\), and \(cm_{ij}\in[0,10]\). Such matrix represents a robust pairwise similarities for samples because it considers different multi-omics integrative clustering algorithms. MOVICS then searches for a stable clustering result by applying hierarchical clustering to \(CM\). The simplest way to perform getConsensusMOIC() is to pass a list of object returned by multiple get%algorithm_name%() or by getMOIC() with specific argument of methodslist. This can be done like below:

cmoic.brca <- getConsensusMOIC(moic.res.list = moic.res.list,
                               fig.name      = "CONSENSUS HEATMAP",
                               distance      = "euclidean",
                               linkage       = "average")
Figure 2. Consensus heatmap based on results from 10 multi-omics integrative clustering algorithms with cluster number of 5.

Figure 2. Consensus heatmap based on results from 10 multi-omics integrative clustering algorithms with cluster number of 5.

Remember, you must choose at least two methods to perform the above consensus clustering by getConsensusMOIC(), and this function will return a consensus matrix (a probability matrix represents how many times samples belonging to the same subtype can be clustered together by different multi-omics clustering methods) and a corresponding consensus heatmap. Ideally, the consensus heatmap will show a perfect diagonal rectangle, and the input values are 0 and 1 only because all algorithms derived the same clustering results.

7) get quantification of similarity using silhoutte

If consensus ensembles are used with multiple algorithms, except for the consensus heatmap, MOVICS provides a function getSilhoutte() to quantify and visualize the sample similarity more specifically given the identified clusters. Such function depends on a sil object that stores silhoutte score which was returned by getConsensusMOIC().

getSilhouette(sil      = cmoic.brca$sil, # a sil object returned by getConsensusMOIC()
              fig.path = getwd(),
              fig.name = "SILHOUETTE",
              height   = 5.5,
              width    = 5)
Figure 3. Quantification of sample similarity using silhoutte score based on consensus ensembles result.

Figure 3. Quantification of sample similarity using silhoutte score based on consensus ensembles result.

#> png 
#>   2

7) get multi-omics heatmap based on clustering result

Genome-wide heatmaps are widely used to graphically display potential underlying patterns within the large genomic dataset. They have been used to reveal information about how the samples/genes cluster together and provide insights into potential sample biases or other artifacts. Herein MOVICS provides getMoHeatmap to visually deal with pre-clustered multi-omics data and generate an exquisite heatmap for publication requirement. Before using getMoHeatmap(), omics data should be properly processed by using the function of getStdiz() which returns a list storing normalized omics data. Omics data, especially for expression (e.g., RNA and protein), should be centered (centerFlag = TRUE) or scaled (scaleFlag = TRUE) or z-scored (both centered and scaled). Generally, DNA methylation data (\(\beta\) matrix ranging from 0 to 1) and somatic mutation (0 and 1 binary matrix) should not be normalized. However, it is a good choice to convert the methylation \(\beta\) value to M value following the formula of \(M=log_2\frac{\beta}{1-\beta}\) for its stronger signal in visualization and M value is suitable for normalization. This function also provides an argument of halfwidth for continuous omics data; such argument is used to truncate the ‘extremum’ after normalization; specifically, normalized values that exceed the halfwidth boundaries will be replaced by the halfwidth, which is beneficial to map colors in heatmap.

# convert beta value to M value for stronger signal
indata <- mo.data
indata$meth.beta <- log2(indata$meth.beta / (1 - indata$meth.beta))

# data normalization for heatmap
plotdata <- getStdiz(data       = indata,
                     halfwidth  = c(2,2,2,NA), # no truncation for mutation
                     centerFlag = c(T,T,T,F), # no center for mutation
                     scaleFlag  = c(T,T,T,F)) # no scale for mutation

As I mentioned earlier, several algorithms also provide feature selection; those selected features show a complex cross-talk with other omics data and might have special biological significance that drives the heterogeneity of cancers. Therefore I show below how to generate a comprehensive heatmap based on a single algorithm (e.g., iClusterBayes) with selected features by using getMoHeatmap(). However in the first place, features must be extracted:

feat   <- iClusterBayes.res$feat.res
feat1  <- feat[which(feat$dataset == "mRNA.expr"),][1:10,"feature"] 
feat2  <- feat[which(feat$dataset == "lncRNA.expr"),][1:10,"feature"]
feat3  <- feat[which(feat$dataset == "meth.beta"),][1:10,"feature"]
feat4  <- feat[which(feat$dataset == "mut.status"),][1:10,"feature"]
annRow <- list(feat1, feat2, feat3, feat4)

The feat.res contained in iClusterBayes.res is sorted by posterior probability of features for each omics data. In this manner, the top 10 features are selected for each omics data and a feature list is generated and named as annRow for heatmap raw annotation (Do not worry, I will talk about how to attach column annotation for samples in a minute).

# set color for each omics data
# if no color list specified all subheatmaps will be unified to green and red color pattern
mRNA.col   <- c("#00FF00", "#008000", "#000000", "#800000", "#FF0000")
lncRNA.col <- c("#6699CC", "white"  , "#FF3C38")
meth.col   <- c("#0074FE", "#96EBF9", "#FEE900", "#F00003")
mut.col    <- c("grey90" , "black")
col.list   <- list(mRNA.col, lncRNA.col, meth.col, mut.col)

# comprehensive heatmap (may take a while)
getMoHeatmap(data          = plotdata,
             row.title     = c("mRNA","lncRNA","Methylation","Mutation"),
             is.binary     = c(F,F,F,T), # the 4th data is mutation which is binary
             legend.name   = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"),
             clust.res     = iClusterBayes.res$clust.res, # cluster results
             clust.dend    = NULL, # no dendrogram
             show.rownames = c(F,F,F,F), # specify for each omics data
             show.colnames = FALSE, # show no sample names
             annRow        = annRow, # mark selected features
             color         = col.list,
             annCol        = NULL, # no annotation for samples
             annColors     = NULL, # no annotation color
             width         = 10, # width of each subheatmap
             height        = 5, # height of each subheatmap
             fig.name      = "COMPREHENSIVE HEATMAP OF ICLUSTERBAYES")
Figure 4. Comprehensive heatmap of multi-omics integrative clustering by iClusterBayes with annotation of potential drivers.

Figure 4. Comprehensive heatmap of multi-omics integrative clustering by iClusterBayes with annotation of potential drivers.

Of course, since there are a total of 10 results stored in the moic.res.list, you can also choose any one of them to create the heatmap. Here I select COCA which also returns sample dendrogram like below:

# comprehensive heatmap (may take a while)
getMoHeatmap(data          = plotdata,
             row.title     = c("mRNA","lncRNA","Methylation","Mutation"),
             is.binary     = c(F,F,F,T), # the 4th data is mutation which is binary
             legend.name   = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"),
             clust.res     = moic.res.list$COCA$clust.res, # cluster results
             clust.dend    = moic.res.list$COCA$clust.dend, # show dendrogram for samples
             color         = col.list,
             width         = 10, # width of each subheatmap
             height        = 5, # height of each subheatmap
             fig.name      = "COMPREHENSIVE HEATMAP OF COCA")
Figure 5. Comprehensive heatmap of multi-omics integrative clustering by COCA with dendrogram for samples.

Figure 5. Comprehensive heatmap of multi-omics integrative clustering by COCA with dendrogram for samples.

Now go back to the consensus result of cmoic.brca that integrates 10 algorithms and this time samples’ annotation is also provided to generate the heatmap. Since the core function of getMoHeatmap() is based on ComplexHeatmap R package, when creating annotations, you should always use circlize::colorRamp2() function to generate the color mapping function for continuous variables (e.g., age in this example).

# extract PAM50, pathologic stage and age for sample annotation
annCol    <- surv.info[,c("PAM50", "pstage", "age"), drop = FALSE]

# generate corresponding colors for sample annotation
annColors <- list(age    = circlize::colorRamp2(breaks = c(min(annCol$age),
                                                           median(annCol$age),
                                                           max(annCol$age)), 
                                                colors = c("#0000AA", "#555555", "#AAAA00")),
                  PAM50  = c("Basal" = "blue",
                            "Her2"   = "red",
                            "LumA"   = "yellow",
                            "LumB"   = "green",
                            "Normal" = "black"),
                  pstage = c("T1"    = "green",
                             "T2"    = "blue",
                             "T3"    = "red",
                             "T4"    = "yellow", 
                             "TX"    = "black"))

# comprehensive heatmap (may take a while)
getMoHeatmap(data          = plotdata,
             row.title     = c("mRNA","lncRNA","Methylation","Mutation"),
             is.binary     = c(F,F,F,T), # the 4th data is mutation which is binary
             legend.name   = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"),
             clust.res     = cmoic.brca$clust.res, # consensusMOIC results
             clust.dend    = NULL, # show no dendrogram for samples
             show.rownames = c(F,F,F,F), # specify for each omics data
             show.colnames = FALSE, # show no sample names
             show.row.dend = c(F,F,F,F), # show no dendrogram for features
             annRow        = NULL, # no selected features
             color         = col.list,
             annCol        = annCol, # annotation for samples
             annColors     = annColors, # annotation color
             width         = 10, # width of each subheatmap
             height        = 5, # height of each subheatmap
             fig.name      = "COMPREHENSIVE HEATMAP OF CONSENSUSMOIC")
Figure 6. Comprehensive heatmap based on consensus across 10 algorithms with clinicopathological annotation.

Figure 6. Comprehensive heatmap based on consensus across 10 algorithms with clinicopathological annotation.

4.2.2 COMP Module

After identification of cancer subtypes, it is essential to further characterize each subtype by discovering their difference from multiple aspects. To this end, MOVICS provides commonly used downstream analyses in cancer subtyping researches for easily cohesion with results derived from GET Module. Now let us check them out.

1) compare survival outcome

I start with comparing the prognosis among different subtypes. MOVICS provides function of compSurv() which not only calculates the overall nominal P value by log-rank test, but also performs pairwise comparison and derives adjusted P values if more than two subtypes are identified. These information will be all printed in the Kaplan-Meier Curve which is convenient for researchers to refer. Except for clustering results (e.g., cmoic.brca in this example), you must additionally provide surv.info argument which should be a data.frame (must has row names of samples) that stores a futime column for survival time (unit of day) and another fustat column for survival outcome (0 = censor; 1 = event). x-year survival probability can be also estimated if specifying argument of xyrs.est.

# survival comparison
surv.brca <- compSurv(moic.res         = cmoic.brca,
                      surv.info        = surv.info,
                      convt.time       = "m", # convert day unit to month
                      surv.median.line = "h", # draw horizontal line at median survival
                      xyrs.est         = c(5,10), # estimate 5 and 10-year survival
                      fig.name         = "KAPLAN-MEIER CURVE OF CONSENSUSMOIC")
#> --a total of 643 samples are identified.
#> --removed missing values.
#> --leaving 642 observations.
#> --cut survival curve up to 10 years.
Figure 7. Kaplan-Meier survival curve of 5 identified subtypes of breast cancer in TCGA-BRCA cohort.

Figure 7. Kaplan-Meier survival curve of 5 identified subtypes of breast cancer in TCGA-BRCA cohort.


print(surv.brca)
#> $fitd
#> Call:
#> survdiff(formula = Surv(futime, fustat) ~ Subtype, data = mosurv.res, 
#>     na.action = na.exclude)
#> 
#>               N Observed Expected (O-E)^2/E (O-E)^2/V
#> Subtype=CS1 146       28     14.6    12.269    15.455
#> Subtype=CS2 132       17     15.1     0.251     0.317
#> Subtype=CS3 107        7     13.4     3.028     3.756
#> Subtype=CS4 144        7     17.4     6.218     8.175
#> Subtype=CS5 113       16     14.6     0.140     0.176
#> 
#>  Chisq= 22.3  on 4 degrees of freedom, p= 2e-04 
#> 
#> $fit
#> Call: survfit(formula = Surv(futime, fustat) ~ Subtype, data = mosurv.res, 
#>     na.action = na.exclude, error = "greenwood", type = "kaplan-meier", 
#>     conf.type = "plain")
#> 
#>       n events median 0.95LCL 0.95UCL
#> CS1 146     28    130    67.3      NA
#> CS2 132     17    114    83.6     144
#> CS3 107      7     NA   102.5      NA
#> CS4 144      7    216   113.5      NA
#> CS5 113     16     NA    97.2      NA
#> 
#> $xyrs.est
#> Call: survfit(formula = Surv(futime, fustat) ~ Subtype, data = mosurv.res)
#> 
#>                 Subtype=CS1 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>  1825     23      23    0.658  0.0647        0.543        0.798
#>  3650      4       4    0.509  0.0830        0.370        0.701
#> 
#>                 Subtype=CS2 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>  1825     27       7    0.855  0.0541        0.755        0.968
#>  3650      3       8    0.421  0.1286        0.231        0.766
#> 
#>                 Subtype=CS3 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>  1825     19       4    0.851  0.0701        0.724        1.000
#>  3650      5       3    0.644  0.1183        0.449        0.923
#> 
#>                 Subtype=CS4 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>  1825     25       4    0.917  0.0443        0.834            1
#>  3650      3       2    0.550  0.2027        0.267            1
#> 
#>                 Subtype=CS5 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>  1825     25      12    0.821  0.0502        0.728        0.926
#>  3650      3       4    0.503  0.1599        0.269        0.938
#> 
#> 
#> $overall.p
#> [1] 0.000172211
#> 
#> $pairwise.p
#> 
#>  Pairwise comparisons using Log-Rank test 
#> 
#> data:  mosurv.res and Subtype 
#> 
#>     CS1     CS2     CS3     CS4    
#> CS2 0.11909 -       -       -      
#> CS3 0.01345 0.11909 -       -      
#> CS4 0.00055 0.03497 0.72167 -      
#> CS5 0.16073 0.87433 0.16073 0.03022
#> 
#> P value adjustment method: BH

2) compare clinical features

Then compare the clinical features among different subtypes. MOVICS provides function of compClinvar() which enables summarizing both continuous and categorical variables and performing proper statistical tests. This function can give a table in .docx format that is easy to use in medical research papers.

clin.brca <- compClinvar(moic.res      = cmoic.brca,
                         var2comp      = surv.info, # data.frame needs to summarize (must has row names of samples)
                         strata        = "Subtype", # stratifying variable (e.g., Subtype in this example)
                         factorVars    = c("PAM50","pstage","fustat"), # features that are considered categorical variables
                         nonnormalVars = "futime", # feature(s) that are considered using nonparametric test
                         exactVars     = "pstage", # feature(s) that are considered using exact test
                         doWord        = TRUE, # generate .docx file in local path
                         tab.name      = "SUMMARIZATION OF CLINICAL FEATURES")
#> --all samples matched.
print(clin.brca$compTab)
Table 1. Comparison of clinical features among 5 identified subtype of breast cancer in TCGA-BRCA cohort.
level CS1 CS2 CS3 CS4 CS5 p test
n 146 133 107 144 113
fustat (%) 0 118 (80.8) 115 (86.5) 100 (93.5) 137 (95.1) 97 (85.8) 0.001
1 28 (19.2) 18 (13.5) 7 ( 6.5) 7 ( 4.9) 16 (14.2)
futime (median [IQR]) 719.00 [431.50, 1355.00] 746.50 [432.25, 1420.25] 703.00 [436.00, 1270.50] 848.50 [470.50, 1547.75] 742.00 [470.00, 1605.00] 0.691 nonnorm
PAM50 (%) Basal 2 ( 1.4) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 109 (96.5) <0.001
Her2 38 (26.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
LumA 47 (32.2) 78 (58.6) 91 (85.0) 128 (88.9) 0 ( 0.0)
LumB 53 (36.3) 55 (41.4) 16 (15.0) 0 ( 0.0) 0 ( 0.0)
Normal 6 ( 4.1) 0 ( 0.0) 0 ( 0.0) 16 (11.1) 4 ( 3.5)
pstage (%) T1 40 (27.4) 35 (26.3) 32 (29.9) 40 (27.8) 25 (22.1) 0.036 exact
T2 85 (58.2) 77 (57.9) 63 (58.9) 72 (50.0) 70 (61.9)
T3 12 ( 8.2) 17 (12.8) 10 ( 9.3) 31 (21.5) 14 (12.4)
T4 9 ( 6.2) 4 ( 3.0) 2 ( 1.9) 1 ( 0.7) 3 ( 2.7)
TX 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 1 ( 0.9)
age 58.47 (13.75) 59.32 (14.03) 59.62 (12.40) 57.58 (12.23) 55.67 (12.11) 0.116

3) compare mutational frequency

Subype-specific mutation might be promising as therapeutic target. Hence, I then compare the mutational frequency among different clusters. MOVICS provides function of compMut() which deals with binary mutational data (0 = wild; 1 = mutated). This function applies independent testing (i.e., Fisher' s exact test or \(\chi^2\) test) for each mutation, and generates a table with statistical results (.docx file also if specified), and creates an OncoPrint using those differentially mutated genes.

# mutational frequency comparison
mut.brca <- compMut(moic.res     = cmoic.brca,
                    mut.matrix   = brca.tcga$mut.status, # binary somatic mutation matrix
                    doWord       = TRUE, # generate table in .docx format
                    doPlot       = TRUE, # draw OncoPrint
                    freq.cutoff  = 0.05, # keep those genes that mutated in at least 5% of samples
                    p.adj.cutoff = 0.05, # keep those genes with adjusted p value < 0.05 to draw OncoPrint
                    innerclust   = TRUE, # perform clustering within each subtype
                    annCol       = annCol, # same annotation for heatmap
                    annColors    = annColors, # same annotation color for heatmap
                    width        = 6, 
                    height       = 2,
                    fig.name     = "ONCOPRINT FOR SIGNIFICANT MUTATIONS",
                    tab.name     = "INDEPENDENT TEST BETWEEN SUBTYPE AND MUTATION")
#> --all samples matched.
Figure 8. Mutational OncoPrint of 5 identified subtypes of breast cancer in TCGA-BRCA cohort.

Figure 8. Mutational OncoPrint of 5 identified subtypes of breast cancer in TCGA-BRCA cohort.

print(mut.brca)
Table 2. Comparison of mutational frequency among 5 identified subtype of breast cancer in TCGA-BRCA cohort.
Gene (Mutated) TMB CS1 CS2 CS3 CS4 CS5 pvalue padj
PIK3CA 208 (32%) 55 (37.7%) 10 ( 7.5%) 77 (72.0%) 64 (44.4%) 2 ( 1.8%) 6.90e-42 3.10e-41
TP53 186 (29%) 83 (56.8%) 12 ( 9.0%) 0 ( 0.0%) 8 ( 5.6%) 83 (73.5%) 2.06e-63 1.85e-62
TTN 111 (17%) 41 (28.1%) 14 (10.5%) 14 (13.1%) 22 (15.3%) 20 (17.7%) 2.03e-03 3.04e-03
CDH1 83 (13%) 11 ( 7.5%) 3 ( 2.3%) 19 (17.8%) 49 (34.0%) 1 ( 0.9%) 5.38e-19 1.61e-18
GATA3 58 ( 9%) 5 ( 3.4%) 32 (24.1%) 10 ( 9.3%) 11 ( 7.6%) 0 ( 0.0%) 6.21e-11 1.40e-10
MLL3 49 ( 8%) 11 (7.5%) 12 (9.0%) 10 (9.3%) 11 (7.6%) 5 (4.4%) 6.31e-01 6.31e-01
MUC16 48 ( 8%) 18 (12.3%) 8 ( 6.0%) 6 ( 5.6%) 8 ( 5.6%) 8 ( 7.1%) 2.05e-01 2.31e-01
MAP3K1 38 ( 6%) 2 ( 1.4%) 6 ( 4.5%) 16 (15.0%) 12 ( 8.3%) 2 ( 1.8%) 3.65e-05 6.57e-05
SYNE1 33 ( 5%) 9 (6.2%) 1 (0.8%) 5 (4.7%) 8 (5.6%) 10 (8.8%) 3.27e-02 4.20e-02

4) compare total mutation burden

Needless to say, immunotherapy is becoming a pillar of modern cancer treatment. Recent analyses have linked the tumoral genomic landscape with antitumor immunity. In particular, an emerging picture suggests that tumor-specific genomic lesions are associated with immune checkpoint activation and the extent and duration of responses for patients to immunotherapy. These lesions contains high mutation burdens 8 and aneuploidy9. To quantify these genomic alterations that may affect immunotherapy, MOVICS provides two functions to calculate total mutation burden (TMB) and fraction genome altered (FGA). To be specific, TMB refers to the number of mutations that are found in the tumor genome, while FGA is the percentage of genome that has been affected by copy number gains or losses. Both attributes are useful for genetic researchers as they provide them with more in-depth information on the genomic make-up of the tumors. Let me show you how to use these two functions by starting with compTMB(). First of all, the input maf data for this function must have the following 10 columns at least:

head(maf)
#>   Tumor_Sample_Barcode Hugo_Symbol Chromosome Start_Position End_Position
#> 1        BRCA-A1XY-01A       USP24       chr1       55159655     55159655
#> 2        BRCA-A1XY-01A      ERICH3       chr1       74571494     74571494
#> 3        BRCA-A1XY-01A      KIF26B       chr1      245419680    245419680
#> 4        BRCA-A1XY-01A       USP34       chr2       61189055     61189055
#> 5        BRCA-A1XY-01A      ANTXR1       chr2       69245305     69245305
#> 6        BRCA-A1XY-01A       SCN9A       chr2      166199365    166199365
#>   Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1
#> 1      Missense_Mutation          SNP                T                 T
#> 2      Missense_Mutation          SNP                C                 C
#> 3                 Silent          SNP                G                 G
#> 4                 Silent          SNP                G                 G
#> 5                 Silent          SNP                G                 G
#> 6                 Silent          SNP                G                 G
#>   Tumor_Seq_Allele2
#> 1                 C
#> 2                 T
#> 3                 T
#> 4                 C
#> 5                 A
#> 6                 A

Run compTMB() then. By default, this function considers nonsynonymous variants only when counting somatic mutation frequency, including Frame_Shift_Del, Frame_Shift_Ins, Splice_Site, Translation_Start_Site, Nonsense_Mutation, Nonstop_Mutation, In_Frame_Del, In_Frame_Ins, and Missense_Mutation. In addition to calculating TMB, this function also classifies Single Nucleotide Variants into Transitions and Transversions (TiTv), and depicts the distribution of both TMB and TiTv.

# compare TMB
tmb.brca <- compTMB(moic.res     = cmoic.brca,
                    maf          = maf,
                    rmDup        = TRUE, # remove duplicated variants per sample
                    rmFLAGS      = FALSE, # keep FLAGS mutations
                    exome.size   = 38, # estimated exome size
                    test.method  = "nonparametric", # statistical testing method
                    fig.name     = "DISTRIBUTION OF TMB AND TITV")
#> --67 samples mismatched from current subtypes.
#> -Validating
#> -Silent variants: 24329 
#> -Summarizing
#> --Possible FLAGS among top ten genes:
#>   TTN
#>   MUC16
#> -Processing clinical data
#> --Missing clinical data
#> -Finished in 3.500s elapsed (3.420s cpu) 
#> Kruskal-Wallis rank sum test p value = 1.97e-23
#> post-hoc pairwise wilcoxon rank sum test with Benjamini-Hochberg adjustment presents below:
#>     CS1        CS2        CS3        CS4       
#> CS2 "4.71e-11" " NA"      " NA"      " NA"     
#> CS3 "5.11e-10" "7.79e-01" " NA"      " NA"     
#> CS4 "1.68e-10" "5.93e-01" "7.79e-01" " NA"     
#> CS5 "3.18e-01" "1.23e-12" "1.27e-11" "1.27e-11"
Figure 9. Comparison of TMB and TiTv among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.

Figure 9. Comparison of TMB and TiTv among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.

head(tmb.brca$TMB.dat)
Table 3. Demo of comparison of TMB among 5 identified subtype of breast cancer in TCGA-BRCA cohort.
samID variants TMB log10TMB Subtype
570 BRCA-A1EW-01A 9 0.2368421 0.0923143 CS1
558 BRCA-A1IO-01A 11 0.2894737 0.1104125 CS1
560 BRCA-A0C3-01A 11 0.2894737 0.1104125 CS1
541 BRCA-A1NG-01A 14 0.3684211 0.1362197 CS1
531 BRCA-A1Y2-01A 15 0.3947368 0.1444923 CS1
525 BRCA-A5RY-01A 16 0.4210526 0.1526102 CS1

5) compare fraction genome altered

Next, compFGA() calculates not only FGA but also computes specific gain (FGG) or loss (FGL) per sample within each subtype. To get this function worked, an eligible input of segmented copy number should be prepared with exactly the same column name like below:

# change column names of segment data
colnames(segment) <- c("sample","chrom","start","end","value")

Let’s see how the input looks like:

head(segment)
#>          sample chrom     start       end   value
#> 1 BRCA-A090-01A     1   3301765  54730235 -0.1271
#> 2 BRCA-A090-01A     1  54730247  57443819 -0.0899
#> 3 BRCA-A090-01A     1  57448465  57448876 -1.1956
#> 4 BRCA-A090-01A     1  57448951  64426102 -0.1009
#> 5 BRCA-A090-01A     1  64426648 106657734 -0.1252
#> 6 BRCA-A090-01A     1 106657854 106835667  0.1371

Then this function can be run without any difficulties. Notably, if your CNA calling procedure did not provide a segmented copy number as value column but the original copy number, argument of iscopynumber must be switched to TRUE instead.

# compare FGA, FGG, and FGL
fga.brca <- compFGA(moic.res     = cmoic.brca,
                    segment      = segment,
                    iscopynumber = FALSE, # this is a segmented copy number file
                    cnathreshold = 0.2, # threshold to determine CNA gain or loss
                    test.method  = "nonparametric", # statistical testing method
                    fig.name     = "BARPLOT OF FGA")
#> --2 samples mismatched from current subtypes.
#> 5% 10% 15% 21% 26% 31% 36% 41% 46% 51% 57% 62% 67% 72% 77% 82% 88% 93% 98%
Figure 10. Barplot of fraction genome altered among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.

Figure 10. Barplot of fraction genome altered among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.

head(fga.brca$summary)
Table 4. Demo of comparison of fraction genome altered among 5 identified subtype of breast cancer in TCGA-BRCA cohort.
samID FGA FGG FGL Subtype
BRCA-A03L-01A 0.6217991 0.3086727 0.3131264 CS1
BRCA-A04R-01A 0.2531019 0.0913201 0.1617818 CS2
BRCA-A075-01A 0.7007067 0.4144424 0.2862643 CS1
BRCA-A08O-01A 0.6501287 0.4564814 0.1936473 CS3
BRCA-A0A6-01A 0.1468893 0.0635649 0.0833244 CS4
BRCA-A0AD-01A 0.1722214 0.0386452 0.1335762 CS2

6) compare drug sensitivity

Predicting response to medication is particularly important for drugs with a narrow therapeutic index, for example chemotherapeutic agents, because response is highly variable and side effects are potentially lethal. Therefore, Paul Geeleher et al. (2014)10 used baseline gene expression and \(in\;vitro\) drug sensitivity derived from cell lines, coupled with \(in\;vivo\) baseline tumor gene expression, to predict patients’ response to drugs. Paul developped an R package pRRophetic for prediction of clinical chemotherapeutic response from tumor gene expression levels11, and now this function has been involved in MOVICS to examine difference of drug sensitivity among different subtypes. Here I estimate the \(IC_{50}\) of Cisplatin and Paclitaxel for 5 identified subtypes of breast cancer in TCGA cohort.

# drug sensitivity comparison
drug.brca <- compDrugsen(moic.res    = cmoic.brca,
                         norm.expr   = fpkm[,cmoic.brca$clust.res$samID], # double guarantee sample order
                         drugs       = c("Cisplatin", "Paclitaxel"), # a vector of names of drug in GDSC
                         tissueType  = "breast", # choose specific tissue type to construct ridge regression model
                         test.method = "nonparametric", # statistical testing method
                         prefix      = "BOXVIOLIN OF ESTIMATED IC50") 
#> --all samples matched.
#> --log2 transformation done for expression data.
#> Cisplatin done...
#> Cisplatin: Kruskal-Wallis rank sum test p value = 1.38e-58
#> post-hoc pairwise wilcoxon rank sum test with Benjamini-Hochberg adjustment presents below:
#>     CS1        CS2        CS3        CS4       
#> CS2 "5.50e-18" " NA"      " NA"      " NA"     
#> CS3 "3.90e-13" "9.78e-02" " NA"      " NA"     
#> CS4 "5.21e-01" "4.54e-24" "3.17e-18" " NA"     
#> CS5 "2.29e-13" "3.89e-34" "2.57e-30" "3.15e-14"
#> Paclitaxel done...
#> Paclitaxel: Kruskal-Wallis rank sum test p value = 1.91e-13
#> post-hoc pairwise wilcoxon rank sum test with Benjamini-Hochberg adjustment presents below:
#>     CS1        CS2        CS3        CS4       
#> CS2 "8.11e-10" " NA"      " NA"      " NA"     
#> CS3 "3.87e-09" "8.56e-01" " NA"      " NA"     
#> CS4 "9.12e-11" "6.59e-01" "8.55e-01" " NA"     
#> CS5 "2.82e-04" "5.20e-02" "4.92e-02" "1.95e-02"
Figure 11. Boxviolins for estimated IC50 of Cisplatin and Paclitaxel among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.Figure 11. Boxviolins for estimated IC50 of Cisplatin and Paclitaxel among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.

Figure 11. Boxviolins for estimated IC50 of Cisplatin and Paclitaxel among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.

head(drug.brca$Cisplatin)
Table 5. Demo of estimated IC50 for Cisplatin among 5 identified subtype of breast cancer in TCGA-BRCA cohort.
Est.IC50 Subtype
BRCA-A03L-01A 3.802060 CS1
BRCA-A075-01A 3.834646 CS1
BRCA-A0AW-01A 3.126744 CS1
BRCA-A0AZ-01A 3.303054 CS1
BRCA-A0BC-01A 3.409402 CS1
BRCA-A0BF-01A 3.358053 CS1

7) compare agreement with other subtypes

At present, many cancers have traditional classifications, and evaluating the consistency of new subtypes with previous classifications is critical to reflect the robustness of clustering analysis and to determine potential but novel subtypes. To measure the agreement (similarity) between the current subtypes and other pre-existed classifications, MOVICS provides function of compAgree() to calculate four statistics: Rand Index (RI)12, Adjusted Mutual Information (AMI)13, Jaccard Index (JI)14, and Fowlkes-Mallows (FM)15; all these measurements range from 0 to 1 and the larger the value is, the more similar the two appraises are. This function can also generate an alluvial diagram to visualize the agreement of two appraises with the current subtypes as reference.

# customize the factor level for pstage
surv.info$pstage <- factor(surv.info$pstage, levels = c("TX","T1","T2","T3","T4"))

# agreement comparison (support up to 6 classifications include current subtype)
agree.brca <- compAgree(moic.res  = cmoic.brca,
                        subt2comp = surv.info[,c("PAM50","pstage")],
                        doPlot    = TRUE,
                        box.width = 0.2,
                        fig.name  = "AGREEMENT OF CONSENSUSMOIC WITH PAM50 AND PSTAGE")
#> --all samples matched.
Figure 12. Agreement of 5 identified subtypes of breast cancer with PAM50 classification and pathological stage in TCGA-BRCA cohort.

Figure 12. Agreement of 5 identified subtypes of breast cancer with PAM50 classification and pathological stage in TCGA-BRCA cohort.

print(agree.brca)
Table 6. Agreement of 5 identified subtypes with PAM50 classification and pathological stage in TCGA-BRCA cohort.
current.subtype other.subtype RI AMI JI FM
Subtype PAM50 0.6929744 0.4043456 0.2910887 0.4694389
Subtype pstage 0.5506751 0.0124902 0.1565996 0.2884961

4.2.3 RUN Module

Take a deep breath, we successfully enter the last RUN Module, and victory is in sight. In this module, MOVICS aims to characterize different subtypes by identifying their potential predictive biomarkers and functional pathways. Identifying and applying molecular biomarkers to predict subtype with efficiency is particularly important for disease management and treatment, thus improving clinical outcome. In this context, MOVICS searches for subtype-specific biomarkers by starting with differential expression analysis (DEA).

1) run differential expression analysis

MOVICS provides runDEA() function which embeds three state-of-the-art DEA approches to identify differentially expressed genes (DEGs), including edgeR and DESeq2 for RNA-Seq count data and limma for microarray profile or normalized expression data. Notably, since runDEA() checks the data scale automatically when choosing limma algorithm, it is recommended to provide a microarray expression profile or normalized expression data (e.g., RSEM, FPKM, TPM) without z-score or log2 transformation. This step is also quite time-consuming, especially in the case of using raw count data, so release your hand from the keyboard, close your eyes, and relax for a moment.

# run DEA with edgeR
runDEA(dea.method = "edger",
       expr       = count, # raw count data
       moic.res   = cmoic.brca,
       prefix     = "TCGA-BRCA") # prefix of figure name
#> --all samples matched.
#> --you choose edger and please make sure an RNA-Seq count data was provided.
#> edger of CS1_vs_Others done...
#> edger of CS2_vs_Others done...
#> edger of CS3_vs_Others done...
#> edger of CS4_vs_Others done...
#> edger of CS5_vs_Others done...

# run DEA with DESeq2
runDEA(dea.method = "deseq2",
       expr       = count,
       moic.res   = cmoic.brca,
       prefix     = "TCGA-BRCA")
#> --all samples matched.
#> --you choose deseq2 and please make sure an RNA-Seq count data was provided.
#> deseq2 of CS1_vs_Others done...
#> deseq2 of CS2_vs_Others done...
#> deseq2 of CS3_vs_Others done...
#> deseq2 of CS4_vs_Others done...
#> deseq2 of CS5_vs_Others done...

# run DEA with limma
runDEA(dea.method = "limma",
       expr       = fpkm, # normalized expression data
       moic.res   = cmoic.brca,
       prefix     = "TCGA-BRCA")
#> --all samples matched.
#> --you choose limma and please make sure a microarray profile or a normalized expression data [FPKM or TPM without log2 transformation is recommended] was provided.
#> --log2 transformation done for expression data.
#> limma of CS1_vs_Others done...
#> limma of CS2_vs_Others done...
#> limma of CS3_vs_Others done...
#> limma of CS4_vs_Others done...
#> limma of CS5_vs_Others done...

Each identified cancer subtype will be compared with the rest (Others) and the corresponding .txt file will be stored according to the argument of res.path. By default, these files will be saved under the current working directory.

2) run biomarker identification procedure

In this procedure, the most differentially expressed genes sorted by log2FoldChange are chosen as the biomarkers for each subtype (200 biomarkers for each subtype by default). These biomarkers should pass the significance threshold (e.g., nominal \(P\) value < 0.05 and adjusted \(P\) value < 0.05) and must not overlap with any biomarkers identified for other subtypes.

# choose edgeR result to identify subtype-specific up-regulated biomarkers
marker.up <- runMarker(moic.res      = cmoic.brca,
                       dea.method    = "edger", # name of DEA method
                       prefix        = "TCGA-BRCA", # MUST be the same of argument in runDEA()
                       dat.path      = getwd(), # path of DEA files
                       res.path      = getwd(), # path to save marker files
                       p.cutoff      = 0.05, # p cutoff to identify significant DEGs
                       p.adj.cutoff  = 0.05, # padj cutoff to identify significant DEGs
                       dirct         = "up", # direction of dysregulation in expression
                       n.marker      = 100, # number of biomarkers for each subtype
                       doplot        = TRUE, # generate diagonal heatmap
                       norm.expr     = fpkm, # use normalized expression as heatmap input
                       annCol        = annCol, # sample annotation in heatmap
                       annColors     = annColors, # colors for sample annotation
                       show_rownames = FALSE, # show no rownames (biomarker name)
                       fig.name      = "UPREGULATED BIOMARKER HEATMAP")
#> --all samples matched.
#> --log2 transformation done for expression data.
Figure 13. Heatmap of subtype-specific upregulated biomarkers using edgeR for 5 identified subtypes in TCGA-BRCA cohort.

Figure 13. Heatmap of subtype-specific upregulated biomarkers using edgeR for 5 identified subtypes in TCGA-BRCA cohort.

# check the upregulated biomarkers
head(marker.up$templates)
Table 7. Demo of subtype-specific upregulated biomarkers for 5 identified subtypes of breast cancer in TCGA-BRCA cohort.
probe class dirct
PNMT CS1 up
AKR1B15 CS1 up
DLK1 CS1 up
ACE2 CS1 up
CRISP3 CS1 up
ACSM1 CS1 up

Then try results derived from limma to identify subtype-specific downregulated biomarkers. Please feel free to try DESeq2.

# choose limma result to identify subtype-specific down-regulated biomarkers
marker.dn <- runMarker(moic.res      = cmoic.brca,
                       dea.method    = "limma",
                       prefix        = "TCGA-BRCA",
                       dirct         = "down",
                       n.marker      = 50, # switch to 50
                       doplot        = TRUE,
                       norm.expr     = fpkm,
                       annCol        = annCol,
                       annColors     = annColors,
                       fig.name      = "DOWNREGULATED BIOMARKER HEATMAP")
#> --all samples matched.
#> --log2 transformation done for expression data.
Figure 14. Heatmap of subtype-specific downregulated biomarkers using limma for 5 identified subtypes in TCGA-BRCA cohort.

Figure 14. Heatmap of subtype-specific downregulated biomarkers using limma for 5 identified subtypes in TCGA-BRCA cohort.

3) run gene set enrichment analysis

Similarly, GSEA is run for each subtype based on its corresponding DEA result to identify subtype-specific functional pathways. To this end, I have prepared a geneset background which includes all gene sets derived from GO biological processes (c5.bp.v7.1.symbols.gmt) from The Molecular Signatures Database (MSigDB, https://www.gsea-msigdb.org/gsea/msigdb/index.jsp). You can download other interested background for your own study.

# MUST locate ABSOLUTE path of msigdb file
MSIGDB.FILE <- system.file("extdata", "c5.bp.v7.1.symbols.xls", package = "MOVICS", mustWork = TRUE)

Likewise, these identified specific pathways should pass the significance threshold (e.g., nominal \(P\) value < 0.05 and adjusted \(P\) value < 0.25) and must not overlap with any pathways identified for other subtypes. After having the subtype-specific pathways, genes that are inside the pathways are retrieved to calculate a single sample enrichment score by using GSVA R package. Subsequently, subtype-specific enrichment score will be represented by the mean or median (mean by default) value within the subtype, and will be further visualized by diagonal heatmap.

# run GSEA to identify up-regulated GO pathways using results from edgeR
gsea.up <- runGSEA(moic.res     = cmoic.brca,
                   dea.method   = "edger", # name of DEA method
                   prefix       = "TCGA-BRCA", # MUST be the same of argument in runDEA()
                   dat.path     = getwd(), # path of DEA files
                   res.path     = getwd(), # path to save GSEA files
                   msigdb.path  = MSIGDB.FILE, # MUST be the ABSOLUTE path of msigdb file
                   norm.expr    = fpkm, # use normalized expression to calculate enrichment score
                   dirct        = "up", # direction of dysregulation in pathway
                   p.cutoff     = 0.05, # p cutoff to identify significant pathways
                   p.adj.cutoff = 0.25, # padj cutoff to identify significant pathways
                   gsva.method  = "gsva", # method to calculate single sample enrichment score
                   norm.method  = "mean", # normalization method to calculate subtype-specific enrichment score
                   fig.name     = "UPREGULATED PATHWAY HEATMAP")
#> --all samples matched.
#> GSEA done...
#> --log2 transformation done for expression data.
#> Estimating GSVA scores for 50 gene sets.
#> Estimating ECDFs with Gaussian kernels
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%
#> gsva done...
#> heatmap done...
Figure 15. Heatmap of subtype-specific upregulated pathways using edgeR algorithm for 5 identified subtypes in TCGA-BRCA cohort.

Figure 15. Heatmap of subtype-specific upregulated pathways using edgeR algorithm for 5 identified subtypes in TCGA-BRCA cohort.

Check some columns of GSEA results for the first subtype (CS1).

print(gsea.up$gsea.list$CS1[1:6,3:6])
Table 8. Demo of GSEA results for the first cancer subtype (CS1) of breast cancer in TCGA-BRCA cohort.
setSize enrichmentScore NES pvalue
GO_REGULATION_OF_ION_TRANSPORT 420 -0.4832370 -1.555051 0.0010040
GO_EMBRYONIC_MORPHOGENESIS 459 -0.4504464 -1.449967 0.0010050
GO_EPITHELIAL_CELL_DIFFERENTIATION 470 -0.4956895 -1.597430 0.0010050
GO_SYNAPTIC_SIGNALING 441 -0.5216787 -1.677297 0.0010050
GO_BEHAVIOR 392 -0.5060951 -1.625033 0.0010091
GO_REGULATION_OF_TRANS_SYNAPTIC_SIGNALING 293 -0.5122532 -1.631490 0.0010101

Also check results of subtype-specific enrichment scores.

head(round(gsea.up$grouped.es,3))
Table 9. Demo of subtype-specific enrichment scores among 5 identified subtypes of breast cancer in TCGA-BRCA cohort.
CS1 CS2 CS3 CS4 CS5
GO_BENZENE_CONTAINING_COMPOUND_METABOLIC_PROCESS 0.153 -0.092 0.142 0.154 -0.503
GO_REGULATION_OF_MONOCYTE_CHEMOTAXIS 0.392 -0.474 -0.329 0.260 0.040
GO_INDOLE_CONTAINING_COMPOUND_METABOLIC_PROCESS 0.191 -0.457 -0.425 0.144 0.430
GO_POSITIVE_REGULATION_OF_MONONUCLEAR_CELL_MIGRATION 0.358 -0.430 -0.236 0.223 -0.012
GO_AMINE_CATABOLIC_PROCESS 0.174 -0.379 -0.204 0.192 0.132
GO_ESTROGEN_METABOLIC_PROCESS 0.130 -0.210 -0.017 0.349 -0.305

Then try results derived from DESeq2 to identify subtype-specific downregulated pathways. Please feel free to try limma.

# run GSEA to identify down-regulated GO pathways using results from DESeq2
gsea.dn <- runGSEA(moic.res     = cmoic.brca,
                   dea.method   = "deseq2",
                   prefix       = "TCGA-BRCA",
                   msigdb.path  = MSIGDB.FILE,
                   norm.expr    = fpkm,
                   dirct        = "down",
                   p.cutoff     = 0.05,
                   p.adj.cutoff = 0.25,
                   gsva.method  = "ssgsea", # switch to ssgsea
                   norm.method  = "median", # switch to median
                   fig.name     = "DOWNREGULATED PATHWAY HEATMAP") 
#> --all samples matched.
#> GSEA done...
#> --log2 transformation done for expression data.
#> Estimating ssGSEA scores for 50 gene sets.
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |=========                                                             |  14%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=====================                                                 |  31%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |==============================                                        |  44%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===================================                                   |  51%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==========================================                            |  61%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |===================================================                   |  74%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |==========================================================            |  84%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================|  99%
  |                                                                            
  |======================================================================| 100%
#> ssgsea done...
#> heatmap done...
Figure 16. Heatmap of subtype-specific downregulated pathways using limma algorithm for 5 identified subtypes in TCGA-BRCA cohort.

Figure 16. Heatmap of subtype-specific downregulated pathways using limma algorithm for 5 identified subtypes in TCGA-BRCA cohort.

4) run gene set variation analysis

For all the new defined molecular subtypes, it is critical to depict its characteristics validated by different signatures of gene sets. MOVICS provides a simple function which uses gene set variation analysis to calculate enrichment score of each sample in each subtype based on given gene set list of interest. First, we must prepare a gene list of interest in hand saved as GMT format.

# MUST locate ABSOLUTE path of gene set file
GSET.FILE <- 
  system.file("extdata", "gene sets of interest.gmt", package = "MOVICS", mustWork = TRUE)

Then we can calculate single sample enrichment score based on specified method and given gene set of interest.

# run GSVA to estimate single sample enrichment score based on given gene set of interest
gsva.res <- 
  runGSVA(moic.res      = cmoic.brca,
          norm.expr     = fpkm,
          gset.gmt.path = GSET.FILE, # ABSOLUTE path of gene set file
          gsva.method   = "gsva", # method to calculate single sample enrichment score
          annCol        = annCol,
          annColors     = annColors,
          fig.path      = getwd(),
          fig.name      = "GENE SETS OF INTEREST HEATMAP",
          height        = 5,
          width         = 8)
#> --all samples matched.
#> --log2 transformation done for expression data.
#> Estimating GSVA scores for 21 gene sets.
#> Estimating ECDFs with Gaussian kernels
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |======================================================================| 100%
#> gsva done...
Figure 17. Heatmap of enrichment score of gene set of interest for 5 identified subtypes in TCGA-BRCA cohort.

Figure 17. Heatmap of enrichment score of gene set of interest for 5 identified subtypes in TCGA-BRCA cohort.


# check raw enrichment score
print(gsva.res$raw.es[1:3,1:3])
#>                    BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A
#> Adhesion              0.09351280    -0.2125523    0.10362446
#> Antigen_Processing    0.05451682    -0.3909617    0.05508321
#> B-Cell_Functions     -0.01931423    -0.6120864    0.18301567

# check z-scored and truncated enrichment score
print(gsva.res$scaled.es[1:3,1:3])
#>                    BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A
#> Adhesion              0.28466782    -0.6600776     0.3158800
#> Antigen_Processing    0.12510682    -1.0000000     0.1267097
#> B-Cell_Functions     -0.04056456    -1.0000000     0.4561039

4) run nearest template prediction in external cohort

Oh wait, did we forget something? Yeah there is a dataset that has not been used yet, so let’s see if we can use those subtype-specific biomarkers to verify the current breast cancer subtypes in the external Yau cohort. In this part, our core purpose is to predict the possible subtypes of each sample in the external dataset. In most cases, this is a multi-classification problem, and the identified biomarkers may be difficult to be overall matched in the external cohort, so it might not be reliable to use model-based predictive algorithms. Therefore, MOVICS provides two model-free approaches for subtype prediction in validation cohort. First, MOVICS switches to nearest template prediction (NTP) which can be flexibly applied to cross-platform, cross-species, and multiclass predictions without any optimization of analysis parameters16. Only one thing have to do is generating a template file, which has been fortunately prepared already.

# run NTP in Yau cohort by using up-regulated biomarkers
yau.ntp.pred <- runNTP(expr       = brca.yau$mRNA.expr,
                       templates  = marker.up$templates, # the template has been already prepared in runMarker()
                       scaleFlag  = TRUE, # scale input data (by default)
                       centerFlag = TRUE, # center input data (by default)
                       doPlot     = TRUE, # to generate heatmap
                       fig.name   = "NTP HEATMAP FOR YAU") 
#> --original template has 500 biomarkers and 272 are matched in external expression profile.
#> cosine correlation distance
#> 682 samples; 5 classes; 39-66 features/class
#> serial processing; 1000 permutation(s)...
#> predicted samples/class (FDR<0.05)
Figure 18. Heatmap of NTP in Yau cohort using subtype-specific upregulated biomarkers identified from TCGA-BRCA cohort

Figure 18. Heatmap of NTP in Yau cohort using subtype-specific upregulated biomarkers identified from TCGA-BRCA cohort

#> 
#>  CS1  CS2  CS3  CS4  CS5 <NA> 
#>  104   85   90  120  140  143
head(yau.ntp.pred$ntp.res)
Table 10. Demo of predicted subtypes in Yau cohort by NTP using subtype-specific upregulated biomarkers identified from TCGA-BRCA cohort.
prediction d.CS1 d.CS2 d.CS3 d.CS4 d.CS5 p.value FDR
107 CS2 0.7344 0.5165 0.7377 0.7471 0.7512 0.001 0.0020
109 CS1 0.5816 0.7581 0.7172 0.7308 0.7295 0.001 0.0020
11 CS1 0.6527 0.7307 0.7430 0.7557 0.7731 0.001 0.0020
110 CS2 0.7619 0.5418 0.7583 0.7655 0.7567 0.001 0.0020
111 CS1 0.6806 0.7157 0.7198 0.7889 0.7911 0.007 0.0106
113 CS4 0.6785 0.7505 0.7394 0.6389 0.7308 0.007 0.0106

Above an object yau.ntp.pred is prepared which is similar in structure of object returned by getMOIC() but only stores clust.res which can be passed to functions within COMP Module if additional data is available. For example, I herein first compare the survival outcome of the predicted 5 cancer subtypes in Yau cohort.

# compare survival outcome in Yau cohort
surv.yau <- compSurv(moic.res         = yau.ntp.pred,
                     surv.info        = brca.yau$clin.info,
                     convt.time       = "m", # switch to month
                     surv.median.line = "hv", # switch to both
                     fig.name         = "KAPLAN-MEIER CURVE OF NTP FOR YAU") 
#> --a total of 682 samples are identified.
#> --cut survival curve up to 10 years.
Figure 19. Kaplan-Meier survival curve of predicted 5 subtypes of breast cancer in Yau cohort.

Figure 19. Kaplan-Meier survival curve of predicted 5 subtypes of breast cancer in Yau cohort.


print(surv.yau)
#> $fitd
#> Call:
#> survdiff(formula = Surv(futime, fustat) ~ Subtype, data = mosurv.res, 
#>     na.action = na.exclude)
#> 
#>               N Observed Expected (O-E)^2/E (O-E)^2/V
#> Subtype=CS1 136       51     43.8     1.169     1.452
#> Subtype=CS2 117       51     38.5     4.037     4.876
#> Subtype=CS3 119       33     42.3     2.045     2.517
#> Subtype=CS4 159       43     57.3     3.590     4.820
#> Subtype=CS5 151       50     46.0     0.351     0.441
#> 
#>  Chisq= 11.2  on 4 degrees of freedom, p= 0.02 
#> 
#> $fit
#> Call: survfit(formula = Surv(futime, fustat) ~ Subtype, data = mosurv.res, 
#>     na.action = na.exclude, error = "greenwood", type = "kaplan-meier", 
#>     conf.type = "plain")
#> 
#>       n events median 0.95LCL 0.95UCL
#> CS1 136     51     NA      NA      NA
#> CS2 117     51    205     109      NA
#> CS3 119     33    236      NA      NA
#> CS4 159     43    222     191      NA
#> CS5 151     50     NA      NA      NA
#> 
#> $xyrs.est
#> [1] "[Not Available]: argument of xyrs.est was not specified."
#> 
#> $overall.p
#> [1] 0.02410534
#> 
#> $pairwise.p
#> 
#>  Pairwise comparisons using Log-Rank test 
#> 
#> data:  mosurv.res and Subtype 
#> 
#>     CS1   CS2   CS3   CS4  
#> CS2 0.699 -     -     -    
#> CS3 0.158 0.059 -     -    
#> CS4 0.100 0.039 0.901 -    
#> CS5 0.804 0.504 0.244 0.158
#> 
#> P value adjustment method: BH

I further check the agreement between the predicted subtype and PAM50 classification.

# compare agreement in Yau cohort
agree.yau <- compAgree(moic.res  = yau.ntp.pred,
                       subt2comp = brca.yau$clin.info[, "PAM50", drop = FALSE],
                       doPlot    = TRUE,
                       fig.name  = "YAU PREDICTEDMOIC WITH PAM50")
#> --all samples matched.
Figure 20. Agreement of predicted 5 subtypes of breast cancer with PAM50 classification in Yau cohort.

Figure 20. Agreement of predicted 5 subtypes of breast cancer with PAM50 classification in Yau cohort.

print(agree.yau)
Table 11. Agreement of 5 predicted subtypes of breast cancer with PAM50 classification in Yau cohort.
current.subtype other.subtype RI AMI JI FM
Subtype PAM50 0.7830213 0.3864876 0.3318526 0.4994425

It is clearly that the predicted subtypes of breast cancer in Yau cohort can distinguish prognosis well and also show similar pattern of agreement with PAM50 classification as compared to TCGA-BRCA to some extent.

5) run partition around medoids classifier

In addition to NTP, MOVICS provides another model-free approach to predict subtypes. To be specific, runPAM() first trains a partition around medoids (PAM) classifier in the discovery (training) cohort (i.e., TCGA-BRCA) to predict the subtype for patients in the external validation (testing) cohort (i.e., BRCA-Yau), and each sample in the validation cohort was assigned to a subtype label whose centroid had the highest Pearson correlation with the sample17. Finally, the in-group proportion (IGP) statistic will be performed to evaluate the similarity and reproducibility of the acquired subtypes between discovery and validation cohorts18.

yau.pam.pred <- runPAM(train.expr  = fpkm,
                       moic.res    = cmoic.brca,
                       test.expr   = brca.yau$mRNA.expr)
#> --all samples matched.
#> --a total of 7303 genes shared and used.
#> --log2 transformation done for training expression data.
#> --testing expression profile seems to have been standardised (z-score or log transformation), no more action will be performed.

The yau.pam.pred object also stores clust.res which can be passed to other functions, and the users can check IGP information like below:

print(yau.pam.pred$IGP)
#>       CS1       CS2       CS3       CS4       CS5 
#> 0.4545455 0.6416667 0.5616438 0.7814570 0.9225806

6) run consistency evaluation using Kappa statistics

Want to know how accurate the NTP or PAM is when using discovery cohort? Want to know how consistent the different prediction results are? Use runKappa() then:

# predict subtype in discovery cohort using NTP
tcga.ntp.pred <- runNTP(expr      = fpkm,
                        templates = marker.up$templates,
                        doPlot    = FALSE) 
#> --original template has 500 biomarkers and 500 are matched in external expression profile.
#> cosine correlation distance
#> 643 samples; 5 classes; 100-100 features/class
#> serial processing; 1000 permutation(s)...
#> predicted samples/class (FDR<0.05)
#> 
#>  CS1  CS2  CS3  CS4  CS5 <NA> 
#>   99  105  138  155  107   39

# predict subtype in discovery cohort using PAM
tcga.pam.pred <- runPAM(train.expr  = fpkm,
                        moic.res    = cmoic.brca,
                        test.expr   = fpkm)
#> --all samples matched.
#> --a total of 13771 genes shared and used.
#> --log2 transformation done for training expression data.
#> --log2 transformation done for testing expression data.

# check consistency between current and NTP-predicted subtype in discovery TCGA-BRCA
runKappa(subt1     = cmoic.brca$clust.res$clust,
         subt2     = tcga.ntp.pred$clust.res$clust,
         subt1.lab = "CMOIC",
         subt2.lab = "NTP",
         fig.name  = "CONSISTENCY HEATMAP FOR TCGA between CMOIC and NTP")

# check consistency between current and PAM-predicted subtype in discovery TCGA-BRCA
runKappa(subt1     = cmoic.brca$clust.res$clust,
         subt2     = tcga.pam.pred$clust.res$clust,
         subt1.lab = "CMOIC",
         subt2.lab = "PAM",
         fig.name  = "CONSISTENCY HEATMAP FOR TCGA between CMOIC and PAM")

# check consistency between NTP and PAM-predicted subtype in validation Yau-BRCA
runKappa(subt1     = yau.ntp.pred$clust.res$clust,
         subt2     = yau.pam.pred$clust.res$clust,
         subt1.lab = "NTP",
         subt2.lab = "PAM",
         fig.name  = "CONSISTENCY HEATMAP FOR YAU")
Figure 21. Consistency heatmap using Kappa statistics.Figure 21. Consistency heatmap using Kappa statistics.Figure 21. Consistency heatmap using Kappa statistics.

Figure 21. Consistency heatmap using Kappa statistics.

5. Little Trick

Indeed, the moic.res parameter is necessary for almost all the functions, especially for downstream analyses. You may deem that if the moic.res object is not obtained properly through the module of GET, downstream analyses cannot run smoothly. But in fact, you can fool downstream modules (COMP and RUN) with little tricks. The core information stored in moic.res is a data.frame named clust.res with a samID column (character) for samples name (should be exactly the same with row names) and a clust column (integer) for subtype indicator. Another information required for some functions in downstream analyses is mo.method, a string value that is usually considered as prefix for output files. For example, if I want to check the prognostic value of PAM50 classification, the only thing I have to prepare is a pseudo moic.res object with a customized mo.method just like below:

# include original clinical information as `clust.res` and a string value for `mo.method` to a list
pseudo.moic.res                 <- list("clust.res" = surv.info,
                                        "mo.method" = "PAM50")

# make pseudo samID
pseudo.moic.res$clust.res$samID <- rownames(pseudo.moic.res$clust.res)

# make pseudo clust using a mapping relationship
pseudo.moic.res$clust.res$clust <- sapply(pseudo.moic.res$clust.res$PAM50,
                                          switch,
                                          "Basal"   = 1, # relabel Basal as 1
                                          "Her2"    = 2, # relabel Her2 as 2
                                          "LumA"    = 3, # relabel LumA as 3
                                          "LumB"    = 4, # relabel LumnB as 4
                                          "Normal"  = 5) # relabel Normal as 5

Let’s check how the pseudo moic.res object looks like.

head(pseudo.moic.res$clust.res)
#>               fustat futime PAM50 pstage age         samID clust
#> BRCA-A03L-01A      0   2442  LumA     T3  34 BRCA-A03L-01A     3
#> BRCA-A04R-01A      0   2499  LumB     T1  36 BRCA-A04R-01A     4
#> BRCA-A075-01A      0    518  LumB     T2  42 BRCA-A075-01A     4
#> BRCA-A08O-01A      0    943  LumA     T2  45 BRCA-A08O-01A     3
#> BRCA-A0A6-01A      0    640  LumA     T2  64 BRCA-A0A6-01A     3
#> BRCA-A0AD-01A      0   1157  LumA     T1  83 BRCA-A0AD-01A     3

Ok everything is ready, just keep in mind how your interested subtypes are mapped and feel free to use such pseudo object to perform downstream analyses such as compSurv() as follows:

# survival comparison
pam50.brca <- compSurv(moic.res         = pseudo.moic.res,
                       surv.info        = surv.info,
                       convt.time       = "y", # convert day unit to year
                       surv.median.line = "h", # draw horizontal line at median survival
                       fig.name         = "KAPLAN-MEIER CURVE OF PAM50 BY PSEUDO")
#> --a total of 643 samples are identified.
#> --removed missing values.
#> --leaving 642 observations.
#> --cut survival curve up to 10 years.
Figure 22. Kaplan-Meier survival curve of PAM50 subtypes of breast cancer with pseudo input in TCGA-BRCA cohort.

Figure 22. Kaplan-Meier survival curve of PAM50 subtypes of breast cancer with pseudo input in TCGA-BRCA cohort.

6. Summary

I have introduced the pipeline and described the functions included in MOVICS in almost detail. This package is rather young and I hope to improve it further in the future, along with more functionality. If you have any questions, bug reports, or suggestions for improving MOVICS, please email them to . Feedback is crucial for improving a package that tries to seamlessly incorporate functionality and flexibility from many other useful tools.

7. Session Information

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.936 
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] MOVICS_0.99.5       Biobase_2.48.0      BiocGenerics_0.34.0
#> [4] kableExtra_1.2.1    knitr_1.29          dplyr_1.0.2        
#> 
#> loaded via a namespace (and not attached):
#>   [1] corpcor_1.6.9               aricode_1.0.0              
#>   [3] class_7.3-17                ps_1.3.4                   
#>   [5] foreach_1.5.0               rprojroot_1.3-2            
#>   [7] crayon_1.3.4                MASS_7.3-52                
#>   [9] nlme_3.1-149                backports_1.1.9            
#>  [11] sva_3.36.0                  impute_1.62.0              
#>  [13] GOSemSim_2.14.2             rlang_0.4.7                
#>  [15] svd_0.5                     XVector_0.28.0             
#>  [17] readxl_1.3.1                heatmap.plus_1.3           
#>  [19] irlba_2.3.3                 nloptr_1.2.2.2             
#>  [21] callr_3.4.4                 limma_3.44.3               
#>  [23] BiocParallel_1.22.0         rjson_0.2.20               
#>  [25] bit64_4.0.5                 glue_1.4.2                 
#>  [27] rngtools_1.5                SNFtool_2.3.0              
#>  [29] processx_3.4.4              AnnotationDbi_1.50.3       
#>  [31] oompaData_3.1.1             DOSE_3.14.0                
#>  [33] haven_2.3.1                 tidyselect_1.1.0           
#>  [35] SummarizedExperiment_1.18.2 km.ci_0.5-2                
#>  [37] usethis_1.6.1               rio_0.5.16                 
#>  [39] XML_3.99-0.5                tidyr_1.1.2                
#>  [41] zoo_1.8-8                   ggpubr_0.4.0               
#>  [43] maftools_2.4.12             ridge_2.7                  
#>  [45] xtable_1.8-4                magrittr_1.5               
#>  [47] prettyGraphs_2.1.6          evaluate_0.14              
#>  [49] bibtex_0.4.2.2              ggplot2_3.3.2              
#>  [51] cli_2.0.2                   zlibbioc_1.34.0            
#>  [53] rstudioapi_0.11             fastmatch_1.1-0            
#>  [55] ClassDiscovery_3.3.12       InterSIM_2.2.0             
#>  [57] shiny_1.5.0                 GSVA_1.36.2                
#>  [59] prettydoc_0.4.0             xfun_0.16                  
#>  [61] coca_1.1.0                  clue_0.3-57                
#>  [63] pkgbuild_1.1.0              cluster_2.1.0              
#>  [65] caTools_1.18.0              tidygraph_1.2.0            
#>  [67] tibble_3.0.3                flexclust_1.4-0            
#>  [69] ggrepel_0.8.2               permute_0.9-5              
#>  [71] png_0.1-7                   withr_2.2.0                
#>  [73] bitops_1.0-6                ggforce_0.3.2              
#>  [75] plyr_1.8.6                  cellranger_1.1.0           
#>  [77] GSEABase_1.50.1             e1071_1.7-3                
#>  [79] survey_4.0                  pillar_1.4.6               
#>  [81] RcppParallel_5.0.2          gplots_3.0.4               
#>  [83] GlobalOptions_0.1.2         geepack_1.3-1              
#>  [85] fs_1.5.0                    iClusterPlus_1.24.0        
#>  [87] GetoptLong_1.0.2            graphite_1.34.0            
#>  [89] europepmc_0.4               clusterProfiler_3.16.1     
#>  [91] clusterRepro_0.9            vctrs_0.3.4                
#>  [93] ellipsis_0.3.1              generics_0.0.2             
#>  [95] urltools_1.7.3              devtools_2.3.1             
#>  [97] NMF_0.23.0                  tools_4.0.2                
#>  [99] foreign_0.8-80              entropy_1.2.1              
#> [101] munsell_0.5.0               tweenr_1.0.1               
#> [103] fgsea_1.14.0                DelayedArray_0.14.1        
#> [105] fastmap_1.0.1               compiler_4.0.2             
#> [107] pkgload_1.1.0               abind_1.4-5                
#> [109] httpuv_1.5.4                sessioninfo_1.1.1          
#> [111] pkgmaker_0.31.1             GenomeInfoDbData_1.2.3     
#> [113] gridExtra_2.3               edgeR_3.30.3               
#> [115] lattice_0.20-41             jstable_0.9.6              
#> [117] later_1.1.0.1               alluvial_0.1-2             
#> [119] jsonlite_1.7.0              scales_1.1.1               
#> [121] graph_1.66.0                carData_3.0-4              
#> [123] genefilter_1.70.0           promises_1.1.1             
#> [125] car_3.0-9                   doParallel_1.0.15          
#> [127] checkmate_2.0.0             rmarkdown_2.3              
#> [129] openxlsx_4.1.5              cowplot_1.1.0              
#> [131] statmod_1.4.34              webshot_0.5.2              
#> [133] forcats_0.5.0               downloader_0.4             
#> [135] igraph_1.2.5                survival_3.2-3             
#> [137] IntNMF_1.2.0                yaml_2.2.1                 
#> [139] htmltools_0.5.0             memoise_1.1.0              
#> [141] modeltools_0.2-23           locfit_1.5-9.4             
#> [143] graphlayouts_0.7.0          IRanges_2.22.2             
#> [145] viridisLite_0.3.0           digest_0.6.25              
#> [147] assertthat_0.2.1            mime_0.9                   
#> [149] rappdirs_0.3.1              registry_0.5-1             
#> [151] KMsurv_0.1-5                RSQLite_2.2.0              
#> [153] remotes_2.2.0               data.table_1.13.0          
#> [155] vegan_2.5-6                 blob_1.2.1                 
#> [157] mogsa_1.22.1                S4Vectors_0.26.1           
#> [159] CMScaller_0.99.2            preprocessCore_1.50.0      
#> [161] survMisc_0.5.5              labeling_0.3               
#> [163] shinythemes_1.1.2           splines_4.0.2              
#> [165] pamr_1.56.1                 RCurl_1.98-1.2             
#> [167] broom_0.7.0                 hms_0.5.3                  
#> [169] colorspace_1.4-1            ConsensusClusterPlus_1.52.0
#> [171] BiocManager_1.30.10         GenomicRanges_1.40.0       
#> [173] shape_1.4.4                 Rcpp_1.0.5                 
#> [175] mclust_5.4.6                circlize_0.4.10            
#> [177] enrichplot_1.8.1            fansi_0.4.1                
#> [179] R6_2.4.1                    survminer_0.4.8            
#> [181] grid_4.0.2                  ggridges_0.5.2             
#> [183] lifecycle_0.2.0             labelled_2.6.0             
#> [185] oompaBase_3.2.9             zip_2.1.1                  
#> [187] curl_4.3                    ggsignif_0.6.0             
#> [189] minqa_1.2.4                 gdata_2.18.0               
#> [191] PINSPlus_2.0.5              testthat_2.3.2             
#> [193] DO.db_2.9                   Matrix_1.2-18              
#> [195] qvalue_2.20.0               desc_1.2.0                 
#> [197] RColorBrewer_1.1-2          iterators_1.0.12           
#> [199] stringr_1.4.0               officer_0.3.15.003         
#> [201] polyclip_1.10-0             triebeard_0.3.0            
#> [203] purrr_0.3.4                 CIMLR_1.0.0                
#> [205] gridGraphics_0.5-0          rvest_0.3.6                
#> [207] ComplexHeatmap_2.5.5        mgcv_1.8-33                
#> [209] patchwork_1.0.1             bdsmatrix_1.3-4            
#> [211] codetools_0.2-16            matrixStats_0.57.0         
#> [213] GO.db_3.11.4                FNN_1.1.3                  
#> [215] gtools_3.8.2                prettyunits_1.1.1          
#> [217] gridBase_0.4-7              GenomeInfoDb_1.24.2        
#> [219] gtable_0.3.0                DBI_1.1.0                  
#> [221] ggpmisc_0.3.6               ggalluvial_0.12.2          
#> [223] stats4_4.0.2                highr_0.8                  
#> [225] httr_1.4.2                  KernSmooth_2.23-17         
#> [227] stringi_1.4.6               progress_1.2.2             
#> [229] uuid_0.1-4                  reshape2_1.4.4             
#> [231] farver_2.0.3                annotate_1.66.0            
#> [233] viridis_0.5.1               magick_2.4.0               
#> [235] coxme_2.2-16                xml2_1.3.2                 
#> [237] rvcheck_0.1.8               boot_1.3-25                
#> [239] lme4_1.1-23                 geneplotter_1.66.0         
#> [241] ggplotify_0.0.5             DESeq2_1.28.1              
#> [243] bit_4.0.4                   scatterpie_0.1.4           
#> [245] ExPosition_2.8.23           ggraph_2.0.3               
#> [247] pkgconfig_2.0.3             rstatix_0.6.0              
#> [249] mitools_2.4                 tableone_0.12.0

8. Citing MOVICS

MOVICS is a wrapper incorporating algorithms and functions from many other sources. The majority of functions used in MOVICS are incorporated directly from other packages or are drawn from specific published algorithms. Hence, below I provide guidance on which papers should be cited alongside MOVICS when using these functions:

cloud

For now, if you use MOVICS R package in published research, please cite:

  • Lu, X., Meng, J., Zhou, Y., Jiang, L., and Yan, F. (2020). MOVICS: an R package for multi-omics integration and visualization in cancer subtyping. bioRxiv, 2020.2009.2015.297820. [doi.org/10.1101/2020.09.15.297820]

Please cite corresponding literature below for multi-omic clustering algorithm if used:

  • CIMLR: Ramazzotti D, Lal A, Wang B, Batzoglou S, Sidow A (2018). Multi-omic tumor data reveal diversity of molecular mechanisms that correlate with survival. Nat Commun, 9(1):4453.
  • iClusterBayes: Mo Q, Shen R, Guo C, Vannucci M, Chan KS, Hilsenbeck SG (2018). A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics, 19(1):71-86.
  • Mocluster: Meng C, Helm D, Frejno M, Kuster B (2016). moCluster: Identifying Joint Patterns Across Multiple Omics Data Sets. J Proteome Res, 15(3):755-765.
  • COCA: Hoadley KA, Yau C, Wolf DM, et al (2014). Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell, 158(4):929-944.
  • ConsensusClustering: Monti S, Tamayo P, Mesirov J, et al (2003). Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Mach Learn, 52:91-118.
  • IntNMF: Chalise P, Fridley BL (2017). Integrative clustering of multi-level omic data based on non-negative matrix factorization algorithm. PLoS One, 12(5):e0176278.
  • LRAcluster: Wu D, Wang D, Zhang MQ, Gu J (2015). Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genomics, 16(1):1022.
  • NEMO: Rappoport N, Shamir R (2019). NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics, 35(18):3348-3356.
  • PINSPlus: Nguyen H, Shrestha S, Draghici S, Nguyen T (2019). PINSPlus: a tool for tumor subtype discovery in integrated genomic data. Bioinformatics, 35(16):2843-2846.
  • SNF: Wang B, Mezlini AM, Demir F, et al (2014). Similarity network fusion for aggregating data types on a genomic scale. Nat Methods, 11(3):333-337.

Heatmap generated by MOVICS uses ComplexHeatmap R package, so please cite if any heatmap is created:

  • Gu Z, Eils R, Schlesner M (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics, 32(18):2847–2849.

If FLAGS is removed when calculating TMB, please cite the following two at the same time, otherwise only the first one:

  • Mayakonda A, Lin D C, Assenov Y, et al. (2018). Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Res, 28(11):1747-1756.
  • Shyr C, Tarailo-Graovac M, Gottlieb M, Lee JJ, van Karnebeek C, Wasserman WW. (2014). FLAGS, frequently mutated genes in public exomes. BMC Med Genomics, 7(1): 1-14.

If calculating FGA, please cite:

  • Cerami E, Gao J, Dogrusoz U, et al. (2012). The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discov, 2(5):401-404.
  • Gao J, Aksoy B A, Dogrusoz U, et al. (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal, 6(269):pl1-pl1.

Drug sensitivity analysis is based on pRRophetic R package, so please cite:

  • Geeleher P, Cox N, Huang R S (2014). pRRophetic: an R package for prediction of clinical chemotherapeutic response from tumor gene expression levels. PLoS One, 9(9):e107468.
  • Geeleher P, Cox N J, Huang R S (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol, 15(3):1-12.

  Please cite corresponding literature below if using differential expression analysis:

  • edgeR:
    • Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139-140.
    • McCarthy DJ, Chen Y, Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 40(10):4288-4297.
  • DESeq2: Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol, 15(12):550-558.
  • limma: Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res, 43(7):e47.

  Gene set enrichment analysis uses clusterProfiler R package with the following paper:

  • Yu G, Wang L, Han Y, He Q (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS, 16(5):284-287.
  • Subramanian A, Tamayo P, Mootha V K, et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci, 102(43):15545-15550.

  For single sample enrichment analysis, please cite literature below appropriately:

  • ssgsea: Barbie, D.A. et al (2009). Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature, 462(5):108-112.
  • gsva: Hänzelmann, S., Castelo, R. and Guinney, J. (2013). GSVA: Gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics, 14(1):7.
  • zscore: Lee, E. et al (2008). Inferring pathway activity toward precise disease classification. PLoS Comp Biol, 4(11):e1000217.
  • plage: Tomfohr, J. et al (2005). Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics, 6(1):1-11.

External validation uses nearest template prediction as implanted in CMScaller R package. Please cite:

  • Eide P W, Bruun J, Lothe R A, et al. (2017). CMScaller: an R package for consensus molecular subtyping of colorectal cancer pre-clinical models. Sci Rep, 7(1):1-8.
  • Hoshida, Y. (2010). Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment. PLoS One, 5(11):e15543.

If using partition around medoids classifier, then cite:

  • Tibshirani R, Hastie T, Narasimhan B and Chu G (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci, 99,6567–6572.
  • Kapp A V, Tibshirani R. (2007). Are clusters found in one dataset present in another dataset? Biostatistics, 8(1):9-31.

REFERENCES

  1. Pierre-Jean M, Deleuze J F, Le Floch E, et al (2019). Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Brief Bioinformatics.
  2. Rappoport N, Shamir R (2018). Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res, 46(20):10546-10562.
  3. Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature, 2012, 490(7418):61-78.
  4. Yau C, Esserman L, Moore D H, et al. (2010). A multigene predictor of metastatic outcome in early stage hormone receptor-negative and triple-negative breast cancer. Breast Cancer Res, 5(12):1-15.
  5. Chalise P, Fridley BL (2017). Integrative clustering of multi-level omic data based on non-negative matrix factorization algorithm. PLoS One, 12(5):e0176278.
  6. Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. J R Stat Soc Series B Stat Methodol, 63(2):411-423.
  7. Strehl, Alexander; Ghosh, Joydeep. (2002). Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Res, 3(Dec):583–617.
  8. Le DT, Uram JN, Wang H, et al. (2015). PD-1 Blockade in tumors with mismatch-repair deficiency. N Engl J Med, 372(26):2509–2520.
  9. Davoli T, Uno H, Wooten E C, et al. (2017). Tumor aneuploidy correlates with markers of immune evasion and with reduced response to immunotherapy. Science, 355(6322):261-U75.
  10. Geeleher P, Cox N J, Huang R S (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol, 15(3):1-12.
  11. Geeleher P, Cox N, Huang R S (2014). pRRophetic: an R package for prediction of clinical chemotherapeutic response from tumor gene expression levels. PLoS One, 9(9):e107468.
  12. Rand W M. (1971). Objective criteria for the evaluation of clustering methods. J Am Stat Assoc, 66(336):846-850.
  13. Vinh, N. X., Epps, J., Bailey, J. (2009). Information theoretic measures for clusterings comparison. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09. p. 1.
  14. Jaccard Distance (Jaccard Index, Jaccard Similarity Coefficient). Dictionary of Bioinformatics and Computational Biology.
  15. Fowlkes E B, Mallows C L. (1983). A method for comparing two hierarchical clusterings. J Am Stat Assoc, 78(383):553-569.
  16. Hoshida, Y. (2010). Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment. PLoS One, 5(11):e15543.
  17. Tibshirani R, Hastie T, Narasimhan B and Chu G (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci, 99(10):6567–6572.
  18. Kapp A V, Tibshirani R. (2007). Are clusters found in one dataset present in another dataset?, Biostatistics, 8(1):9-31.