Genome region pooling

By: Nathan Sheffield, University of Virginia
Contact: databio.org; nathan@code.databio.org

Workshop description and packages used

This workshop will introduce Bioconductor packages for analyzing epigenomic data types using collections of genomic regions, or region sets. A region set is a collection of genomic regions genome that share some biological annotation. Region sets are represented in Bioconductor by GRanges objects, which form the basis of many downstream analysis packages. The workshop focuses on three packages produced by the Sheffield lab that use region sets to perform different kinds of epigenome analysis:

LOLA – Locus Overlap Analysis. Provides functions for testing overlap of sets of genomic regions with public and custom region set (genomic ranges) databases. This makes it possible to do automated enrichment analysis for genomic region sets, thus facilitating interpretation of functional genomics and epigenomics data.
MIRA – Methylation-Based Inference of Regulatory Activity. MIRA aggregates genome-scale DNA methylation data into a DNA methylation profile for a given region set. Using this profile, MIRA infers and scores the collective regulatory activity for the region set. MIRA facilitates regulatory analysis in situations where classical regulatory assays would be difficult and allows public sources of region sets to be leveraged for novel insight into the regulatory state of DNA methylation datasets.
COCOA – Coordinate Covariation Analysis. COCOA is a method for understanding variation among samples. It can be used with any data that includes genomic coordinates, such as DNA methylation. COCOA uses principal component analysis (PCA) combined with a database of region sets to identify sources of variation among samples. COCOA works in both supervised (known groups of samples) and unsupervised (no groups) situations and can be used as a complement to differential methylation methods that find discrete differences between groups.

Prerequisites

Basic knowledge of R syntax
Familiarity with the GenomicRanges class
Familiarity with BED files and biological concepts in epigenomics

Relevant background reading

LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics (2016)
MIRA: an R package for DNA methylation-based inference of regulatory activity. Bioinformatics (2018)

Workshop outline

Activity	Time
Intro to region concepts	20m
Introduction to LOLA	10m
LOLA vignettee	20m
Introduction to MIRA	10m
MIRA vignettee	20m
Introduction to COCOA	10m
COCOA vignettee	20m

Workshop goals and objectives

Learning goals

understand how region pooling methods bring power to genome analysis
learn to use specific R packages to employ region pooling methods

Learning objectives

identify the most enriched public region sets for a given query user set
identify different aggregate DNA methylation patterns at different region sets
annotate principal component axes using region sets for a provided dataset

LOLA bioconductor package

Biological data is often compared to reference databases and searching for interesting patterns of enrichment and depletion. For example, gene set analysis has been pivotal for making connections between diverse types of genomic data. However, it method suffers from one major limitation: it requires gene-centric data. This is becoming increasingly limiting as our understanding of gene regulation advances. It has become evident that gene expression and chromatin organization are controlled by hundreds of thousands of enhancers and other functional elements, which are often difficult to map to gene symbols. The increasing emphasis on genomic region sets has been propelled by next generation sequencing technology that produces data most naturally analyzed in the context of genomic regions – as peaks and segmentations. The research community has now established large catalogs of regulatory elements and other genomic features across many cell types. LOLA makes use of these catalogs to perform enrichment analysis of genomic ranges.

Preparing analysis

In this vignette, you’ll use small example datasets that come with the LOLA package to get a first look at the most common functions in a LOLA workflow.

You need 3 things to run a LOLA analysis:

A region database.
userSets: Regions of interest (at least 1 set of regions as a GRanges object, or multiple sets of regions of interest as a GRangesList object)
userUniverse: The set of regions tested for inclusion in your sets of regions of interest (a single GRanges object)

Let’s load an example regionDB with loadRegionDB(). Here’s a small example that comes with LOLA. The database location should point to a folder that contains collection subfolders:

library("LOLA")
dbPath = system.file("extdata", "hg19", package="LOLA")
regionDB = loadRegionDB(dbPath)

The regionDB is an R (list) object that has a few elements:

names(regionDB)

## [1] "dbLocation"     "regionAnno"     "collectionAnno" "regionGRL"

dbLocation: A string recording the location of the database folder you passed to loadRegionDB().
collectionAnno: A data.table annotating the collections, with rows corresponding to the rows in your collection annotation files in the database.
regionAnno: A data.table annotating each region set, with rows corresponding to bed files in the database (there is also a collection column recording which collection each region set belongs to).
regionGRL: A GRangesList object holding the actual regions, with one list element per region set, ordered as in regionAnno.

Now with the database loaded, let’s load up some sample data (the regions of interest, and the tested universe):

data("sample_input", package="LOLA") # load userSets
data("sample_universe", package="LOLA") # load userUniverse

Now we have a GRanges object called userSets and a GRanges object called userUniverse. This is all we need to run the enrichment calculation.

Run the analysis

runLOLA() will test the overlap between your userSets, and each region set in the regionDB.

locResults = runLOLA(userSets, userUniverse, regionDB, cores=1)

runLOLA tests for pairwise overlap between each user set and each region set in regionDB. It then uses a Fisher’s exact test to assess significance of the overlap. The results are a data.table with several columns:

colnames(locResults)
head(locResults)

##  [1] "userSet"     "dbSet"       "collection"  "pValueLog"   "oddsRatio"  
##  [6] "support"     "rnkPV"       "rnkOR"       "rnkSup"      "maxRnk"     
## [11] "meanRnk"     "b"           "c"           "d"           "description"
## [16] "cellType"    "tissue"      "antibody"    "treatment"   "dataSource" 
## [21] "filename"    "qValue"      "size"       
##    userSet dbSet   collection   pValueLog oddsRatio support rnkPV rnkOR
## 1:    setB     2 ucsc_example 264.4863407 7.7578283     850     1     1
## 2:    setA     2 ucsc_example 254.6188080 8.6487312     632     1     1
## 3:    setB     1 ucsc_example  34.6073689 3.3494078    5747     2     2
## 4:    setA     4 ucsc_example   1.7169689 1.2377725     124     2     2
## 5:    setA     5 ucsc_example   1.7169689 1.2377725     124     2     2
## 6:    setA     3 ucsc_example   0.1877354 0.9135696       8     4     4
##    rnkSup maxRnk meanRnk     b    c     d                      description
## 1:      2      2    1.33   452 4981 20546                     ucsc_example
## 2:      2      2    1.33   670 2510 23017                     ucsc_example
## 3:      1      2    1.67 20018   84   980 CpG islands from UCSC annotation
## 4:      3      3    2.33   761 3018 22926                     ucsc_example
## 5:      3      3    2.33   761 3018 22926                     ucsc_example
## 6:      5      5    4.33    66 3134 23621                     ucsc_example
##    cellType tissue antibody treatment dataSource
## 1:     <NA>   <NA>     <NA>      <NA>       <NA>
## 2:     <NA>   <NA>     <NA>      <NA>       <NA>
## 3:     <NA>   <NA>     <NA>      <NA>       <NA>
## 4:     <NA>   <NA>     <NA>      <NA>       <NA>
## 5:     <NA>   <NA>     <NA>      <NA>       <NA>
## 6:     <NA>   <NA>     <NA>      <NA>       <NA>
##                       filename        qValue  size
## 1:             laminB1Lads.bed 3.263317e-264  1302
## 2:             laminB1Lads.bed 1.202713e-254  1302
## 3:            cpgIslandExt.bed  8.232086e-35 28691
## 4:          vistaEnhancers.bed  3.837612e-02  1339
## 5: vistaEnhancers_colNames.bed  3.837612e-02  1340
## 6:          numtSAssembled.bed  1.000000e+00    78

If you’re not familiar with how data.table works in R, it’s worth reading some of the documentation of this powerful package. Columns userSet and dbSet are indexes into the respective GRangeList objects, identifying each pairwise comparison. There are a series of columns describing the results of the statistical test, such as pValueLog, logOdds, and the actual values from the contingency table (support is the overlap, and b, c, and d complete the 2x2 table). Rank columns simply rank the tests by pValueLog, logOdds, or support; following these are a series of columns annotating the database regions, depending on how you populated the index table in the regionDB folder.

You can explore these results in R by, for example, ranking with different orders:

locResults[order(support, decreasing=TRUE),]

##     userSet dbSet   collection    pValueLog oddsRatio support rnkPV rnkOR
##  1:    setB     1 ucsc_example 3.460737e+01 3.3494078    5747     2     2
##  2:    setA     1 ucsc_example 2.818334e-02 0.8704355    3002     5     5
##  3:    setB     2 ucsc_example 2.644863e+02 7.7578283     850     1     1
##  4:    setA     2 ucsc_example 2.546188e+02 8.6487312     632     1     1
##  5:    setA     4 ucsc_example 1.716969e+00 1.2377725     124     2     2
##  6:    setA     5 ucsc_example 1.716969e+00 1.2377725     124     2     2
##  7:    setB     4 ucsc_example 0.000000e+00 0.3489379      80     4     3
##  8:    setB     5 ucsc_example 0.000000e+00 0.3489379      80     4     3
##  9:    setA     3 ucsc_example 1.877354e-01 0.9135696       8     4     4
## 10:    setB     3 ucsc_example 9.184826e-06 0.2052377       4     3     5
##     rnkSup maxRnk meanRnk     b    c     d
##  1:      1      2    1.67 20018   84   980
##  2:      1      5    3.67 22763  140   924
##  3:      2      2    1.33   452 4981 20546
##  4:      2      2    1.33   670 2510 23017
##  5:      3      3    2.33   761 3018 22926
##  6:      3      3    2.33   761 3018 22926
##  7:      3      4    3.33   805 5751 20193
##  8:      3      4    3.33   805 5751 20193
##  9:      5      5    4.33    66 3134 23621
## 10:      5      5    4.33    70 5827 20928
##                          description cellType tissue antibody treatment
##  1: CpG islands from UCSC annotation     <NA>   <NA>     <NA>      <NA>
##  2: CpG islands from UCSC annotation     <NA>   <NA>     <NA>      <NA>
##  3:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  4:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  5:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  6:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  7:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  8:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  9:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
## 10:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##     dataSource                    filename        qValue  size
##  1:       <NA>            cpgIslandExt.bed  8.232086e-35 28691
##  2:       <NA>            cpgIslandExt.bed  1.000000e+00 28691
##  3:       <NA>             laminB1Lads.bed 3.263317e-264  1302
##  4:       <NA>             laminB1Lads.bed 1.202713e-254  1302
##  5:       <NA>          vistaEnhancers.bed  3.837612e-02  1339
##  6:       <NA> vistaEnhancers_colNames.bed  3.837612e-02  1340
##  7:       <NA>          vistaEnhancers.bed  1.000000e+00  1339
##  8:       <NA> vistaEnhancers_colNames.bed  1.000000e+00  1340
##  9:       <NA>          numtSAssembled.bed  1.000000e+00    78
## 10:       <NA>          numtSAssembled.bed  1.000000e+00    78

You can order by one of the rank columns:

locResults[order(maxRnk, decreasing=TRUE),]

##     userSet dbSet   collection    pValueLog oddsRatio support rnkPV rnkOR
##  1:    setA     3 ucsc_example 1.877354e-01 0.9135696       8     4     4
##  2:    setA     1 ucsc_example 2.818334e-02 0.8704355    3002     5     5
##  3:    setB     3 ucsc_example 9.184826e-06 0.2052377       4     3     5
##  4:    setB     4 ucsc_example 0.000000e+00 0.3489379      80     4     3
##  5:    setB     5 ucsc_example 0.000000e+00 0.3489379      80     4     3
##  6:    setA     4 ucsc_example 1.716969e+00 1.2377725     124     2     2
##  7:    setA     5 ucsc_example 1.716969e+00 1.2377725     124     2     2
##  8:    setB     2 ucsc_example 2.644863e+02 7.7578283     850     1     1
##  9:    setA     2 ucsc_example 2.546188e+02 8.6487312     632     1     1
## 10:    setB     1 ucsc_example 3.460737e+01 3.3494078    5747     2     2
##     rnkSup maxRnk meanRnk     b    c     d
##  1:      5      5    4.33    66 3134 23621
##  2:      1      5    3.67 22763  140   924
##  3:      5      5    4.33    70 5827 20928
##  4:      3      4    3.33   805 5751 20193
##  5:      3      4    3.33   805 5751 20193
##  6:      3      3    2.33   761 3018 22926
##  7:      3      3    2.33   761 3018 22926
##  8:      2      2    1.33   452 4981 20546
##  9:      2      2    1.33   670 2510 23017
## 10:      1      2    1.67 20018   84   980
##                          description cellType tissue antibody treatment
##  1:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  2: CpG islands from UCSC annotation     <NA>   <NA>     <NA>      <NA>
##  3:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  4:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  5:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  6:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  7:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  8:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
##  9:                     ucsc_example     <NA>   <NA>     <NA>      <NA>
## 10: CpG islands from UCSC annotation     <NA>   <NA>     <NA>      <NA>
##     dataSource                    filename        qValue  size
##  1:       <NA>          numtSAssembled.bed  1.000000e+00    78
##  2:       <NA>            cpgIslandExt.bed  1.000000e+00 28691
##  3:       <NA>          numtSAssembled.bed  1.000000e+00    78
##  4:       <NA>          vistaEnhancers.bed  1.000000e+00  1339
##  5:       <NA> vistaEnhancers_colNames.bed  1.000000e+00  1340
##  6:       <NA>          vistaEnhancers.bed  3.837612e-02  1339
##  7:       <NA> vistaEnhancers_colNames.bed  3.837612e-02  1340
##  8:       <NA>             laminB1Lads.bed 3.263317e-264  1302
##  9:       <NA>             laminB1Lads.bed 1.202713e-254  1302
## 10:       <NA>            cpgIslandExt.bed  8.232086e-35 28691

And finally, record the results to file like this:

Write out results:

writeCombinedEnrichment(locResults, outFolder= "lolaResults")

By default, this function will write the entire table to a tsv file. I recommend using the includeSplits parameter, which tells the function to also print out additional tables that are subsetted by userSet, so that each region set you test has its own result table. It just makes it a little easier to explore the results.

writeCombinedEnrichment(locResults, outFolder= "lolaResults", includeSplits=TRUE)

Exploring LOLA Results

Say you’d like to know which regions are responsible for the enrichment we see; or, in other words, you’d like to extract the regions that are actually overlapping a particular database. For this, you can use the function extractEnrichmentOverlaps():

oneResult = locResults[2,]
extractEnrichmentOverlaps(oneResult, userSets, regionDB)

## GRanges object with 632 ranges and 0 metadata columns:
##         seqnames              ranges strand
##            <Rle>           <IRanges>  <Rle>
##     [1]     chr1   18229570-19207602      *
##     [2]     chr1   35350878-35351854      *
##     [3]     chr1   38065507-38258622      *
##     [4]     chr1   38499473-39306315      *
##     [5]     chr1   42611485-42611691      *
##     ...      ...                 ...    ...
##   [628]     chrX 125299245-125300436      *
##   [629]     chrX 136032577-138821238      *
##   [630]     chrX 139018365-148549454      *
##   [631]     chrX 154066672-154251301      *
##   [632]     chrY     2880166-7112793      *
##   -------
##   seqinfo: 69 sequences from an unspecified genome; no seqlengths

Extracting certain region sets from a database

If you have a large database, you may be interested in using the LOLA database format for other projects, or for additional follow-up analysis. In this case, you may be interested in just a single region set within a database, or perhaps just a few of them. LOLA provides a function to extract certain region sets from either a loaded or an unloaded database.

Say you just want an object with regions from the “vistaEnhancers” region set. You can grab it from a loaded database like this:

getRegionSet(regionDB, collections="ucsc_example", filenames="vistaEnhancers.bed")

## GRangesList object of length 1:
## [[1]]
## GRanges object with 1339 ranges and 0 metadata columns:
##        seqnames              ranges strand
##           <Rle>           <IRanges>  <Rle>
##      1     chr1     3190582-3191428      *
##      2     chr1     8130440-8131887      *
##      3     chr1   10593124-10594209      *
##      4     chr1   10732071-10733118      *
##      5     chr1   10757665-10758631      *
##    ...      ...                 ...    ...
##   1335     chrX 139380917-139382199      *
##   1336     chrX 139593503-139594774      *
##   1337     chrX 139674500-139675403      *
##   1338     chrX 147829017-147830159      *
##   1339     chrX 150407693-150409052      *
##   -------
##   seqinfo: 69 sequences from an unspecified genome; no seqlengths

Or, if you haven’t already loaded the database, you can just give the path to the database and LOLA will only load the specific region set(s) you are interested in. This can take more than one filename or collection:

getRegionSet(dbPath, collections="ucsc_example", filenames="vistaEnhancers.bed")

## GRangesList object of length 1:
## [[1]]
## GRanges object with 1339 ranges and 0 metadata columns:
##        seqnames              ranges strand
##           <Rle>           <IRanges>  <Rle>
##      1     chr1     3190582-3191428      *
##      2     chr1     8130440-8131887      *
##      3     chr1   10593124-10594209      *
##      4     chr1   10732071-10733118      *
##      5     chr1   10757665-10758631      *
##    ...      ...                 ...    ...
##   1335     chrX 139380917-139382199      *
##   1336     chrX 139593503-139594774      *
##   1337     chrX 139674500-139675403      *
##   1338     chrX 147829017-147830159      *
##   1339     chrX 150407693-150409052      *
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

Now that you have a basic idea of what the functions are, you can follow some other vignettes, such as Using LOLA Core, to see how this works on a realistic dataset.

The MIRA Bioconductor package

This vignette gives a broad overview of MIRA. You can find a realistic example workflow, in “Applying MIRA to a Biological Question.”

MIRA (Methylation-based Inference of Regulatory Activity) is an R package that infers regulatory activity from DNA methylation data. It does this by aggregating DNA methylation data from a set of regions across the genome, producing a single summary profile of DNA methylation for those regions. MIRA then uses this profile to produce a score (the MIRA score) that infers the level of regulatory activity on the basis of the shape of the DNA methylation profile. The region set is provided by the user and should contain regions that correspond to a shared feature, such as binding of a particular transcription factor or DNase hypersensitivity sites. The concept of MIRA relies on the observation that DNA methylation tends to be lower in regions where transcription factors are bound. Since DNA methylation will generally be lower in active regions, the shape of the MIRA profile and the associated score can be used as a metric to compare regulatory activity in different samples and conditions. MIRA thus allows one to predict transcription factor activity data given only DNA methylation data as input. MIRA works with genome-scale, single-nucleotide-resolution methylation data, such as reduced representation bisulfite sequencing (RRBS) or whole genome bisulfite sequencing (WGBS) data. MIRA overcomes sparsity in DNA methylation data by aggregating across many regions. Below are examples of methylation profiles and associated MIRA scores for two (contrived) samples. This vignette will demonstrate how to obtain a methylation profile and MIRA score from the starting inputs of DNA methylation data and a region set.

##     featureID sampleName     score sampleType
## 1: RegionSet1    Sample1 0.9808293 Condition1
## 2: RegionSet1    Sample2 0.2876821 Condition2

Required inputs

You need 2 things to run MIRA:

Nucleotide-resolution DNA methylation data
A set of genomic regions

Let’s describe each one in more detail:

DNA methylation data

MIRA requires DNA methylation data after methylation calling. For a given genomic coordinate (the location of the C in a CpG), MIRA needs the methylation level (0 to 1) or data that can be used to calculate this: counts of methylated reads and total reads. The total number of reads can be used for screening purposes. This data should be represented as a data.table for each sample, which we call a BSDT (Bisulfite data.table). The BSDT will have these columns: chr, start (the coordinate of the C of the CpG) and the methylProp column with methylation level at each cytosine. Alternatively, the methylProp column may be generated from methylCount (number of methylated reads) and coverage (total number of reads covering this site) columns where methylProp = methylCount/coverage and these columns may be included in the data.table in addition to the methylProp column. When running MIRA on multiple samples, it is recommended to make each sample an item of a named list, with sample names as the names for the list. Since some existing R packages for DNA methylation use different formats, we include a format conversion function that can be used to convert SummarizedExperiment-based objects like you would obtain from the bsseq, methylPipe, and BiSeq packages to the necessary format for MIRA (SummarizedExperimentToDataTable function). A BSseq object is an acceptable input to MIRA and will be converted to the right format internally but for the sake of parallelization it is recommended that BSseq objects be converted to the right format outside the MIRA function so that MIRA can be run on each sample in parallel. Here is an example of a data.table in the right format for input to MIRA:

data("exampleBSDT", package="MIRA")
head(exampleBSDT)

##     chr   start methylCount coverage
## 1: chr1 1062326           0        3
## 2: chr1 1062330           0        3
## 3: chr1 1062339           0        4
## 4: chr1 1062448           2       27
## 5: chr1 1062478           0       23
## 6: chr1 1062480           0       19

Region sets

A region set is a GRanges object containing genomic regions that share a biological annotation. For example, it could be ChIP peaks for a transcription factor. Many types of region sets may be used with MIRA, including ChIP regions for transcription factors or histone modifications, promoters for a set of related genes, sites of motif matches, or DNase hypersensitivity sites. Many such region sets may be found in online repositories and we have pulled together some major sources at http://databio.org/regiondb. At the end of this vignette, we show how to load a database of regions with a few lines of code using the LOLA package. You may also want to check out the AnnotationHub Bioconductor package which gives access to many region sets in an R-friendly format. For use in MIRA, each region set should be a GRanges object and multiple region sets may be passed to MIRA as a GRangesList with each list element being a region set. Here is an example of a region set, which we will use in this vignette:

data("exampleRegionSet", package="MIRA")
head(exampleRegionSet)

## GRanges object with 6 ranges and 0 metadata columns:
##        seqnames          ranges strand
##           <Rle>       <IRanges>  <Rle>
##   2763     chr1 1064067-1064582      *
##   2433     chr1 1307973-1308488      *
##   2948     chr1 1374984-1375499      *
##   3845     chr1 1778779-1779294      *
##    430     chr1 1780451-1780653      *
##   2249     chr1 2578251-2578766      *
##   -------
##   seqinfo: 36 sequences from an unspecified genome; no seqlengths

Analysis workflow

The general workflow is as follows:
1. Data inputs: start with single-nucleotide resolution methylation data and one or more sets of genomic regions, as described above. 2. Expand the regions sizes so that MIRA will be able to get a broad methylation profile surrounding your feature of interest. All regions should be expanded to the same final size. 3. Aggregate methylation data across regions to get a MIRA profile. 4. Calculate MIRA score based on shape of MIRA profile.

The input

First let’s load the necessary libraries and example data. exampleBSDT is a data.table containing genomic location, number of reads methylated (methylCount), and total read number (coverage) from bisulfite sequencing of a lymphoblastoid cell line. exampleRegionSet is a GRanges object with regions that share a biological annotation, in this case Nrf1 binding in a different lymphoblastoid cell line.

library(MIRA)
library(GenomicRanges) # for the `resize`, `width` functions
data("exampleRegionSet", package="MIRA")
data("exampleBSDT", package="MIRA")

While we only have one sample (Gm06990_1) in this example, we would normally want to have our samples in a named list so we will use that format here as well:

BSDTList <- list(exampleBSDT)
names(BSDTList) <- "Gm06990_1"

Since our input methylation data did not have a column for proportion of methylation at each site (methylProp), we need to add this based on the methylation count and total coverage data.

BSDTList <- addMethPropCol(BSDTList)

Expand your regions

A short but important step. MIRA scores are based on the difference between methylation surrounding the regions versus the methylation at the region itself, so we need to expand our regions to include surrounding bases. Let’s use the resize function from the GenomicRanges package to increase the size of our regions and make them all the same size. It is recommended to make all regions the same size so the final methylation profile will be easier to interpret. The average size of our regions is currently about 500:

mean(width(exampleRegionSet))

## [1] 498.788

For this vignette, 4000 bp regions should be wide enough to capture the dip:

exampleRegionSet <- resize(exampleRegionSet, 4000, fix="center")
mean(width(exampleRegionSet))

## [1] 4000

Normally, we want to have each region set (only one in our case) be an item in a list, with corresponding names given to the list:

exampleRegionSetGRL <- GRangesList(exampleRegionSet)
names(exampleRegionSetGRL) <- "lymphoblastoid_NRF1"

Aggregating methylation across regions

Next we aggregate methylation across regions to get a summary methylation profile with the aggregateMethyl function. MIRA divides each region into bins, aggregates methylation levels for cytosines within a given bin for each region individually, and then aggregates matching bins over all the regions (all 1st bins together, 2nd bins together, etc.). A couple important parameters: The binNum argument determines how many approximately equally sized bins each region is split into. This could affect the resolution or noisiness of the MIRA profile because using more bins will result in smaller bin sizes (potentially increasing resolution) but also less reads per bin (potentially increasing noise). The minBaseCovPerBin argument in aggregateMethyl is used to screen out region sets that have any bins with less than a minimum coverage. Here we use the default (500). Let’s aggregate the methylation and then view the MIRA profile.

bigBin <- lapply(X=BSDTList, FUN=aggregateMethyl, GRList=exampleRegionSetGRL, 
                binNum=11)
bigBinDT <- bigBin[[1]]

We add the sample name to the data.table then plot the MIRA profile:

sampleName = rep(names(bigBin), nrow(bigBinDT))
bigBinDT = cbind(bigBinDT, sampleName)
plotMIRAProfiles(binnedRegDT=bigBinDT)

Calculating the MIRA score

To calculate MIRA scores based on the MIRA profiles, we will apply the scoring function, calcMIRAScore, to the data.table that contains the methylation aggregated in bins. calcMIRAScore calculates a score for each group of bins corresponding to a sample and region set, based on the degree of the dip in the profile. With MIRA’s default scoring method (logratio), calcMIRAScore will take the log of the ratio of the outside edges of the dip (identified by calcMIRAScore) to the center of the dip. Higher MIRA scores are associated with deeper dips. A flat MIRA profile would have a score of 0.

sampleScores <- calcMIRAScore(bigBinDT,
                        regionSetIDColName="featureID",
                        sampleIDColName="sampleName")
head(sampleScores)

##              featureID sampleName    score
## 1: lymphoblastoid_NRF1  Gm06990_1 1.393596

A note on annotation. For real uses of MIRA, samples and region sets should be annotated. In order to save memory, it is not recommended that sample name or sample type be included as columns in the BSDT before the aggregation step. An annotation file that matches sample name with sample type (sampleType column) can be used to add sample type information to the data.table after aggregating methylation. A sampleType column is not required for the main functions but is used for the plotting functions. See “Applying MIRA to a Biological Question” vignette for a workflow including annotation.

Interpreting the results

Regulatory information can be inferred from the MIRA scores and profiles but interpretation depends on the type of region set and samples used. The basic assumption is that deeper dips and higher MIRA scores would be associated with generally higher activity at the regions that were tested, which might also inform you about the activity of the regulatory feature that defines the region set (eg the activity of a transcription factor if using ChIP-seq regions). However, MIRA does not directly tell you the activity of the regulatory feature and, in some cases, higher MIRA scores will not be associated with higher activity of the regulatory feature. For example, samples with higher MIRA scores for a transcription factor binding region set could have more activity of that transcription factor but there could be other causes like increased activity of a different transcription factor that binds to many of the same regions. Additionally, it is possible that the samples with higher scores generally had more similar chromatin states to the samples from which the region set was derived than samples with lower scores did (eg when using MIRA on samples from multiple tissue/cell types and the region set was derived from the same tissue/cell type as some but not all of the samples). The general interpretation rule is that a higher MIRA score means more activity but we encourage the user to think carefully about what biological question they are actually addressing based on the region sets and samples used. For more on interpreting the results, see the “Applying MIRA to a Biological Question” vignette.

Bonus: Loading region sets with LOLA

Here is one simple way to get region sets to use with MIRA. First, download the LOLA Core Database (actual files, not the cached version) from here (this might take a while). Next, let’s load the LOLA package and use it to load some of the region sets into R!

The following code will not be evaluated because of the large files involved and long loading time but shows the general process (loading the almost 3,000 region sets could take 10-30 minutes). Alternatively to avoid the long loading process, you can just download the database and load region sets into R one by one with data.table::fread or another such function.

library(LOLA)
pathToDB <- "path/to/LOLACore/hg38"
regionDB <- loadRegionDB(pathToDB)

The regionDB is a list that includes some annotation and a GRangesList of region sets that can be accessed with regionDB$regionGRL. Check out the LOLA docs here if you want some more information. Now you have plenty of region sets to use with MIRA!

The COCOA Bioconductor package

Coordinate Covariation Analysis (COCOA) is a method to understand variation among samples and can be used with data that includes genomic coordinates such as DNA methylation. To describe the method on a high level, COCOA uses a database of “region sets” and principal component analysis (PCA) of the data to identify sources of variation among samples. A region set is a set of genomic regions that share a biological annotation, for instance transcription factor (TF) binding regions, histone modification regions, or open chromatin regions. COCOA can identify region sets where there is intersample variation, giving you biologically meaningful insight into variation in your data. For a more in-depth description of the method, see the “Method details” section of this vignette. In contrast to some other common techniques, COCOA is unsupervised, meaning that samples do not have to be divided into groups such as case/control or healthy/disease, although COCOA works in those situations as well. Also, COCOA focuses on continuous variation between samples instead of having cutoffs. Because of this, COCOA can be used as a complementary method alongside “differential” methods that find discrete differences between groups of samples and it can also be used in situations where there are no groups.

Our test case

In this vignette, we will see how COCOA can find meaningful sources of intersample variation in breast cancer patients. We will use data from The Cancer Genome Atlas: 450k DNA methylation microarrays for 657 breast cancer patients (TCGA-BRCA, https://portal.gdc.cancer.gov/). This vignette begins after PCA has been performed on the DNA methylation data. As you can see below, PC1 and PC4 cluster patients based on estrogen receptor status which is an important prognostic marker for breast cancer:

Based on this, we might expect that the variation captured by these PCs would be associated with estrogen receptor (consistent with interpatient variation in estrogen receptor status). We are using an easy example for this vignette but COCOA will still work even if there are not distinct clusters in your PCA plot and if you do not have known groups of samples. Outside of this vignette, COCOA was performed with a region set database of over 2000 region sets. To keep things simple for this vignette, we will only use two of the highest scoring region sets and two of the lowest scoring region sets from that analysis to demonstrate how COCOA works.

Inputs

There are three main inputs for COCOA:

PCA loadings (prcomp()$rotation).
Genomic coordinates associated with the raw data used for the PCA and with the PCA loadings.
Region sets (normally a database of region sets).

We did PCA with DNA methylation data for all autosomal chromosomes but only the DNA methylation data and loadings for cytosines in chromosome 1 are included in this vignette to minimize memory requirements. brcaLoadings1 has the chr1 loadings. brcaMCoord1 has the coordinates for the chr1 DNA methylation data, which are also the coordinates for the chr1 loadings. For more on the relationship between the loadings and the original data see the “Method details” section. In this vignette, we will use only a few region sets to demonstrate how to use the COCOA functions. For actual use, COCOA should be done with many region sets (ie hundreds or > 1000). Publicly available collections of region sets can be found online (eg http://databio.org/regiondb).

Running COCOA

First, we load the example data and required packages:

library(COCOA)
library(data.table)
library(ggplot2)
data("esr1_chr1")
data("gata3_chr1")
data("nrf1_chr1")
data("atf3_chr1")
data("brcaLoadings1")
data("brcaMCoord1")
data("brcaPCScores")
data("brcaMethylData1")

We have region sets for four transcription factors from ChIP-seq experiments (with the same reference genome as our breast cancer data): Esr1 in MCF-7 cells, Gata3 in MCF-7 cells, Nrf1 in HepG2 cells, and Atf1 in K562 cells.

## prepare data
GRList <- GRangesList(esr1_chr1, gata3_chr1, nrf1_chr1, atf3_chr1) 
regionSetName <- c("esr1_chr1", "gata3_chr1", "nrf1_chr1", "atf3_chr1")

Now let’s give each region set a score with runCOCOA() to quantify how much it is associated with each principal component:

PCsToAnnotate <- paste0("PC", 1:4)
regionSetScores <- runCOCOA(loadingMat=brcaLoadings1, 
                            signalCoord=brcaMCoord1, 
                            GRList=GRList, 
                            PCsToAnnotate=PCsToAnnotate, 
                            scoringMetric="regionMean")
regionSetScores$regionSetName <- regionSetName

As an easy way to visualize the results, we can see how the region sets are ranked for each PC:

rsScoreHeatmap(regionSetScores, 
               PCsToAnnotate=paste0("PC", 1:4), 
               rsNameCol = "regionSetName",
               orderByPC = "PC1", 
               column_title = "Region sets ordered by score for PC1")

We can see that Gata3 had the highest score for PCs 1, 3, and 4 but that Esr1 had a higher score for PC2. While you might expect that Esr1 would be ranked first for all PCs, Gata3 is known to be associated with breast cancer and Esr1 activity so these results still make sense. As mentioned before, this vignette is showing two of the top scoring and bottom scoring region sets from a previous COCOA analysis with over 2200 region sets. So despite Esr1 being ranked second out of four for some PCs in this vignette which may not seem very impressive, it had one of the best scores in the previous larger analysis. On a practical note, if you want to arrange the heatmap by region set scores for another PC, you can just change the orderByPC parameter like so:

rsScoreHeatmap(regionSetScores, 
               PCsToAnnotate=paste0("PC", 1:4),
               rsNameCol = "regionSetName",
               orderByPC = "PC2", 
               column_title = "Region sets ordered by score for PC2")

Understanding the results

We can further understand the variability in these region sets in several ways:

Look at whether variability is specific to the regions of interest compared to the genome around these regions.
Look at the genomic signal in these regions, in our case DNA methylation, and whether it follows the same trends as the PC scores.
Look at the loadings in each region of a given region set to see whether all regions have high loadings or just a subset the regions. Also we can see if the same regions have high loadings for multiple PCs.

Specificity of variation to the regions of interest

We can see whether variability along the PC is specific to the region of interest by comparing the region of interest to the surrounding genome. To do this, we will calculate the average loadings of a wide area surrounding the regions of interest.

wideGRList <- lapply(GRList, resize, width=14000, fix="center")
loadProfile <- lapply(wideGRList, function(x) getLoadingProfile(loadingMat=brcaLoadings1,
                                                                signalCoord=brcaMCoord1,
                                                                regionSet=x, 
                                                                PCsToAnnotate=PCsToAnnotate,
                                                                binNum=21))

We will normalize the result for each PC so we can better compare them. Here we normalize by subtracting the mean absolute loading of each PC from the region set profiles for the corresponding PC. Then we get the plot scale so we can easily compare the different profiles. These normalization steps are helpful for comparing the loading profiles but not necessarily required so it’s not essential that you understand the below code.

## average loading value from each PC to normalize so PCs can be compared with each other
avLoad <- apply(X=brcaLoadings1[, PCsToAnnotate], 
                MARGIN=2, 
                FUN=function(x) mean(abs(x)))

## normalize
loadProfile <- lapply(loadProfile, 
                      FUN=function(x) as.data.frame(mapply(FUN = function(y, z) x[, y] - z, 
                                                           y=PCsToAnnotate, z=avLoad)))
binID = 1:nrow(loadProfile[[1]])
loadProfile <- lapply(loadProfile, FUN=function(x) cbind(binID, x))

## for the plot scale
maxVal <- max(sapply(loadProfile, FUN=function(x) max(x[, PCsToAnnotate])))
minVal <- min(sapply(loadProfile, FUN=function(x) min(x[, PCsToAnnotate])))

## convert to long format for plots
loadProfile <- lapply(X=loadProfile, FUN=function(x) tidyr::gather(data=x, key="PC", value="loading_value", PCsToAnnotate))
loadProfile <- lapply(loadProfile, 
                      function(x){x$PC <- factor(x$PC, levels=PCsToAnnotate); return(x)})

Let’s look at the plots!

wrapper <- function(x, ...) paste(strwrap(x, ...), collapse="\n") 
profilePList <- list()
for (i in seq_along(loadProfile)) {
    
    thisRS <- loadProfile[[i]]
    
    profilePList[[i]] <- ggplot(data=thisRS, 
                                mapping=aes(x=binID , y=loading_value)) + 
        geom_line() + ylim(c(minVal, maxVal)) + facet_wrap(facets="PC") + 
        ggtitle(label=wrapper(regionSetName[i], width=30)) + 
        xlab("Genome around region set, 14 kb") + 
        ylab("Normalized loading value") + 
        theme(panel.grid.major.x=element_blank(), 
              panel.grid.minor.x=element_blank(), 
              axis.text.x=element_blank(), 
              axis.ticks.x=element_blank())
    profilePList[[i]]

}
profilePList[[1]]

profilePList[[2]]

profilePList[[3]]

profilePList[[4]]

These plots show the average magnitude of the loadings in the genome around and including the regions of interest. The loading for an input variable (for us an input variable is DNA methylation at a single cytosine), indicates how much that input variable varies in the same direction as the PC. A peak in the middle of the profile indicates that there is increased covariation in the regions of interest compared to the surrounding genome. It indicates that those regions are changing in a coordinated way whereas the surrounding genome is not changing in a coordinated way to the same extent. A peak suggests that the variation in a PC may be somehow specifically related to the region set although it is not clear whether the region set is causally linked to the variation or just affected by other things that are causing the variation captured by the PC. Some region sets may have an increased loading but no peak, for example, some histone modification region sets like H3K27me3 or H3K9me3. This doesn’t necessarily mean these regions are not relevant. It could just mean that there is variability in larger blocks of the genome around these histone modifications. For details on how the loading profile was created, check out the “Method details” section of this vignette or see the docs for getLoadingProfile with ?COCOA::getLoadingProfile.

These plots show that Esr1 and Gata3 binding regions demonstrate higher covariation along the PC axes than the surrounding genome (because they have a peak in the middle) while Nrf1 and Atf3 do not. So for example, if you line up samples by their PC1 score, then as you go from low to high PC1 score, the DNA methylation of ESR1 binding regions will generally change in a coordinated way across samples but the DNA methylation in the surrounding genome would not change as much or would not change in a coordinated way. The results from our four region sets suggest that Esr1 and Gata3 regions specifically contribute to the variation along the PCs 1 and 3 (as well as perhaps 4), helping us understand the biological meaning of the variation captured by those PCs (at least in part related to estrogen receptor).

The raw data

If a region set has a high score for a PC, we would expect that DNA methylation in at least some of those regions would correlate with PC score. In other words, as you go from high PC score to low PC score along the PC axis, DNA methylation will either go up or down. Let’s look at DNA methylation at CpGs in Esr1 regions. In the following plot, each column is a CpG in an Esr1 region and each row is a patient. Patients are ordered by their PC score for PC1.

signalAlongPC(genomicSignal=brcaMethylData1,
                signalCoord=brcaMCoord1,
                regionSet=esr1_chr1,
                pcScores=brcaPCScores,
                orderByPC="PC1", cluster_columns=TRUE,
                column_title = "Individual cytosine/CpG",
                name = "DNA methylation level")

It appears that some but not all CpGs vary greatly along the PC axis. The CpGs that show high variation along the PC axis are the ones that contribute to the Esr1 region set being ranked highly in our analysis. Looking at the raw data confirms that DNA methylation is in fact varying along the PC axis, which gives us more confidence in our results.

Now let’s look at one of the region sets, Nrf1, that had a low score for PC1:

signalAlongPC(genomicSignal=brcaMethylData1,
              signalCoord=brcaMCoord1,
              regionSet=nrf1_chr1,
              pcScores=brcaPCScores,
              orderByPC="PC1", 
              cluster_columns=TRUE, 
              column_title = "Individual cytosine/CpG",
              name = "DNA methylation level")

When patients are ordered according to PC1 score, we can see that there is very little covariation in DNA methylation in these regions. Therefore, it is not surprising that the Nrf1 region set had a low score for this PC.

Since COCOA ranks region sets based on their relative scores in comparison to other region sets tested, there will always be a region set with a best score. Therefore, it is always a good idea to check the raw genomic signal in your top region sets to make sure that there really is variation along the PC.

Loadings of individual regions

This plot can help you learn more about the contribution of individual regions to the region set score for each PC. For example, if the estrogen receptor region set was associated with PCs 1, 3, and 4, we might wonder whether the same regions are causing the association with these PCs or whether different regions are associated with each PC. To do this, we will first calculate the average absolute loading value for each region in a region set (obtained by averaging a given PC’s loadings for CpGs within that region). Then we can use the distribution of loadings for each PC to convert each region’s loading to a percentile to see how extreme/high that region is for each PC. Let’s look at the plot for estrogen receptor:

regionQuantileByPC(loadingMat = brcaLoadings1,
                                signalCoord = brcaMCoord1,
                                regionSet = esr1_chr1,
                                rsName = "Estrogen receptor (chr1)",
                                PCsToAnnotate=paste0("PC", 1:4),
                                maxRegionsToPlot = 8000,
                                cluster_rows = TRUE,
                                cluster_columns = FALSE,
                                column_title = rsName,
                                name = "Percentile of loading scores in PC")

We can see that some of the same regions have high loadings for multiple PCs (ie these regions are important for these PCs). Also, there are some regions that do not have high loadings for any of the top 4 PCs, suggesting that these regions are not associated with the largest sources of covariation in the data. Overall, PC2 does not have as high loadings as PCs 1, 3, and 4, consistent with our loading profiles (no peak for PC2). While Esr1 was ranked highest for PC2 for the initial COCOA results, this was only the highest score out of 4 region sets. This illustrates two important points for using COCOA. First, you almost always will want to use many region sets so that you can compare the COCOA score for a region set to many others to see if it really is relatively high. Second, you should do further confirmation on top-ranked region sets such as looking at the raw DNA methylation values and looking at the loading profiles to make sure your top region sets are capturing real variation.

For contrast, we can look at the regions of Nrf1:

regionQuantileByPC(loadingMat = brcaLoadings1,
                                signalCoord = brcaMCoord1,
                                regionSet = nrf1_chr1,
                                rsName = "Nrf1 (chr1)",
                                PCsToAnnotate=paste0("PC", 1:4),
                                maxRegionsToPlot = 8000,
                                cluster_rows = TRUE,
                                cluster_columns = FALSE,
                                column_title = rsName,
                                name = "Percentile of loading scores in PC")

We can see that only a very small proportion of regions in the Nrf1 region set have high loadings for PCs 1-4. This is consistent with this region set being ranked low by COCOA for association with these PCs.

These visualization functions can help you explore your data and understand its variation. In conclusion, COCOA can help you identify biologically meaningful sources of variation that contribute to each PC, giving you insight into variability in your data.

Method details

There are two main conceptual steps:

Quantify how each individual genomic feature/signal contributes to intersample variation using PCA.
Aggregate information from individual genomic features to identify biologically meaningful sources of intersample variation using a region set database and COCOA functions.

PCA

PCA is a common dimensionality reduction method that produces new (orthogonal) dimensions in the directions of highest intersample variation. Each new dimension (principal component) is a linear combination of the original dimensions/features. For the linear combination for a given principal component (PC), each original feature has a coefficient called a “loading” which quantifies how much that feature contributes to that PC. To run PCA and get the loadings, first you need a matrix with features that all samples share (samples as rows, features as columns). Then run PCA on the data with the prcomp() function. The loadings can be found in the prcomp()$rotation output. Here is a small toy example with made up data, where the columns F1, F2, and F3 represent three original features and the five rows represent five samples:

toyData = data.frame(F1 = 1:5, F2 = c(2, 0, 6, 4, 3), F3 = c(1, 3, 2, 10, 5))

Let’s look at the loading matrix:

prcomp(toyData)$rotation

##          PC1        PC2        PC3
## F1 0.3193463 -0.1784098 -0.9306922
## F2 0.2005095 -0.9471601  0.2503670
## F3 0.9261824  0.2665664  0.2666992

The values in this loading matrix are the coefficients of the linear combination for each PC as we see below (each column in the loading matrix corresponds to an equation below):

PC1 = 0.319 * F1 + 0.201 * F2 + 0.926 * F3
PC2 = -0.178 * F1 + -0.947 * F2 + 0.267 * F3
PC3 = -0.931 * F1 + 0.250 * F2 + 0.267 * F3

COCOA will use these loadings as feature level information which will be aggregated to gain insight into the broader sources of variation in the data.

Region set database

COCOA uses a database of region sets to gain biological insight into sources of variability in your data. A region set is a set of genomic regions that share a biological annotation. This includes transcription factor (TF) binding regions (eg from ChIP-seq), regions with a certain histone modification (eg ChIP-seq) or chromatin accessibility regions (eg DNase/ATAC-seq). Most of these region sets are from experimental data but don’t necessarily have to be. For instance, you could use predicted TF binding regions based on the TF motif. The big picture goal of using a region set database is to connect variation between samples to an interpretable biological meaning: the known annotation of a region set. For each PC, COCOA will give a score to each region set that quantifies how much that region set is associated with that PC (and therefore with the intersample variation captured by that PC).

COCOA should be done with many region sets (ie hundreds or > 1000). A region set can be a simple “.bed”" file with three columns containing the genomic locations of the regions: chr (for chromosome), start, and end. In R, this data can be represented as a data.frame or as a GRanges object. Publicly available collections of region sets can be found online (eg http://databio.org/regiondb) and region sets can be accessed through Bioconductor packages (eg LOLA and AnnotationHub). The region sets must be from the same reference genome as your sample data (although you could use the liftOver tool to convert from one genome version to another). The region sets can come from anywhere so if you experimentally or computationally generate your own region sets, you can just include those with the others when running the COCOA analysis. For an example of working with a region set database, see the ‘COCOA_Workflow’ vignette.

Aggregating info from individual features

Since differences between samples in individual features (nucleotides/regions) may be hard to interpret, COCOA uses region sets to aggregate nucleotide/region level info into a more condensed, interpretable form. As mentioned above, each feature has a “loading” for a given PC and the magnitude of the loading represents how much that feature contributes to that PC. We will use the absolute value of the loading since both a very positive and very negative loading reflect an association between that feature and the PC. Also, each original feature is associated with a genomic coordinate (or a range but ranges are not supported by COCOA in this release version). COCOA will use this information to give each region set in the region set database a score for each PC. For a given PC-region set combination (for example PC1 and the region set esr1_chr1), we first identify all the original features that overlap with the region set. Then the scoring of the region set depends on the scoring metric chosen. The “regionMean” method is described here although the other methods are described in function docs. For the “regionMean” method, we take the PC loadings (eg PC1 loadings) for the features that overlap the region set and average the absolute loadings by region (average loadings in each region to get one value per region of the region set). Then we average the region values to get a single average for that region set which is its score. We repeat this calculation for all PC-region set combinations. Now for a given PC, we can rank the region sets by their score/loading average to see which region sets are most associated with that PC (higher loading average means a greater association with the PC). Since the “regionMean” scoring method only tells you which region sets are the most associated with a PC out of those tested, it is important to use a large database of region sets and then to validate by looking at the raw data in these regions which can be done with COCOA visualization functions. Also, you can screen out region sets that have low coverage in your input data (the region_coverage column of runCOCOA() output) since high scores for these regions are more likely to have happened by chance. The biological annotation of the top ranked region sets for the top PCs can help you understand variation among your samples.

Making a “meta-region” loading profile

A “meta-region” loading profile is a summary of the loadings in the genome in and around the regions of a region set. This is created with the getLoadingProfile function. The calculations are similar to those of runCOCOA with a few major differences. Instead of using the region set as is, we will expand each region in the region set on both sides so we can also look at the surrounding genome. We will then split each region into the same number of bins (of approximately equal size). Then we average the magnitude of all loadings that overlap a bin to get a single average loading for each bin. We combine information from all the regions by averaging the corresponding bins from the different regions (all bin1’s averaged together, all bin2’s averaged together, etc.). Finally, we average the profile symmetrically over the center (the first bin becomes the average of the first and last bin, the second bin becomes the average of the second and second to last bin, etc.). We do this because the orientation of the regions in the genome is arbitrary: we did not use any strand information or relate the regions to any directional information. The “meta-region” profile gives a summary in a single profile of all the regions in a region set and allows you to compare the regions to the surrounding genome.

Q and A

What data types can COCOA be used with? So far, COCOA has been validated on single base pair resolution DNA methylation data. Theoretically, COCOA could work with any type of genomic coordinate-based data: data where you have a genomic coordinate or range and an associated value. This could include ATAC-seq data, single nucleotide polymorphism/mutation data, copy number variation etc. although COCOA would probably work better for data where smaller regions or single bases are measured. Future releases of COCOA will likely have support for more of these data types. If you are interested in using COCOA for these analyses before the next Bioconductor release, you can check out the Github page to see the latest version or download the “dev” version of COCOA from Bioconductor.
Can COCOA be used with other dimensionality reduction techniques such as t-SNE? Short answer for t-SNE: no. In general though, it depends. COCOA must have a score for each original dimension that quantifies how much it contributes to the new dimension. Since t-SNE maps the original dimensions to new dimensions in a nonlinear way, the mappings of the original dimensions to the new dimensions are not comparable to each other and cannot be aggregated into a single score for a region set in a uniform way.
Where did the name COCOA come from?

The method is called Coordinate Covariation Analysis because it uses PCA to capture covariation of individual signals/features at genomic coordinates. The top PCs capture both variation between samples and covariation between individual genomic features. COCOA annotates the covariation of individual genomic features with region sets in order to gain insight into variation between samples.