Public data resources and Bioconductor

Instructor names and contact information

Levi Waldron <levi.waldron at sph.cuny.edu> (City University of New York, New York, NY, USA)
Benjamin Haibe-Kains <benjamin.haibe.kains at utoronto.ca> (Princess Margaret Cancer Center, Toronto, Canada)
Sean Davis <sdavis2 at mail.nih.gov> (Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA)

Syllabus

Workshop Description

The goal of this workshop is to introduce Bioconductor packages for finding, accessing, and using large-scale public data resources including the Gene Expression Omnibus GEO, the NCI Genomic Data Commons GDC, and Bioconductor-hosted curated data resources for metagenomics, pharmacogenomics PharmacoDB, and The Cancer Genome Atlas.

Pre-requisites

Basic knowledge of R syntax
Familiarity with the ExpressionSet and SummarizedExperiment classes
Basic familiarity with ’omics technologies such as microarray and NGS sequencing

Interested students can prepare by reviewing vignettes of the packages listed in “R/Bioconductor packages used” to gain background on aspects of interest to them.

Some more general background on these resources is published in Kannan et al. (2016).

Workshop Participation

Each component will include runnable examples of typical usage that students are encouraged to run during demonstration of that component.

R/Bioconductor packages used

GEOquery: Access to the NCBI Gene Expression Omnibus (GEO), a public repository of gene expression (primarily microarray) data.
GenomicDataCommons: Access to the NIH / NCI Genomic Data Commons RESTful service.
curatedTCGAData: Curated data from The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
curatedMetagenomicData: Curated metagenomic data of the human microbiome
HMP16SData: Curated metagenomic data of the human microbiome
PharmacoGx: Curated large-scale preclinical pharmacogenomic data and basic analysis tools

Time outline

This is a 1h45m workshop.

Activity	Time
Overview	10m
GEOquery	15m
GenomicDataCommons	20m
curatedTCGAData	10m
curatedMetagenomicData and HMP16SData	15m
PharmacoGx	20m
Q & A	20m

Workshop goals and objectives

Bioconductor provides access to significant amounts of publicly available experimental data. This workshop introduces students to Bioconductor interfaces to the NCBI’s Gene Expression Omnibus, Genomic Data Commons, and PharmacoDB. It additionally introduces curated resources providing The Cancer Genome Atlas, the Human Microbiome Project and other microbiome studies, and major pharmacogenomic studies, as native Bioconductor objects ready for analysis and comparison to in-house datasets.

Learning goals

search NCBI resources for publicly available ’omics data
quickly use data from the TCGA and the Human Microbiome Project

Learning objectives

find and download processed microarray and RNA-seq datasets from the Gene Expression Omnibus
find and download ’omics data from the Genomic Data Commons
download and manipulate data from The Cancer Genome Atlas and Human Microbiome Project
download and explore pharmacogenomics data

Overview

Before proceeding, ensure that the following packages are installed.

required_pkgs = c(
  "TCGAbiolinks", 
  "GEOquery", 
  "GenomicDataCommons",
  "limma",
  "curatedTCGAData",
  "recount",
  "curatedMetagenomicData",
  "phyloseq",
  "HMP16SData",
  "caTools",
  "piano",
  "isa",
  "VennDiagram",
  "downloader",
    "gdata",
    "AnnotationDbi",
    "hgu133a.db",
  "PharmacoGx")
BiocManager::install(required_pkgs)

GEOquery

(Davis and Meltzer 2007)

The NCBI Gene Expression Omnibus (GEO) serves as a public repository for a wide range of high-throughput experimental data. These data include single and dual channel microarray-based experiments measuring mRNA, genomic DNA, and protein abundance, as well as non-array techniques such as serial analysis of gene expression (SAGE), mass spectrometry proteomic data, and high-throughput sequencing data. The GEOquery package (Davis and Meltzer 2007) forms a bridge between this public repository and the analysis capabilities in Bioconductor.

Overview of GEO

At the most basic level of organization of GEO, there are four basic entity types. The first three (Sample, Platform, and Series) are supplied by users; the fourth, the dataset, is compiled and curated by GEO staff from the user-submitted data. See the GEO home page for more information.

Platforms

A Platform record describes the list of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs, antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE tags, peptides). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx). A Platform may reference many Samples that have been submitted by multiple submitters.

Samples

A Sample record describes the conditions under which an individual Sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it. Each Sample record is assigned a unique and stable GEO accession number (GSMxxx). A Sample entity must reference only one Platform and may be included in multiple Series.

Series

A Series record defines a set of related Samples considered to be part of a group, how the Samples are related, and if and how they are ordered. A Series provides a focal point and description of the experiment as a whole. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each Series record is assigned a unique and stable GEO accession number (GSExxx). Series records are available in a couple of formats which are handled by GEOquery independently. The smaller and new GSEMatrix files are quite fast to parse; a simple flag is used by GEOquery to choose to use GSEMatrix files (see below).

Datasets

GEO DataSets (GDSxxx) are curated sets of GEO Sample data. There are hundreds of GEO datasets available, but GEO discontinued creating GDS records several years ago. We mention them here for completeness only.

Getting Started using GEOquery

Getting data from GEO is really quite easy. There is only one command that is needed, getGEO. This one function interprets its input to determine how to get the data from GEO and then parse the data into useful R data structures.

library(GEOquery)

With the library loaded, we are free to access any GEO accession.

Use case: MDS plot of cancer data

The data we are going to access are from this paper.

Background: The tumor microenvironment is an important factor in cancer immunotherapy response. To further understand how a tumor affects the local immune system, we analyzed immune gene expression differences between matching normal and tumor tissue.Methods: We analyzed public and new gene expression data from solid cancers and isolated immune cell populations. We also determined the correlation between CD8, FoxP3 IHC, and our gene signatures.Results: We observed that regulatory T cells (Tregs) were one of the main drivers of immune gene expression differences between normal and tumor tissue. A tumor-specific CD8 signature was slightly lower in tumor tissue compared with normal of most (12 of 16) cancers, whereas a Treg signature was higher in tumor tissue of all cancers except liver. Clustering by Treg signature found two groups in colorectal cancer datasets. The high Treg cluster had more samples that were consensus molecular subtype 1/4, right-sided, and microsatellite-instable, compared with the low Treg cluster. Finally, we found that the correlation between signature and IHC was low in our small dataset, but samples in the high Treg cluster had significantly more CD8+ and FoxP3+ cells compared with the low Treg cluster.Conclusions: Treg gene expression is highly indicative of the overall tumor immune environment.Impact: In comparison with the consensus molecular subtype and microsatellite status, the Treg signature identifies more colorectal tumors with high immune activation that may benefit from cancer immunotherapy.

In this little exercise, we will:

Access public omics data using the GEOquery package
Convert the public omics data to a SummarizedExperiment object.
Perform a simple unsupervised analysis to visualize these public data.

Use the GEOquery package to fetch data about GSE103512.

gse = getGEO("GSE103512")[[1]]

## Warning: 102 parsing failures.
##   row     col           expected    actual         file
## 54614 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## 54615 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## 54616 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## 54617 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## 54618 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## ..... ....... .................. ......... ............
## See problems(...) for more details.

Note that getGEO, when used to retrieve GSE records, returns a list. The members of the list each represent one GEO Platform, since each GSE record can contain multiple related datasets (eg., gene expression and DNA methylation). In this case, the list is of length one, but it is still necessary to grab the first elment.

The first step–a detail–is to convert from the older Bioconductor data structure (GEOquery was written in 2007), the ExpressionSet, to the newer SummarizedExperiment. One line suffices.

library(SummarizedExperiment)
se = as(gse, "SummarizedExperiment")

Examine two variables of interest, cancer type and tumor/normal status.

with(colData(se),table(`cancer.type.ch1`,`normal.ch1`))

##                normal.ch1
## cancer.type.ch1 no yes
##           BC    65  10
##           CRC   57  12
##           NSCLC 60   9
##           PCA   60   7

Filter gene expression by variance to find most informative genes.

sds = apply(assay(se, 'exprs'),1,sd)
dat = assay(se, 'exprs')[order(sds,decreasing = TRUE)[1:500],]

Perform multidimensional scaling and prepare for plotting. We will be using ggplot2, so we need to make a data.frame before plotting.

mdsvals = cmdscale(dist(t(dat)))
mdsvals = as.data.frame(mdsvals)
mdsvals$Type=factor(colData(se)[,'cancer.type.ch1'])
mdsvals$Normal = factor(colData(se)[,'normal.ch1'])
head(mdsvals)

##                   V1       V2 Type Normal
## GSM2772660  8.531331 18.57115   BC     no
## GSM2772661  8.991591 13.63764   BC     no
## GSM2772662 10.788973 13.48403   BC     no
## GSM2772663  3.127105 19.13529   BC     no
## GSM2772664 13.056599 13.88711   BC     no
## GSM2772665  7.903717 13.24731   BC     no

And do the plot.

library(ggplot2)
ggplot(mdsvals, aes(x=V1,y=V2,shape=Normal,color=Type)) + 
    geom_point( alpha=0.6) + theme(text=element_text(size = 18))

Accessing Raw Data from GEO

NCBI GEO accepts (but has not always required) raw data such as .CEL files, .CDF files, images, etc. It is also not uncommon for some RNA-seq or other sequencing datasets to supply only raw data (with accompanying sample information, of course), necessitating Sometimes, it is useful to get quick access to such data. A single function, getGEOSuppFiles, can take as an argument a GEO accession and will download all the raw data associate with that accession. By default, the function will create a directory in the current working directory to store the raw data for the chosen GEO accession.

GenomicDataCommons

From the Genomic Data Commons (GDC) website:

The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs. The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared. As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.

The data model for the GDC is complex, but it worth a quick overview and a graphical representation is included here.

The data model is encoded as a so-called property graph. Nodes represent entities such as Projects, Cases, Diagnoses, Files (various kinds), and Annotations. The relationships between these entities are maintained as edges. Both nodes and edges may have Properties that supply instance details.

The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.

Quickstart

This quickstart section is just meant to show basic functionality. More details of functionality are included further on in this vignette and in function-specific help.

To report bugs or problems, either submit a new issue or submit a bug.report(package='GenomicDataCommons') from within R (which will redirect you to the new issue on GitHub).

Installation

Installation of the GenomicDataCommons package is identical to installation of other Bioconductor packages.

install.packages('BiocManager')
BiocManager::install('GenomicDataCommons')

After installation, load the library in order to use it.

library(GenomicDataCommons)

Check connectivity and status

The GenomicDataCommons package relies on having network connectivity. In addition, the NCI GDC API must also be operational and not under maintenance. Checking status can be used to check this connectivity and functionality.

GenomicDataCommons::status()

## $commit
## [1] "3e22a4257d5079ae9f7193950b51ed9dfc561ed1"
## 
## $data_release
## [1] "Data Release 17.0 - June 05, 2019"
## 
## $status
## [1] "OK"
## 
## $tag
## [1] "1.21.0"
## 
## $version
## [1] 1

Find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using HTSeq from ovarian cancer patients.

ge_manifest = files() %>%
    filter( ~ cases.project.project_id == 'TCGA-OV' &
                type == 'gene_expression' &
                analysis.workflow_type == 'HTSeq - Counts') %>%
    manifest()

Download data

After the 379 gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. On a standard 1Gb connection, the following completes in about 30 seconds. The first time the data are downloaded, R will ask to create a cache directory (see ?gdc_cache for details of setting and interacting with the cache). Resulting downloaded files will be stored in the cache directory. Future access to the same files will be directly from the cache, alleviating multiple downloads.

fnames = lapply(ge_manifest$id[1:20],gdcdata)

If the download had included controlled-access data, the download above would have needed to include a token. Details are available in the authentication section below.

Metadata queries

The GenomicDataCommons can access the significant clinical, demographic, biospecimen, and annotation information contained in the NCI GDC.

expands = c("diagnoses","annotations",
             "demographic","exposures")
projResults = projects() %>%
    results(size=10)
str(projResults,list.len=5)

## List of 9
##  $ dbgap_accession_number: chr [1:10] "phs001287" "phs001374" "phs001628" "phs000466" ...
##  $ disease_type          :List of 10
##   ..$ CPTAC-3              : chr "Adenomas and Adenocarcinomas"
##   ..$ VAREPOP-APOLLO       : chr [1:2] "Epithelial Neoplasms, NOS" "Squamous Cell Neoplasms"
##   ..$ BEATAML1.0-CRENOLANIB: chr "Myeloid Leukemias"
##   ..$ TARGET-CCSK          : chr "Clear Cell Sarcoma of the Kidney"
##   ..$ TARGET-NBL           : chr "Neuroblastoma"
##   .. [list output truncated]
##  $ releasable            : logi [1:10] FALSE FALSE FALSE TRUE TRUE FALSE ...
##  $ released              : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ state                 : chr [1:10] "open" "open" "open" "open" ...
##   [list output truncated]
##  - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"

names(projResults)

## [1] "dbgap_accession_number" "disease_type"          
## [3] "releasable"             "released"              
## [5] "state"                  "primary_site"          
## [7] "project_id"             "id"                    
## [9] "name"

# or listviewer::jsonedit(clinResults)

Basic design

This package design is meant to have some similarities to the “hadleyverse” approach of dplyr. Roughly, the functionality for finding and accessing files and metadata can be divided into:

Simple query constructors based on GDC API endpoints.
A set of verbs that when applied, adjust filtering, field selection, and faceting (fields for aggregation) and result in a new query object (an endomorphism)
A set of verbs that take a query and return results from the GDC

In addition, there are exhiliary functions for asking the GDC API for information about available and default fields, slicing BAM files, and downloading actual data files. Here is an overview of functionality¹.

Creating a query
- projects()
- cases()
- files()
- annotations()
Manipulating a query
- filter()
- facet()
- select()
Introspection on the GDC API fields
- mapping()
- available_fields()
- default_fields()
- grep_fields()
- field_picker()
- available_values()
- available_expand()
Executing an API call to retrieve query results
- results()
- count()
- response()
Raw data file downloads
- gdcdata()
- transfer()
- gdc_client()
Summarizing and aggregating field values (faceting)
- aggregations()
Authentication
- gdc_token()
BAM file slicing
- slicing()

Usage

There are two main classes of operations when working with the NCI GDC.

Querying metadata and finding data files (e.g., finding all gene expression quantifications data files for all colon cancer patients).
Transferring raw or processed data from the GDC to another computer (e.g., downloading raw or processed data)

Both classes of operation are reviewed in detail in the following sections.

Querying metadata

Vast amounts of metadata about cases (patients, basically), files, projects, and so-called annotations are available via the NCI GDC API. Typically, one will want to query metadata to either focus in on a set of files for download or transfer or to perform so-called aggregations (pivot-tables, facets, similar to the R table() functionality).

Querying metadata starts with creating a “blank” query. One will often then want to filter the query to limit results prior to retrieving results. The GenomicDataCommons package has helper functions for listing fields that are available for filtering.

In addition to fetching results, the GDC API allows faceting, or aggregating,, useful for compiling reports, generating dashboards, or building user interfaces to GDC data (see GDC web query interface for a non-R-based example).

Creating a query

The GenomicDataCommons package accesses the same API as the GDC website. Therefore, a useful approach, particularly for beginning users is to examine the filters available on the GDC repository pages to find appropriate filtering criteria. From there, converting those checkboxes to a GenomicDataCommons query() is relatively straightforward. Note that only a small subset of the available_fields() are available by default on the website.

A screenshot of an example query of the GDC repository portal.

A query of the GDC starts its life in R. Queries follow the four metadata endpoints available at the GDC. In particular, there are four convenience functions that each create GDCQuery objects (actually, specific subclasses of GDCQuery):

projects()
cases()
files()
annotations()

pquery = projects()

The pquery object is now an object of (S3) class, GDCQuery (and gdc_projects and list). The object contains the following elements:

fields: This is a character vector of the fields that will be returned when we retrieve data. If no fields are specified to, for example, the projects() function, the default fields from the GDC are used (see default_fields())
filters: This will contain results after calling the filter() method and will be used to filter results on retrieval.
facets: A character vector of field names that will be used for aggregating data in a call to aggregations().
archive: One of either “default” or “legacy”.
token: A character(1) token from the GDC. See the authentication section for details, but note that, in general, the token is not necessary for metadata query and retrieval, only for actual data download.

Looking at the actual object (get used to using str()!), note that the query contains no results.

str(pquery)

## List of 5
##  $ fields : chr [1:10] "dbgap_accession_number" "disease_type" "intended_release_date" "name" ...
##  $ filters: NULL
##  $ facets : NULL
##  $ legacy : logi FALSE
##  $ expand : NULL
##  - attr(*, "class")= chr [1:3] "gdc_projects" "GDCQuery" "list"

Retrieving results

[ GDC pagination documentation ]

[ GDC sorting documentation ]

With a query object available, the next step is to retrieve results from the GDC. The GenomicDataCommons package. The most basic type of results we can get is a simple count() of records available that satisfy the filter criteria. Note that we have not set any filters, so a count() here will represent all the project records publicly available at the GDC in the “default” archive"

pcount = count(pquery)
# or
pcount = pquery %>% count()
pcount

## [1] 47

The results() method will fetch actual results.

presults = pquery %>% results()

These results are returned from the GDC in JSON format and converted into a (potentially nested) list in R. The str() method is useful for taking a quick glimpse of the data.

str(presults)

## List of 9
##  $ dbgap_accession_number: chr [1:10] "phs001287" "phs001374" "phs001628" "phs000466" ...
##  $ disease_type          :List of 10
##   ..$ CPTAC-3              : chr "Adenomas and Adenocarcinomas"
##   ..$ VAREPOP-APOLLO       : chr [1:2] "Epithelial Neoplasms, NOS" "Squamous Cell Neoplasms"
##   ..$ BEATAML1.0-CRENOLANIB: chr "Myeloid Leukemias"
##   ..$ TARGET-CCSK          : chr "Clear Cell Sarcoma of the Kidney"
##   ..$ TARGET-NBL           : chr "Neuroblastoma"
##   ..$ FM-AD                : chr [1:23] "Germ Cell Neoplasms" "Acinar Cell Neoplasms" "Miscellaneous Tumors" "Thymic Epithelial Neoplasms" ...
##   ..$ TCGA-HNSC            : chr "Squamous Cell Neoplasms"
##   ..$ TCGA-LGG             : chr "Gliomas"
##   ..$ TCGA-SARC            : chr [1:6] "Myomatous Neoplasms" "Soft Tissue Tumors and Sarcomas, NOS" "Fibromatous Neoplasms" "Lipomatous Neoplasms" ...
##   ..$ TCGA-ESCA            : chr [1:3] "Cystic, Mucinous and Serous Neoplasms" "Adenomas and Adenocarcinomas" "Squamous Cell Neoplasms"
##  $ releasable            : logi [1:10] FALSE FALSE FALSE TRUE TRUE FALSE ...
##  $ released              : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ state                 : chr [1:10] "open" "open" "open" "open" ...
##  $ primary_site          :List of 10
##   ..$ CPTAC-3              : chr [1:3] "Kidney" "Bronchus and lung" "Uterus, NOS"
##   ..$ VAREPOP-APOLLO       : chr "Bronchus and lung"
##   ..$ BEATAML1.0-CRENOLANIB: chr "Hematopoietic and reticuloendothelial systems"
##   ..$ TARGET-CCSK          : chr "Kidney"
##   ..$ TARGET-NBL           : chr "Nervous System"
##   ..$ FM-AD                : chr [1:42] "Testis" "Gallbladder" "Unknown" "Other and unspecified parts of biliary tract" ...
##   ..$ TCGA-HNSC            : chr [1:13] "Other and ill-defined sites in lip, oral cavity and pharynx" "Palate" "Other and unspecified parts of tongue" "Hypopharynx" ...
##   ..$ TCGA-LGG             : chr "Brain"
##   ..$ TCGA-SARC            : chr [1:13] "Corpus uteri" "Bones, joints and articular cartilage of limbs" "Other and unspecified parts of tongue" "Meninges" ...
##   ..$ TCGA-ESCA            : chr [1:2] "Esophagus" "Stomach"
##  $ project_id            : chr [1:10] "CPTAC-3" "VAREPOP-APOLLO" "BEATAML1.0-CRENOLANIB" "TARGET-CCSK" ...
##  $ id                    : chr [1:10] "CPTAC-3" "VAREPOP-APOLLO" "BEATAML1.0-CRENOLANIB" "TARGET-CCSK" ...
##  $ name                  : chr [1:10] "" "VA Research Precision Oncology Program" "Clinical Resistance to Crenolanib in Acute Myeloid Leukemia Due to Diverse Molecular Mechanisms" "Clear Cell Sarcoma of the Kidney" ...
##  - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"

A default of only 10 records are returned. We can use the size and from arguments to results() to either page through results or to change the number of results. Finally, there is a convenience method, results_all() that will simply fetch all the available results given a query. Note that results_all() may take a long time and return HUGE result sets if not used carefully. Use of a combination of count() and results() to get a sense of the expected data size is probably warranted before calling results_all()

length(ids(presults))

## [1] 10

presults = pquery %>% results_all()
length(ids(presults))

## [1] 47

# includes all records
length(ids(presults)) == count(pquery)

## [1] TRUE

Extracting subsets of results or manipulating the results into a more conventional R data structure is not easily generalizable. However, the purrr, rlist, and data.tree packages are all potentially of interest for manipulating complex, nested list structures. For viewing the results in an interactive viewer, consider the listviewer package.

Fields and Values

[ GDC fields documentation ]

Central to querying and retrieving data from the GDC is the ability to specify which fields to return, filtering by fields and values, and faceting or aggregating. The GenomicDataCommons package includes two simple functions, available_fields() and default_fields(). Each can operate on a character(1) endpoint name (“cases”, “files”, “annotations”, or “projects”) or a GDCQuery object.

default_fields('files')

##  [1] "access"                         "acl"                           
##  [3] "average_base_quality"           "average_insert_size"           
##  [5] "average_read_length"            "channel"                       
##  [7] "chip_id"                        "chip_position"                 
##  [9] "contamination"                  "contamination_error"           
## [11] "created_datetime"               "data_category"                 
## [13] "data_format"                    "data_type"                     
## [15] "error_type"                     "experimental_strategy"         
## [17] "file_autocomplete"              "file_id"                       
## [19] "file_name"                      "file_size"                     
## [21] "imaging_date"                   "magnification"                 
## [23] "md5sum"                         "mean_coverage"                 
## [25] "origin"                         "pairs_on_diff_chr"             
## [27] "plate_name"                     "plate_well"                    
## [29] "platform"                       "proportion_base_mismatch"      
## [31] "proportion_coverage_10X"        "proportion_coverage_30X"       
## [33] "proportion_reads_duplicated"    "proportion_reads_mapped"       
## [35] "proportion_targets_no_coverage" "read_pair_number"              
## [37] "revision"                       "state"                         
## [39] "state_comment"                  "submitter_id"                  
## [41] "tags"                           "total_reads"                   
## [43] "type"                           "updated_datetime"

# The number of fields available for files endpoint
length(available_fields('files'))

## [1] 827

# The first few fields available for files endpoint
head(available_fields('files'))

## [1] "access"                      "acl"                        
## [3] "analysis.analysis_id"        "analysis.analysis_type"     
## [5] "analysis.created_datetime"   "analysis.input_files.access"

The fields to be returned by a query can be specified following a similar paradigm to that of the dplyr package. The select() function is a verb that resets the fields slot of a GDCQuery; note that this is not quite analogous to the dplyr select() verb that limits from already-present fields. We completely replace the fields when using select() on a GDCQuery.

# Default fields here
qcases = cases()
qcases$fields

##  [1] "aliquot_ids"              "analyte_ids"             
##  [3] "case_autocomplete"        "case_id"                 
##  [5] "created_datetime"         "days_to_index"           
##  [7] "days_to_lost_to_followup" "diagnosis_ids"           
##  [9] "disease_type"             "index_date"              
## [11] "lost_to_followup"         "portion_ids"             
## [13] "primary_site"             "sample_ids"              
## [15] "slide_ids"                "state"                   
## [17] "submitter_aliquot_ids"    "submitter_analyte_ids"   
## [19] "submitter_diagnosis_ids"  "submitter_id"            
## [21] "submitter_portion_ids"    "submitter_sample_ids"    
## [23] "submitter_slide_ids"      "updated_datetime"

# set up query to use ALL available fields
# Note that checking of fields is done by select()
qcases = cases() %>% GenomicDataCommons::select(available_fields('cases'))
head(qcases$fields)

## [1] "case_id"                       "aliquot_ids"                  
## [3] "analyte_ids"                   "annotations.annotation_id"    
## [5] "annotations.case_id"           "annotations.case_submitter_id"

Finding fields of interest is such a common operation that the GenomicDataCommons includes the grep_fields() function and the field_picker() widget. See the appropriate help pages for details.

Facets and aggregation

[ GDC facet documentation ]

The GDC API offers a feature known as aggregation or faceting. By specifying one or more fields (of appropriate type), the GDC can return to us a count of the number of records matching each potential value. This is similar to the R table method. Multiple fields can be returned at once, but the GDC API does not have a cross-tabulation feature; all aggregations are only on one field at a time. Results of aggregation() calls come back as a list of data.frames (actually, tibbles).

# total number of files of a specific type
res = files() %>% facet(c('type','data_type')) %>% aggregations()
res$type

##                            key doc_count
## 1      simple_somatic_mutation     65977
## 2   annotated_somatic_mutation     65439
## 3                aligned_reads     53580
## 4          copy_number_segment     45258
## 5              gene_expression     40653
## 6                  slide_image     30072
## 7       biospecimen_supplement     25166
## 8             mirna_expression     23334
## 9          clinical_supplement     12510
## 10      methylation_beta_value     12359
## 11 aggregated_somatic_mutation       186
## 12     masked_somatic_mutation       132
## 13        copy_number_estimate        33

Using aggregations() is an also easy way to learn the contents of individual fields and forms the basis for faceted search pages.

Filtering

[ GDC filtering documentation ]

The GenomicDataCommons package uses a form of non-standard evaluation to specify R-like queries that are then translated into an R list. That R list is, upon calling a method that fetches results from the GDC API, translated into the appropriate JSON string. The R expression uses the formula interface as suggested by Hadley Wickham in his vignette on non-standard evaluation

It’s best to use a formula because a formula captures both the expression to evaluate and the environment where the evaluation occurs. This is important if the expression is a mixture of variables in a data frame and objects in the local environment [for example].

For the user, these details will not be too important except to note that a filter expression must begin with a “~”.

qfiles = files()
qfiles %>% count() # all files

## [1] 374699

To limit the file type, we can refer back to the section on faceting to see the possible values for the file field “type”. For example, to filter file results to only “gene_expression” files, we simply specify a filter.

qfiles = files() %>% filter(~ type == 'gene_expression')
# here is what the filter looks like after translation
str(get_filter(qfiles))

## List of 2
##  $ op     : 'scalar' chr "="
##  $ content:List of 2
##   ..$ field: chr "type"
##   ..$ value: chr "gene_expression"

What if we want to create a filter based on the project (‘TCGA-OVCA’, for example)? Well, we have a couple of possible ways to discover available fields. The first is based on base R functionality and some intuition.

grep('pro',available_fields('files'),value=TRUE)

##  [1] "analysis.input_files.proportion_base_mismatch"                  
##  [2] "analysis.input_files.proportion_coverage_10X"                   
##  [3] "analysis.input_files.proportion_coverage_30X"                   
##  [4] "analysis.input_files.proportion_reads_duplicated"               
##  [5] "analysis.input_files.proportion_reads_mapped"                   
##  [6] "analysis.input_files.proportion_targets_no_coverage"            
##  [7] "cases.diagnoses.progression_or_recurrence"                      
##  [8] "cases.follow_ups.days_to_progression"                           
##  [9] "cases.follow_ups.days_to_progression_free"                      
## [10] "cases.follow_ups.progression_or_recurrence"                     
## [11] "cases.follow_ups.progression_or_recurrence_anatomic_site"       
## [12] "cases.follow_ups.progression_or_recurrence_type"                
## [13] "cases.project.dbgap_accession_number"                           
## [14] "cases.project.disease_type"                                     
## [15] "cases.project.intended_release_date"                            
## [16] "cases.project.name"                                             
## [17] "cases.project.primary_site"                                     
## [18] "cases.project.program.dbgap_accession_number"                   
## [19] "cases.project.program.name"                                     
## [20] "cases.project.program.program_id"                               
## [21] "cases.project.project_id"                                       
## [22] "cases.project.releasable"                                       
## [23] "cases.project.released"                                         
## [24] "cases.project.state"                                            
## [25] "cases.samples.days_to_sample_procurement"                       
## [26] "cases.samples.method_of_sample_procurement"                     
## [27] "cases.samples.portions.slides.number_proliferating_cells"       
## [28] "cases.tissue_source_site.project"                               
## [29] "downstream_analyses.output_files.proportion_base_mismatch"      
## [30] "downstream_analyses.output_files.proportion_coverage_10X"       
## [31] "downstream_analyses.output_files.proportion_coverage_30X"       
## [32] "downstream_analyses.output_files.proportion_reads_duplicated"   
## [33] "downstream_analyses.output_files.proportion_reads_mapped"       
## [34] "downstream_analyses.output_files.proportion_targets_no_coverage"
## [35] "index_files.proportion_base_mismatch"                           
## [36] "index_files.proportion_coverage_10X"                            
## [37] "index_files.proportion_coverage_30X"                            
## [38] "index_files.proportion_reads_duplicated"                        
## [39] "index_files.proportion_reads_mapped"                            
## [40] "index_files.proportion_targets_no_coverage"                     
## [41] "proportion_base_mismatch"                                       
## [42] "proportion_coverage_10X"                                        
## [43] "proportion_coverage_30X"                                        
## [44] "proportion_reads_duplicated"                                    
## [45] "proportion_reads_mapped"                                        
## [46] "proportion_targets_no_coverage"

Interestingly, the project information is “nested” inside the case. We don’t need to know that detail other than to know that we now have a few potential guesses for where our information might be in the files records. We need to know where because we need to construct the appropriate filter.

files() %>% facet('cases.project.project_id') %>% aggregations()

## $cases.project.project_id
##                      key doc_count
## 1                  FM-AD     36134
## 2              TCGA-BRCA     31598
## 3              TCGA-LUAD     17074
## 4              TCGA-UCEC     16187
## 5              TCGA-HNSC     15298
## 6                TCGA-OV     15166
## 7              TCGA-THCA     14435
## 8              TCGA-LUSC     15340
## 9               TCGA-LGG     14741
## 10             TCGA-KIRC     15107
## 11             TCGA-PRAD     14314
## 12             TCGA-COAD     14320
## 13              TCGA-GBM     12005
## 14             TCGA-SKCM     12738
## 15             TCGA-STAD     12867
## 16             TCGA-BLCA     11721
## 17             TCGA-LIHC     10832
## 18             TCGA-CESC      8599
## 19             TCGA-KIRP      8541
## 20             TCGA-SARC      7494
## 21             TCGA-PAAD      5307
## 22             TCGA-ESCA      5289
## 23             TCGA-PCPG      5044
## 24             TCGA-READ      4925
## 25               CPTAC-3      8404
## 26             TCGA-TGCT      4301
## 27             TCGA-LAML      4434
## 28             TCGA-THYM      3445
## 29            TARGET-AML      2462
## 30            TARGET-NBL      2807
## 31              TCGA-ACC      2556
## 32             TCGA-KICH      2325
## 33          NCICCR-DLBCL      4805
## 34             TCGA-MESO      2346
## 35              TCGA-UVM      2180
## 36             TARGET-WT      1410
## 37              TCGA-UCS      1659
## 38         TARGET-ALL-P3      2503
## 39             TCGA-CHOL      1354
## 40             TCGA-DLBC      1229
## 41             TARGET-OS        66
## 42             TARGET-RT       381
## 43           CTSP-DLBCL1       417
## 44             HCMI-CMDC       280
## 45 BEATAML1.0-CRENOLANIB       223
## 46           TARGET-CCSK        15
## 47        VAREPOP-APOLLO        21

We note that cases.project.project_id looks like it is a good fit. We also note that TCGA-OV is the correct project_id, not TCGA-OVCA. Note that unlike with dplyr and friends, the filter() method here replaces the filter and does not build on any previous filters.

qfiles = files() %>%
    filter( ~ cases.project.project_id == 'TCGA-OV' & type == 'gene_expression')
str(get_filter(qfiles))

## List of 2
##  $ op     : 'scalar' chr "and"
##  $ content:List of 2
##   ..$ :List of 2
##   .. ..$ op     : 'scalar' chr "="
##   .. ..$ content:List of 2
##   .. .. ..$ field: chr "cases.project.project_id"
##   .. .. ..$ value: chr "TCGA-OV"
##   ..$ :List of 2
##   .. ..$ op     : 'scalar' chr "="
##   .. ..$ content:List of 2
##   .. .. ..$ field: chr "type"
##   .. .. ..$ value: chr "gene_expression"

qfiles %>% count()

## [1] 1137

Asking for a count() of results given these new filter criteria gives r qfiles %>% count() results. Generating a manifest for bulk downloads is as simple as asking for the manifest from the current query.

manifest_df = qfiles %>% manifest()
head(manifest_df)

## # A tibble: 6 x 5
##   id                filename                  md5               size state 
##   <chr>             <chr>                     <chr>            <dbl> <chr> 
## 1 e09d9ae2-b5b5-47… 56427511-2167-465f-bdde-… 1be3c2be2cff4c… 250109 relea…
## 2 b02459dd-1f1d-4a… 5327762f-5add-4008-ac5f-… b22976ca2e31bd… 256176 relea…
## 3 329ae6b7-3f35-4f… 9ac17699-409b-4750-9317-… 8ce298486ba9ab… 261299 relea…
## 4 4eba4f78-eb17-4b… e2cf2389-07bc-49d0-9426-… 538cc6b0bf8832… 527507 relea…
## 5 96afc1e3-07fb-48… c5e8e5e5-91e0-4e3c-97ad-… 41e957782aa95c… 549958 relea…
## 6 ff7edff0-bfb5-45… 25d6dd26-51c7-40dd-962a-… fd0b10ceb8566f… 253578 relea…

Note that we might still not be quite there. Looking at filenames, there are suspiciously named files that might include “FPKM”, “FPKM-UQ”, or “counts”. Another round of grep and available_fields, looking for “type” turned up that the field “analysis.workflow_type” has the appropriate filter criteria.

qfiles = files() %>% filter( ~ cases.project.project_id == 'TCGA-OV' &
                            type == 'gene_expression' &
                            analysis.workflow_type == 'HTSeq - Counts')
manifest_df = qfiles %>% manifest()
nrow(manifest_df)

## [1] 379

The GDC Data Transfer Tool can be used (from R, transfer() or from the command-line) to orchestrate high-performance, restartable transfers of all the files in the manifest. See the bulk downloads section for details.

Authentication

[ GDC authentication documentation ]

The GDC offers both “controlled-access” and “open” data. As of this writing, only data stored as files is “controlled-access”; that is, metadata accessible via the GDC is all “open” data and some files are “open” and some are “controlled-access”. Controlled-access data are only available after going through the process of obtaining access.

After controlled-access to one or more datasets has been granted, logging into the GDC web portal will allow you to access a GDC authentication token, which can be downloaded and then used to access available controlled-access data via the GenomicDataCommons package.

The GenomicDataCommons uses authentication tokens only for downloading data (see transfer and gdcdata documentation). The package includes a helper function, gdc_token, that looks for the token to be stored in one of three ways (resolved in this order):

As a string stored in the environment variable, GDC_TOKEN
As a file, stored in the file named by the environment variable, GDC_TOKEN_FILE
In a file in the user home directory, called .gdc_token

As a concrete example:

token = gdc_token()
transfer(...,token=token)
# or
transfer(...,token=get_token())

Datafile access and download

The gdcdata function takes a character vector of one or more file ids. A simple way of producing such a vector is to produce a manifest data frame and then pass in the first column, which will contain file ids.

fnames = gdcdata(manifest_df$id[1:2],progress=FALSE)

Note that for controlled-access data, a GDC authentication token is required. Using the BiocParallel package may be useful for downloading in parallel, particularly for large numbers of smallish files.

The bulk download functionality is only efficient (as of v1.2.0 of the GDC Data Transfer Tool) for relatively large files, so use this approach only when transferring BAM files or larger VCF files, for example. Otherwise, consider using the approach shown above, perhaps in parallel.

fnames = gdcdata(manifest_df$id[3:10], access_method = 'client')

Accessing The Cancer Genome Atlas (TCGA)

We summarize two approaches to accessing TCGA data:

TCGAbiolinks:
1. data access through GenomicDataCommons
2. provides data both from the legacy Firehose pipeline used by the TCGA publications (alignments based on hg18 and hg19 builds²), and the GDC harmonized GRCh38 pipeline³.
3. downloads files from the Genomic Data Commons, and provides conversion to (Ranged)SummarizedExperiment where possible
curatedTCGAData:
1. data access through ExperimentHub
2. provides data from the legacy Firehose pipeline⁴
3. provides individual assays as (Ranged)SummarizedExperiment and RaggedExperiment, integrates multiple assays within and across cancer types using MultiAssayExperiment

TCGAbiolinks

We demonstrate here generating a RangedSummarizedExperiment for RNA-seq data from adrenocortical carcinoma (ACC). For additional information and options, see the TCGAbiolinks vignettes⁵.

Load packages:

Search for matching data:

library(TCGAbiolinks)
library(SummarizedExperiment)
query <- GDCquery(project = "TCGA-ACC",
                           data.category = "Gene expression",
                           data.type = "Gene expression quantification",
                           platform = "Illumina HiSeq", 
                           file.type  = "normalized_results",
                           experimental.strategy = "RNA-Seq",
                           legacy = TRUE)

Download data and convert it to RangedSummarizedExperiment:

gdcdir <- file.path("Waldron_PublicData", "GDCdata")
GDCdownload(query, method = "api", files.per.chunk = 10,
            directory = gdcdir)
ACCse <- GDCprepare(query, directory = gdcdir)
ACCse

curatedTCGAData: Curated Data From The Cancer Genome Atlas as MultiAssayExperiment Objects

curatedTCGAData does not interface with the Genomic Data Commons, but downloads data from Bioconductor’s ExperimentHub.

library(curatedTCGAData)
library(MultiAssayExperiment)

By default, the curatedTCGAData() function will only show available datasets, and not download anything. The arguments are shown here only for demonstration, the same result is obtained with no arguments:

curatedTCGAData(diseaseCode = "*", assays = "*")

## Please see the list below for available cohorts and assays

## Available Cancer codes:
##  ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
##  KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
##  PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM 
## Available Data Types:
##  CNACGH CNACGH_CGH_hg_244a
##  CNACGH_CGH_hg_415k_g4124a CNASeq CNASNP
##  CNVSNP GISTIC_AllByGene GISTIC_Peaks
##  GISTIC_ThresholdedByGene Methylation
##  Methylation_methyl27 Methylation_methyl450
##  miRNAArray miRNASeqGene mRNAArray
##  mRNAArray_huex mRNAArray_TX_g4502a
##  mRNAArray_TX_g4502a_1
##  mRNAArray_TX_ht_hg_u133a Mutation
##  RNASeq2GeneNorm RNASeqGene RPPAArray

Check potential files to be downloaded for adrenocortical carcinoma (ACC):

curatedTCGAData(diseaseCode = "ACC")

##                                    Title DispatchClass
## 1                    ACC_CNASNP-20160128           Rda
## 2                    ACC_CNVSNP-20160128           Rda
## 4          ACC_GISTIC_AllByGene-20160128           Rda
## 5              ACC_GISTIC_Peaks-20160128           Rda
## 6  ACC_GISTIC_ThresholdedByGene-20160128           Rda
## 8        ACC_Methylation-20160128_assays        H5File
## 9            ACC_Methylation-20160128_se           Rds
## 10             ACC_miRNASeqGene-20160128           Rda
## 11                 ACC_Mutation-20160128           Rda
## 12          ACC_RNASeq2GeneNorm-20160128           Rda
## 13                ACC_RPPAArray-20160128           Rda

Actually download the reverse phase protein array (RPPA) and RNA-seq data for ACC

ACCmae <- curatedTCGAData("ACC", c("RPPAArray", "RNASeq2GeneNorm"), 
                          dry.run=FALSE)
ACCmae

## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] ACC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 79 columns 
##  [2] ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

Note. Data will be downloaded the first time the above command is run; subsequent times it will be loaded from local cache.

This object contains 822 columns of clinical, pathological, specimen, and subtypes data in its colData, merged from all available data levels (1-4) of the Firehose pipeline:

dim(colData(ACCmae))

## [1]  79 822

head(colnames(colData(ACCmae)))

## [1] "patientID"             "years_to_birth"        "vital_status"         
## [4] "days_to_death"         "days_to_last_followup" "tumor_tissue_site"

See the MultiAssayExperiment vignette (Ramos et al. 2017) and the Workflow for Multi-omics Analysis with MultiAssayExperiment workshop for details on using this object.

Subtype information

Some cancer datasets contain associated subtype information within the clinical datasets provided. This subtype information is included in the metadata of colData of the MultiAssayExperiment object. To obtain these variable names, run the metadata function on the colData of the object such as:

head(metadata(colData(ACCmae))[["subtypes"]])

##         ACC_annotations   ACC_subtype
## 1            Patient_ID        SAMPLE
## 2 histological_subtypes     Histology
## 3         mrna_subtypes       C1A/C1B
## 4         mrna_subtypes       mRNA_K4
## 5                  cimp    MethyLevel
## 6     microrna_subtypes miRNA cluster

recount: Reproducible RNA-seq Analysis Using recount2

The recount(Collado-Torres et al. 2017) package provides uniformly processed RangedSummarizedExperiment objects at the gene, exon, or exon-exon junctions level, the raw counts, the phenotype metadata used, the urls to sample coverage bigWig files and mean coverage bigWig file, for every study available. The RangedSummarizedExperiment objects can be used for differential expression analysis. These are also accessible through a web interface.⁶

recount provides a search function:

library(recount)
project_info <- abstract_search('GSE32465')

It is not an ExperimentHub package, so downloading and serializing is slightly more involved in involves two steps: first, download the gene-level RangedSummarizedExperiment data:

download_study(project_info$project)

## 2019-06-22 01:51:16 downloading file rse_gene.Rdata to SRP009615

followed by loading the data

load(file.path(project_info$project, 'rse_gene.Rdata'))

curated*Data packages for standardized cancer transcriptomes

There are focused databases of cancer microarray data for several cancer types, which can be useful for researchers of those cancer types or for methodological development:

curatedOvarianData(Ganzfried et al. 2013): Clinically Annotated Data for the Ovarian Cancer Transcriptome (data available with additional options through the MetaGxOvarian package).
curatedBladderData: Clinically Annotated Data for the Bladder Cancer Transcriptome
curatedCRCData: Clinically Annotated Data for the Colorectal Cancer Transcriptome

These provide data from the Gene Expression Omnibus and other sources, but use a formally vocabulary for clinicopathological data and use a common pipeline for preprocessing of microarray data (for Affymetrix, other for other platforms the processed data are provided as processed by original study authors), merging probesets, and mapping to gene symbols. The pipeline is described by Ganzfried et al. (2013).

Microbiome data

Bioconductor provides curated resources of microbiome data. Most microbiome data are generated either by targeted amplicon sequencing (usually of variable regions of the 16S ribosomal RNA gene) or by metagenomic shotgun sequencing (MGX). These two approaches are analyzed by different sequence analysis tools, but downstream statistical and ecological analysis can involve any of the following types of data:

taxonomic abundance at different levels of the taxonomic hierarchy
phylogenetic distances and the phylogenetic tree of life
metabolic potential of the microbiome
abundance of microbial genes and gene families

A review of types and properties of microbiome data is provided by (Morgan and Huttenhower 2012).

curatedMetagenomicData: Curated and processed metagenomic data through ExperimentHub

curatedMetagenomicData(Pasolli et al. 2017) provides 6 types of processed data for >30 publicly available whole-metagenome shotgun sequencing datasets (obtained from the Sequence Read Archive):

Species-level taxonomic profiles, expressed as relative abundance from kingdom to strain level
Presence of unique, clade-specific markers
Abundance of unique, clade-specific markers
Abundance of gene families
Metabolic pathway coverage
Metabolic pathway abundance

Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.

Currently, curatedMetagenomicData provides:

8184 samples from 46 datasets, primarily of the human gut but including body sites profiled in the Human Microbiome Project
Processed data from whole-metagenome shotgun metagenomics, with manually-curated metadata, as integrated and documented Bioconductor ExpressionSet objects
~80 fields of specimen metadata from original papers, supplementary files, and websites, with manual curation to standardize annotations
Processing of data through the MetaPhlAn2 pipeline for taxonomic abundance, and HUMAnN2 pipeline for metabolic analysis
These represent ~100TB of raw sequencing data, but the processed data provided are much smaller.

These datasets are documented in the reference manual.

This is an ExperimentHub package, and its main workhorse function is curatedMetagenomicData():

The manually curated metadata for all available samples are provided in a single table combined_metadata:

library(curatedMetagenomicData)
?combined_metadata
View(data.frame(combined_metadata))

The main function provides a list of ExpressionSet objects:

oral <- c("BritoIL_2016.metaphlan_bugs_list.oralcavity",
          "Castro-NallarE_2015.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)

## Working on BritoIL_2016.metaphlan_bugs_list.oralcavity

## snapshotDate(): 2019-06-20

## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation

## downloading 0 resources

## loading from cache 
##     'EH1179 : 1179'

## Working on Castro-NallarE_2015.metaphlan_bugs_list.oralcavity

## snapshotDate(): 2019-06-20

## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation

## downloading 0 resources

## loading from cache 
##     'EH1714 : 1714'

esl

## List of length 2
## names(2): BritoIL_2016.metaphlan_bugs_list.oralcavity ...

These ExpressionSet objects can also be converted to phyloseq object for ecological analysis and differential abundance analysis using the DESeq2 package, using the ExpressionSet2phyloseq() function:

ExpressionSet2phyloseq( esl[[1]], phylogenetictree = TRUE)

## Loading required namespace: phyloseq

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 535 taxa and 140 samples ]
## sample_data() Sample Data:       [ 140 samples by 17 sample variables ]
## tax_table()   Taxonomy Table:    [ 535 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 535 tips and 534 internal nodes ]

See the documentation of phyloseq for more on ecological and differential abundance analysis of the microbiome.

HMP16SData: 16S rRNA Sequencing Data from the Human Microbiome Project

suppressPackageStartupMessages(library(HMP16SData))

## snapshotDate(): 2019-06-20

HMP16SData(Schiffer et al. 2018) is a Bioconductor ExperimentData package of the Human Microbiome Project (HMP) 16S rRNA sequencing data. Taxonomic count data files are provided as downloaded from the HMP Data Analysis and Coordination Center from its QIIME pipeline. Processed data is provided as SummarizedExperiment class objects via ExperimentHub. Like other ExperimentHub-based packages, a convenience function does downloading, automatic local caching, and serializing of a Bioconductor data class. This returns taxonomic counts from the V1-3 variable region of the 16S rRNA gene, along with the unrestricted participant data and phylogenetic tree.

V13()

## snapshotDate(): 2019-06-20

## see ?HMP16SData and browseVignettes('HMP16SData') for documentation

## downloading 0 resources

## loading from cache 
##     'EH1117 : 1117'

## class: SummarizedExperiment 
## dim: 43140 2898 
## metadata(2): experimentData phylogeneticTree
## assays(1): 16SrRNA
## rownames(43140): OTU_97.1 OTU_97.10 ... OTU_97.9997 OTU_97.9999
## rowData names(7): CONSENSUS_LINEAGE SUPERKINGDOM ... FAMILY GENUS
## colnames(2898): 700013549 700014386 ... 700114963 700114965
## colData names(7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID

This can also be converted to phyloseq for ecological and differential abundance analysis; see the HMP16SData vignette for details.

Pharmacogenomics

Pharmacogenomics holds great promise for the development of biomarkers of drug response and the design of new therapeutic options, which are key challenges in precision medicine. However, such data are scattered and lack standards for efficient access and analysis, consequently preventing the realization of the full potential of pharmacogenomics. To address these issues, we implemented PharmacoGx, an easy-to-use, open source package for integrative analysis of multiple pharmacogenomic datasets. ~PharmacoGx` provides a unified framework for downloading and analyzing large pharmacogenomic datasets which are extensively curated to ensure maximum overlap and consistency.

Examples of PharmacoGx usage in biomedical research can be found in the following publications:

Getting started

Let us first load the PharmacoGx library.

library(PharmacoGx)

We can now access large-scale preclinical pharmacogenomic datasets that have been fully curated for ease of use.

Overview of PharmacoGx datasets (PharmacoSets)

To efficiently store and analyze large pharmacogenomic datasets, we developed the PharmacoSet class (also referred to as PSet), which acts as a data container storing pharmacological and molecular data along with experimental metadata (detailed structure provided in Supplementary materials). This class enables efficient implementation of curated annotations for cell lines, drug compounds and molecular features, which facilitates comparisons between different datasets stored as PharmacoSet objects.

We have made the PharmacoSet objects of the curated datasets available for download using functions provided in the package. A table of available PharmacoSet objects can be obtained by using the availablePSets function. Any of the PharmacoSets in the table can then be downloaded by calling downloadPSet, which saves the datasets into a directory of the users choice, and returns the data into the R session.

Structure of the PharmacoSet class

To get a list of all the available PharmacoSets in PharmacoGx, we can use the availablePSets` function, which returns a table providing key information for each dataset.

availablePSets(saveDir=file.path(".", "Waldron_PublicData"))

##                       PSet.Name Dataset.Type Available.Molecular.Profiles
## CCLE_2013             CCLE_2013  sensitivity                 rna/mutation
## CCLE                       CCLE  sensitivity      rna/rnaseq/mutation/cnv
## GDSC_2013             GDSC_2013  sensitivity                 rna/mutation
## GDSC                       GDSC  sensitivity rna/rna2/mutation/fusion/cnv
## GDSC1000               GDSC1000  sensitivity                          rna
## gCSI                       gCSI  sensitivity                   rnaseq/cnv
## FIMM                       FIMM  sensitivity                             
## CTRPv2                   CTRPv2  sensitivity                             
## CMAP                       CMAP perturbation                          rna
## L1000_compounds L1000_compounds perturbation                          rna
## L1000_genetic     L1000_genetic perturbation                          rna
##                             Date.Updated
## CCLE_2013       Tue Sep 15 18:50:07 2015
## CCLE            Thu Dec 10 18:17:14 2015
## GDSC_2013       Mon Oct  5 16:07:54 2015
## GDSC            Wed Dec 30 10:44:21 2015
## GDSC1000        Thu Aug 25 11:13:00 2016
## gCSI            Mon Jun 13 18:50:12 2016
## FIMM             Mon Oct 3 17:14:00 2016
## CTRPv2          Thu Aug 25 11:15:00 2016
## CMAP            Mon Sep 21 02:38:45 2015
## L1000_compounds Mon Jan 25 12:51:00 2016
## L1000_genetic   Mon Jan 25 12:51:00 2016
##                                                                                                  URL
## CCLE_2013       https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CCLE_Nature2013.RData
## CCLE                       https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CCLE.RData
## GDSC_2013        https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CGP_Nature2013.RData
## GDSC                       https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/GDSC.RData
## GDSC1000               https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/GDSC1000.RData
## gCSI                       https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/gCSI.RData
## FIMM                       https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/FIMM.RData
## CTRPv2                   https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CTRPv2.RData
## CMAP                       https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CMAP.RData
## L1000_compounds https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/L1000_compounds.RData
## L1000_genetic     https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/L1000_genetic.RData

There are currently 11 datasets available in PharmacoGx, including 8 sensitivity datasets and 3 perturbation datasets (see below).

Drug Sensitivity Datasets

Drug sensitivity datasets refer to pharmacogenomic data where cancer cells are molecularly profiled at baseline (before drug treatment), and the effect of drug treatment on cell viability is measured using a pharmacological assay (e.g., Cell Titer-Glo). These datasets can be used for biomarker discovery by correlating the molecular features of cancer cells to their response to drugs of interest.

Schematic view of the drug sensitivity datasets.

psets <- availablePSets(saveDir=file.path(".", "Waldron_PublicData"))
psets[psets[ , "Dataset.Type"] == "sensitivity", ]

##           PSet.Name Dataset.Type Available.Molecular.Profiles
## CCLE_2013 CCLE_2013  sensitivity                 rna/mutation
## CCLE           CCLE  sensitivity      rna/rnaseq/mutation/cnv
## GDSC_2013 GDSC_2013  sensitivity                 rna/mutation
## GDSC           GDSC  sensitivity rna/rna2/mutation/fusion/cnv
## GDSC1000   GDSC1000  sensitivity                          rna
## gCSI           gCSI  sensitivity                   rnaseq/cnv
## FIMM           FIMM  sensitivity                             
## CTRPv2       CTRPv2  sensitivity                             
##                       Date.Updated
## CCLE_2013 Tue Sep 15 18:50:07 2015
## CCLE      Thu Dec 10 18:17:14 2015
## GDSC_2013 Mon Oct  5 16:07:54 2015
## GDSC      Wed Dec 30 10:44:21 2015
## GDSC1000  Thu Aug 25 11:13:00 2016
## gCSI      Mon Jun 13 18:50:12 2016
## FIMM       Mon Oct 3 17:14:00 2016
## CTRPv2    Thu Aug 25 11:15:00 2016
##                                                                                            URL
## CCLE_2013 https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CCLE_Nature2013.RData
## CCLE                 https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CCLE.RData
## GDSC_2013  https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CGP_Nature2013.RData
## GDSC                 https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/GDSC.RData
## GDSC1000         https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/GDSC1000.RData
## gCSI                 https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/gCSI.RData
## FIMM                 https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/FIMM.RData
## CTRPv2             https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CTRPv2.RData

Notably, the Genomics of Drug Sensitivity in Cancer GDSC and the Cancer Cell Line Encyclopedia CCLE are large drug sensitivity datasets published in seminal studies in Nature, Garnett et al., https://www.nature.com/articles/nature11005, Nature (2012) and Barretina et al., The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity, Nature (2012), respectively.

Drug Perturbation Datasets

Drug perturbation datasets refer to pharmacogenomic data where gene expression profiles are measured before and after short-term (e.g., 6h) drug treatment to identify genes that are up- and down-regulated due to the drug treatment. These datasets can be to classify drug (drug taxonomy), infer their mechanism of action, or find drugs with similar effects (drug repurposing).

Schematic view of drug perturbation datasets

psets <- availablePSets(saveDir=file.path(".", "Waldron_PublicData"))
psets[psets[ , "Dataset.Type"] == "perturbation", ]

##                       PSet.Name Dataset.Type Available.Molecular.Profiles
## CMAP                       CMAP perturbation                          rna
## L1000_compounds L1000_compounds perturbation                          rna
## L1000_genetic     L1000_genetic perturbation                          rna
##                             Date.Updated
## CMAP            Mon Sep 21 02:38:45 2015
## L1000_compounds Mon Jan 25 12:51:00 2016
## L1000_genetic   Mon Jan 25 12:51:00 2016
##                                                                                                  URL
## CMAP                       https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/CMAP.RData
## L1000_compounds https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/L1000_compounds.RData
## L1000_genetic     https://www.pmgenomics.ca/bhklab/sites/default/files/downloads/L1000_genetic.RData

Large drug perturbation data have been generated within the Connectivity Map Project CAMP, with CMAPv2 and CMAPv3 available from PharmacoGx, published in Lamb et al., The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease, Science (2006) and Subramanian et al., A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles, Cell (2017), respectively.

Exploring drug sensitivity datasets

The Biomarker discovery from large pharmacogenomics datasets workshop demonstrates analyses of PharmacoGx data.

Bibliography

Collado-Torres, Leonardo, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, and Jeffrey T Leek. 2017. “Reproducible RNA-seq Analysis Using Recount2.” Nature Biotechnology 35 (4): 319–21. https://doi.org/10.1038/nbt.3838.

Davis, Sean R., and Paul S Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 23 (14): 1846–7. https://doi.org/10.1093/bioinformatics/btm254.

Ganzfried, Benjamin Frederick, Markus Riester, Benjamin Haibe-Kains, Thomas Risch, Svitlana Tyekucheva, Ina Jazic, Xin Victoria Wang, et al. 2013. “curatedOvarianData: Clinically Annotated Data for the Ovarian Cancer Transcriptome.” Database: The Journal of Biological Databases and Curation 2013 (April): bat013. https://doi.org/10.1093/database/bat013.

Kannan, Lavanya, Marcel Ramos, Angela Re, Nehme El-Hachem, Zhaleh Safikhani, Deena M A Gendoo, Sean Davis, et al. 2016. “Public Data and Open Source Tools for Multi-Assay Genomic Investigation of Disease.” Briefings in Bioinformatics 17 (4): 603–15. https://doi.org/10.1093/bib/bbv080.

Morgan, Xochitl C, and Curtis Huttenhower. 2012. “Chapter 12: Human Microbiome Analysis.” PLoS Computational Biology 8 (12): e1002808. https://doi.org/10.1371/journal.pcbi.1002808.

Pasolli, Edoardo, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, et al. 2017. “Accessible, Curated Metagenomic Data Through ExperimentHub.” Nature Methods 14 (11). Nature Research: 1023–4. https://doi.org/10.1038/nmeth.4468.

Ramos, Marcel, Lucas Schiffer, Angela Re, Rimsha Azhar, Azfar Basunia, Carmen Rodriguez, Tiffany Chan, et al. 2017. “Software for the Integration of Multiomics Experiments in Bioconductor.” Cancer Research 77 (21). American Association for Cancer Research: e39–e42. https://doi.org/10.1158/0008-5472.CAN-17-0344.

Schiffer, Lucas, Rimsha Azhar, Lori Shepherd, Marcel Ramos, Ludwig Geistlinger, Curtis Huttenhower, Jennifer B Dowd, Nicola Segata, and Levi Waldron. 2018. “HMP16SData: Efficient Access to the Human Microbiome Project Through Bioconductor.” bioRxiv. https://doi.org/10.1101/299115.

See individual function and methods documentation for specific details.↩
https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-Q%C2%A0Whatreferencegenomebuildareyouusing ↩
https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization-0 ↩
https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-Q%C2%A0Whatreferencegenomebuildareyouusing ↩
https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/download_prepare.html ↩
https://jhubiostatistics.shinyapps.io/recount/↩