Skip to contents

Abstract

The National Cancer Institute (NCI) has established the Genomic Data Commons (GDC). The GDC provides the cancer research community with an open and unified repository for sharing and accessing data across numerous cancer studies and projects via a high-performance data transfer and query infrastructure. The GenomicDataCommons Bioconductor package provides basic infrastructure for querying, accessing, and mining genomic datasets available from the GDC. We expect that the Bioconductor developer and the larger bioinformatics communities will build on the GenomicDataCommons package to add higher-level functionality and expose cancer genomics data to the plethora of state-of-the-art bioinformatics methods available in Bioconductor.

What is the GDC?

From the Genomic Data Commons (GDC) website:

The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs. The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared. As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.

The data model for the GDC is complex, but it worth a quick overview and a graphical representation is included here.

The data model is encoded as a so-called property graph. Nodes represent entities such as Projects, Cases, Diagnoses, Files (various kinds), and Annotations. The relationships between these entities are maintained as edges. Both nodes and edges may have Properties that supply instance details.

The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.

Quickstart

This quickstart section is just meant to show basic functionality. More details of functionality are included further on in this vignette and in function-specific help.

This software is available at Bioconductor.org and can be downloaded via BiocManager::install.

To report bugs or problems, either submit a new issue or submit a bug.report(package='GenomicDataCommons') from within R (which will redirect you to the new issue on GitHub).

Installation

Installation can be achieved via Bioconductor’s BiocManager package.

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install('GenomicDataCommons')

Check connectivity and status

The GenomicDataCommons package relies on having network connectivity. In addition, the NCI GDC API must also be operational and not under maintenance. Checking status can be used to check this connectivity and functionality.

GenomicDataCommons::status()
## $commit
## [1] "4dd3680528a19ed33cfc83c7d049426c97bb903b"
## 
## $data_release
## [1] "Data Release 35.0 - September 28, 2022"
## 
## $status
## [1] "OK"
## 
## $tag
## [1] "3.0.0"
## 
## $version
## [1] 1

And to check the status in code:

stopifnot(GenomicDataCommons::status()$status=="OK")

Find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using STAR from ovarian cancer patients.

ge_manifest <- files() %>%
    filter( cases.project.project_id == 'TCGA-OV') %>% 
    filter( type == 'gene_expression' ) %>%
    filter( analysis.workflow_type == 'STAR - Counts')  %>%
    manifest()
head(ge_manifest)
##                                     id data_format     access
## 1 6431664a-4a19-4a02-8ea5-8478a295f391         TSV controlled
## 2 db02a2f2-2d83-4486-b8a3-3b01d6f0a6f2         TSV       open
## 3 d1737bd6-98b5-4c83-977d-007123b554b2         TSV controlled
## 4 be11a1db-9919-47b8-bc06-90ce5f2c9b02         TSV       open
## 5 49618db3-ba6f-442d-b001-6af64869461e         TSV controlled
## 6 ffec2c19-e5f2-46c3-8fb7-02c620f06d8a         TSV controlled
##                                                                     file_name
## 1   2fb7ad5a-0ef2-4524-9fa8-4d61d1bb04e1.rna_seq.star_splice_junctions.tsv.gz
## 2 2fb7ad5a-0ef2-4524-9fa8-4d61d1bb04e1.rna_seq.augmented_star_gene_counts.tsv
## 3   8a7a4704-c698-4291-ba48-207895898a40.rna_seq.star_splice_junctions.tsv.gz
## 4 8a7a4704-c698-4291-ba48-207895898a40.rna_seq.augmented_star_gene_counts.tsv
## 5   872f7c22-17e4-4940-809b-a0970fde369b.rna_seq.star_splice_junctions.tsv.gz
## 6   6ab54953-3d77-4239-bf4a-5a893cf9a43c.rna_seq.star_splice_junctions.tsv.gz
##                           submitter_id           data_category       acl
## 1 1fe77c9d-4a01-4de1-8217-d0a329800e60 Transcriptome Profiling phs000178
## 2 16328313-f0d8-462d-a02b-5a6f2f4feabf Transcriptome Profiling      open
## 3 047ec964-cda2-4932-a979-29ef0e5b109d Transcriptome Profiling phs000178
## 4 8b20ea0f-57d5-420a-839d-03e9235a9536 Transcriptome Profiling      open
## 5 836530f2-4019-42df-8b39-25f53a875350 Transcriptome Profiling phs000178
## 6 8ff29681-9fc8-44a7-a7d1-fa11da07d81f Transcriptome Profiling phs000178
##              type file_size                 created_datetime
## 1 gene_expression   2800561 2021-12-13T20:58:10.452132-06:00
## 2 gene_expression   4230648 2021-12-13T20:46:42.020696-06:00
## 3 gene_expression   2538631 2021-12-13T21:02:18.823573-06:00
## 4 gene_expression   4221504 2021-12-13T20:50:25.802750-06:00
## 5 gene_expression   3828886 2021-12-13T21:02:08.938480-06:00
## 6 gene_expression   2881690 2021-12-13T21:00:47.129504-06:00
##                             md5sum                 updated_datetime
## 1 bbf3056454389fe35837aae8f1d6ef61 2022-01-19T14:01:15.621847-06:00
## 2 7a3053cd76efcf0a9681b7f0020f3bd1 2022-01-19T14:47:53.778421-06:00
## 3 df64d2d93d1712dd882ffec54b90a0f8 2022-01-19T14:01:15.621847-06:00
## 4 fc135a6076cc55d33370dcb2a0db0945 2022-01-19T14:47:45.329025-06:00
## 5 0b8b2f1da115c2c1899b2924ab72ff27 2022-01-19T14:01:15.621847-06:00
## 6 79ad7f463d3dafbea72d697f7b69d1d5 2022-01-19T14:01:15.621847-06:00
##                                file_id                      data_type    state
## 1 6431664a-4a19-4a02-8ea5-8478a295f391 Splice Junction Quantification released
## 2 db02a2f2-2d83-4486-b8a3-3b01d6f0a6f2 Gene Expression Quantification released
## 3 d1737bd6-98b5-4c83-977d-007123b554b2 Splice Junction Quantification released
## 4 be11a1db-9919-47b8-bc06-90ce5f2c9b02 Gene Expression Quantification released
## 5 49618db3-ba6f-442d-b001-6af64869461e Splice Junction Quantification released
## 6 ffec2c19-e5f2-46c3-8fb7-02c620f06d8a Splice Junction Quantification released
##   experimental_strategy
## 1               RNA-Seq
## 2               RNA-Seq
## 3               RNA-Seq
## 4               RNA-Seq
## 5               RNA-Seq
## 6               RNA-Seq

Download data

After the 762 gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. On a standard 1Gb connection, the following completes in about 30 seconds. The first time the data are downloaded, R will ask to create a cache directory (see ?gdc_cache for details of setting and interacting with the cache). Resulting downloaded files will be stored in the cache directory. Future access to the same files will be directly from the cache, alleviating multiple downloads.

fnames <- lapply(ge_manifest$id[1:20], gdcdata)

If the download had included controlled-access data, the download above would have needed to include a token. Details are available in the authentication section below.

Metadata queries

Clinical data

Accessing clinical data is a very common task. Given a set of case_ids, the gdc_clinical() function will return a list of four tibbles.

  • demographic
  • diagnoses
  • exposures
  • main
case_ids = cases() %>% results(size=10) %>% ids()
clindat = gdc_clinical(case_ids)
names(clindat)
## [1] "demographic" "diagnoses"   "exposures"   "main"
head(clindat[["main"]])
## # A tibble: 6 × 7
##   id                               disea…¹ submi…² prima…³ updat…⁴ case_id state
##   <chr>                            <chr>   <chr>   <chr>   <chr>   <chr>   <chr>
## 1 b9a32a1c-9c93-5a92-8b30-e09a91d… Comple… TARGET… Kidney  2019-0… b9a32a… rele…
## 2 c2829ab9-d5b2-5a82-a134-de9c591… Comple… TARGET… Kidney  2019-0… c2829a… rele…
## 3 f5548317-3be4-5227-a655-dfb97e6… Comple… TARGET… Kidney  2019-0… f55483… rele…
## 4 cf3bd8c5-4cd6-57c6-b07a-f20d414… Comple… TARGET… Kidney  2019-0… cf3bd8… rele…
## 5 eaffceb7-3b14-5b19-a7f0-a43bd8c… Comple… TARGET… Kidney  2019-0… eaffce… rele…
## 6 c07901d8-2829-5e98-9e8a-e12faaf… Comple… TARGET… Kidney  2019-0… c07901… rele…
## # … with abbreviated variable names ¹​disease_type, ²​submitter_id,
## #   ³​primary_site, ⁴​updated_datetime
head(clindat[["diagnoses"]])
## # A tibble: 6 × 19
##   case_id    days_…¹ morph…² submi…³ created_datetime    last_…⁴ tissu…⁵ days_…⁶
##   <chr>      <lgl>   <chr>   <chr>   <dttm>              <chr>   <chr>     <dbl>
## 1 b9a32a1c-… NA      8960/3  TARGET… 2017-02-25 02:55:58 not re… Kidney…    3828
## 2 c2829ab9-… NA      8960/3  TARGET… 2017-02-25 02:57:38 not re… Kidney…    3706
## 3 f5548317-… NA      8960/3  TARGET… 2017-02-25 02:56:34 not re… Kidney…    2717
## 4 cf3bd8c5-… NA      8960/3  TARGET… 2017-02-25 02:55:01 not re… Kidney…    4695
## 5 eaffceb7-… NA      8960/3  TARGET… 2017-02-25 02:57:07 not re… Kidney…    3954
## 6 c07901d8-… NA      8960/3  TARGET… 2017-02-25 03:01:15 not re… Kidney…    1918
## # … with 11 more variables: age_at_diagnosis <int>, primary_diagnosis <chr>,
## #   updated_datetime <dttm>, diagnosis_id <chr>, year_of_diagnosis <dbl>,
## #   cog_renal_stage <chr>, site_of_resection_or_biopsy <chr>, state <chr>,
## #   tumor_grade <chr>, days_to_last_known_disease_status <lgl>,
## #   progression_or_recurrence <chr>, and abbreviated variable names
## #   ¹​days_to_recurrence, ²​morphology, ³​submitter_id,
## #   ⁴​last_known_disease_status, ⁵​tissue_or_organ_of_origin, …

General metadata queries

The GenomicDataCommons package can access the significant clinical, demographic, biospecimen, and annotation information contained in the NCI GDC. The gdc_clinical() function will often be all that is needed, but the API and GenomicDataCommons package make much flexibility if fine-tuning is required.

expands = c("diagnoses","annotations",
             "demographic","exposures")
clinResults = cases() %>%
    GenomicDataCommons::select(NULL) %>%
    GenomicDataCommons::expand(expands) %>%
    results(size=50)
str(clinResults[[1]],list.len=6)
##  chr [1:50] "b9a32a1c-9c93-5a92-8b30-e09a91dc3cfc" ...
# or listviewer::jsonedit(clinResults)

Basic design

This package design is meant to have some similarities to the “hadleyverse” approach of dplyr. Roughly, the functionality for finding and accessing files and metadata can be divided into:

  1. Simple query constructors based on GDC API endpoints.
  2. A set of verbs that when applied, adjust filtering, field selection, and faceting (fields for aggregation) and result in a new query object (an endomorphism)
  3. A set of verbs that take a query and return results from the GDC

In addition, there are exhiliary functions for asking the GDC API for information about available and default fields, slicing BAM files, and downloading actual data files. Here is an overview of functionality1.

Usage

There are two main classes of operations when working with the NCI GDC.

  1. Querying metadata and finding data files (e.g., finding all gene expression quantifications data files for all colon cancer patients).
  2. Transferring raw or processed data from the GDC to another computer (e.g., downloading raw or processed data)

Both classes of operation are reviewed in detail in the following sections.

Querying metadata

Vast amounts of metadata about cases (patients, basically), files, projects, and so-called annotations are available via the NCI GDC API. Typically, one will want to query metadata to either focus in on a set of files for download or transfer or to perform so-called aggregations (pivot-tables, facets, similar to the R table() functionality).

Querying metadata starts with creating a “blank” query. One will often then want to filter the query to limit results prior to retrieving results. The GenomicDataCommons package has helper functions for listing fields that are available for filtering.

In addition to fetching results, the GDC API allows faceting, or aggregating,, useful for compiling reports, generating dashboards, or building user interfaces to GDC data (see GDC web query interface for a non-R-based example).

Creating a query

A query of the GDC starts its life in R. Queries follow the four metadata endpoints available at the GDC. In particular, there are four convenience functions that each create GDCQuery objects (actually, specific subclasses of GDCQuery):

pquery = projects()

The pquery object is now an object of (S3) class, GDCQuery (and gdc_projects and list). The object contains the following elements:

  • fields: This is a character vector of the fields that will be returned when we retrieve data. If no fields are specified to, for example, the projects() function, the default fields from the GDC are used (see default_fields())
  • filters: This will contain results after calling the filter() method and will be used to filter results on retrieval.
  • facets: A character vector of field names that will be used for aggregating data in a call to aggregations().
  • archive: One of either “default” or “legacy”.
  • token: A character(1) token from the GDC. See the authentication section for details, but note that, in general, the token is not necessary for metadata query and retrieval, only for actual data download.

Looking at the actual object (get used to using str()!), note that the query contains no results.

str(pquery)
## List of 5
##  $ fields : chr [1:10] "dbgap_accession_number" "disease_type" "intended_release_date" "name" ...
##  $ filters: NULL
##  $ facets : NULL
##  $ legacy : logi FALSE
##  $ expand : NULL
##  - attr(*, "class")= chr [1:3] "gdc_projects" "GDCQuery" "list"

Retrieving results

[ GDC pagination documentation ]

[ GDC sorting documentation ]

With a query object available, the next step is to retrieve results from the GDC. The GenomicDataCommons package. The most basic type of results we can get is a simple count() of records available that satisfy the filter criteria. Note that we have not set any filters, so a count() here will represent all the project records publicly available at the GDC in the “default” archive"

pcount = count(pquery)
# or
pcount = pquery %>% count()
pcount
## [1] 72

The results() method will fetch actual results.

presults = pquery %>% results()

These results are returned from the GDC in JSON format and converted into a (potentially nested) list in R. The str() method is useful for taking a quick glimpse of the data.

str(presults)
## List of 9
##  $ id                    : chr [1:10] "TARGET-NBL" "GENIE-GRCC" "GENIE-DFCI" "GENIE-NKI" ...
##  $ primary_site          :List of 10
##   ..$ TARGET-NBL: chr [1:20] "Retroperitoneum and peritoneum" "Lymph nodes" "Stomach" "Connective, subcutaneous and other soft tissues" ...
##   ..$ GENIE-GRCC: chr [1:45] "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" "Ovary" ...
##   ..$ GENIE-DFCI: chr [1:49] "Eye and adnexa" "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" ...
##   ..$ GENIE-NKI : chr [1:42] "Eye and adnexa" "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" ...
##   ..$ GENIE-VICC: chr [1:46] "Bronchus and lung" "Adrenal gland" "Gallbladder" "Esophagus" ...
##   ..$ GENIE-UHN : chr [1:42] "Eye and adnexa" "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" ...
##   ..$ GENIE-MDA : chr [1:42] "Eye and adnexa" "Uterus, NOS" "Ovary" "Other and unspecified urinary organs" ...
##   ..$ GENIE-MSK : chr [1:49] "Eye and adnexa" "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" ...
##   ..$ GENIE-JHU : chr [1:33] "Eye and adnexa" "Uterus, NOS" "Rectum" "Ovary" ...
##   ..$ FM-AD     : chr [1:42] "Bronchus and lung" "Esophagus" "Cervix uteri" "Other and unspecified female genital organs" ...
##  $ dbgap_accession_number: chr [1:10] "phs000467" NA NA NA ...
##  $ project_id            : chr [1:10] "TARGET-NBL" "GENIE-GRCC" "GENIE-DFCI" "GENIE-NKI" ...
##  $ disease_type          :List of 10
##   ..$ TARGET-NBL: chr [1:2] "Neuroepitheliomatous Neoplasms" "Not Applicable"
##   ..$ GENIE-GRCC: chr [1:32] "Osseous and Chondromatous Neoplasms" "Synovial-like Neoplasms" "Fibromatous Neoplasms" "Myomatous Neoplasms" ...
##   ..$ GENIE-DFCI: chr [1:52] "Osseous and Chondromatous Neoplasms" "Other Leukemias" "Synovial-like Neoplasms" "Lymphoid Leukemias" ...
##   ..$ GENIE-NKI : chr [1:23] "Synovial-like Neoplasms" "Fibromatous Neoplasms" "Myomatous Neoplasms" "Transitional Cell Papillomas and Carcinomas" ...
##   ..$ GENIE-VICC: chr [1:43] "Neoplasms, NOS" "Adnexal and Skin Appendage Neoplasms" "Squamous Cell Neoplasms" "Gliomas" ...
##   ..$ GENIE-UHN : chr [1:39] "Other Leukemias" "Osseous and Chondromatous Neoplasms" "Synovial-like Neoplasms" "Lymphoid Leukemias" ...
##   ..$ GENIE-MDA : chr [1:34] "Osseous and Chondromatous Neoplasms" "Synovial-like Neoplasms" "Fibromatous Neoplasms" "Myomatous Neoplasms" ...
##   ..$ GENIE-MSK : chr [1:49] "Osseous and Chondromatous Neoplasms" "Synovial-like Neoplasms" "Lymphoid Leukemias" "Fibromatous Neoplasms" ...
##   ..$ GENIE-JHU : chr [1:33] "Osseous and Chondromatous Neoplasms" "Other Leukemias" "Synovial-like Neoplasms" "Lymphoid Leukemias" ...
##   ..$ FM-AD     : chr [1:23] "Gliomas" "Acinar Cell Neoplasms" "Specialized Gonadal Neoplasms" "Miscellaneous Tumors" ...
##  $ name                  : chr [1:10] "Neuroblastoma" "AACR Project GENIE - Contributed by Institut Gustave Roussy" "AACR Project GENIE - Contributed by Dana-Farber Cancer Institute" "AACR Project GENIE - Contributed by Netherlands Cancer Institute" ...
##  $ releasable            : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ state                 : chr [1:10] "open" "open" "open" "open" ...
##  $ released              : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"

A default of only 10 records are returned. We can use the size and from arguments to results() to either page through results or to change the number of results. Finally, there is a convenience method, results_all() that will simply fetch all the available results given a query. Note that results_all() may take a long time and return HUGE result sets if not used carefully. Use of a combination of count() and results() to get a sense of the expected data size is probably warranted before calling results_all()

length(ids(presults))
## [1] 10
presults = pquery %>% results_all()
length(ids(presults))
## [1] 72
# includes all records
length(ids(presults)) == count(pquery)
## [1] TRUE

Extracting subsets of results or manipulating the results into a more conventional R data structure is not easily generalizable. However, the purrr, rlist, and data.tree packages are all potentially of interest for manipulating complex, nested list structures. For viewing the results in an interactive viewer, consider the listviewer package.

Fields and Values

[ GDC fields documentation ]

Central to querying and retrieving data from the GDC is the ability to specify which fields to return, filtering by fields and values, and faceting or aggregating. The GenomicDataCommons package includes two simple functions, available_fields() and default_fields(). Each can operate on a character(1) endpoint name (“cases”, “files”, “annotations”, or “projects”) or a GDCQuery object.

##  [1] "access"                         "acl"                           
##  [3] "average_base_quality"           "average_insert_size"           
##  [5] "average_read_length"            "channel"                       
##  [7] "chip_id"                        "chip_position"                 
##  [9] "contamination"                  "contamination_error"           
## [11] "created_datetime"               "data_category"                 
## [13] "data_format"                    "data_type"                     
## [15] "error_type"                     "experimental_strategy"         
## [17] "file_autocomplete"              "file_id"                       
## [19] "file_name"                      "file_size"                     
## [21] "imaging_date"                   "magnification"                 
## [23] "md5sum"                         "mean_coverage"                 
## [25] "msi_score"                      "msi_status"                    
## [27] "pairs_on_diff_chr"              "plate_name"                    
## [29] "plate_well"                     "platform"                      
## [31] "proc_internal"                  "proportion_base_mismatch"      
## [33] "proportion_coverage_10x"        "proportion_coverage_10X"       
## [35] "proportion_coverage_30x"        "proportion_coverage_30X"       
## [37] "proportion_reads_duplicated"    "proportion_reads_mapped"       
## [39] "proportion_targets_no_coverage" "read_pair_number"              
## [41] "revision"                       "stain_type"                    
## [43] "state"                          "state_comment"                 
## [45] "submitter_id"                   "tags"                          
## [47] "total_reads"                    "tumor_ploidy"                  
## [49] "tumor_purity"                   "type"                          
## [51] "updated_datetime"
# The number of fields available for files endpoint
length(available_fields('files'))
## [1] 1017
# The first few fields available for files endpoint
head(available_fields('files'))
## [1] "access"                      "acl"                        
## [3] "analysis.analysis_id"        "analysis.analysis_type"     
## [5] "analysis.created_datetime"   "analysis.input_files.access"

The fields to be returned by a query can be specified following a similar paradigm to that of the dplyr package. The select() function is a verb that resets the fields slot of a GDCQuery; note that this is not quite analogous to the dplyr select() verb that limits from already-present fields. We completely replace the fields when using select() on a GDCQuery.

# Default fields here
qcases = cases()
qcases$fields
##  [1] "aliquot_ids"              "analyte_ids"             
##  [3] "case_autocomplete"        "case_id"                 
##  [5] "consent_type"             "created_datetime"        
##  [7] "days_to_consent"          "days_to_lost_to_followup"
##  [9] "diagnosis_ids"            "disease_type"            
## [11] "index_date"               "lost_to_followup"        
## [13] "portion_ids"              "primary_site"            
## [15] "sample_ids"               "slide_ids"               
## [17] "state"                    "submitter_aliquot_ids"   
## [19] "submitter_analyte_ids"    "submitter_diagnosis_ids" 
## [21] "submitter_id"             "submitter_portion_ids"   
## [23] "submitter_sample_ids"     "submitter_slide_ids"     
## [25] "updated_datetime"
# set up query to use ALL available fields
# Note that checking of fields is done by select()
qcases = cases() %>% GenomicDataCommons::select(available_fields('cases'))
head(qcases$fields)
## [1] "case_id"                       "aliquot_ids"                  
## [3] "analyte_ids"                   "annotations.annotation_id"    
## [5] "annotations.case_id"           "annotations.case_submitter_id"

Finding fields of interest is such a common operation that the GenomicDataCommons includes the grep_fields() function. See the appropriate help pages for details.

Facets and aggregation

[ GDC facet documentation ]

The GDC API offers a feature known as aggregation or faceting. By specifying one or more fields (of appropriate type), the GDC can return to us a count of the number of records matching each potential value. This is similar to the R table method. Multiple fields can be returned at once, but the GDC API does not have a cross-tabulation feature; all aggregations are only on one field at a time. Results of aggregation() calls come back as a list of data.frames (actually, tibbles).

# total number of files of a specific type
res = files() %>% facet(c('type','data_type')) %>% aggregations()
res$type
##    doc_count                           key
## 1     190236    annotated_somatic_mutation
## 2     123377                 aligned_reads
## 3      86801          structural_variation
## 4      86592       simple_somatic_mutation
## 5      58883           copy_number_segment
## 6      46153          copy_number_estimate
## 7      43618               gene_expression
## 8      33639   aggregated_somatic_mutation
## 9      33234              mirna_expression
## 10     31716      masked_methylation_array
## 11     30075                   slide_image
## 12     26002        biospecimen_supplement
## 13     15858        methylation_beta_value
## 14     15635       masked_somatic_mutation
## 15     13135           clinical_supplement
## 16      7906            protein_expression
## 17        84 secondary_expression_analysis
## 18        67              pathology_report

Using aggregations() is an also easy way to learn the contents of individual fields and forms the basis for faceted search pages.

Filtering

[ GDC filtering documentation ]

The GenomicDataCommons package uses a form of non-standard evaluation to specify R-like queries that are then translated into an R list. That R list is, upon calling a method that fetches results from the GDC API, translated into the appropriate JSON string. The R expression uses the formula interface as suggested by Hadley Wickham in his vignette on non-standard evaluation

It’s best to use a formula because a formula captures both the expression to evaluate and the environment where the evaluation occurs. This is important if the expression is a mixture of variables in a data frame and objects in the local environment [for example].

For the user, these details will not be too important except to note that a filter expression must begin with a “~”.

qfiles = files()
qfiles %>% count() # all files
## [1] 843011

To limit the file type, we can refer back to the section on faceting to see the possible values for the file field “type”. For example, to filter file results to only “gene_expression” files, we simply specify a filter.

qfiles = files() %>% filter( type == 'gene_expression')
# here is what the filter looks like after translation
str(get_filter(qfiles))
## List of 2
##  $ op     : 'scalar' chr "="
##  $ content:List of 2
##   ..$ field: chr "type"
##   ..$ value: chr "gene_expression"

What if we want to create a filter based on the project (‘TCGA-OVCA’, for example)? Well, we have a couple of possible ways to discover available fields. The first is based on base R functionality and some intuition.

grep('pro',available_fields('files'),value=TRUE) %>% 
    head()
## [1] "analysis.input_files.proc_internal"           
## [2] "analysis.input_files.proportion_base_mismatch"
## [3] "analysis.input_files.proportion_coverage_10x" 
## [4] "analysis.input_files.proportion_coverage_10X" 
## [5] "analysis.input_files.proportion_coverage_30x" 
## [6] "analysis.input_files.proportion_coverage_30X"

Interestingly, the project information is “nested” inside the case. We don’t need to know that detail other than to know that we now have a few potential guesses for where our information might be in the files records. We need to know where because we need to construct the appropriate filter.

files() %>% 
    facet('cases.project.project_id') %>% 
    aggregations() %>% 
    head()
## $cases.project.project_id
##    doc_count                       key
## 1      54096                     FM-AD
## 2      49455                 TCGA-BRCA
## 3      56636                   CPTAC-3
## 4      38305                TARGET-AML
## 5      26469                 TCGA-LUAD
## 6      36470                 GENIE-MSK
## 7      22749                 TCGA-UCEC
## 8      23663                 TCGA-HNSC
## 9      22838                 TCGA-THCA
## 10     23734                 TCGA-KIRC
## 11     21971                   TCGA-OV
## 12     23893                 TCGA-LUSC
## 13     23134                  TCGA-LGG
## 14     22531                 TCGA-PRAD
## 15     20776                 TCGA-COAD
## 16     18013                  TCGA-GBM
## 17     28464                GENIE-DFCI
## 18     20167                 TCGA-SKCM
## 19     19694                 TCGA-STAD
## 20     27014             MMRF-COMMPASS
## 21     18331                 TCGA-BLCA
## 22     16965                 TCGA-LIHC
## 23     16959             TARGET-ALL-P2
## 24     13339                 TCGA-CESC
## 25     13387                 TCGA-KIRP
## 26     11593                 TCGA-SARC
## 27     14968         BEATAML1.0-COHORT
## 28     11553                 REBC-THYR
## 29      8187                 TCGA-PAAD
## 30      8164                 TCGA-ESCA
## 31      7868                 TCGA-PCPG
## 32      7178                 TCGA-READ
## 33      6731                 TCGA-TGCT
## 34      9244                   CPTAC-2
## 35      5358                TARGET-NBL
## 36      7221                 TCGA-LAML
## 37      5376                 TCGA-THYM
## 38      6984                 HCMI-CMDC
## 39      5304             CGCI-HTMCP-CC
## 40      5454                   CMI-MBC
## 41      3876                  TCGA-ACC
## 42      3493                 TCGA-KICH
## 43      5286              NCICCR-DLBCL
## 44      3666                 TCGA-MESO
## 45      3383                  TCGA-UVM
## 46      2532                 TARGET-WT
## 47      2801                 TARGET-OS
## 48      3625             TARGET-ALL-P3
## 49      3857                 GENIE-MDA
## 50      3833                GENIE-VICC
## 51      2549                  TCGA-UCS
## 52      3320                 GENIE-JHU
## 53      2043                 TCGA-DLBC
## 54      2059                 TCGA-CHOL
## 55      2632                 GENIE-UHN
## 56      2139                CGCI-BLGSP
## 57      1826 EXCEPTIONAL_RESPONDERS-ER
## 58      1571                 MP2PRT-WT
## 59      1036                 TARGET-RT
## 60      1093                WCDT-MCRPC
## 61      1038                GENIE-GRCC
## 62       878                  OHSU-CNL
## 63       806                   CMI-ASC
## 64       801                 GENIE-NKI
## 65       758       ORGANOID-PANCREATIC
## 66       553               CTSP-DLBCL1
## 67       480                   CMI-MPC
## 68       339                  TRIO-CRU
## 69       222     BEATAML1.0-CRENOLANIB
## 70       163               TARGET-CCSK
## 71        96             TARGET-ALL-P1
## 72        21            VAREPOP-APOLLO

We note that cases.project.project_id looks like it is a good fit. We also note that TCGA-OV is the correct project_id, not TCGA-OVCA. Note that unlike with dplyr and friends, the filter() method here replaces the filter and does not build on any previous filters.

qfiles = files() %>%
    filter( cases.project.project_id == 'TCGA-OV' & type == 'gene_expression')
str(get_filter(qfiles))
## List of 2
##  $ op     : 'scalar' chr "and"
##  $ content:List of 2
##   ..$ :List of 2
##   .. ..$ op     : 'scalar' chr "="
##   .. ..$ content:List of 2
##   .. .. ..$ field: chr "cases.project.project_id"
##   .. .. ..$ value: chr "TCGA-OV"
##   ..$ :List of 2
##   .. ..$ op     : 'scalar' chr "="
##   .. ..$ content:List of 2
##   .. .. ..$ field: chr "type"
##   .. .. ..$ value: chr "gene_expression"
qfiles %>% count()
## [1] 762

Asking for a count() of results given these new filter criteria gives r qfiles %>% count() results. Filters can be chained (or nested) to accomplish the same effect as multiple & conditionals. The count() below is equivalent to the & filtering done above.

qfiles2 = files() %>%
    filter( cases.project.project_id == 'TCGA-OV') %>% 
    filter( type == 'gene_expression') 
qfiles2 %>% count()
## [1] 762
(qfiles %>% count()) == (qfiles2 %>% count()) #TRUE
## [1] TRUE

Generating a manifest for bulk downloads is as simple as asking for the manifest from the current query.

manifest_df = qfiles %>% manifest()
head(manifest_df)
##                                     id data_format     access
## 1 6431664a-4a19-4a02-8ea5-8478a295f391         TSV controlled
## 2 db02a2f2-2d83-4486-b8a3-3b01d6f0a6f2         TSV       open
## 3 d1737bd6-98b5-4c83-977d-007123b554b2         TSV controlled
## 4 be11a1db-9919-47b8-bc06-90ce5f2c9b02         TSV       open
## 5 49618db3-ba6f-442d-b001-6af64869461e         TSV controlled
## 6 ffec2c19-e5f2-46c3-8fb7-02c620f06d8a         TSV controlled
##                                                                     file_name
## 1   2fb7ad5a-0ef2-4524-9fa8-4d61d1bb04e1.rna_seq.star_splice_junctions.tsv.gz
## 2 2fb7ad5a-0ef2-4524-9fa8-4d61d1bb04e1.rna_seq.augmented_star_gene_counts.tsv
## 3   8a7a4704-c698-4291-ba48-207895898a40.rna_seq.star_splice_junctions.tsv.gz
## 4 8a7a4704-c698-4291-ba48-207895898a40.rna_seq.augmented_star_gene_counts.tsv
## 5   872f7c22-17e4-4940-809b-a0970fde369b.rna_seq.star_splice_junctions.tsv.gz
## 6   6ab54953-3d77-4239-bf4a-5a893cf9a43c.rna_seq.star_splice_junctions.tsv.gz
##                           submitter_id           data_category       acl
## 1 1fe77c9d-4a01-4de1-8217-d0a329800e60 Transcriptome Profiling phs000178
## 2 16328313-f0d8-462d-a02b-5a6f2f4feabf Transcriptome Profiling      open
## 3 047ec964-cda2-4932-a979-29ef0e5b109d Transcriptome Profiling phs000178
## 4 8b20ea0f-57d5-420a-839d-03e9235a9536 Transcriptome Profiling      open
## 5 836530f2-4019-42df-8b39-25f53a875350 Transcriptome Profiling phs000178
## 6 8ff29681-9fc8-44a7-a7d1-fa11da07d81f Transcriptome Profiling phs000178
##              type file_size                 created_datetime
## 1 gene_expression   2800561 2021-12-13T20:58:10.452132-06:00
## 2 gene_expression   4230648 2021-12-13T20:46:42.020696-06:00
## 3 gene_expression   2538631 2021-12-13T21:02:18.823573-06:00
## 4 gene_expression   4221504 2021-12-13T20:50:25.802750-06:00
## 5 gene_expression   3828886 2021-12-13T21:02:08.938480-06:00
## 6 gene_expression   2881690 2021-12-13T21:00:47.129504-06:00
##                             md5sum                 updated_datetime
## 1 bbf3056454389fe35837aae8f1d6ef61 2022-01-19T14:01:15.621847-06:00
## 2 7a3053cd76efcf0a9681b7f0020f3bd1 2022-01-19T14:47:53.778421-06:00
## 3 df64d2d93d1712dd882ffec54b90a0f8 2022-01-19T14:01:15.621847-06:00
## 4 fc135a6076cc55d33370dcb2a0db0945 2022-01-19T14:47:45.329025-06:00
## 5 0b8b2f1da115c2c1899b2924ab72ff27 2022-01-19T14:01:15.621847-06:00
## 6 79ad7f463d3dafbea72d697f7b69d1d5 2022-01-19T14:01:15.621847-06:00
##                                file_id                      data_type    state
## 1 6431664a-4a19-4a02-8ea5-8478a295f391 Splice Junction Quantification released
## 2 db02a2f2-2d83-4486-b8a3-3b01d6f0a6f2 Gene Expression Quantification released
## 3 d1737bd6-98b5-4c83-977d-007123b554b2 Splice Junction Quantification released
## 4 be11a1db-9919-47b8-bc06-90ce5f2c9b02 Gene Expression Quantification released
## 5 49618db3-ba6f-442d-b001-6af64869461e Splice Junction Quantification released
## 6 ffec2c19-e5f2-46c3-8fb7-02c620f06d8a Splice Junction Quantification released
##   experimental_strategy
## 1               RNA-Seq
## 2               RNA-Seq
## 3               RNA-Seq
## 4               RNA-Seq
## 5               RNA-Seq
## 6               RNA-Seq

Note that we might still not be quite there. Looking at filenames, there are suspiciously named files that might include “FPKM”, “FPKM-UQ”, or “counts”. Another round of grep and available_fields, looking for “type” turned up that the field “analysis.workflow_type” has the appropriate filter criteria.

qfiles = files() %>% filter( ~ cases.project.project_id == 'TCGA-OV' &
                            type == 'gene_expression' &
                            access == "open" &
                            analysis.workflow_type == 'STAR - Counts')
manifest_df = qfiles %>% manifest()
nrow(manifest_df)
## [1] 381

The GDC Data Transfer Tool can be used (from R, transfer() or from the command-line) to orchestrate high-performance, restartable transfers of all the files in the manifest. See the bulk downloads section for details.

Authentication

[ GDC authentication documentation ]

The GDC offers both “controlled-access” and “open” data. As of this writing, only data stored as files is “controlled-access”; that is, metadata accessible via the GDC is all “open” data and some files are “open” and some are “controlled-access”. Controlled-access data are only available after going through the process of obtaining access.

After controlled-access to one or more datasets has been granted, logging into the GDC web portal will allow you to access a GDC authentication token, which can be downloaded and then used to access available controlled-access data via the GenomicDataCommons package.

The GenomicDataCommons uses authentication tokens only for downloading data (see transfer and gdcdata documentation). The package includes a helper function, gdc_token, that looks for the token to be stored in one of three ways (resolved in this order):

  1. As a string stored in the environment variable, GDC_TOKEN
  2. As a file, stored in the file named by the environment variable, GDC_TOKEN_FILE
  3. In a file in the user home directory, called .gdc_token

As a concrete example:

token = gdc_token()
transfer(...,token=token)
# or
transfer(...,token=get_token())

Datafile access and download

Data downloads via the GDC API

The gdcdata function takes a character vector of one or more file ids. A simple way of producing such a vector is to produce a manifest data frame and then pass in the first column, which will contain file ids.

fnames = gdcdata(manifest_df$id[1:2],progress=FALSE)

Note that for controlled-access data, a GDC authentication token is required. Using the BiocParallel package may be useful for downloading in parallel, particularly for large numbers of smallish files.

Bulk downloads

The bulk download functionality is only efficient (as of v1.2.0 of the GDC Data Transfer Tool) for relatively large files, so use this approach only when transferring BAM files or larger VCF files, for example. Otherwise, consider using the approach shown above, perhaps in parallel.

# Requires gcd_client command-line utility to be isntalled
# separately. 
fnames = gdcdata(manifest_df$id[3:10], access_method = 'client')

BAM slicing

Use Cases

Cases

How many cases are there per project_id?

res = cases() %>% facet("project.project_id") %>% aggregations()
head(res)
## $project.project_id
##    doc_count                       key
## 1      18004                     FM-AD
## 2      16824                 GENIE-MSK
## 3      14232                GENIE-DFCI
## 4       3857                 GENIE-MDA
## 5       3320                 GENIE-JHU
## 6       2632                 GENIE-UHN
## 7       2492                TARGET-AML
## 8       2052                GENIE-VICC
## 9       1587             TARGET-ALL-P2
## 10      1132                TARGET-NBL
## 11      1098                 TCGA-BRCA
## 12      1046                   CPTAC-3
## 13      1038                GENIE-GRCC
## 14       995             MMRF-COMMPASS
## 15       826         BEATAML1.0-COHORT
## 16       801                 GENIE-NKI
## 17       652                 TARGET-WT
## 18       617                  TCGA-GBM
## 19       608                   TCGA-OV
## 20       585                 TCGA-LUAD
## 21       560                 TCGA-UCEC
## 22       537                 TCGA-KIRC
## 23       528                 TCGA-HNSC
## 24       516                  TCGA-LGG
## 25       507                 TCGA-THCA
## 26       504                 TCGA-LUSC
## 27       500                 TCGA-PRAD
## 28       489              NCICCR-DLBCL
## 29       470                 TCGA-SKCM
## 30       461                 TCGA-COAD
## 31       443                 TCGA-STAD
## 32       440                 REBC-THYR
## 33       412                 TCGA-BLCA
## 34       383                 TARGET-OS
## 35       377                 TCGA-LIHC
## 36       342                   CPTAC-2
## 37       339                  TRIO-CRU
## 38       307                 TCGA-CESC
## 39       291                 TCGA-KIRP
## 40       261                 TCGA-SARC
## 41       212             CGCI-HTMCP-CC
## 42       200                   CMI-MBC
## 43       200                 TCGA-LAML
## 44       191             TARGET-ALL-P3
## 45       185                 TCGA-ESCA
## 46       185                 TCGA-PAAD
## 47       179                 TCGA-PCPG
## 48       176                  OHSU-CNL
## 49       172                 TCGA-READ
## 50       150                 TCGA-TGCT
## 51       124                 TCGA-THYM
## 52       120                CGCI-BLGSP
## 53       113                 TCGA-KICH
## 54       110                 HCMI-CMDC
## 55       101                WCDT-MCRPC
## 56        92                  TCGA-ACC
## 57        87                 TCGA-MESO
## 58        84 EXCEPTIONAL_RESPONDERS-ER
## 59        80                  TCGA-UVM
## 60        70       ORGANOID-PANCREATIC
## 61        69                 TARGET-RT
## 62        58                 TCGA-DLBC
## 63        57                  TCGA-UCS
## 64        56     BEATAML1.0-CRENOLANIB
## 65        52                 MP2PRT-WT
## 66        51                 TCGA-CHOL
## 67        45               CTSP-DLBCL1
## 68        36                   CMI-ASC
## 69        30                   CMI-MPC
## 70        24             TARGET-ALL-P1
## 71        13               TARGET-CCSK
## 72         7            VAREPOP-APOLLO
library(ggplot2)
ggplot(res$project.project_id,aes(x = key, y = doc_count)) +
    geom_bar(stat='identity') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

How many cases are included in all TARGET projects?

cases() %>% filter(~ project.program.name=='TARGET') %>% count()
## [1] 6543

How many cases are included in all TCGA projects?

cases() %>% filter(~ project.program.name=='TCGA') %>% count()
## [1] 11315

What is the breakdown of sample types in TCGA-BRCA?

# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
                              project.project_id=='TCGA-BRCA' ) %>%
    facet('samples.sample_type') %>% aggregations()
resp$samples.sample_type
##   doc_count                  key
## 1      1098        primary tumor
## 2      1011 blood derived normal
## 3       162  solid tissue normal
## 4         7           metastatic

Fetch all samples in TCGA-BRCA that use “Solid Tissue” as a normal.

# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
                              samples.sample_type=='Solid Tissue Normal') %>%
    GenomicDataCommons::select(c(default_fields(cases()),'samples.sample_type')) %>%
    response_all()
count(resp)
## [1] 162
res = resp %>% results()
str(res[1],list.len=6)
## List of 1
##  $ id: chr [1:162] "3d676bba-154b-4d22-ab59-d4d4da051b94" "1133b8a9-6b11-4511-b70a-f200e3b8b5db" "17c1d42c-cb84-4655-a4cd-b54bae17ecaf" "9da462b0-93c2-4305-89f6-7199a30399a7" ...
head(ids(resp))
## [1] "3d676bba-154b-4d22-ab59-d4d4da051b94"
## [2] "1133b8a9-6b11-4511-b70a-f200e3b8b5db"
## [3] "17c1d42c-cb84-4655-a4cd-b54bae17ecaf"
## [4] "9da462b0-93c2-4305-89f6-7199a30399a7"
## [5] "14267783-5624-4fe5-ba81-9d67f1017474"
## [6] "26573441-eedb-4364-966c-e7f803deef19"

Get all TCGA case ids that are female

cases() %>%
  GenomicDataCommons::filter(~ project.program.name == 'TCGA' &
    "cases.demographic.gender" %in% "female") %>%
      GenomicDataCommons::results(size = 4) %>%
        ids()
## [1] "cbfef004-b437-4d51-9d88-a2db50aa6481"
## [2] "a9644274-13bb-4228-9b4f-14260ccc26eb"
## [3] "096bd95f-9900-4db2-b1c4-103902c3b31f"
## [4] "0a45f302-5748-48f3-9dc9-66c01843a68e"

Get all TCGA-COAD case ids that are NOT female

cases() %>%
  GenomicDataCommons::filter(~ project.project_id == 'TCGA-COAD' &
    "cases.demographic.gender" %exclude% "female") %>%
      GenomicDataCommons::results(size = 4) %>%
        ids()
## [1] "58facedb-fcb8-4ecf-8338-2bfa4947acef"
## [2] "0a94eecf-4db2-4846-8383-c83ff02e4a9f"
## [3] "8368d745-c74d-4236-ba75-16ca7aaeb3ca"
## [4] "eb4e4e09-98b3-4e85-8dd2-75676ff2af14"

Get all TCGA cases that are missing gender

cases() %>%
  GenomicDataCommons::filter(~ project.program.name == 'TCGA' &
    missing("cases.demographic.gender")) %>%
      GenomicDataCommons::results(size = 4) %>%
        ids()
## [1] "aebd0313-23be-46a8-abc6-b16c531c3a8e"
## [2] "07119baf-64a7-454c-b1b0-c769b506a63d"
## [3] "494d50b6-578e-441b-8195-4b4d26c0d810"
## [4] "9b714a42-62e8-4b33-947e-6c4850725afd"

Get all TCGA cases that are NOT missing gender

cases() %>%
  GenomicDataCommons::filter(~ project.program.name == 'TCGA' &
    !missing("cases.demographic.gender")) %>%
      GenomicDataCommons::results(size = 4) %>%
        ids()
## [1] "35243518-b086-4d76-a336-8c61a14f9ded"
## [2] "df3362cb-0f6b-412e-8af4-c5606526be17"
## [3] "6433f001-db5c-476b-83c7-23f4c5397ae9"
## [4] "eb80244a-5f20-49a8-8f73-e92c14395895"

Files

How many of each type of file are available?

res = files() %>% facet('type') %>% aggregations()
res$type
##    doc_count                           key
## 1     190236    annotated_somatic_mutation
## 2     123377                 aligned_reads
## 3      86801          structural_variation
## 4      86592       simple_somatic_mutation
## 5      58883           copy_number_segment
## 6      46153          copy_number_estimate
## 7      43618               gene_expression
## 8      33639   aggregated_somatic_mutation
## 9      33234              mirna_expression
## 10     31716      masked_methylation_array
## 11     30075                   slide_image
## 12     26002        biospecimen_supplement
## 13     15858        methylation_beta_value
## 14     15635       masked_somatic_mutation
## 15     13135           clinical_supplement
## 16      7906            protein_expression
## 17        84 secondary_expression_analysis
## 18        67              pathology_report
ggplot(res$type,aes(x = key,y = doc_count)) + geom_bar(stat='identity') +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Find gene-level RNA-seq quantification files for GBM

q = files() %>%
    GenomicDataCommons::select(available_fields('files')) %>%
    filter(~ cases.project.project_id=='TCGA-GBM' &
               data_type=='Gene Expression Quantification')
q %>% facet('analysis.workflow_type') %>% aggregations()
## list()
# so need to add another filter
file_ids = q %>% filter(~ cases.project.project_id=='TCGA-GBM' &
                            data_type=='Gene Expression Quantification' &
                            analysis.workflow_type == 'STAR - Counts') %>%
    GenomicDataCommons::select('file_id') %>%
    response_all() %>%
    ids()

Slicing

Get all BAM file ids from TCGA-GBM

I need to figure out how to do slicing reproducibly in a testing environment and for vignette building.

q = files() %>%
    GenomicDataCommons::select(available_fields('files')) %>%
    filter(~ cases.project.project_id == 'TCGA-GBM' &
               data_type == 'Aligned Reads' &
               experimental_strategy == 'RNA-Seq' &
               data_format == 'BAM')
file_ids = q %>% response_all() %>% ids()
bamfile = slicing(file_ids[1],regions="chr12:6534405-6538375",token=gdc_token())
library(GenomicAlignments)
aligns = readGAlignments(bamfile)

Troubleshooting

SSL connection errors

  • Symptom: Trying to connect to the API results in:
Error in curl::curl_fetch_memory(url, handle = handle) :
SSL connect error

sessionInfo()

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.4.0             GenomicDataCommons_1.23.0
## [3] magrittr_2.0.3            knitr_1.40               
## [5] BiocStyle_2.24.0         
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.2.1       rprojroot_2.0.3        digest_0.6.30         
##  [4] utf8_1.2.2             R6_2.5.1               GenomeInfoDb_1.32.4   
##  [7] stats4_4.2.2           evaluate_0.18          highr_0.9             
## [10] httr_1.4.4             pillar_1.8.1           zlibbioc_1.42.0       
## [13] rlang_1.0.6            curl_4.3.3             jquerylib_0.1.4       
## [16] S4Vectors_0.34.0       rmarkdown_2.17         pkgdown_2.0.6         
## [19] labeling_0.4.2         textshaping_0.3.6      desc_1.4.2            
## [22] readr_2.1.3            stringr_1.4.1          RCurl_1.98-1.9        
## [25] munsell_0.5.0          compiler_4.2.2         xfun_0.34             
## [28] pkgconfig_2.0.3        systemfonts_1.0.4      BiocGenerics_0.42.0   
## [31] htmltools_0.5.3        tidyselect_1.2.0       tibble_3.1.8          
## [34] GenomeInfoDbData_1.2.8 bookdown_0.29          IRanges_2.30.1        
## [37] fansi_1.0.3            withr_2.5.0            crayon_1.5.2          
## [40] dplyr_1.0.10           tzdb_0.3.0             bitops_1.0-7          
## [43] rappdirs_0.3.3         grid_4.2.2             jsonlite_1.8.3        
## [46] gtable_0.3.1           lifecycle_1.0.3        DBI_1.1.3             
## [49] scales_1.2.1           cli_3.4.1              stringi_1.7.8         
## [52] cachem_1.0.6           farver_2.1.1           XVector_0.36.0        
## [55] fs_1.5.2               xml2_1.3.3             bslib_0.4.1           
## [58] ellipsis_0.3.2         ragg_1.2.4             generics_0.1.3        
## [61] vctrs_0.5.0            tools_4.2.2            glue_1.6.2            
## [64] purrr_0.3.5            hms_1.1.2              fastmap_1.1.0         
## [67] yaml_2.3.6             colorspace_2.0-3       BiocManager_1.30.19   
## [70] GenomicRanges_1.48.0   memoise_2.0.1          sass_0.4.2

Developer notes

  • The S3 object-oriented programming paradigm is used.
  • We have adopted a functional programming style with functions and methods that often take an “object” as the first argument. This style lends itself to pipeline-style programming.
  • The GenomicDataCommons package uses the alternative request format (POST) to allow very large request bodies.