Cloud-scale genomic data science with Bioconductor

Abstract

Bioconductor’s approach to the analysis of genome-scale assays is rooted in commitments to the use of self-describing data objects representing genomic assays and annotation. Analysis tools and workflows based on these objects have proven effective in a large number of scientific projects and publications.

The dominant model for utilization of Bioconductor to date involves a locally controlled deployment of R and Bioconductor/CRAN packages in an essentially closed storage and execution environment.

New approaches to federated elastic computing with lab-resident or commercial cloud environments provide opportunities for inference on questions of vast scope. This workshop is devoted to understanding how to leverage Bioconductor’s strengths in seizing these new opportunities. Special attention is devoted to how programming and reporting patterns familiar from two decades of Bioconductor development and use can be retained, or must change, in cloud-scale genomic data science.

Our approach will be a mix of lecture and hands-on programming with Rstudio Cloud. We will learn how the restfulSE and BiocOncoTK packages work with HDF Scalable Data Service and Google BigQuery to provide immediate interactive access to a compendium of 181000 human transcriptomics experiments, and to the PanCancer Atlas. We will also learn how to couple Docker containers with formal workflows in CWL and WDL to achieve sharable reproducible analyses with nearly zero configuration.

Pre-requisites

  • Basic knowledge of R syntax
  • Familiarity with the SummarizedExperiment class
  • Familiarity with one or more of TCGA, GTEx, BigQuery
  • Familiarity with docker containers is not required but a running docker installation will be useful

Workshop Participation

Students should have a laptop and be prepared to execute specific commands to load packages and evaluate functions. It will be helpful to have a Google identity that may be necessary to work with BigQuery.

R / Bioconductor packages used

DelayedArray, restfulSE, rhdf5client, BiocOncoTK, htxcomp (github/vjcitn), TxRegInfra

Time outline

Approximate timings for sections of workshop

Activity Time
Review of Bioconductor software and data structures 10m
DelayedArray concepts 5m
Exercises with htxcomp and the HDF Scalable Data Service 10m
Exercises with PanCancer Atlas and Google BigQuery 10m
Docker and CWL/WDL with Dockstore.org 10m

Workshop goals and objectives

Goals:

  • Develop an appreciation of strengths and limitations of Bioconductor’s approach to structure and annotation of genome-scale data as scope of data grows to cloud scale

  • Learn about alternatives to “all-in-memory” models of computing in R, and how Bioconductor has used such alternatives in the local computing model (e.g., external SQLite databases, local HDF5 serialization, API to remote services)

  • Obtain experience using Bioconductor methods and tools with data and annotation that are cloud-scale

  • Develop an appreciation of threats to reliability and predictable costs that arise when working with commercial cloud computing

Objectives:

  • Use rhdf5client to interact with matrix data in HDF Scalable Data Service

  • Use BiocOncoTK to interrogate multiomic PanCancer atlas data in Google BigQuery

  • Understand the role of Docker containers and formal workflow expression in establishing reproducible and shareable large scale analyses

Course activities

Review of Bioconductor software and data structures (10min)

Bioconductor uses an approach to object-oriented programming to help control complexity of programming in the domain of genome-scale data science.

We will illustrate the basic ideas by using a catalog of features of the human genome.

The S4 classes EnsDb and GRanges

library(EnsDb.Hsapiens.v79)
EnsDb.Hsapiens.v79
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.3.0
## |Creation time: Thu May 18 12:42:51 2017
## |ensembl_version: 79
## |ensembl_host: localhost
## |Organism: homo_sapiens
## |taxonomy_id: 9606
## |genome_build: GRCh38
## |DBSCHEMAVERSION: 2.0
## | No. of genes: 65774.
## | No. of transcripts: 214285.
## |Protein data available.

The object named EnsDb.Hsapiens.v79 mediates access to a SQLite database that contains information on the Ensembl definitions of genes for reference build hg38. This object is an instance of a class:

class(EnsDb.Hsapiens.v79)
## [1] "EnsDb"
## attr(,"package")
## [1] "ensembldb"

Formal methods are defined for this class:

methods(class=class(EnsDb.Hsapiens.v79))
##  [1] activeFilter              addFilter                
##  [3] cdsBy                     columns                  
##  [5] convertFilter             dbconn                   
##  [7] disjointExons             dropFilter               
##  [9] ensemblVersion            exons                    
## [11] exonsBy                   exonsByOverlaps          
## [13] fiveUTRsByTranscript      genes                    
## [15] getGeneRegionTrackForGviz getGenomeFaFile          
## [17] getGenomeTwoBitFile       hasProteinData           
## [19] initialize                intronsByTranscript      
## [21] keys                      keytypes                 
## [23] lengthOf                  listColumns              
## [25] listGenebiotypes          listTables               
## [27] listTxbiotypes            listUniprotDbs           
## [29] listUniprotMappingTypes   mapIds                   
## [31] metadata                  organism                 
## [33] promoters                 proteins                 
## [35] returnFilterColumns       returnFilterColumns<-    
## [37] select                    seqinfo                  
## [39] seqlevels                 seqlevelsStyle           
## [41] seqlevelsStyle<-          show                     
## [43] supportedFilters          supportedSeqlevelsStyles 
## [45] threeUTRsByTranscript     transcripts              
## [47] transcriptsBy             transcriptsByOverlaps    
## [49] updateEnsDb               useMySQL                 
## see '?methods' for accessing help and source code

Let’s try the genes method.

genes79 = genes(EnsDb.Hsapiens.v79)
genome(genes79)[1]
##        1 
## "GRCh38"
class(genes79)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
head(genes79[,1:3])
## GRanges object with 6 ranges and 3 metadata columns:
##                   seqnames      ranges strand |         gene_id
##                      <Rle>   <IRanges>  <Rle> |     <character>
##   ENSG00000223972        1 11869-14409      + | ENSG00000223972
##   ENSG00000227232        1 14404-29570      - | ENSG00000227232
##   ENSG00000278267        1 17369-17436      - | ENSG00000278267
##   ENSG00000243485        1 29554-31109      + | ENSG00000243485
##   ENSG00000274890        1 30366-30503      + | ENSG00000274890
##   ENSG00000237613        1 34554-36081      - | ENSG00000237613
##                      gene_name                       gene_biotype
##                    <character>                        <character>
##   ENSG00000223972      DDX11L1 transcribed_unprocessed_pseudogene
##   ENSG00000227232       WASH7P             unprocessed_pseudogene
##   ENSG00000278267    MIR6859-3                              miRNA
##   ENSG00000243485 RP11-34P13.3                            lincRNA
##   ENSG00000274890    MIR1302-9                              miRNA
##   ENSG00000237613      FAM138A                            lincRNA
##   -------
##   seqinfo: 319 sequences from GRCh38 genome

Basic R language elements are given new meaning when applied to structures like GRanges. Here we use $ to obtain one of the fields of metadata about genes. The operation returns a vector that we summarize using basic R functions.

sort(table(genes79$gene_biotype),decreasing=TRUE)[1:6]
## 
##         protein_coding   processed_pseudogene                lincRNA 
##                  22002                  10702                   7839 
##              antisense                  miRNA unprocessed_pseudogene 
##                   5648                   4548                   3137

dplyr idioms can be used with some help (and this may become more straightforward over time):

library(dplyr)
library(magrittr)
as.data.frame(mcols(genes79)) %>% 
  select(gene_name, gene_biotype) %>%
  filter(gene_biotype == "ribozyme")
##         gene_name gene_biotype
## 1  CPEB3_ribozyme     ribozyme
## 2 Hammerhead_HH10     ribozyme
## 3   CoTC_ribozyme     ribozyme
## 4      RNaseP_nuc     ribozyme
## 5        RPPH1-2P     ribozyme
## 6        RPPH1-3P     ribozyme
## 7          RMRPP5     ribozyme
## 8       RNase_MRP     ribozyme

tibble representations have pleasant summaries:

library(tibble)
as.tibble(genes79)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
## # A tibble: 65,774 x 11
##    seqnames start    end width strand gene_id gene_name gene_biotype
##    <fct>    <int>  <int> <int> <fct>  <chr>   <chr>     <chr>       
##  1 1        11869  14409  2541 +      ENSG00… DDX11L1   transcribed…
##  2 1        14404  29570 15167 -      ENSG00… WASH7P    unprocessed…
##  3 1        17369  17436    68 -      ENSG00… MIR6859-3 miRNA       
##  4 1        29554  31109  1556 +      ENSG00… RP11-34P… lincRNA     
##  5 1        30366  30503   138 +      ENSG00… MIR1302-9 miRNA       
##  6 1        34554  36081  1528 -      ENSG00… FAM138A   lincRNA     
##  7 1        52473  53312   840 +      ENSG00… OR4G4P    unprocessed…
##  8 1        62948  63887   940 +      ENSG00… OR4G11P   unprocessed…
##  9 1        69091  70008   918 +      ENSG00… OR4F5     protein_cod…
## 10 1        89295 133723 44429 -      ENSG00… RP11-34P… lincRNA     
## # … with 65,764 more rows, and 3 more variables: seq_coord_system <chr>,
## #   symbol <chr>, entrezid <I<list>>

Finding and visualizing genomic elements in a specified chromosomal region

GRanges are easy to construct and can be used to query genomes. Here we deal with three problems related to genome region specification. We start with a region specified using coordinates from reference build hg19 (GRCh37). We use UCSC’s liftOver utility to convert to GRCh38 (hg38). We finish by converting the chromosome annotation to that used by Ensembl.

myRange = GRanges("chr10", IRanges(37.45e6, 37.8e6))
myRange
## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr10 37450000-37800000      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(rtracklayer)
ch = import.chain("hg19ToHg38.over.chain")
myr38 = liftOver(myRange, ch)[[1]]
library(GenomeInfoDb)
genome(myr38) = "GRCh38"
seqlevelsStyle(myr38) = "Ensembl"
myr38
## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]       10 37161072-37511072      *
##   -------
##   seqinfo: 1 sequence from GRCh38 genome; no seqlengths

It is easy to find ‘gene-level’ elements using subsetByOverlaps:

els = subsetByOverlaps(genes79, myr38)
as.tibble(els)
## # A tibble: 8 x 11
##   seqnames  start    end  width strand gene_id gene_name gene_biotype
##   <fct>     <int>  <int>  <int> <fct>  <chr>   <chr>     <chr>       
## 1 10       3.71e7 3.74e7 258324 +      ENSG00… ANKRD30A  protein_cod…
## 2 10       3.72e7 3.72e7   1163 +      ENSG00… RP11-20F… lincRNA     
## 3 10       3.72e7 3.73e7  67684 +      ENSG00… ATP8A2P1  transcribed…
## 4 10       3.73e7 3.73e7    292 +      ENSG00… RN7SL314P misc_RNA    
## 5 10       3.73e7 3.73e7    110 -      ENSG00… Y_RNA     misc_RNA    
## 6 10       3.73e7 3.73e7  37847 +      ENSG00… LINC00993 lincRNA     
## 7 10       3.73e7 3.73e7   1457 +      ENSG00… TMEM161B… processed_p…
## 8 10       3.73e7 3.73e7    957 +      ENSG00… VN1R53P   processed_p…
## # … with 3 more variables: seq_coord_system <chr>, symbol <chr>,
## #   entrezid <I<list>>

To get a quick view of the layout of these elements on their chromosome, we can use Gviz:

library(survival)
library(Gviz)
options(ucscChromosomeNames=FALSE)
plotTracks(list(GenomeAxisTrack(), 
  GeneRegionTrack(els, gene=els$symbol)), showId=TRUE)

There are many options available to enhance this display. An important task for this region would be to distinguish exons and introns. See the Gviz vignette for details.

DelayedArray concepts (5min)

Our interest in cloud-scale computing methods arises in part from desire to analyze very large data sets with computers that have relatively small endowments of random access memory. The most common methods of working with R assume that all data are resident in memory and can be addressed directly.

Acquisition of breast cancer RNA-seq data from TCGA

In this section we will construct two representations of RNA-seq data for breast cancer tumors collected in TCGA. First we use curatedTCGAData to obtain a standard in-memory representation in a SummarizedExperiment.

library(curatedTCGAData)
brMAE = curatedTCGAData("BRCA", "RNASeq2GeneNorm", dry.run=FALSE)
brexp = experiments(brMAE)[[1]] # should use name
colnames(brexp) = substr(colnames(brexp),1,12) # need to shorten
cd = colData(brMAE)[colnames(brexp),]
colData(brexp) = cd
brexp
## class: SummarizedExperiment 
## dim: 20501 1212 
## metadata(0):
## assays(1): ''
## rownames(20501): A1BG A1CF ... psiTPTE22 tAKR
## rowData names(0):
## colnames(1212): TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5
##   TCGA-Z7-A8R6
## colData names(2684): patientID years_to_birth ...
##   Integrated.Clusters..unsup.exp.
##   X60.Gene.classifier.Class.Assignment

We estimate its size in RAM.

library(SummarizedExperiment) # assay
object.size(assay(brexp))
## 200190840 bytes

This is not particularly large, but in certain applications it is advantageous to have tight control over the memory consumption required for an analysis. In this form we have no choice – either the complete dataset is in memory or it is not and cannot be accessed without loading it in its entirety.

Exporting the expression data to HDF5

Now we develop a representation on disk. We use the HDF5 data format as it is well-established as a tool for managing numerical data in scientific computing.

library(HDF5Array)
saveHDF5SummarizedExperiment(brexp, "brexpHDF5", replace=TRUE)

We named the HDF5 repository for the data brexpHDF5; this is in fact a folder created in the current working directory. Its contents are a SummarizedExperiment ‘shell’ in RDS format, and the HDF5 matrix representation of the RNA-seq quantifications.

dir("brexpHDF5")
## [1] "assays.h5" "se.rds"

Using the on-disk HDF5 representation

We use a loading function to retrieve the new SummarizedExperiment instance. Its memory consumption is independent of the dimensions of the assay matrix.

brextern = loadHDF5SummarizedExperiment("brexpHDF5")
assay(brextern)
## <20501 x 1212> DelayedMatrix object of type "double":
##           TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5 TCGA-Z7-A8R6
##      A1BG     197.0897     237.3844   .     439.5425     248.3271
##      A1CF       0.0000       0.0000   .       0.0000       0.0000
##     A2BP1       0.0000       0.0000   .       0.5973       0.0000
##     A2LD1     102.9634      70.8646   .      81.3010      25.1866
##     A2ML1       1.3786       4.3502   .       2.3893       4.1757
##       ...            .            .   .            .            .
##       ZYX    3507.2482    5504.6221   .    6675.6264    3402.5228
##     ZZEF1    1894.9342    1318.6514   .     754.4127     564.4193
##      ZZZ3    1180.4565     406.7428   .     750.8288     462.1140
## psiTPTE22       1.7233     926.5905   .     238.9272      20.8786
##      tAKR       0.0000       0.0000   .       0.0000       0.0000
object.size(assay(brextern))
## 1415792 bytes

Targeted queries to the HDF5 store are rapidly resolved.

par(las=2, mar=c(18,4,3,3))
boxplot(split(log(as.numeric(assay(brextern["BRCA2",])+1)), 
   brexp$histological_type), ylab="log BRCA2")

But there is a price to pay.

library(microbenchmark)
microbenchmark(
 split(log(as.numeric(assay(brextern["BRCA2", ]) + 1)), 
   brexp$histological_type),
 times=5)
## Unit: milliseconds
##                                                                             expr
##  split(log(as.numeric(assay(brextern["BRCA2", ]) + 1)), brexp$histological_type)
##       min       lq     mean   median       uq      max neval
##  331.8319 526.5889 533.4987 574.0303 603.4796 631.5628     5

Compare to the in memory benchmark:

microbenchmark(
 split(log(as.numeric(assay(brexp["BRCA2", ]) + 1)), 
   brexp$histological_type),
 times=5)
## Unit: milliseconds
##                                                                          expr
##  split(log(as.numeric(assay(brexp["BRCA2", ]) + 1)), brexp$histological_type)
##       min      lq     mean   median       uq      max neval
##  14.05118 14.3116 14.31203 14.34153 14.39318 14.46265     5

Exercises with the HumanTranscriptomeCompendium and the HDF Scalable Data Service (10min)

The HumanTranscriptomeCompendium package was formed to take advantage of two new technologies.

The first is the OmicIDX metadata transformation and access facility, devised by Dr Sean Davis of NCI. This currently provides comprehensive access to metadata about sequencing studies collected at NCBI SRA, but will be extended to other institutional metadata archives.

The second is HDF Scalable Data Service, devised by John Readey of The HDF Group. Substantial support has been provided to Bioconductor by HDF Group, with open hosting of significant genomic data archives in a publicly accessible server.

Fruits of both of these developments are used in the following exercises.

Command-line work with HumanTranscriptomeCompendium

library(HumanTranscriptomeCompendium)
htxSE = htx_load()
## Using temporary cache /tmp/RtmpTY9jSI/BiocFileCache
## adding RDS to local cache, future invocations will use local image
## adding rname 'https://s3.amazonaws.com/bcfound-bigrna/rangedHtxGeneSE.rds'
htxSE
## class: RangedSummarizedExperiment 
## dim: 58288 181134 
## metadata(1): rangeSource
## assays(1): counts
## rownames(58288): ENSG00000000003.14 ENSG00000000005.5 ...
##   ENSG00000284747.1 ENSG00000284748.1
## rowData names(0):
## colnames(181134): DRX001125 DRX001126 ... SRX999990 SRX999991
## colData names(4): experiment_accession experiment_platform
##   study_accession study_title

There’s a little extra annotation to add.

htxSE = addRD(htxSE)
library(SummarizedExperiment)
head(rowData(htxSE))
## DataFrame with 6 rows and 4 columns
##                         gene_type            gene_id   gene_name
##                       <character>        <character> <character>
## ENSG00000000003.14 protein_coding ENSG00000000003.14      TSPAN6
## ENSG00000000005.5  protein_coding  ENSG00000000005.5        TNMD
## ENSG00000000419.12 protein_coding ENSG00000000419.12        DPM1
## ENSG00000000457.13 protein_coding ENSG00000000457.13       SCYL3
## ENSG00000000460.16 protein_coding ENSG00000000460.16    C1orf112
## ENSG00000000938.12 protein_coding ENSG00000000938.12         FGR
##                             havana_gene
##                             <character>
## ENSG00000000003.14 OTTHUMG00000022002.1
## ENSG00000000005.5  OTTHUMG00000022001.1
## ENSG00000000419.12 OTTHUMG00000032742.2
## ENSG00000000457.13 OTTHUMG00000035941.5
## ENSG00000000460.16 OTTHUMG00000035821.8
## ENSG00000000938.12 OTTHUMG00000003516.2

We will make a searchable table of study titles and sizes.

ad = as.data.frame(colData(htxSE))
library(dplyr)
library(magrittr)
library(DT)
studtab = (ad %>% select(study_title, study_accession) %>% group_by(study_title, study_accession) %>% summarise(n=n()) %>% as.data.frame()) 
datatable(studtab)

Search for zika in this table.

zikp = htxSE[, which(htxSE$study_accession == "SRP075248")]
assay(zikp)
## <58288 x 1598> DelayedMatrix object of type "double":
##                    SRX1767663 SRX1767664 ... SRX1769277 SRX1769278
## ENSG00000000003.14  94.662016   0.000000   . 623.854770 627.757554
##  ENSG00000000005.5   0.000000   0.000000   .   0.000000   0.000000
## ENSG00000000419.12   2.999995   0.000000   . 101.000070  50.000247
## ENSG00000000457.13  24.025096   6.628778   .  45.372716   0.000000
## ENSG00000000460.16   0.000000  12.376506   .   4.000001  39.000084
##                ...          .          .   .          .          .
##  ENSG00000284744.1   1.004693   2.999999   .          0          0
##  ENSG00000284745.1   0.000000   0.000000   .          0          0
##  ENSG00000284746.1   0.000000   0.000000   .          0          0
##  ENSG00000284747.1   2.844749   0.000000   .          0          0
##  ENSG00000284748.1   0.000000   0.000000   .          0          0

To acquire additional metadata on this study, you could use SRAdbV2. We will skip that for now.

The ca43k app

We have a shiny app that mediates access to cancer-related studies in the compendium. Using the following command starts the app. Clear the initial tabs and cart elements. Search for the string ‘lncRNAs’ and then search the resulting table for ‘archived’. Then stop the app.

cadat = ca43k()

Now metadata(cadat)[[2]] is a data.frame that gives sample-level metadata on the samples that were retrieved. How many different tissues were assayed?

Have a look at the paper referenced in the associated PMID. Find the lncRNA that the authors assert is specific to breast and display its distribution of abundance across tissues.

Exercises with PanCancer Atlas and Google BigQuery (10min)

The PanCancer Atlas builds on TCGA by including a number of matched normal tissue samples subjected to many of the same assay processes as the tumor tissues.

In this section we’ll create SummarizedExperiment instances for 450k methylation assays in bladder cancer.

Some background on PanCancer Atlas data in the ISB Cancer Genomics Cloud project

The names of resources available for PanCancer atlas are somewhat unwieldy. We’ve created abbreviations.

library(BiocOncoTK)
anndf = data.frame(abbr=names(
  BiocOncoTK::annotTabs), tabname=as.character(BiocOncoTK::annotTabs))
datatable(anndf)

In this section you must have environment variable CGC_BILLING set to a valid google cloud platform billing account.

Creating a ‘RESTful’ SummarizedExperiment with BigQuery references for assay data

library(BiocOncoTK)
bq = pancan_BQ()
bq@quiet = TRUE
se1 = buildPancanSE(bq)
se1

## class: RangedSummarizedExperiment 
## dim: 396065 409 
## metadata(3): acronym assay sampType
## assays(1): assay
## rownames(396065): cg00000029 cg00000165 ... rs966367 rs9839873
## rowData names(3): gene_id gene_name gene_biotype
## colnames(409): TCGA-FD-A3SN TCGA-FD-A5BV ... TCGA-UY-A78L
##   TCGA-ZF-AA4T
## colData names(20): bcr_patient_uuid bcr_patient_barcode ...
##   radiation_therapy race
assay(se1)

## <396065 x 409> DelayedMatrix object of type "double":
##            TCGA-FD-A3SN TCGA-FD-A5BV ... TCGA-UY-A78L TCGA-ZF-AA4T
## cg00000029    0.0954445    0.0687728   .     0.162314     0.172125
## cg00000165    0.8368510    0.1360180   .     0.623998     0.387653
## cg00000236    0.9393230    0.9227600   .     0.873759     0.904457
## cg00000289    0.7025300    0.7373440   .     0.641360     0.609875
## cg00000292    0.7530310    0.8062190   .     0.729177     0.505125
##        ...            .            .   .            .            .
##  rs9363764    0.9462190    0.9424980   .    0.4760480    0.0337384
##   rs939290    0.0228799    0.9769100   .    0.9669960    0.9753450
##   rs951295    0.3357700    0.0386658   .    0.5140070    0.7081640
##   rs966367    0.9464320    0.9492910   .    0.8572390    0.4777650
##  rs9839873    0.9548290    0.2012180   .    0.9278950    0.9552880

Comparing methylation levels between tumor and matched normal tissues at a specific CpG probe

In PMID 29540343 it is noted that increased methylation at CpG probe cg22748573 (within exon 1 of CITED4) is associated with decreased risk of bladder cancer. We’ll use tumor-normal pairs to check for differential methylation in tumor tissue at this site.

nor = buildPancanSE(bq, sampType="NT")
nor
kp = intersect(colnames(nor), colnames(se1)) 
norsel = nor["cg22748573",kp]
se1sel = se1["cg22748573",kp]
nornum = as.matrix(assay(norsel))
se1num = as.matrix(assay(se1sel))
t.test(se1num-nornum)
plot(as.numeric(nornum), as.numeric(se1num), ylim=c(0,.5), xlim=c(0,.5))
abline(0,1)

Docker and CWL/WDL with Dockstore.org (10min)

dockstore.org implements a GA4GH concept of reproducible research by connecting a CWL/WDL workflow program to a docker container that is endowed with everything needed to execute the workflow.

An example workflow is available, that [annotates variants observed in a given Coriell cell line sample] (https://dockstore.org/workflows/github.com/vjcitn/vardemo/AnnotatingWGSVariantsWithBioc:master?tab=info).

A WDL program

The main components are workflow and task. Tasks are called within the workflow element. In this example the task consists of a command, and declarations of output and runtime.

workflow task1 {
  call doVariantWorkflow { }
}

task doVariantWorkflow {
  command {
    R -e "BiocManager::install('variants', version = '3.9', update=TRUE, ask=FALSE); \
        library('variants'); \
        file <- system.file('vcf', 'NA06985_17.vcf.gz', package = 'cgdv17'); \
        genesym <- 'ORMDL3'; \
        geneid <- select(org.Hs.eg.db, keys=genesym, keytype='SYMBOL', \
                 columns='ENTREZID'); \
        txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene; \
        seqlevelsStyle(txdb) = 'NCBI'; \
        txdb = keepStandardChromosomes(txdb); \
        txdb <- keepSeqlevels(txdb, '17'); \
        txbygene = transcriptsBy(txdb, 'gene'); \
        gnrng <- unlist(range(txbygene[geneid[['ENTREZID']]]), use.names=FALSE); \
        names(gnrng) <- geneid[['SYMBOL']]; \
        param <- ScanVcfParam(which = gnrng+20000, info = 'DP', geno = c('GT', 'cPd')); \
        vcf <- readVcf(file, 'hg19', param); \
        seqlevels(vcf)[25] = 'MT'; \
        ans = locateVariants(vcf, txdb, AllVariants()); \
        table(mcols(ans)[['LOCATION']]); \
        names(ans) = make.names(names(ans), unique=TRUE); \
        ans = as.data.frame(ans); \
        rownames(ans) = make.names(rownames(ans), unique=TRUE); \
                write.csv(ans, 'trpvar.csv');"
  }
  output {
      File out1 = "trpvar.csv"
  }
  runtime {
    disks: "local-disk 40 HDD"
    bootDiskSizeGb: 50
    docker: "waldronlab/bioconductor_devel"
    }
}

Note that the dockstore includes capacity to easily submit the workflow to a hosted cloud computing service.