Cloud-scale genomic data science with Bioconductor
Abstract
Bioconductor’s approach to the analysis of genome-scale assays is rooted in commitments to the use of self-describing data objects representing genomic assays and annotation. Analysis tools and workflows based on these objects have proven effective in a large number of scientific projects and publications.
The dominant model for utilization of Bioconductor to date involves a locally controlled deployment of R and Bioconductor/CRAN packages in an essentially closed storage and execution environment.
New approaches to federated elastic computing with lab-resident or commercial cloud environments provide opportunities for inference on questions of vast scope. This workshop is devoted to understanding how to leverage Bioconductor’s strengths in seizing these new opportunities. Special attention is devoted to how programming and reporting patterns familiar from two decades of Bioconductor development and use can be retained, or must change, in cloud-scale genomic data science.
Our approach will be a mix of lecture and hands-on programming with Rstudio Cloud. We will learn how the restfulSE and BiocOncoTK packages work with HDF Scalable Data Service and Google BigQuery to provide immediate interactive access to a compendium of 181000 human transcriptomics experiments, and to the PanCancer Atlas. We will also learn how to couple Docker containers with formal workflows in CWL and WDL to achieve sharable reproducible analyses with nearly zero configuration.
Pre-requisites
- Basic knowledge of R syntax
- Familiarity with the SummarizedExperiment class
- Familiarity with one or more of TCGA, GTEx, BigQuery
- Familiarity with docker containers is not required but a running docker installation will be useful
Workshop Participation
Students should have a laptop and be prepared to execute specific commands to load packages and evaluate functions. It will be helpful to have a Google identity that may be necessary to work with BigQuery.
R / Bioconductor packages used
DelayedArray, restfulSE, rhdf5client, BiocOncoTK, htxcomp (github/vjcitn), TxRegInfra
Time outline
Approximate timings for sections of workshop
Activity | Time |
---|---|
Review of Bioconductor software and data structures | 10m |
DelayedArray concepts | 5m |
Exercises with htxcomp and the HDF Scalable Data Service | 10m |
Exercises with PanCancer Atlas and Google BigQuery | 10m |
Docker and CWL/WDL with Dockstore.org | 10m |
Workshop goals and objectives
Goals:
Develop an appreciation of strengths and limitations of Bioconductor’s approach to structure and annotation of genome-scale data as scope of data grows to cloud scale
Learn about alternatives to “all-in-memory” models of computing in R, and how Bioconductor has used such alternatives in the local computing model (e.g., external SQLite databases, local HDF5 serialization, API to remote services)
Obtain experience using Bioconductor methods and tools with data and annotation that are cloud-scale
Develop an appreciation of threats to reliability and predictable costs that arise when working with commercial cloud computing
Objectives:
Use rhdf5client to interact with matrix data in HDF Scalable Data Service
Use BiocOncoTK to interrogate multiomic PanCancer atlas data in Google BigQuery
Understand the role of Docker containers and formal workflow expression in establishing reproducible and shareable large scale analyses
Course activities
Review of Bioconductor software and data structures (10min)
Bioconductor uses an approach to object-oriented programming to help control complexity of programming in the domain of genome-scale data science.
We will illustrate the basic ideas by using a catalog of features of the human genome.
The S4 classes EnsDb
and GRanges
library(EnsDb.Hsapiens.v79)
EnsDb.Hsapiens.v79
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.3.0
## |Creation time: Thu May 18 12:42:51 2017
## |ensembl_version: 79
## |ensembl_host: localhost
## |Organism: homo_sapiens
## |taxonomy_id: 9606
## |genome_build: GRCh38
## |DBSCHEMAVERSION: 2.0
## | No. of genes: 65774.
## | No. of transcripts: 214285.
## |Protein data available.
The object named EnsDb.Hsapiens.v79
mediates access to
a SQLite database that contains information on the
Ensembl definitions of genes for reference build
hg38. This object is an instance of a class:
class(EnsDb.Hsapiens.v79)
## [1] "EnsDb"
## attr(,"package")
## [1] "ensembldb"
Formal methods are defined for this class:
methods(class=class(EnsDb.Hsapiens.v79))
## [1] activeFilter addFilter
## [3] cdsBy columns
## [5] convertFilter dbconn
## [7] disjointExons dropFilter
## [9] ensemblVersion exons
## [11] exonsBy exonsByOverlaps
## [13] fiveUTRsByTranscript genes
## [15] getGeneRegionTrackForGviz getGenomeFaFile
## [17] getGenomeTwoBitFile hasProteinData
## [19] initialize intronsByTranscript
## [21] keys keytypes
## [23] lengthOf listColumns
## [25] listGenebiotypes listTables
## [27] listTxbiotypes listUniprotDbs
## [29] listUniprotMappingTypes mapIds
## [31] metadata organism
## [33] promoters proteins
## [35] returnFilterColumns returnFilterColumns<-
## [37] select seqinfo
## [39] seqlevels seqlevelsStyle
## [41] seqlevelsStyle<- show
## [43] supportedFilters supportedSeqlevelsStyles
## [45] threeUTRsByTranscript transcripts
## [47] transcriptsBy transcriptsByOverlaps
## [49] updateEnsDb useMySQL
## see '?methods' for accessing help and source code
Let’s try the genes
method.
genes79 = genes(EnsDb.Hsapiens.v79)
genome(genes79)[1]
## 1
## "GRCh38"
class(genes79)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
head(genes79[,1:3])
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000223972 1 11869-14409 + | ENSG00000223972
## ENSG00000227232 1 14404-29570 - | ENSG00000227232
## ENSG00000278267 1 17369-17436 - | ENSG00000278267
## ENSG00000243485 1 29554-31109 + | ENSG00000243485
## ENSG00000274890 1 30366-30503 + | ENSG00000274890
## ENSG00000237613 1 34554-36081 - | ENSG00000237613
## gene_name gene_biotype
## <character> <character>
## ENSG00000223972 DDX11L1 transcribed_unprocessed_pseudogene
## ENSG00000227232 WASH7P unprocessed_pseudogene
## ENSG00000278267 MIR6859-3 miRNA
## ENSG00000243485 RP11-34P13.3 lincRNA
## ENSG00000274890 MIR1302-9 miRNA
## ENSG00000237613 FAM138A lincRNA
## -------
## seqinfo: 319 sequences from GRCh38 genome
Basic R language elements are given new
meaning when applied to structures like GRanges
.
Here we use $
to obtain one of the fields
of metadata about genes. The operation returns
a vector that we summarize using basic R
functions.
sort(table(genes79$gene_biotype),decreasing=TRUE)[1:6]
##
## protein_coding processed_pseudogene lincRNA
## 22002 10702 7839
## antisense miRNA unprocessed_pseudogene
## 5648 4548 3137
dplyr
idioms can be used with some help (and this may
become more straightforward over time):
library(dplyr)
library(magrittr)
as.data.frame(mcols(genes79)) %>%
select(gene_name, gene_biotype) %>%
filter(gene_biotype == "ribozyme")
## gene_name gene_biotype
## 1 CPEB3_ribozyme ribozyme
## 2 Hammerhead_HH10 ribozyme
## 3 CoTC_ribozyme ribozyme
## 4 RNaseP_nuc ribozyme
## 5 RPPH1-2P ribozyme
## 6 RPPH1-3P ribozyme
## 7 RMRPP5 ribozyme
## 8 RNase_MRP ribozyme
tibble
representations have pleasant summaries:
library(tibble)
as.tibble(genes79)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
## # A tibble: 65,774 x 11
## seqnames start end width strand gene_id gene_name gene_biotype
## <fct> <int> <int> <int> <fct> <chr> <chr> <chr>
## 1 1 11869 14409 2541 + ENSG00… DDX11L1 transcribed…
## 2 1 14404 29570 15167 - ENSG00… WASH7P unprocessed…
## 3 1 17369 17436 68 - ENSG00… MIR6859-3 miRNA
## 4 1 29554 31109 1556 + ENSG00… RP11-34P… lincRNA
## 5 1 30366 30503 138 + ENSG00… MIR1302-9 miRNA
## 6 1 34554 36081 1528 - ENSG00… FAM138A lincRNA
## 7 1 52473 53312 840 + ENSG00… OR4G4P unprocessed…
## 8 1 62948 63887 940 + ENSG00… OR4G11P unprocessed…
## 9 1 69091 70008 918 + ENSG00… OR4F5 protein_cod…
## 10 1 89295 133723 44429 - ENSG00… RP11-34P… lincRNA
## # … with 65,764 more rows, and 3 more variables: seq_coord_system <chr>,
## # symbol <chr>, entrezid <I<list>>
Finding and visualizing genomic elements in a specified chromosomal region
GRanges
are easy to construct and can be used to
query genomes. Here we deal with three problems related
to genome region specification. We start with a
region specified using coordinates from
reference build hg19 (GRCh37). We use UCSC’s
liftOver utility to convert to GRCh38 (hg38). We
finish by converting the chromosome annotation to
that used by Ensembl.
myRange = GRanges("chr10", IRanges(37.45e6, 37.8e6))
myRange
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr10 37450000-37800000 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(rtracklayer)
ch = import.chain("hg19ToHg38.over.chain")
myr38 = liftOver(myRange, ch)[[1]]
library(GenomeInfoDb)
genome(myr38) = "GRCh38"
seqlevelsStyle(myr38) = "Ensembl"
myr38
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 10 37161072-37511072 *
## -------
## seqinfo: 1 sequence from GRCh38 genome; no seqlengths
It is easy to find ‘gene-level’ elements using subsetByOverlaps
:
els = subsetByOverlaps(genes79, myr38)
as.tibble(els)
## # A tibble: 8 x 11
## seqnames start end width strand gene_id gene_name gene_biotype
## <fct> <int> <int> <int> <fct> <chr> <chr> <chr>
## 1 10 3.71e7 3.74e7 258324 + ENSG00… ANKRD30A protein_cod…
## 2 10 3.72e7 3.72e7 1163 + ENSG00… RP11-20F… lincRNA
## 3 10 3.72e7 3.73e7 67684 + ENSG00… ATP8A2P1 transcribed…
## 4 10 3.73e7 3.73e7 292 + ENSG00… RN7SL314P misc_RNA
## 5 10 3.73e7 3.73e7 110 - ENSG00… Y_RNA misc_RNA
## 6 10 3.73e7 3.73e7 37847 + ENSG00… LINC00993 lincRNA
## 7 10 3.73e7 3.73e7 1457 + ENSG00… TMEM161B… processed_p…
## 8 10 3.73e7 3.73e7 957 + ENSG00… VN1R53P processed_p…
## # … with 3 more variables: seq_coord_system <chr>, symbol <chr>,
## # entrezid <I<list>>
To get a quick view of the layout of these elements on their chromosome, we can use Gviz:
library(survival)
library(Gviz)
options(ucscChromosomeNames=FALSE)
plotTracks(list(GenomeAxisTrack(),
GeneRegionTrack(els, gene=els$symbol)), showId=TRUE)
There are many options available to enhance this display. An important task for this region would be to distinguish exons and introns. See the Gviz vignette for details.
DelayedArray concepts (5min)
Our interest in cloud-scale computing methods arises in part from desire to analyze very large data sets with computers that have relatively small endowments of random access memory. The most common methods of working with R assume that all data are resident in memory and can be addressed directly.
Acquisition of breast cancer RNA-seq data from TCGA
In this section we will construct two representations of
RNA-seq data for breast cancer tumors collected in TCGA.
First we use curatedTCGAData
to obtain a standard in-memory
representation in a SummarizedExperiment.
library(curatedTCGAData)
brMAE = curatedTCGAData("BRCA", "RNASeq2GeneNorm", dry.run=FALSE)
brexp = experiments(brMAE)[[1]] # should use name
colnames(brexp) = substr(colnames(brexp),1,12) # need to shorten
cd = colData(brMAE)[colnames(brexp),]
colData(brexp) = cd
brexp
## class: SummarizedExperiment
## dim: 20501 1212
## metadata(0):
## assays(1): ''
## rownames(20501): A1BG A1CF ... psiTPTE22 tAKR
## rowData names(0):
## colnames(1212): TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5
## TCGA-Z7-A8R6
## colData names(2684): patientID years_to_birth ...
## Integrated.Clusters..unsup.exp.
## X60.Gene.classifier.Class.Assignment
We estimate its size in RAM.
library(SummarizedExperiment) # assay
object.size(assay(brexp))
## 200190840 bytes
This is not particularly large, but in certain applications it is advantageous to have tight control over the memory consumption required for an analysis. In this form we have no choice – either the complete dataset is in memory or it is not and cannot be accessed without loading it in its entirety.
Exporting the expression data to HDF5
Now we develop a representation on disk. We use the HDF5 data format as it is well-established as a tool for managing numerical data in scientific computing.
library(HDF5Array)
saveHDF5SummarizedExperiment(brexp, "brexpHDF5", replace=TRUE)
We named the HDF5 repository for the data brexpHDF5
; this
is in fact a folder created in the current working directory.
Its contents are a SummarizedExperiment ‘shell’ in RDS format,
and the HDF5 matrix representation of the RNA-seq quantifications.
dir("brexpHDF5")
## [1] "assays.h5" "se.rds"
Using the on-disk HDF5 representation
We use a loading function to retrieve the new SummarizedExperiment instance. Its memory consumption is independent of the dimensions of the assay matrix.
brextern = loadHDF5SummarizedExperiment("brexpHDF5")
assay(brextern)
## <20501 x 1212> DelayedMatrix object of type "double":
## TCGA-3C-AAAU TCGA-3C-AALI ... TCGA-Z7-A8R5 TCGA-Z7-A8R6
## A1BG 197.0897 237.3844 . 439.5425 248.3271
## A1CF 0.0000 0.0000 . 0.0000 0.0000
## A2BP1 0.0000 0.0000 . 0.5973 0.0000
## A2LD1 102.9634 70.8646 . 81.3010 25.1866
## A2ML1 1.3786 4.3502 . 2.3893 4.1757
## ... . . . . .
## ZYX 3507.2482 5504.6221 . 6675.6264 3402.5228
## ZZEF1 1894.9342 1318.6514 . 754.4127 564.4193
## ZZZ3 1180.4565 406.7428 . 750.8288 462.1140
## psiTPTE22 1.7233 926.5905 . 238.9272 20.8786
## tAKR 0.0000 0.0000 . 0.0000 0.0000
object.size(assay(brextern))
## 1415792 bytes
Targeted queries to the HDF5 store are rapidly resolved.
par(las=2, mar=c(18,4,3,3))
boxplot(split(log(as.numeric(assay(brextern["BRCA2",])+1)),
brexp$histological_type), ylab="log BRCA2")
But there is a price to pay.
library(microbenchmark)
microbenchmark(
split(log(as.numeric(assay(brextern["BRCA2", ]) + 1)),
brexp$histological_type),
times=5)
## Unit: milliseconds
## expr
## split(log(as.numeric(assay(brextern["BRCA2", ]) + 1)), brexp$histological_type)
## min lq mean median uq max neval
## 331.8319 526.5889 533.4987 574.0303 603.4796 631.5628 5
Compare to the in memory
benchmark:
microbenchmark(
split(log(as.numeric(assay(brexp["BRCA2", ]) + 1)),
brexp$histological_type),
times=5)
## Unit: milliseconds
## expr
## split(log(as.numeric(assay(brexp["BRCA2", ]) + 1)), brexp$histological_type)
## min lq mean median uq max neval
## 14.05118 14.3116 14.31203 14.34153 14.39318 14.46265 5
Exercises with the HumanTranscriptomeCompendium and the HDF Scalable Data Service (10min)
The HumanTranscriptomeCompendium package was formed to take advantage of two new technologies.
The first is the OmicIDX metadata transformation and access facility, devised by Dr Sean Davis of NCI. This currently provides comprehensive access to metadata about sequencing studies collected at NCBI SRA, but will be extended to other institutional metadata archives.
The second is HDF Scalable Data Service, devised by John Readey of The HDF Group. Substantial support has been provided to Bioconductor by HDF Group, with open hosting of significant genomic data archives in a publicly accessible server.
Fruits of both of these developments are used in the following exercises.
Command-line work with HumanTranscriptomeCompendium
library(HumanTranscriptomeCompendium)
htxSE = htx_load()
## Using temporary cache /tmp/RtmpTY9jSI/BiocFileCache
## adding RDS to local cache, future invocations will use local image
## adding rname 'https://s3.amazonaws.com/bcfound-bigrna/rangedHtxGeneSE.rds'
htxSE
## class: RangedSummarizedExperiment
## dim: 58288 181134
## metadata(1): rangeSource
## assays(1): counts
## rownames(58288): ENSG00000000003.14 ENSG00000000005.5 ...
## ENSG00000284747.1 ENSG00000284748.1
## rowData names(0):
## colnames(181134): DRX001125 DRX001126 ... SRX999990 SRX999991
## colData names(4): experiment_accession experiment_platform
## study_accession study_title
There’s a little extra annotation to add.
htxSE = addRD(htxSE)
library(SummarizedExperiment)
head(rowData(htxSE))
## DataFrame with 6 rows and 4 columns
## gene_type gene_id gene_name
## <character> <character> <character>
## ENSG00000000003.14 protein_coding ENSG00000000003.14 TSPAN6
## ENSG00000000005.5 protein_coding ENSG00000000005.5 TNMD
## ENSG00000000419.12 protein_coding ENSG00000000419.12 DPM1
## ENSG00000000457.13 protein_coding ENSG00000000457.13 SCYL3
## ENSG00000000460.16 protein_coding ENSG00000000460.16 C1orf112
## ENSG00000000938.12 protein_coding ENSG00000000938.12 FGR
## havana_gene
## <character>
## ENSG00000000003.14 OTTHUMG00000022002.1
## ENSG00000000005.5 OTTHUMG00000022001.1
## ENSG00000000419.12 OTTHUMG00000032742.2
## ENSG00000000457.13 OTTHUMG00000035941.5
## ENSG00000000460.16 OTTHUMG00000035821.8
## ENSG00000000938.12 OTTHUMG00000003516.2
We will make a searchable table of study titles and sizes.
ad = as.data.frame(colData(htxSE))
library(dplyr)
library(magrittr)
library(DT)
studtab = (ad %>% select(study_title, study_accession) %>% group_by(study_title, study_accession) %>% summarise(n=n()) %>% as.data.frame())
datatable(studtab)
Search for zika in this table.
zikp = htxSE[, which(htxSE$study_accession == "SRP075248")]
assay(zikp)
## <58288 x 1598> DelayedMatrix object of type "double":
## SRX1767663 SRX1767664 ... SRX1769277 SRX1769278
## ENSG00000000003.14 94.662016 0.000000 . 623.854770 627.757554
## ENSG00000000005.5 0.000000 0.000000 . 0.000000 0.000000
## ENSG00000000419.12 2.999995 0.000000 . 101.000070 50.000247
## ENSG00000000457.13 24.025096 6.628778 . 45.372716 0.000000
## ENSG00000000460.16 0.000000 12.376506 . 4.000001 39.000084
## ... . . . . .
## ENSG00000284744.1 1.004693 2.999999 . 0 0
## ENSG00000284745.1 0.000000 0.000000 . 0 0
## ENSG00000284746.1 0.000000 0.000000 . 0 0
## ENSG00000284747.1 2.844749 0.000000 . 0 0
## ENSG00000284748.1 0.000000 0.000000 . 0 0
To acquire additional metadata on this study, you could use SRAdbV2. We will skip that for now.
The ca43k app
We have a shiny app that mediates access to cancer-related studies in the compendium. Using the following command starts the app. Clear the initial tabs and cart elements. Search for the string ‘lncRNAs’ and then search the resulting table for ‘archived’. Then stop the app.
cadat = ca43k()
Now metadata(cadat)[[2]]
is a data.frame that gives sample-level
metadata on the samples that were retrieved. How many different
tissues were assayed?
Have a look at the paper referenced in the associated PMID. Find the lncRNA that the authors assert is specific to breast and display its distribution of abundance across tissues.
Exercises with PanCancer Atlas and Google BigQuery (10min)
The PanCancer Atlas builds on TCGA by including a number of matched normal tissue samples subjected to many of the same assay processes as the tumor tissues.
In this section we’ll create SummarizedExperiment instances for 450k methylation assays in bladder cancer.
Some background on PanCancer Atlas data in the ISB Cancer Genomics Cloud project
The names of resources available for PanCancer atlas are somewhat unwieldy. We’ve created abbreviations.
library(BiocOncoTK)
anndf = data.frame(abbr=names(
BiocOncoTK::annotTabs), tabname=as.character(BiocOncoTK::annotTabs))
datatable(anndf)
In this section you must have environment variable
CGC_BILLING
set to a valid google cloud platform
billing account.
Creating a ‘RESTful’ SummarizedExperiment with BigQuery references for assay data
library(BiocOncoTK)
bq = pancan_BQ()
bq@quiet = TRUE
se1 = buildPancanSE(bq)
se1
## class: RangedSummarizedExperiment
## dim: 396065 409
## metadata(3): acronym assay sampType
## assays(1): assay
## rownames(396065): cg00000029 cg00000165 ... rs966367 rs9839873
## rowData names(3): gene_id gene_name gene_biotype
## colnames(409): TCGA-FD-A3SN TCGA-FD-A5BV ... TCGA-UY-A78L
## TCGA-ZF-AA4T
## colData names(20): bcr_patient_uuid bcr_patient_barcode ...
## radiation_therapy race
assay(se1)
## <396065 x 409> DelayedMatrix object of type "double":
## TCGA-FD-A3SN TCGA-FD-A5BV ... TCGA-UY-A78L TCGA-ZF-AA4T
## cg00000029 0.0954445 0.0687728 . 0.162314 0.172125
## cg00000165 0.8368510 0.1360180 . 0.623998 0.387653
## cg00000236 0.9393230 0.9227600 . 0.873759 0.904457
## cg00000289 0.7025300 0.7373440 . 0.641360 0.609875
## cg00000292 0.7530310 0.8062190 . 0.729177 0.505125
## ... . . . . .
## rs9363764 0.9462190 0.9424980 . 0.4760480 0.0337384
## rs939290 0.0228799 0.9769100 . 0.9669960 0.9753450
## rs951295 0.3357700 0.0386658 . 0.5140070 0.7081640
## rs966367 0.9464320 0.9492910 . 0.8572390 0.4777650
## rs9839873 0.9548290 0.2012180 . 0.9278950 0.9552880
Comparing methylation levels between tumor and matched normal tissues at a specific CpG probe
In PMID 29540343 it is noted that increased methylation at CpG probe cg22748573 (within exon 1 of CITED4) is associated with decreased risk of bladder cancer. We’ll use tumor-normal pairs to check for differential methylation in tumor tissue at this site.
nor = buildPancanSE(bq, sampType="NT")
nor
kp = intersect(colnames(nor), colnames(se1))
norsel = nor["cg22748573",kp]
se1sel = se1["cg22748573",kp]
nornum = as.matrix(assay(norsel))
se1num = as.matrix(assay(se1sel))
t.test(se1num-nornum)
plot(as.numeric(nornum), as.numeric(se1num), ylim=c(0,.5), xlim=c(0,.5))
abline(0,1)
Docker and CWL/WDL with Dockstore.org (10min)
dockstore.org implements a GA4GH concept of reproducible research by connecting a CWL/WDL workflow program to a docker container that is endowed with everything needed to execute the workflow.
An example workflow is available, that [annotates variants observed in a given Coriell cell line sample] (https://dockstore.org/workflows/github.com/vjcitn/vardemo/AnnotatingWGSVariantsWithBioc:master?tab=info).
A WDL program
The main components are workflow
and task
. Tasks are
called within the workflow element. In this example
the task consists of a command, and declarations of
output and runtime.
workflow task1 {
call doVariantWorkflow { }
}
task doVariantWorkflow {
command {
R -e "BiocManager::install('variants', version = '3.9', update=TRUE, ask=FALSE); \
library('variants'); \
file <- system.file('vcf', 'NA06985_17.vcf.gz', package = 'cgdv17'); \
genesym <- 'ORMDL3'; \
geneid <- select(org.Hs.eg.db, keys=genesym, keytype='SYMBOL', \
columns='ENTREZID'); \
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene; \
seqlevelsStyle(txdb) = 'NCBI'; \
txdb = keepStandardChromosomes(txdb); \
txdb <- keepSeqlevels(txdb, '17'); \
txbygene = transcriptsBy(txdb, 'gene'); \
gnrng <- unlist(range(txbygene[geneid[['ENTREZID']]]), use.names=FALSE); \
names(gnrng) <- geneid[['SYMBOL']]; \
param <- ScanVcfParam(which = gnrng+20000, info = 'DP', geno = c('GT', 'cPd')); \
vcf <- readVcf(file, 'hg19', param); \
seqlevels(vcf)[25] = 'MT'; \
ans = locateVariants(vcf, txdb, AllVariants()); \
table(mcols(ans)[['LOCATION']]); \
names(ans) = make.names(names(ans), unique=TRUE); \
ans = as.data.frame(ans); \
rownames(ans) = make.names(rownames(ans), unique=TRUE); \
write.csv(ans, 'trpvar.csv');"
}
output {
File out1 = "trpvar.csv"
}
runtime {
disks: "local-disk 40 HDD"
bootDiskSizeGb: 50
docker: "waldronlab/bioconductor_devel"
}
}
Note that the dockstore includes capacity to easily submit the workflow to a hosted cloud computing service.