Bioconductor on Containers
Instructor(s) name(s) and contact information
Nitesh Turaga* nitesh.turaga@roswellpark.org (Github: nturaga)
Schedule
https://bioc2019.bioconductor.org/schedule-developer-day
11:00 - 12:00 – Workshops I Track 1: Bioconductor on Containers - Beginners
12:00 - 1:00 - Lunch
1:00 - 2:00 - Workshop II Track 1: Bioconductor on Containers - Advanced
Pre-requisites
Basic knowledge of R / how to install packages the Bioconductor way (using
BiocManager()
).IMPORTANT: Docker needs to be installed on your local machine. Please install Docker ahead of time. If you are planning to attend this workshop and are having issues installing Docker please email me ahead of time, I will help you through it. Please do not attend the workshop without installing Docker on your machine.
To test installation of Docker and please run on your command line,
$ docker run hello-world Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world ca4f61b1923c: Pull complete Digest: sha256:ca0eeb6fb05351dfc8759c20733c91def84cb8007aa89a5bf606bc8b315b9fc7 Status: Downloaded newer image for hello-world:latest Hello from Docker! This message shows that your installation appears to be working correctly.
IMPORTANT: After installing docker, be sure to pull(download) an image which we will use in class. This is a pretty big ‘download’ so make sure to run it before the workshop starts, otherwise there will be some lag. Run the command,
$ docker pull bioconductor/release_base2:latest
Some familiarity with HPC infrastructure (if this is relevant to participant as there is only one topic which covers Singularity on HPC infrastructure).
R / Bioconductor packages used
Beginners section - Topics
Activity | Time |
---|---|
Why Containers & Why Docker | 10m |
Bioconductor on Docker | 10m |
Flavors of Bioconductor | 10m |
Which flavor to use | 10m |
How to mount volumes | 10m |
Questions | Please interrupt me!! |
Who is this for?
This workshop is for a beginner. You don’t need any prior experience with Docker or container technology. In this course participants are expected to launch Docker images on their local machines, mount volumes, and set ports as needed to use RStudio.
Since it’s a beginners course, please note that we will be starting at a basic level. The next section of the course, the advanced section will give you deeper insight into how container technology can be used on the cloud with some demonstrations.
Why Containers and Why Docker
A Container is a completely isolated environment. They can have their own processes or services or network interfaces, volume mounts, just like a virtual machine. They differ from VM’s only in a single aspect, they “share” the OS kernel. Containers are running instances of images.
Docker is simply the most popular container technology being used today. An Image is a static (fixed) template and container is a running version of the image. An image can be used to create a container.
Main reason behind using containers: As a user, you want to delegate problems of software installation to the developer (in this case - Bioconductor). Bioconductor gives you these static docker images which solve this issue of software installation on your infrastructure.
The total disk space used by all of the running containers on disk is some combination of each container’s size and the virtual size values. If multiple containers started from the same exact image, the total size on disk for these containers would be SUM (size of containers) plus one container’s (virtual size). Be careful and monitor the size of your containers as it may grow very quickly. Read more about Container size on disk.
Docker warmup
Get a list of docker images
docker images
Let’s pull (get) an image specifically bioconductor/release_base2:latest
(
docker pull bioconductor/release_base2:latest
Run the image to start a container,
docker run -p 8787:8787 -e PASSWORD=bioc bioconductor/release_base2:latest
where, -p
if the port, and -e
is the environment variable
Open a new terminal and you can check to see your container is running,
docker ps
Kill the container
docker kill <container_id>
Remove the image (you don’t want to do this one right now)
docker rmi bioconductor/release_base2:latest
Interactively running docker images ,
## For bash
docker run -it bioconductor/release_base2 bash
## For R directly
docker run -it bioconductor/release_base2 R
Interactively run Containers,
## Get the container ID
docker ps
You can do this only if the container with that ID is running currently,
docker exec -it <container ID>
Bioconductor on Docker
In this section we will discuss what flavors are available to users, and how to build on these images.
Bioconductor provides Docker images in a few ways:
bioc_docker images
These bioconductor images are hosted on Docker Hub. The bioconductor docker page has information and instructions on how to use them. The Dockerfile(s) for each of these containers is located in Github. These are provided directly by the Bioconductor team for each supported version of Bioconductor and are categorized into,
base2: Contains R, RStudio, and a single Bioconductor package (
BiocManager
, providing theinstall()
function for installing additional packages). Also contains many system dependencies for Bioconductor packages.core2: It contains everything in base2, plus a selection of core Bioconductor packages.
Core packages – – – BiocManager OrganismDbi ExperimentHub Biobase BiocParallel biomaRt Biostrings BSgenome ShortRead IRanges GenomicRanges GenomicAlignment GenomicFeatures SummarizedExperiment VariantAnnotation DelayedArray GSEABase Gviz graph RBGL Rgraphviz rmarkdown httr knitr BiocStyle protmetcore2: It contains everything in core2, plus a selection of core Bioconductor packages recommended for proteomic and metabolomics analysis. Reference section contains list of packages in protmetcore2
metabolomics2: It contains everything in protmetcore2, plus select packages from the Metabolomics biocView. Reference section contains list of packages in metabolomics2
cytometry2: Built on base2, so it contains everything in base2, plus CytoML and needed dependencies.
It is possible to build on top of these default containers for a customized Bioconductor container. This is a very good resource for developers/users who are looking to get an image without having to worry about system requirements.
Bioconda
Bioconda provides BioContainers. Bioconda provides a container for each individual package in Bioconductor under the “BioContainer” umbrella in Quay.io.
They also provide additional multi package containers, where you can combine multiple bioconda packages into one container. You may choose a Bioconductor package combination of your desire, and build a container with them. This is a helpful tool.
But please note that they only contain the “release” version of the Bioconductor packages. Bioconda does not build or distribute ‘devel’ version of Bioconductor.
bioconductor_full
Bioconductor is now providing an image with the complete set of system dependencies. The goal of this image is, the user should be able install every Bioconductor package (1620 +), without having to deal with system dependency installation.
But please note, that the actual Bioconductor packages themselves
don’t come on this image. It is up to the user to install the
Bioconductor packages through BiocManager::install
.
These bioconductor_full images can be obtained on Docker hub.
docker pull bioconductor/bioconductor_full:devel
docker pull bioconductor/bioconductor_full:RELEASE_3_9
docker pull bioconductor/bioconductor_full:RELEASE_3_8
There are a few exceptions in this list, which are unavoidable because
of the way these images are built and provisioned as docker
images. The list is given in the docs/issues.md
files on the Github
page for these images. (We try to keep that document as updated as
possible, but if you find a package installs and is listed there, send
us a Pull Request.)
Flavors of Bioconductor
The flavors you “need” for your work should be based on the packages you want for your analysis. There are a list of packages for each flavor you should evaluate before using the container.
For most beginners a good place to start is the
bioconductor/release_core2
docker container which includes all the
“core” bioconductor packages.
How to extend a docker image
There are two ways,
Reproducible way:
Create a new folder
my_image
. Inside that folder, in a new file calledDockerfile
, use the Bioconductor Image most suited for your needs as your base using## Inherit from the bioconductor/release_base2 image FROM bioconductor/release_base2:latest ## Add packages RUN R -e "BiocManager::install('<package_name>')"
Then inside the build the new docker image,
docker build -t my_image:v1 my_image/
The interactive way:
Run the image interactively, with RStudio
docker run -e PASSWORD=bioc -p 8787:8787 bioconductor/release_base2:latest or docker run -it bioconductor/release_base2:latest bash
In localhost:8787
BiocManager::install('BiocGenerics')
In another terminal window, note the container ID
docker ps
Save the container state, (it’s a good idea to “tag” your image)
docker commit <container ID> my_image:1.0
Mount volumes to your Docker container
It is useful sometimes to mount a volume either with data or a list of installed packages.
You can bind as many volumes as needed with the -v
/--volume
command. The format is given as
-v /host/path:/container/path
If you wanted to share data with your docker container, run the container with the command,
docker run \
-v /data:/home/rstudio/data
-e PASSWORD=bioc
-p 8787:8787
bioconductor/release_base2
To share libraries with your Docker container, bioconductor provides a way so that the R running within your docker image is able to see the packages in your container,
-v /shared-Bio-3.9:/usr/local/lib/R/host-site-library
This location /usr/local/lib/R/host-site-library
has special
meaning, and the configuration within the bioconductor container
allows the user to share built packages. The path on the docker image
that should be mapped to a local R library directory is
/usr/local/lib/R/host-site-library.
Create a directory, on your host machine, (as given as whichever path suits you the best)
mkdir /shared-BioC-3.9
Run the image by sharing that directory, for the R package libraries and the data.
docker run \ -v /shared-BioC-3.9:/usr/local/lib/R/host-site-library \ -v /data:/home/rstudio/data \ -it bioconductor/release_base2 \ R
Within R,
BiocManager::install('BiocGenerics')
All the packages in that directory will be directly available to use instantly the next time R is started from the container. Another big advantage is, it prevents your docker image from growing in size due to package installation and data. This allows the users to distribute smaller docker images containing only what is needed.
Conclusion of Beginners section
Re cap of items you have learnt in this section,
How to start Bioconductor based docker images.
Where these images are available.
What flavors of Bioconductor images are present.
How to bind/mount volumes on these Bioconductor docker images.
These skills get you started on using Bioconductor on containers.
Advanced section - Topics
Activity | Time |
---|---|
Containers on the cloud | 10m |
Mount volumes on the cloud | 10m |
Localize select packages | 10m |
HPC-Singularity containers | 5m |
Launch and run parallel jobs | 10m |
Questions | 10m+ |
Future work | 5m |
Who is this section for?
The advanced section is for anyone who has some familiarity with how Docker works, and wants to run things on the “Cloud”. This will be a demonstration section, since we are not providing “credits” to use a certain cloud (Google or AWS).
This section aims to give some direction to users who are looking to use Bioconductor on the cloud. We also discuss some future developments, which look promising.
Containers on the cloud
Running your R session on the cloud with RStudio makes you ‘machine independent’. It also allows you to access data hosted on the cloud cheaper and faster for example 1000 Genomes data which is available on the google cloud: gs://genomics-public-data/1000-genomes.
One of the biggest advantages to cloud machines is the on-demand scalability to run programs.
In this section, there will be a demonstration showing the usability with the Google cloud Infrastructure, but take note that everything you can do on the Google Cloud you can do with AWS.
Requirements
Google cloud billing account
Pro Tip: Create a test google account (or try the google cloud for the first time from your google account) to get $300 in billing credits for a year. You have to link your credit card to this, so make sure you don’t leave any google cloud resources running. You will go through your $300 USD in credit very quickly if you run large compute instances. Take a look at Google cloud - free.
NOTE: The user is responsible for keeping track of their billing account and any credit cards they use for their workshop progress.
You should also have taken the beginners course or at the very least have a basic understanding of Docker. If you have missed the earlier section please refer back to it for any gaps of information.
Google Cloud SDK
needs to be installed, and command line callable usinggcloud auth login
.
Launch a container on the Google cloud
There are a couple of ways this can be done, one is using the a service on that the google cloud provides called “Container Engine”, the other by natively using the “Compute Engine” service, installing docker, and launching a container with a specific image stored either on the “Google Container Registry” (Google’s Dockerhub) or Dockerhub itself.
Using the Google Container Engine
Launch a Bioconductor docker image on the Google Cloud using the command line application
gcloud
.In this demo, we use the
bioconductor/release_base2:latest
, which are hosted on dockerhub. There are a few additional settings to be aware of in this case since we’ll be using RStudio as the interface for R.These containers running on a cloud instance (read: a computer accessible remotely) need to be accessible via the HTTP protocol at a specific port (8787 for RStudio). To allow this we need to create a ‘firewall rule’ which allows HTTP access at the port 8787.
## Launch bioconductor/release_base2:latest image gcloud compute instances create-with-container bioc-rstudio \ --container-image bioconductor/release_base2:latest \ --container-restart-policy never \ --container-env PASSWORD=bioc \ --tags http-server --machine-type n1-standard-4
Note: Reference to gcloud compute. Also keep in mind that, you can launch any image derived from
bioc_docker
images in this way, i.e.bioconductor/bioconductor_full:<tag>
These containers running on the a cloud instance (read: a computer accessible remotely) need to accessible via the HTTP protocol at a specific port (8787 for RStudio). For this we need to create a firewall rule which allows HTTP access at that port 8787.
## Create a firewall gcloud compute firewall-rules create allow-http \ --allow tcp:8787 \ --target-tags http-server
You can now list the running instances you have to check if your container started.
## List running compute instances gcloud compute instances list
Once you are done using the instance, you might want to stop it. This avoids running up the bill on your Google cloud project. When you are ready to restart it is easy enough to restart.
## Stop an instance, gcloud compute instances stop bioc-rstudio
NOTE: Terminating/deleting the instance is your problem! Go to your google cloud, and make sure to terminate the instance or
gcloud compute instances delete <bioc-rstudio>
.You can also log into your VM on the Google cloud and provision any data you need.
## Log in to the VM gcloud compute ssh bioc-rstudio
For more configuration options of your containers take a look at the Google Cloud documentation to configure containers.
Mount volumes on the cloud (Optional)
There are many publicly available genomics data sets hosted by cloud services, Google cloud genomics datasets are located on the bucket, gs://genomics-public-data.
You can mount storage buckets on your VM to your container using GCSFUSE. For this you need to choose an image with access to the google cloud and gcsfuse. To install gcsfuse, and
Create a directory, where you want the google bucket
mkdir /path/to/mount
Mount a bucket on your google cloud
gcsfuse example-bucket /path/to/mount
Start working with the mounted bucket
ls /path/to/mount
Localize select packages
For faster installation of packages from Bioconductor on cloud based instances, as provide as a demonstration ONLY Google storage buckets with all the software packages which are pre-compiled (1600+).
Bioconductor provides the entire pre-compiled binary libraries for the following versions of R and Bioconductor. These are publicly accessible buckets.
Bioconductor Version | R version | Google Bucket Name | Link |
---|---|---|---|
Bioc 3.10 | R devel | gs://bioconductor-full-devel | https://console.cloud.google.com/storage/browser/bioconductor-full-devel |
Bioc 3.9 | R 3.6.0 | gs://bioconductor-full-release-3-9 | https://console.cloud.google.com/storage/browser/bioconductor-full-release-3-9 |
Bioc 3.8 | R 3.5.0 | gs://bioconductor-full-release-3-8 | https://console.cloud.google.com/storage/browser/bioconductor-full-release-3-8 |
NOTE: Requester Pays (aka: You) for the access to these buckets. So by accessing the data on these packages to be downloaded on to your google cloud instance your billing project gets charged.
To copy packages over to the instance, using the command line:
## Get the name of your billing project
gcloud config list project
## Copy packages over to your local directory
gsutil \
-u <name-of-billing-project> \
-m \
cp -r gs://bioconductor-full-devel/<package_name> <local_directory>
We also provide a very helpful package (which is part of another
project) called AnVIL
. With the help of the AnVIL::install()
function, users are able to localize packages from buckets and install
them on containers at high speed without having to wait for
compilation.
## Launch a bioconductor_full:RELEASE_3_9 image
gcloud compute instances create-with-container bioc-release \
--container-image bioconductor/bioconductor_full:RELEASE_3_9 \
--container-restart-policy never \
--container-env PASSWORD=bioc \
--tags http-server
## Login in to your Rstudio session
## username: Rstudio, password:bioc
## Install the AnVIL package
BiocManager::install('Bioconductor/AnVIL_rapiclient')
BiocManager::install('Bioconductor/AnVIL')
## Use AnVIL install to install packages on your bioconductor-full image
AnVIL::install('GenomicRanges')
HPC-Singularity containers
Singularity is a container engine just like Docker. Read more about on the Singularity webpage.
Advantages of using Singularity are:
Singularity is compatible on HPC infrastructures and is easily installable.
It has a BSD license (open source!!)
Works very well with Docker images which are already available.
Singularity comes as a module which can be installed on a cluster facility. Once the module is available, it is easy for the user to run software in a containerized fashion, without depending on the system admin for new software installations.
module load singularity
Bioconductor provides Singularity containers on the Singularity Hub. There are three containers provided right now which represent versions of Bioconductor,
devel (master)
## Pull image singularity pull –name bioc-devel.simg shub://Bioconductor/bioconductor_full
## Use image’s R singularity shell bioc-devel.simg /usr/local/bin/R
RELEASE_3_9
singularity pull –name bioc-release-3-9.simg shub://Bioconductor/bioconductor_full:release_3_9
RELEASE_3_8
singularity pull –name bioc-release-3-8.simg shub://Bioconductor/bioconductor_full:release_3_8
You can set an alias to your Singularity image on your .bashrc
file
as,
alias singulaR="singularity shell \
-B /mnt/lustre/users/nturaga/my-lib:/usr/local/lib/R/host-site-library \
/mnt/lustre/users/nturaga/bioc-devel.simg \
/usr/local/bin/R"
where,
-B
is how you mount a volume on the singularity image <local path>:<container path>
. This command shows my R library being
mounted from my head node on the singularity container.
/mnt/lustre/users/nturaga/bioc-devel.simg
refers to the location of
my singularity image.
/usr/local/bin/R
is the location of R on the container.
From here on out, the user can build packages or use packages on their local cluster using,
singulaR CMD build <package name>
or run scripts using
singulaR -f <my_R_script.R>
Launch and run parallel jobs
Use BiocParallel to launch and run jobs on Singularity containers. The
following example has been run on SLURM
(This example has been
provided thanks to Martin Morgan, Levi Waldron and Ludwig
Geistlinger).
To use the containerized singularity image in conjunction with
BiocParallel, a recent R(3.6.0) installation on the cluster is needed
that, as a minimum requirement, has also the packages BiocParallel
and batchtools
installed (minimum requirement would be similar to
the image biocparallel-example)
NOTE: Submission of jobs directly from within the singularity container on the head node is not possible. The container doesn’t recognize the cluster environment and its specific configuration. It might also cause issues to mount volumes.
For a SLURM cluster, we need to first define a template file for the
jobs. slurm-singularity.tmpl
is given below as an example. Loading
the module ‘singularity’ and using the full path to the singularity
image <image.simg>
is important.
#!/bin/bash
<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>
#SBATCH -p <queue>
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
module load singularity
singularity shell /mnt/fs/nturaga/bioconductor-image-w-BiocParallel-latest.simg R -e 'batchtools::doJobCollection("<%= uri %>")'
Using BiocParallel
and batchtools
, it is then possible to run the
following code on your head node in R using BatchtoolsParam
. This
batch tools functionality was introduced last year as an improvement on
previously available batch processing methods in BiocParallel.
## Load the library
library(BiocParallel)
library(batchtools)
## define your BPPARAM,
param <- BatchtoolsParam(
## define number of workers
workers = 5,
## define the cluster you are using, in this case SLURM
cluster = "slurm",
## point to the custome template
template="slurm-singularity.tmpl"
)
## Supply the param to bplapply for parallel execution
## of a simple function that returns the hostname of the
## compute nodes for testing purpose. Run bplapply.
test_cluster <- function(n) system2('hostname')
bplapply(1:5, test_cluster, BPPARAM = param)
Future work
There is quite a bit of work in progress to develop solutions for Bioconductor to meet scalable data analysis requirements. A few of them are highlighted below,
BiocParallel::RedisParam
It is based on the redis data-structure (Redis).
Link: RedisParam
enabling an alternative, modern mechanism for distributed computation.
This is yet to be integrated into BiocParallel. The RedisParam back end allows computation to be distributed across instances or containers running on the cloud. To explain the way the RedisParam functions, think of 5 worker instances waiting to to hear from a single manager instance which sends a job or multiple jobs.
Kubernetes
As the next iteration of our efforts on the cloud, we needed RedisParam to be working on a container orchestration technology like Kubernetes.
We use Kubernetes to deploy the “cluster” on the cloud, scalability and to automate management of the deployment to free the user from mundane tasks such as worrying if a worker node has failed.
Helm charts
helm.sh is a package manager for Kubernetes. Once we created the Kubernetes deployment for RedisParam we wanted to make sure it’s easily distributed and used in the community.
Helm basically works on templating the K8s deployment and making a easy to distribute tarball (or package).
References
- List of protmetcore2 packages
## [1] "bioassayR" "BioNetStat" "biosigner" "ChemmineOB"
## [5] "ChemmineR" "EGSEA" "eiR" "fmcsR"
## [9] "KnowSeq" "limma" "mixOmics" "mzR"
## [13] "NormalyzerDE" "OmicsLonDA" "pathview" "pathwayPCA"
## [17] "proFIA" "RankProd" "ropls" "SDAMS"
## [21] "sparsenetgls" "statTarget" "tofsims" "SBGNview"
- List of metabolomics2 packages
## [1] "adductomicsR" "ASICS" "BiGGR"
## [4] "bioassayR" "BioNetStat" "biosigner"
## [7] "BridgeDbR" "CAMERA" "ChemmineOB"
## [10] "ChemmineR" "CluMSID" "cosmiq"
## [13] "EGSEA" "eiR" "FELLA"
## [16] "fmcsR" "graphite" "hmdbQuery"
## [19] "INDEED" "IPO" "IsoCorrectoR"
## [22] "IsoCorrectoRGUI" "KnowSeq" "limma"
## [25] "LOBSTAHS" "MAIT" "Metab"
## [28] "metabomxtr" "metaMS" "MetCirc"
## [31] "MetNet" "mixOmics" "msPurity"
## [34] "MTseeker" "MWASTools" "mzR"
## [37] "netboost" "NormalyzerDE" "OmicsLonDA"
## [40] "OmicsMarkeR" "PAPi" "pathview"
## [43] "pathwayPCA" "PepsNMR" "proFIA"
## [46] "RankProd" "Rdisop" "RMassBank"
## [49] "ropls" "rWikiPathways" "SDAMS"
## [52] "SIMAT" "sparsenetgls" "statTarget"
## [55] "tofsims" "xcms" "yamss"
## [58] "SBGNview"
sessionInfo
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tibble_2.1.3 dplyr_0.8.1 BiocManager_1.30.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 knitr_1.23 magrittr_1.5 tidyselect_0.2.5
## [5] R6_2.4.0 rlang_0.3.4 stringr_1.4.0 tools_3.6.0
## [9] xfun_0.7 png_0.1-7 htmltools_0.3.6 yaml_2.2.0
## [13] digest_0.6.19 assertthat_0.2.1 crayon_1.3.4 bookdown_0.11.1
## [17] purrr_0.3.2 glue_1.3.1 evaluate_0.14 rmarkdown_1.13
## [21] blogdown_0.13.1 stringi_1.4.3 compiler_3.6.0 pillar_1.4.1
## [25] pkgconfig_2.0.2