Bioconductor on Containers

Instructor(s) name(s) and contact information

Nitesh Turaga* (Github: nturaga)

Schedule

https://bioc2019.bioconductor.org/schedule-developer-day

11:00 - 12:00 – Workshops I Track 1: Bioconductor on Containers - Beginners

12:00 - 1:00 - Lunch

1:00 - 2:00 - Workshop II Track 1: Bioconductor on Containers - Advanced

Pre-requisites

  • Basic knowledge of R / how to install packages the Bioconductor way (using BiocManager()).

  • IMPORTANT: Docker needs to be installed on your local machine. Please install Docker ahead of time. If you are planning to attend this workshop and are having issues installing Docker please email me ahead of time, I will help you through it. Please do not attend the workshop without installing Docker on your machine.

    To test installation of Docker and please run on your command line,

      $ docker run hello-world
    
      Unable to find image 'hello-world:latest' locally
      latest: Pulling from library/hello-world
      ca4f61b1923c: Pull complete
      Digest: sha256:ca0eeb6fb05351dfc8759c20733c91def84cb8007aa89a5bf606bc8b315b9fc7
      Status: Downloaded newer image for hello-world:latest
    
      Hello from Docker!
      This message shows that your installation appears to be working correctly.
  • IMPORTANT: After installing docker, be sure to pull(download) an image which we will use in class. This is a pretty big ‘download’ so make sure to run it before the workshop starts, otherwise there will be some lag. Run the command,

      $ docker pull bioconductor/release_base2:latest
  • Some familiarity with HPC infrastructure (if this is relevant to participant as there is only one topic which covers Singularity on HPC infrastructure).

R / Bioconductor packages used

Beginners section - Topics

Activity Time
Why Containers & Why Docker 10m
Bioconductor on Docker 10m
Flavors of Bioconductor 10m
Which flavor to use 10m
How to mount volumes 10m
Questions Please interrupt me!!

Who is this for?

This workshop is for a beginner. You don’t need any prior experience with Docker or container technology. In this course participants are expected to launch Docker images on their local machines, mount volumes, and set ports as needed to use RStudio.

Since it’s a beginners course, please note that we will be starting at a basic level. The next section of the course, the advanced section will give you deeper insight into how container technology can be used on the cloud with some demonstrations.

Why Containers and Why Docker

A Container is a completely isolated environment. They can have their own processes or services or network interfaces, volume mounts, just like a virtual machine. They differ from VM’s only in a single aspect, they “share” the OS kernel. Containers are running instances of images.

Docker is simply the most popular container technology being used today. An Image is a static (fixed) template and container is a running version of the image. An image can be used to create a container.

Main reason behind using containers: As a user, you want to delegate problems of software installation to the developer (in this case - Bioconductor). Bioconductor gives you these static docker images which solve this issue of software installation on your infrastructure.

The total disk space used by all of the running containers on disk is some combination of each container’s size and the virtual size values. If multiple containers started from the same exact image, the total size on disk for these containers would be SUM (size of containers) plus one container’s (virtual size). Be careful and monitor the size of your containers as it may grow very quickly. Read more about Container size on disk.

Docker warmup

Get a list of docker images

docker images

Let’s pull (get) an image specifically bioconductor/release_base2:latest (/:). This will take a minute or two.

docker pull bioconductor/release_base2:latest

Run the image to start a container,

docker run -p 8787:8787 -e PASSWORD=bioc bioconductor/release_base2:latest

where, -p if the port, and -e is the environment variable

Open a new terminal and you can check to see your container is running,

docker ps

Kill the container

docker kill <container_id>

Remove the image (you don’t want to do this one right now)

docker rmi bioconductor/release_base2:latest

Interactively running docker images ,

## For bash
docker run -it bioconductor/release_base2 bash

## For R directly
docker run -it bioconductor/release_base2 R

Interactively run Containers,

## Get the container ID
docker ps

You can do this only if the container with that ID is running currently,

docker exec -it <container ID>

Bioconductor on Docker

In this section we will discuss what flavors are available to users, and how to build on these images.

Bioconductor provides Docker images in a few ways:

bioc_docker images

These bioconductor images are hosted on Docker Hub. The bioconductor docker page has information and instructions on how to use them. The Dockerfile(s) for each of these containers is located in Github. These are provided directly by the Bioconductor team for each supported version of Bioconductor and are categorized into,

  1. base2: Contains R, RStudio, and a single Bioconductor package (BiocManager, providing the install() function for installing additional packages). Also contains many system dependencies for Bioconductor packages.

  2. core2: It contains everything in base2, plus a selection of core Bioconductor packages.

    Core packages
    BiocManager OrganismDbi ExperimentHub Biobase BiocParallel
    biomaRt Biostrings BSgenome ShortRead IRanges
    GenomicRanges GenomicAlignment GenomicFeatures SummarizedExperiment VariantAnnotation
    DelayedArray GSEABase Gviz graph RBGL
    Rgraphviz rmarkdown httr knitr BiocStyle
  3. protmetcore2: It contains everything in core2, plus a selection of core Bioconductor packages recommended for proteomic and metabolomics analysis. Reference section contains list of packages in protmetcore2

  4. metabolomics2: It contains everything in protmetcore2, plus select packages from the Metabolomics biocView. Reference section contains list of packages in metabolomics2

  5. cytometry2: Built on base2, so it contains everything in base2, plus CytoML and needed dependencies.

It is possible to build on top of these default containers for a customized Bioconductor container. This is a very good resource for developers/users who are looking to get an image without having to worry about system requirements.

Bioconda

Bioconda provides BioContainers. Bioconda provides a container for each individual package in Bioconductor under the “BioContainer” umbrella in Quay.io.

  • They also provide additional multi package containers, where you can combine multiple bioconda packages into one container. You may choose a Bioconductor package combination of your desire, and build a container with them. This is a helpful tool.

  • But please note that they only contain the “release” version of the Bioconductor packages. Bioconda does not build or distribute ‘devel’ version of Bioconductor.

bioconductor_full

Bioconductor is now providing an image with the complete set of system dependencies. The goal of this image is, the user should be able install every Bioconductor package (1620 +), without having to deal with system dependency installation.

But please note, that the actual Bioconductor packages themselves don’t come on this image. It is up to the user to install the Bioconductor packages through BiocManager::install.

These bioconductor_full images can be obtained on Docker hub.

docker pull bioconductor/bioconductor_full:devel

docker pull bioconductor/bioconductor_full:RELEASE_3_9

docker pull bioconductor/bioconductor_full:RELEASE_3_8

There are a few exceptions in this list, which are unavoidable because of the way these images are built and provisioned as docker images. The list is given in the docs/issues.md files on the Github page for these images. (We try to keep that document as updated as possible, but if you find a package installs and is listed there, send us a Pull Request.)

Flavors of Bioconductor

The flavors you “need” for your work should be based on the packages you want for your analysis. There are a list of packages for each flavor you should evaluate before using the container.

For most beginners a good place to start is the bioconductor/release_core2 docker container which includes all the “core” bioconductor packages.

How to extend a docker image

There are two ways,

  1. Reproducible way:

    Create a new folder my_image. Inside that folder, in a new file called Dockerfile, use the Bioconductor Image most suited for your needs as your base using

     ## Inherit from the bioconductor/release_base2 image
     FROM bioconductor/release_base2:latest
    
     ## Add packages
     RUN  R -e "BiocManager::install('<package_name>')"

    Then inside the build the new docker image,

     docker build -t my_image:v1 my_image/
  2. The interactive way:

    Run the image interactively, with RStudio

     docker run -e PASSWORD=bioc -p 8787:8787 bioconductor/release_base2:latest
    
     or
    
     docker run -it bioconductor/release_base2:latest bash

    In localhost:8787

     BiocManager::install('BiocGenerics')

    In another terminal window, note the container ID

     docker ps

    Save the container state, (it’s a good idea to “tag” your image)

     docker commit <container ID> my_image:1.0

Mount volumes to your Docker container

It is useful sometimes to mount a volume either with data or a list of installed packages.

You can bind as many volumes as needed with the -v/--volume command. The format is given as

-v /host/path:/container/path

If you wanted to share data with your docker container, run the container with the command,

docker run \
    -v /data:/home/rstudio/data
    -e PASSWORD=bioc
    -p 8787:8787
    bioconductor/release_base2

To share libraries with your Docker container, bioconductor provides a way so that the R running within your docker image is able to see the packages in your container,

-v /shared-Bio-3.9:/usr/local/lib/R/host-site-library

This location /usr/local/lib/R/host-site-library has special meaning, and the configuration within the bioconductor container allows the user to share built packages. The path on the docker image that should be mapped to a local R library directory is /usr/local/lib/R/host-site-library.

  1. Create a directory, on your host machine, (as given as whichever path suits you the best)

     mkdir /shared-BioC-3.9
  2. Run the image by sharing that directory, for the R package libraries and the data.

     docker run \
         -v /shared-BioC-3.9:/usr/local/lib/R/host-site-library \
         -v /data:/home/rstudio/data \
         -it bioconductor/release_base2 \
         R

    Within R,

     BiocManager::install('BiocGenerics')

All the packages in that directory will be directly available to use instantly the next time R is started from the container. Another big advantage is, it prevents your docker image from growing in size due to package installation and data. This allows the users to distribute smaller docker images containing only what is needed.

Conclusion of Beginners section

Re cap of items you have learnt in this section,

  1. How to start Bioconductor based docker images.

  2. Where these images are available.

  3. What flavors of Bioconductor images are present.

  4. How to bind/mount volumes on these Bioconductor docker images.

These skills get you started on using Bioconductor on containers.

Advanced section - Topics

Activity Time
Containers on the cloud 10m
Mount volumes on the cloud 10m
Localize select packages 10m
HPC-Singularity containers 5m
Launch and run parallel jobs 10m
Questions 10m+
Future work 5m

Who is this section for?

The advanced section is for anyone who has some familiarity with how Docker works, and wants to run things on the “Cloud”. This will be a demonstration section, since we are not providing “credits” to use a certain cloud (Google or AWS).

This section aims to give some direction to users who are looking to use Bioconductor on the cloud. We also discuss some future developments, which look promising.

Containers on the cloud

Running your R session on the cloud with RStudio makes you ‘machine independent’. It also allows you to access data hosted on the cloud cheaper and faster for example 1000 Genomes data which is available on the google cloud: gs://genomics-public-data/1000-genomes.

One of the biggest advantages to cloud machines is the on-demand scalability to run programs.

In this section, there will be a demonstration showing the usability with the Google cloud Infrastructure, but take note that everything you can do on the Google Cloud you can do with AWS.

Requirements

  1. Google cloud billing account

    Pro Tip: Create a test google account (or try the google cloud for the first time from your google account) to get $300 in billing credits for a year. You have to link your credit card to this, so make sure you don’t leave any google cloud resources running. You will go through your $300 USD in credit very quickly if you run large compute instances. Take a look at Google cloud - free.

    NOTE: The user is responsible for keeping track of their billing account and any credit cards they use for their workshop progress.

  2. You should also have taken the beginners course or at the very least have a basic understanding of Docker. If you have missed the earlier section please refer back to it for any gaps of information.

  3. Google Cloud SDK needs to be installed, and command line callable using gcloud auth login.

Launch a container on the Google cloud

There are a couple of ways this can be done, one is using the a service on that the google cloud provides called “Container Engine”, the other by natively using the “Compute Engine” service, installing docker, and launching a container with a specific image stored either on the “Google Container Registry” (Google’s Dockerhub) or Dockerhub itself.

Using the Google Container Engine
  1. Launch a Bioconductor docker image on the Google Cloud using the command line application gcloud.

    In this demo, we use the bioconductor/release_base2:latest, which are hosted on dockerhub. There are a few additional settings to be aware of in this case since we’ll be using RStudio as the interface for R.

    These containers running on a cloud instance (read: a computer accessible remotely) need to be accessible via the HTTP protocol at a specific port (8787 for RStudio). To allow this we need to create a ‘firewall rule’ which allows HTTP access at the port 8787.

     ## Launch bioconductor/release_base2:latest image
     gcloud compute instances create-with-container bioc-rstudio \
         --container-image bioconductor/release_base2:latest \
         --container-restart-policy never \
         --container-env PASSWORD=bioc \
         --tags http-server
         --machine-type n1-standard-4

    Note: Reference to gcloud compute. Also keep in mind that, you can launch any image derived from bioc_docker images in this way, i.e. bioconductor/bioconductor_full:<tag>

    These containers running on the a cloud instance (read: a computer accessible remotely) need to accessible via the HTTP protocol at a specific port (8787 for RStudio). For this we need to create a firewall rule which allows HTTP access at that port 8787.

     ## Create a firewall
     gcloud compute firewall-rules create allow-http \
         --allow tcp:8787 \
         --target-tags http-server

    You can now list the running instances you have to check if your container started.

     ## List running compute instances
     gcloud compute instances list

    Once you are done using the instance, you might want to stop it. This avoids running up the bill on your Google cloud project. When you are ready to restart it is easy enough to restart.

     ## Stop an instance,
     gcloud compute instances stop bioc-rstudio

    NOTE: Terminating/deleting the instance is your problem! Go to your google cloud, and make sure to terminate the instance or gcloud compute instances delete <bioc-rstudio>.

    You can also log into your VM on the Google cloud and provision any data you need.

     ## Log in to the VM
     gcloud compute ssh bioc-rstudio

For more configuration options of your containers take a look at the Google Cloud documentation to configure containers.

Mount volumes on the cloud (Optional)

There are many publicly available genomics data sets hosted by cloud services, Google cloud genomics datasets are located on the bucket, gs://genomics-public-data.

You can mount storage buckets on your VM to your container using GCSFUSE. For this you need to choose an image with access to the google cloud and gcsfuse. To install gcsfuse, and

Create a directory, where you want the google bucket

mkdir /path/to/mount

Mount a bucket on your google cloud

gcsfuse example-bucket /path/to/mount

Start working with the mounted bucket

ls /path/to/mount

Localize select packages

For faster installation of packages from Bioconductor on cloud based instances, as provide as a demonstration ONLY Google storage buckets with all the software packages which are pre-compiled (1600+).

Bioconductor provides the entire pre-compiled binary libraries for the following versions of R and Bioconductor. These are publicly accessible buckets.

Bioconductor Version R version Google Bucket Name Link
Bioc 3.10 R devel gs://bioconductor-full-devel https://console.cloud.google.com/storage/browser/bioconductor-full-devel
Bioc 3.9 R 3.6.0 gs://bioconductor-full-release-3-9 https://console.cloud.google.com/storage/browser/bioconductor-full-release-3-9
Bioc 3.8 R 3.5.0 gs://bioconductor-full-release-3-8 https://console.cloud.google.com/storage/browser/bioconductor-full-release-3-8

NOTE: Requester Pays (aka: You) for the access to these buckets. So by accessing the data on these packages to be downloaded on to your google cloud instance your billing project gets charged.

To copy packages over to the instance, using the command line:

## Get the name of your billing project
gcloud config list project

## Copy packages over to your local directory
gsutil \
    -u <name-of-billing-project> \
    -m \
    cp -r gs://bioconductor-full-devel/<package_name> <local_directory>

We also provide a very helpful package (which is part of another project) called AnVIL. With the help of the AnVIL::install() function, users are able to localize packages from buckets and install them on containers at high speed without having to wait for compilation.

## Launch a bioconductor_full:RELEASE_3_9 image
gcloud compute instances create-with-container bioc-release \
        --container-image bioconductor/bioconductor_full:RELEASE_3_9 \
        --container-restart-policy never \
        --container-env PASSWORD=bioc \
        --tags http-server

## Login in to your Rstudio session
## username: Rstudio, password:bioc

## Install the AnVIL package
BiocManager::install('Bioconductor/AnVIL_rapiclient')
BiocManager::install('Bioconductor/AnVIL')

## Use AnVIL install to install packages on your bioconductor-full image
AnVIL::install('GenomicRanges')

HPC-Singularity containers

Singularity is a container engine just like Docker. Read more about on the Singularity webpage.

Advantages of using Singularity are:

  • Singularity is compatible on HPC infrastructures and is easily installable.

  • It has a BSD license (open source!!)

  • Works very well with Docker images which are already available.

Singularity comes as a module which can be installed on a cluster facility. Once the module is available, it is easy for the user to run software in a containerized fashion, without depending on the system admin for new software installations.

module load singularity

Bioconductor provides Singularity containers on the Singularity Hub. There are three containers provided right now which represent versions of Bioconductor,

  1. devel (master)

    ## Pull image singularity pull –name bioc-devel.simg shub://Bioconductor/bioconductor_full

    ## Use image’s R singularity shell bioc-devel.simg /usr/local/bin/R

  2. RELEASE_3_9

    singularity pull –name bioc-release-3-9.simg shub://Bioconductor/bioconductor_full:release_3_9

  3. RELEASE_3_8

    singularity pull –name bioc-release-3-8.simg shub://Bioconductor/bioconductor_full:release_3_8

You can set an alias to your Singularity image on your .bashrc file as,

alias singulaR="singularity shell \
  -B /mnt/lustre/users/nturaga/my-lib:/usr/local/lib/R/host-site-library \
  /mnt/lustre/users/nturaga/bioc-devel.simg \
  /usr/local/bin/R"

where,

-B is how you mount a volume on the singularity image <local path>:<container path>. This command shows my R library being mounted from my head node on the singularity container.

/mnt/lustre/users/nturaga/bioc-devel.simg refers to the location of my singularity image.

/usr/local/bin/R is the location of R on the container.

From here on out, the user can build packages or use packages on their local cluster using,

singulaR CMD build <package name>

or run scripts using

singulaR -f <my_R_script.R>

Launch and run parallel jobs

Use BiocParallel to launch and run jobs on Singularity containers. The following example has been run on SLURM (This example has been provided thanks to Martin Morgan, Levi Waldron and Ludwig Geistlinger).

To use the containerized singularity image in conjunction with BiocParallel, a recent R(3.6.0) installation on the cluster is needed that, as a minimum requirement, has also the packages BiocParallel and batchtools installed (minimum requirement would be similar to the image biocparallel-example)

NOTE: Submission of jobs directly from within the singularity container on the head node is not possible. The container doesn’t recognize the cluster environment and its specific configuration. It might also cause issues to mount volumes.

For a SLURM cluster, we need to first define a template file for the jobs. slurm-singularity.tmpl is given below as an example. Loading the module ‘singularity’ and using the full path to the singularity image <image.simg> is important.

#!/bin/bash

<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>

#SBATCH -p <queue>
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>

module load singularity

singularity shell /mnt/fs/nturaga/bioconductor-image-w-BiocParallel-latest.simg R -e 'batchtools::doJobCollection("<%= uri %>")'

Using BiocParallel and batchtools, it is then possible to run the following code on your head node in R using BatchtoolsParam. This batch tools functionality was introduced last year as an improvement on previously available batch processing methods in BiocParallel.

## Load the library
library(BiocParallel)
library(batchtools)

## define your BPPARAM,
param <- BatchtoolsParam(
    ## define number of workers
    workers = 5,
    ## define the cluster you are using, in this case SLURM
    cluster = "slurm",
    ## point to the custome template
    template="slurm-singularity.tmpl"
)

## Supply the param to bplapply for parallel execution
## of a simple function that returns the hostname of the
## compute nodes for testing purpose. Run bplapply.
test_cluster <- function(n) system2('hostname')

bplapply(1:5, test_cluster, BPPARAM = param)

Future work

There is quite a bit of work in progress to develop solutions for Bioconductor to meet scalable data analysis requirements. A few of them are highlighted below,

  1. BiocParallel::RedisParam

    It is based on the redis data-structure (Redis).

    Link: RedisParam

    enabling an alternative, modern mechanism for distributed computation.

    This is yet to be integrated into BiocParallel. The RedisParam back end allows computation to be distributed across instances or containers running on the cloud. To explain the way the RedisParam functions, think of 5 worker instances waiting to to hear from a single manager instance which sends a job or multiple jobs.

  2. Kubernetes

    As the next iteration of our efforts on the cloud, we needed RedisParam to be working on a container orchestration technology like Kubernetes.

    We use Kubernetes to deploy the “cluster” on the cloud, scalability and to automate management of the deployment to free the user from mundane tasks such as worrying if a worker node has failed.

    Link: Bioconductor-k8sredis-k8s

  3. Helm charts

    helm.sh is a package manager for Kubernetes. Once we created the Kubernetes deployment for RedisParam we wanted to make sure it’s easily distributed and used in the community.

    Helm basically works on templating the K8s deployment and making a easy to distribute tarball (or package).

    Link: Bioconductor-k8sredis-helm-chart

References

  1. List of protmetcore2 packages
##  [1] "bioassayR"    "BioNetStat"   "biosigner"    "ChemmineOB"  
##  [5] "ChemmineR"    "EGSEA"        "eiR"          "fmcsR"       
##  [9] "KnowSeq"      "limma"        "mixOmics"     "mzR"         
## [13] "NormalyzerDE" "OmicsLonDA"   "pathview"     "pathwayPCA"  
## [17] "proFIA"       "RankProd"     "ropls"        "SDAMS"       
## [21] "sparsenetgls" "statTarget"   "tofsims"      "SBGNview"
  1. List of metabolomics2 packages
##  [1] "adductomicsR"    "ASICS"           "BiGGR"          
##  [4] "bioassayR"       "BioNetStat"      "biosigner"      
##  [7] "BridgeDbR"       "CAMERA"          "ChemmineOB"     
## [10] "ChemmineR"       "CluMSID"         "cosmiq"         
## [13] "EGSEA"           "eiR"             "FELLA"          
## [16] "fmcsR"           "graphite"        "hmdbQuery"      
## [19] "INDEED"          "IPO"             "IsoCorrectoR"   
## [22] "IsoCorrectoRGUI" "KnowSeq"         "limma"          
## [25] "LOBSTAHS"        "MAIT"            "Metab"          
## [28] "metabomxtr"      "metaMS"          "MetCirc"        
## [31] "MetNet"          "mixOmics"        "msPurity"       
## [34] "MTseeker"        "MWASTools"       "mzR"            
## [37] "netboost"        "NormalyzerDE"    "OmicsLonDA"     
## [40] "OmicsMarkeR"     "PAPi"            "pathview"       
## [43] "pathwayPCA"      "PepsNMR"         "proFIA"         
## [46] "RankProd"        "Rdisop"          "RMassBank"      
## [49] "ropls"           "rWikiPathways"   "SDAMS"          
## [52] "SIMAT"           "sparsenetgls"    "statTarget"     
## [55] "tofsims"         "xcms"            "yamss"          
## [58] "SBGNview"

sessionInfo

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tibble_2.1.3       dplyr_0.8.1        BiocManager_1.30.4
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1       knitr_1.23       magrittr_1.5     tidyselect_0.2.5
##  [5] R6_2.4.0         rlang_0.3.4      stringr_1.4.0    tools_3.6.0     
##  [9] xfun_0.7         png_0.1-7        htmltools_0.3.6  yaml_2.2.0      
## [13] digest_0.6.19    assertthat_0.2.1 crayon_1.3.4     bookdown_0.11.1 
## [17] purrr_0.3.2      glue_1.3.1       evaluate_0.14    rmarkdown_1.13  
## [21] blogdown_0.13.1  stringi_1.4.3    compiler_3.6.0   pillar_1.4.1    
## [25] pkgconfig_2.0.2