Original Authors: Martin Morgan, Sonali Arora, Lori Shepherd
Presenting Author: Maria Doyle
Date: 23-28 June, 2024
Back: Monday labs
Objective: An overview of software available in Bioconductor.
Lessons learned:
Analysis and comprehension of high-throughput genomic data
AnnotationHub
landing page references
HOW-TO vignette illustrating some fun use cases.‘Domain-specific’ analysis
Exercise
Bioconductor provides ‘infrastructure’ for working with genomic data. We’ll explore some of these in more detail in a later part of this lab. For now…
Exercise
import()
and export()
functions be useful?Annotation packages are data-, rather than software-, centric, providing information about the relationship between different identifiers, gene models, reference genomes, etc.
Exercise
On the page listing All Packages, click on the AnnotationData top-level term.
Search, using the box on the right-hand side, for annotation packages that start with the following letters to get a sense of the packages and organisms available.
org.*
: symbol mappingTxDb.*
and EnsDb.*
: gene modelsBSgenome.*
: reference genomesWe’ll see in a subsequent lab that a wealth of additional annotation resources, including updated EnsDb and reference genomes, are available through AnnotationHub.
Workflow packages are meant to provide a comprehensive introduction to work flows that require several different packages. These can be quite extensive documents, providing a very rich source of information.
Exercise
Likely the packages needed for this course are already installed. Nonetheless it is useful to know how to install other packages.
Bioconductor has a particular approach to making packages available. We have a ‘devel’ branch where new packages and features are introduced, and a ‘release’ branch where users have access to stable packages. Each six months, in spring and fall, the current ‘devel’ version of packages is branched to become the next ‘release’. Packages within a release are tested with one another, so it is important to install packages from the same release. The BiocManager package tries to make it easy to do this.
The first step to package installation is to make sure that the BiocManager package has been installed using standard R procedures.
if (!require(BiocManager))
install.packages("BiocManager", repos = "https://cran.r-project.org")
Then, install the package(s) you would like to use
BiocManager::install(c("Biostrings", "GenomicRanges"))
BiocManager knows how to install CRAN and github packages, too.
There are several common problems encountered with package installation. Often, packages have been installed using methods different from the one recommended here, and the packages are from different Bioconductor releases. This leads to problems when packages from different releases are incompatible with one another.
Exercise Verify that your packages are current and installed from the same Bioconductor release with
BiocManager::valid()
Two common problems are that some packages are too old (a newer
version of the package exists) or too new (some packages have been
installed using a method other than BiocManager). If there are
packages that are too old or too new, it is almost always a good idea
to follow the instructions from BiocManager::valid()
to correct the
situation.
Packages need to be installed only once for each version of R you use, but need to be loaded into each new R session that you start. Packages are loaded using
library(Biostrings)
Whan a package is loaded, it can sometimes generate messages that are
informational only, if you are confident this is the case for the
packages you’re loading, use suppressPackageStartupMessages()
for a
quieter experience:
suppressPackageStartupMessages({
library(GenomicRanges)
library(GenomicAlignments)
})
Exercise It is usually very helpful to explore package vignettes.
Visit the vignette of the DESeq2 package, and walk through a few steps to understand what the vignette provides in terms of instructions for starting with the package, functionality the package provides, mathematical and statistical details of the implementation, and how the analysis provided by the package might be extended by other packages in the Bioconductor ecosystem. One can visit vignettes through RStudio, or by running commands such as
vignette(package = "DESeq2")
browseVignettes("DESeq2")
Most vignettes are written in such a way that the R code of the vignette must be correct for the vignette to be produced. The code itself is available in the package. Find the code for the DESeq2 vignette
dir(system.file(package="DESeq2", "doc"))
## character(0)
vign <- system.file(package="DESeq2", "doc", "DESeq2.R")
open it in RStudio (e.g., using File -> Open File… menu), step through the first few lines of R code and compare your output to the output in the vignette. Alternatively, run the entire analysis in the vignette with the command
source(vign, echo = TRUE, max.lines = Inf)
Exercise Help pages provide more focused instructions for use of particular functions. It is often con
Load the Biostrings package
library(Biostrings)
Look for help on the function letterFrequency()
using the command
?letterFrequency
note that there is tab completion after the ?
and first few
letters of the command.
The help page is quite complicated, documenting several different
functions. In the ‘Description’ section, find a description of what
letterFrequency()
does. In the ‘Usage’ section, find the arguments
that can be used with letterFrequency()
, and try to understand,
from the Arguments
section what each argument might be or how it
influences the computation. The Value
section attempts to describe
the return value of the letterFrequency()
function.
Sometimes an example is worth a thousand words. Can you run the
first two sections of the example at the end of the help page (for
alphabetFrequency()
and letterFrequency()
to arrive at a better
understanding of how the letterFrequency()
function works?
Where to get help?
What can you get help on?
How to ask a good question
Simplify to just a few lines of R code.
Must be able to be run by someone else
include the output of sessionInfo()
, which often shows problems
with out-of-date packages.
Exercise Visit the support site and review the five most recent questions. Which do you think are ‘good’, from the guidelines offered above? Which have received helpful answers? Can you figure out who the person answering the question is, i.e., why do they think they have an answer?
This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.
Basics
library(GenomicRanges)
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges
Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.
Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course.
?consensusMatrix
,
for instance. Also check out the BSgenome package for working
with whole genome sequences, e.g., ?"getSeq,BSgenome-method"
?readGAlignments
help
page and vigentte(package="GenomicAlignments", "summarizeOverlaps")
import
and export
functions can read in many
common file types, e.g., BED, WIG, GTF, …, in addition to querying
and navigating the UCSC genome browser. Check out the ?import
page
for basic usage.Annotation: Bioconductor provides extensive access to ‘annotation’ resources (see the AnnotationData biocViews hierarchy); these are covered in greater detail in Thursday’s lab, but some interesting examples to explore during this lab include:
?select
?exonsBy
page to retrieve all
exons grouped by gene or transcript."gene_biotype"
or "tx_biotype"
defining the biotype of
the features (e.g. lincRNA, protein_coding, miRNA etc). EnsDb databases are
designed for Ensembl annotations and contain annotations for all genes
(protein coding and non-coding) for a specific Ensembl release.A number of Bioconductor packages help with visualization and reporting, in addition to functions provided by indiidual packages.
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Dublin
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GenomicAlignments_1.40.0 Rsamtools_2.20.0
## [3] SummarizedExperiment_1.34.0 Biobase_2.64.0
## [5] MatrixGenerics_1.16.0 matrixStats_1.3.0
## [7] GenomicRanges_1.56.0 Biostrings_2.72.1
## [9] GenomeInfoDb_1.40.1 XVector_0.44.0
## [11] IRanges_2.38.0 S4Vectors_0.42.0
## [13] BiocGenerics_0.50.0 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-0 jsonlite_1.8.8 compiler_4.4.0
## [4] BiocManager_1.30.23 crayon_1.5.2 bitops_1.0-7
## [7] parallel_4.4.0 jquerylib_0.1.4 BiocParallel_1.38.0
## [10] yaml_2.3.8 fastmap_1.2.0 lattice_0.22-6
## [13] R6_2.5.1 S4Arrays_1.4.1 knitr_1.47
## [16] DelayedArray_0.30.1 bookdown_0.39 GenomeInfoDbData_1.2.12
## [19] bslib_0.7.0 rlang_1.1.4 cachem_1.1.0
## [22] xfun_0.44 sass_0.4.9 SparseArray_1.4.8
## [25] cli_3.6.2 zlibbioc_1.50.0 digest_0.6.35
## [28] grid_4.4.0 rstudioapi_0.16.0 lifecycle_1.0.4
## [31] evaluate_0.23 codetools_0.2-20 abind_1.4-5
## [34] rmarkdown_2.27 httr_1.4.7 tools_4.4.0
## [37] htmltools_0.5.8.1 UCSC.utils_1.0.0
Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U24HG004059 (Bioconductor), U24HG010263 (AnVIL) and U24CA180996 (ITCR).