1  Why Use S4 Instead of S3 or R6 in R?

If you have just learned about object-oriented programming (OOP) in R, you might be wondering why you would choose S4 instead of the more common S3 system, which is widely used in popular R projects such as the tidyverse. Yet, despite being overall less popular then S3, the S4 system has several key advantages:

2 Leveraging the S4 Infrastructure of Bioconductor

Bioconductor is a large and diverse project that provides functionality for a wide range of biological data types and statistical methods.

A key foundation of Bioconductor is its reliance on S4 classes rather than the more commonly used S3 classes. S4 is more structured, rigorous, and verbose compared to S3, giving it an initially steeper learning curve. However, this rigor makes it much easier to share and reuse code across hundreds of R/Bioconductor packages.

Advantages of Using S4 in Your Bioconductor Packages:

  • Reuse Optimized Code: You can easily reuse highly optimized and stable code from hundreds of other Bioconductor packages.
  • Central Data Representations: S4 classes serve as central data representations, allowing users to seamlessly integrate analysis workflows across multiple Bioconductor packages.
  • Familiar Interfaces: Leveraging familiar interfaces makes it easier for new users to start using your package effectively.

3 Finding Existing S4 Classes

The easiest way to find out if there is already an existing S4 class for your data type is to search the Bioconductor package index. If you are unsure, you can always ask on the main Bioconductor communication channels, such as the [bioc-devel mailing list][bioc-devel-mail], or the Bioconductor Slack.

Below are some pointers to the most central S4 classes in the Bioconductor project.

3.1 Bioconductor core packages and S4-classes

Bioconductor core packages are maintained centrally by the Bioconductor team itself. As they are some of the most optimized and stable parts of Bioconductor (some packages are more than a decade old!), they are the best starting point for reusing classes.

The S4Vectors and IRanges package contain low-level S4-classes for simple types of data:

  • DFrame: Improved version of the base R data.frame, where columns can be any type and can have meta data attached.
  • List and friends: Improved version of the base R list, where each element has to be the same type (CharacterList, IntegerList, NumericList, etc.)
  • Factor: Improved version of the base R factor, where levels can be any type.
  • Rle: Efficient Long vectors with many repeated values (e.g. coverage calculated across a whole genome)
  • Hits: Storing “hits” or “overlaps” between two sets, e.g. overlap between two sets of genomic intervals
  • Views: Accessing smaller parts of a large object, like a genome, without copying the large object itself. Many specialized classes for different use cases (RleViews, XStringViews, etc.)

The GenomicRanges, GenomeInfoDb, rtracklayer package contains S4-Classes for genomic intervals (as seen in BED, GTF or BigWig files):

  • GRanges: Genomic ranges with start and end coordinates. Also keeps information
  • GRangesList: Sets of GRanges.
  • Seqinfo: Chromosome names and lengths for a genome/assembly.
  • GPos: Single base pair genomic intervals.
  • Import with rtracklayer::import()

SummarizedExperiment contains S4-Classes for count/expression matrices and associated meta data.

  • SummarizedExperiment: Store on or more expression matrix with meta data for both columns and rows.
  • RangedSummarizedExperiment:SummarizedExperimentwith an attachedGRanges`.
  • Many packages reuse SummarizedExperiment for more specialized cases, see for example RaggedExperiment.

The Biostrings package contains S4-classes for biological strings (e.g. from FASTQ files):

  • DNAString: DNA sequences
  • AAString: Amino acid sequences
  • DNAStringSet/AAStringSet and DNAStringSetList/AAStringSetList: Sets of sequences
  • Import with readDNAStringSet() and readAAStringSet()

The GenomicAlignment and Rsamtools packages contains S4-classes for aligned reads (e.g. BAM-files)

  • GAlignments: Alignments of shorts reads to a reference genome.
  • Large BAM-files can be imported with scanBam() or readGAlignments

VariantAnnotation package contains S4-classes for genetic variants:

  • VCF: Genotypes across individuals and associated meta data.
  • VRanges: Location of genetic variants
  • Import with readVcf()

BiocSets and GSEABase contains S4-classes for gene sets, e.g. Gene Ontology (GO)-terms and similar:

  • GeneSet: Gene set identifiers and metadata.
  • GeneSetCollection: Sets of GeneSet

DelayedArray contains S4-classes for analyzing matrices that are too large to fit into memory:

  • DelayedArray: Wrapper around data stored either in a highly efficent format (e.g. sparse) or on disk.
  • Several specialized subclasses, including RleMatrix , ConstantArray, SparseArray, HDF5Array, ConstantArray and ScaledMatrix

3.2 Widely used Bioconductor S4-classes.

Some Bioconductor package have implemented S4-Classes that have been widely adopted:

SingleCellExperiment for single cell datasets (e.g. scRNA-Seq), including single cell multi-omics (e.g. CITE-Seq).

SpatialExperiment for spatial-omics.

MultiAssayExperiment for complex multi-omics datasets with arbitrary patterns of mixing data.

Spectral for mass spec data

TBFSTools for analyzing transcription factor binding sites with Position Frequency Matrices (PFMs) and similar.

limma, edgeR and DESeq2 for differential expression (DE) analysis

4 Extending Bioconductor S4-classes

We are generally recommending that developers simply reuse existing classes: This saves time on the developers part and makes it easier for end-users to switch between packages.

Some advanced developers might find the need to formally extend existing S4-classes with new subclasses. This requires more knowledge of how S4 inheritance works and how the different Bioconductor packages build on each other.

We are currently developing new documentation on this topic. For now, we refer to some general background on S4 from the Advanced R book (https://adv-r.hadley.nz/s4.html) and the vignettes from the S4Vectors, SummarizedExperiment, SingleCellExperiment and DelayedArray packages which contains concrete examples of extending existing S4-Classes

Back to top