Bioconductor 5xx:
Analysis of multi-sample
multi-group scRNA-seq data

Introduction

What is DS analysis?

A fundamental task in the analysis of single-cell RNA-sequencing (scRNA-seq) data is the identification of systematic transcriptional changes(Stegle, Teichmann, and Marioni, n.d.). Such analyses are a critical step in the understanding of molecular responses, and have applications in development, in perturbation studies or in disease.
Most of the current scRNA-seq differential expression (DE) analysis methods are designed to test one set of cells against another (or more generally, multiple sets together), and can be used to compare cell clusters (e.g., for identifying marker genes) or across conditions (cells from one condition versus another) (Soneson and Robinson 2018). In such statistical models, the cells are the experimental units and thus represent the population that inferences will extrapolate to.

Using established terminology, we refer to cell identity as the combination of cell type, a stable molecular signature, and cell state, a transient snapshot of a cell’s molecular events (Wagner, Regev, and Yosef 2016; Trapnell, n.d.). This classification is inherently arbitrary, but still provides a basis for biological interpretation and a framework for discovering interesting expression patterns from scRNA-seq datasets. For example, T cells could be defined as a single (albeit diverse) cell type or could be divided into discrete subtypes, if relevant information to categorize each cell at this level were available. In either case, the framework presented here would be able to focus on the cell type of interest and look for changes (in expression) across samples.
Given the emergence of multi-sample multi-group scRNA-seq datasets, the goal becomes making sample-level inferences (i.e., experimental units are samples). Thus, differential state (DS) analysis is defined as following a given cell type across a set of samples (e.g., individuals) and experimental conditions (e.g., treatments), in order to identify cell-type-specific responses, i.e., changes in cell state. DS analysis: i) should be able to detect diluted changes that only affect a single cell type, a subset of cell types or even a subset of a single subpopulation; and, ii) is intended to be orthogonal to clustering or cell type assignment.

Starting point

The starting point for a DS analysis is a (sparse) matrix of gene expression, either as counts or some kind of normalized data, where rows = genes and columns = cells. Each cell additionally has a cluster (subpopulation) label as well as a sample label; metadata should accompany the list of samples, such that they can be organized into comparable groups with sample-level replicates (e.g., via a design matrix).

The approach presented here is modular and thus the type label could originate from an earlier step in the analysis, such as clustering (Duó, Robinson, and Soneson 2018; Freytag et al. 2018), perhaps after integration (Butler et al. 2018; Stuart et al. 2018) or after labeling of clusters (Diaz-Mejia et al. 2019) or after cell-level type assignment (Zhang et al. 2019). Although we have pipelines for these various entry points, the specific details and suitability of all these pre-processing steps needs careful exploration and are beyond the scope of this workflow.

Getting started

Data description

To illustrate a standard DS analysis workflow, we will use Kang et al. (2018)’s droplet-based scRNA-seq data of PBMCs cells from 8 lupus patients measured before and after 6h-treatment with INF-β (16 samples in total). The original dataset is deposited under Gene Expression Ombnibus (GEO) accession GSE96583, and has been made available via Bioconductor’s ExperimentHub web resource as a SingleCellExperiment (SCE) object that contains unfiltered raw counts, and any gene and cell metadata available from the original data source.

Loading the data

We first initialize a Hub instance to search for and load available data with the ExperimentHub function, and store the complete list of records in the variable eh. Using query, we then retrieve any records that match our keyword(s) of interest, as well as their accession IDs (EH1234).

library(ExperimentHub)
eh <- ExperimentHub()
query(eh, "Kang")
## ExperimentHub with 3 records
## # snapshotDate(): 2019-06-20 
## # $dataprovider: NCI_GDC, GEO
## # $species: Homo sapiens
## # $rdataclass: BSseq, SingleCellExperiment, character
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH1661"]]' 
## 
##            title                                               
##   EH1661 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH1662 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH2259 | Kang18_8vs8

Finally, we load the data of interest into R via [[ and the corresponding accession ID:

(sce <- eh[["EH2259"]])
## class: SingleCellExperiment 
## dim: 35635 29065 
## metadata(0):
## assays(1): counts
## rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
## rowData names(2): ENSEMBL SYMBOL
## colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ...
##   TTTGCATGGTTTGG-1 TTTGCATGTCTTAC-1
## colData names(5): ind stim cluster cell multiplets
## reducedDimNames(1): TSNE
## spikeNames(0):

For data handling and visualization throughout this workflow, we will require the following packages:

# data handling
library(dplyr)
library(magrittr)
library(Matrix)
library(purrr)
library(reshape2)
library(S4Vectors)
library(tibble)

# visualzation
library(ComplexHeatmap)
library(ggplot2)
library(pheatmap)
library(scales)

Data preparation

Before proceeding with basic preprocessing and filtering steps, we drop non-singlet cells as well as cells that have not been assigned a cluster ID:

sce <- sce[, sce$multiplets == "singlet" & !is.na(sce$cell)]
dim(sce)
## [1] 35635 24673

For simplicity, we will further retain only cell-metadata columns that are relevant to our analysis (and use intuitive names for these):

colData(sce) %>% 
    as.data.frame %>% 
    transmute(
        group_id = stim, 
        patient_id = ind,
        sample_id = paste0(stim, ind),
        cluster_id = cell) %>%
    mutate_all(as.factor) %>% 
    set_rownames(colnames(sce)) %>% 
    DataFrame -> colData(sce)
head(colData(sce))
## DataFrame with 6 rows and 4 columns
##                  group_id patient_id sample_id      cluster_id
##                  <factor>   <factor>  <factor>        <factor>
## AAACATACATTTCC-1     ctrl       1016  ctrl1016 CD14+ Monocytes
## AAACATACCAGAAA-1     ctrl       1256  ctrl1256 CD14+ Monocytes
## AAACATACCATGCA-1     ctrl       1488  ctrl1488     CD4 T cells
## AAACATACCTCGCT-1     ctrl       1256  ctrl1256 CD14+ Monocytes
## AAACATACCTGGTA-1     ctrl       1039  ctrl1039 Dendritic cells
## AAACATACGATGAA-1     ctrl       1488  ctrl1488     CD4 T cells

For consistency and easy accession throughout this workflow, we will store cluster and sample IDs, as well as the number of clusters and samples:

nk <- length(kids <- set_names(levels(sce$cluster_id)))
ns <- length(sids <- set_names(levels(sce$sample_id)))

Finally, we compile a table that summarizes the experimental design:

m <- match(sids, sce$sample_id)
n_cells <- as.numeric(table(sce$sample_id))
(ei <- data.frame(colData(sce)[m, ], 
    n_cells, row.names = NULL) %>% 
    select(-"cluster_id"))
##    group_id patient_id sample_id n_cells
## 1      ctrl        101   ctrl101     858
## 2      ctrl       1015  ctrl1015    2738
## 3      ctrl       1016  ctrl1016    1723
## 4      ctrl       1039  ctrl1039     461
## 5      ctrl        107   ctrl107     517
## 6      ctrl       1244  ctrl1244    1879
## 7      ctrl       1256  ctrl1256    2108
## 8      ctrl       1488  ctrl1488    2031
## 9      stim        101   stim101    1166
## 10     stim       1015  stim1015    2352
## 11     stim       1016  stim1016    1635
## 12     stim       1039  stim1039     641
## 13     stim        107   stim107     525
## 14     stim       1244  stim1244    1464
## 15     stim       1256  stim1256    2026
## 16     stim       1488  stim1488    2549

Preprocessing

The scater package (McCarthy et al. 2017) provides a variey of tools for preprocessing and quality control of single-cell transcriptomic data. For completeness, we will apply some minimal filtering steps to

  • remove undetected genes
  • remove cells with very few or many detected genes
  • remove very lowly expressed genes
  • compute normalized expression values for visualization

For more thorough preprocessing, we refer to the Quality control with scater vignette.

# remove undetected genes
sce[rowSums(counts(sce) > 0) > 0, ]
## class: SingleCellExperiment 
## dim: 18464 24673 
## metadata(0):
## assays(1): counts
## rownames(18464): RP11-34P13.8 AL627309.1 ... S100B PRMT2
## rowData names(2): ENSEMBL SYMBOL
## colnames(24673): AAACATACATTTCC-1 AAACATACCAGAAA-1 ...
##   TTTGCATGGGACGA-1 TTTGCATGTCTTAC-1
## colData names(4): group_id patient_id sample_id cluster_id
## reducedDimNames(1): TSNE
## spikeNames(0):
dim(sce)
## [1] 35635 24673

We use calculateQCMetrics to compute various quality control metrics for each cell and gene, stored in the colData and rowData, respectively, and proceed with filtering cells and genes as noted above:

library(scater)

# calculate quality control (QC) metrics
sce <- calculateQCMetrics(sce)

# get cells w/ few/many detected genes
sce$is_outlier <- isOutlier(
    metric = sce$total_features_by_counts,
    nmads = 2, type = "both", log = TRUE)

# remove outlier cells
sce <- sce[, !sce$is_outlier]
dim(sce)
## [1] 35635 22221
# remove lowly expressed genes & normalize
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
dim(sce)
## [1]  6413 22221

Finally, we use normalize to calculate log2-transformed normalized expression values by dividing each count by its size factor, adding a pseudo-count of 1, and log-transforming1.

sizeFactors(sce) <- librarySizeFactors(sce)
sce <- normalize(sce)
assayNames(sce)
## [1] "counts"    "logcounts"
range(logcounts(sce))
## [1] 0.000000 9.728775

Alternatively, expression values could be obtained via vst (variance stabilizing transformation) from the sctransform package (Hafemeister and Satija 2019), which returns Pearson residuals from a regularized negative binomial regression model that can be interpreted as normalized expression values.

DS analysis

Aggregation of single-cell to pseudo-bulk data

In order to leverage existing robust bulk RNA-seq DE frameworks, such as DESeq2, edgeR and limma, we first aggregate measurements for each cluster at the sample level to obtain pseudobulk data. While, in principle, various combinations of input data (raw/(log-)normalized counts, CPM ect.) and summary statistics (sum, mean, median) could be applied, we here default to the sum of raw counts.

Aggregation of single-cell to pseudo-bulk data. Cell-level measurements for each subpopulation (cluster) and sample are aggregated to obtain pseudo-bulk data.

Figure 1: Aggregation of single-cell to pseudo-bulk data. Cell-level measurements for each subpopulation (cluster) and sample are aggregated to obtain pseudo-bulk data.

For aggregation, we use Matrix.utils’s aggregate.Matrix function with the colData columns "cluster_id" and "sample_id" as groupings. Note that aggregate.Matrix initially yields a matrix of dimensions #(cluster-sample-instances) × #(genes), which we re-split and transform to obtain, for each cluster, and matrix of dimensions #(genes) × #(samples):

library(Matrix.utils)

system.time({
    # aggregate by cluster-sample
    groups <- colData(sce)[, c("cluster_id", "sample_id")]
    pb <- aggregate.Matrix(t(counts(sce)), 
        groupings = groups, fun = "sum") 

    # split by cluster, transform & rename columns
    pb <- split.data.frame(pb, rep(kids, ns)) %>% 
        lapply(function(u) set_colnames(t(u), unname(sids)))
})
##    user  system elapsed 
##   0.823   0.096   0.918

From the code snippet above, we obtain a list of length nk, where each element contains a dgCMatrix of pseudobulk counts with rows = genes and columns = samples. For a more elegant show-method and easier data accession, we construct a SCE where each assay sheet corresponds to one cluster:

# construct SCE of pseudo-bulk counts
# (assays = clusters, rows = genes, columns = samples)
(pb <- SingleCellExperiment(assays = pb))
## class: SingleCellExperiment 
## dim: 6413 16 
## metadata(0):
## assays(8): B cells CD14+ Monocytes ... Megakaryocytes NK cells
## rownames(6413): NOC2L HES4 ... S100B PRMT2
## rowData names(0):
## colnames(16): ctrl101 ctrl1015 ... stim1256 stim1488
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):

The approach used here for aggregation presents but one of many ways, and was selected due to its simplicity and efficiency. Alternatively, one could, for example, split the character vector of cells (colnames(sce)) by cluster-sample, and use rowSums(counts(sce[, ...])) for aggregation. However, this approach is comparatively more verbose and less efficient2:

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt
## The following object is masked from 'package:purrr':
## 
##     transpose
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:SummarizedExperiment':
## 
##     shift
## The following object is masked from 'package:GenomicRanges':
## 
##     shift
## The following object is masked from 'package:IRanges':
## 
##     shift
## The following objects are masked from 'package:S4Vectors':
## 
##     first, second
system.time({
    # split cell by cluster-sample
    cs_by_ks <- as.data.frame(colData(sce)) %>% 
        rownames_to_column("cells") %>% setDT %>% 
        split(flatten = FALSE, sorted = TRUE,
            by = c("cluster_id", "sample_id")) %>% 
        map_depth(2, "cells")
    
    # for ea. cluster-sample..
    pb2 <- map_depth(cs_by_ks, 2, function(cs)
        rowSums(counts(sce[, cs]))) %>% # ..compute pseudobulks
        map(data.frame) # column-bind samples
})
##    user  system elapsed 
##   9.006   0.062   8.838

Pseudobulk-level MDS plot

Prior to conducting any formal testing, we can compute a multi-dimensional scaling (MDS) plot of aggregated signal to explore overall sample similarities. Ideally, such a represenation of the data should separate both clusters and groups from one another. Vice versa, samples from the same cluster/group should fall close to each other.
In our MDS plot on pseudobulk counts (Fig. 2), we can appreciate that the horizontal dimension (MDS dim. 1) clearly separates cell-populations (clusters), while control and stimulated samples (groups) are separated vertically (MDS dim. 2).

library(edgeR)

# compute MDS coordinates
mds <- as.list(assays(pb)) %>% 
    lapply(as.data.frame.matrix) %>% 
    bind_cols %>% 
    DGEList(remove.zeros = TRUE) %>% 
    calcNormFactors %>% 
    plotMDS.DGEList(plot = FALSE)
    
# prep. data.frame for plotting
gg_df <- data.frame(mds[c("x", "y")],
    cluster_id = rep(kids, each = ns),
    sample_id = rep(sids, nk),
    group_id = ei$group_id[match(sids, ei$sample_id)])

ggplot(gg_df, aes(x, y, col = cluster_id, shape = group_id)) + 
    geom_point(size = 3, alpha = 0.8) +
    labs(x = "MDS dim. 1", y = "MDS dim. 2") + 
    theme(panel.grid.minor = element_blank()) +
    coord_fixed() + theme_bw()
Pseudobulk-level MDS plot. Each point represents one cluster-sample instance; points are colored by cluster ID and shaped by group ID.

Figure 2: Pseudobulk-level MDS plot. Each point represents one cluster-sample instance; points are colored by cluster ID and shaped by group ID.

Cluster-sample cell-counts

While DE analysis is typically used for comparison between cell-types, and may struggle with rare subpopulations, DS analysis compares cluster-sample instances that are likely to be much smaller. Thus, DS analysis may only be applicable to more prominent populations. It is therefore recommended to check cluster-sample cell-counts, and to possibly exclude small instances from downstream analyses. In our example, we might consider, for instance, omitting DS analysis on the “Megakaryocytes” and “Dendritic cells” clusters, as these contain less than 30 cells across almost all samples.

options(width = 100)
table(sce$cluster_id, sce$sample_id)
##                    
##                     ctrl101 ctrl1015 ctrl1016 ctrl1039 ctrl107 ctrl1244 ctrl1256 ctrl1488 stim101
##   B cells               101      424      119       25      43      109      205      203     130
##   CD14+ Monocytes       136      644      315       86     176      314      292      251     166
##   CD4 T cells           288      819      413      177     155     1041      973     1151     375
##   CD8 T cells            72      174      532       26      20       56      126       55     104
##   Dendritic cells         5        6        6        3       1       17        5        9      13
##   FCGR3A+ Monocytes      50      177       85       14      25       28       36       75      91
##   Megakaryocytes          9       16        7        4       4       12       14       18       7
##   NK cells               69      174      127       17      40      104      257      116     102
##                    
##                     stim1015 stim1016 stim1039 stim107 stim1244 stim1256 stim1488
##   B cells                319      114       35      50       86      196      259
##   CD14+ Monocytes        550      282      111     146      244      307      309
##   CD4 T cells            714      352      265     176      835      934     1449
##   CD8 T cells            128      485       29      15       30      109       39
##   Dendritic cells          9        6        3       6       15       13       21
##   FCGR3A+ Monocytes      159       93       22      26       22       51      113
##   Megakaryocytes          17       10       10       3        9       15       26
##   NK cells               191      194       23      46      113      231      167

Testing for DS

To calculate differential tests, we require a matrix describing the experimental design, and a contrast matrix specifying the comparison of interest, i.e., the combination of model parameters assumed to equal zero under the null hypothesis.

Design and contrast matrices of appropriate format can be created using the model.matrix (stats package) and makeContrast function (limma package), respectively. For clarity, we also set the dimension names of the design matrix (though this is not required by edgeR).

library(limma)

# construct design & contrast matrix
(design <- model.matrix(~ 0 + ei$group_id) %>% 
    set_rownames(ei$sample_id) %>% 
    set_colnames(levels(ei$group_id)))
##          ctrl stim
## ctrl101     1    0
## ctrl1015    1    0
## ctrl1016    1    0
## ctrl1039    1    0
## ctrl107     1    0
## ctrl1244    1    0
## ctrl1256    1    0
## ctrl1488    1    0
## stim101     0    1
## stim1015    0    1
## stim1016    0    1
## stim1039    0    1
## stim107     0    1
## stim1244    0    1
## stim1256    0    1
## stim1488    0    1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`ei$group_id`
## [1] "contr.treatment"
(contrast <- makeContrasts("stim-ctrl", levels = design))
##       Contrasts
## Levels stim-ctrl
##   ctrl        -1
##   stim         1
# for ea. cluster, run edgeR w/ default parameters
res <- lapply(kids, function(k) {
    y <- assays(pb)[[k]]
    y <- DGEList(y, remove.zeros = TRUE)
    y <- calcNormFactors(y)
    y <- estimateDisp(y, design)
    fit <- glmQLFit(y, design)
    fit <- glmQLFTest(fit, contrast = contrast)
    topTags(fit, n = Inf, sort.by = "none")$table %>% 
        dplyr::mutate(gene = rownames(.), cluster_id = k) %>% 
        dplyr::rename(p_val = PValue, p_adj = FDR)
})

Results filtering & overview

To get a general overview of the differential testing results, we first filter them to retain hits with FDR < 5% and  | logFC |  > 1, and count the number of differential findings by cluster.

# filter FDR < 0.05, |logFC| > 1 & sort by FDR
res_fil <- lapply(res, 
    function(u)  u %>% 
        dplyr::filter(p_adj < 0.05, abs(logFC) > 1) %>% 
        dplyr::arrange(p_adj))

# nb. & % of DE genes per cluster
n_de <- vapply(res_fil, nrow, numeric(1))
cbind(n_de, p_gs = n_de / nrow(sce) * 100)
##                   n_de      p_gs
## B cells            591  9.215656
## CD14+ Monocytes   1730 26.976454
## CD4 T cells        440  6.861063
## CD8 T cells        241  3.757992
## Dendritic cells    196  3.056292
## FCGR3A+ Monocytes  854 13.316700
## Megakaryocytes     117  1.824419
## NK cells           373  5.816311

Between-cluster concordance

DS analysis aims at identifying cell-type-specific changes (in expression) across conditions. In this setting, key questions of interest arise, e.g., which genes are DE in only a single (or very few) clusters? How many DE genes are shared between clusters? In summary, what is the general concordance in differential findings between clusters?

To gain an impression of the between-cluster (dis-)agreement on DE genes, we generate an upset-plot that visualizas the number of DE genes that are shared across or unique to certain clusters:

library(UpSetR)
upset(fromList(map(res_fil, "gene")))

From the upset-plot above we can deduce that, for instance,  > 150 of genes are DE across all clusters, about half of CD14+ Monocytes DE genes are unique to that cluster, and that the two Monocytes clusters (FCGR3A+ and CD4+) share the largest set of DE genes ( ≈ 350), ect..

Visualization

Dimension reduction

One of the most popular plots for representing single-cell data are t-SNE plots, where each cell is represented in a lower, usually two-dimensional, space computed using t-stochastic neighbor embedding (t-SNE).

Dimensionality reductions available within our SCE can be accessed via reducedDims from the scater package, and visualized using plotReducedDim. For our dataset, the t-SNE colored by cluster IDs (Fig. 3; left) shows that cell-populations are well-separated from one another. INF-β stimulation manifests as a severe shift in the t-SNE projection of cells (Fig. 3; right), indicating widespread, genome-scale transcriptiontal changes.

# t-SNE colored by cluster ID
plotReducedDim(sce, use_dimred = "TSNE", 
    colour_by = "cluster_id", point_size = 0.8, point_alpha = 0.4) + 
    guides(fill = guide_legend(override.aes = list(alpha = 1, size = 5)))

# t-SNE colored by group ID
plotReducedDim(sce, use_dimred = "TSNE", 
    colour_by = "group_id", point_size = 0.8, point_alpha = 0.4) + 
    guides(fill = guide_legend(override.aes = list(alpha = 1, size = 5)))
t-SNE colored by cluster ID (left) and group ID (right), respectively.t-SNE colored by cluster ID (left) and group ID (right), respectively.

Figure 3: t-SNE colored by cluster ID (left) and group ID (right), respectively.

Another nonlinear dimensionality reduction technique, uniform manifold approximation and projection (UMAP), has recently been directly compared to t-SNE, and shown to outperform t-SNE in runtime, reproducibility, and its ability to organize cells into meaningful clusters.
Besides those already available, we can compute additional dimension reductions (Dr) using scater’s runDR functions, where DR may be any of PCA, MDS, TSNE, UMAP or DiffusionMap, and visualize these as above:

sce <- runUMAP(sce)
plotReducedDim(sce, use_dimred = "UMAP")

Cell-level visualization

For changes of high interest, we can view the cell-level expressions of specific genes in a specific cluster and across samples. The top-hits for each cluster can be pulled as follows:

# pull top-n genes for ea. cluster
top_gs <- lapply(res_fil, function(u) u$gene[seq_len(9)])

We use scater::plotExpression to generate violin plots for the top differential genes identified for each cluster, specifying x = "sample_id" to obtain one violin per sample, and colour_by = "group_id" to highlight the experimental condition each sample belongs to.

Note that, as we are testing for DE on the cluster-level, for plotting, we need to subset the cells that have been assigned to a given cluster k. To this end, we first split the character vector of cells colnames(sce) by the colData column "cluster_id":

# split cells by cluster
cs_by_k <- split(colnames(sce), sce$cluster_id)

lapply(kids, function(k) {
    gs <- top_gs[[k]]  # get top gene-hits for cluster k
    cs <- cs_by_k[[k]] # subset cells assigned to cluster k
    plotExpression(sce[, cs], features = gs, 
        x = "sample_id", colour_by = "group_id", ncol = 3) +
        guides(fill = guide_legend(override.aes = list(size = 5, alpha = 1))) +
        theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
})

B cells

Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.<br>x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Figure 4: Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.
x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

CD14+ Monocytes

Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.<br>x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Figure 5: Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.
x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

CD4 T cells

Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.<br>x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Figure 6: Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.
x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

CD8 T cells

Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.<br>x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Figure 7: Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.
x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Dendritic cells

Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.<br>x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Figure 8: Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.
x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

FCGR3A+ Monocytes

Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.<br>x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Figure 9: Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.
x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Megakaryocytes

Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.<br>x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Figure 10: Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.
x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

NK cells

Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.<br>x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Figure 11: Violin plots. Shown are the top-9 (lowest adj. p-value) differential genes in each cluster.
x-axis = samples, y-axis = expression (log-normalized counts). Cells are colored by group ID.

Sample-level visualization

Especially when wanting to gain an overview of numerous DE testing results for many clusters, both dimension reduction and cell-level visualisations require a lot of space and can become cumbersome to interpret. In this setting, it is thus recommendable to visualise aggregated measurements, e.g., mean expression values by cluster and sample.

We can assemble cluster-sample expression-means by specifying fun = "mean" and replacing counts with logcounts in the code snipped used to compute pseudobulk-counts in section ??:

# calculate expression-means by cluster-sample
ms <- aggregate.Matrix(t(logcounts(sce)), 
    groupings = groups, fun = "mean") %>% 
    split.data.frame(rep(kids, ns)) %>% 
    lapply(function(u) set_colnames(t(u), unname(sids))) %>% 
    SingleCellExperiment(assays = .)

To better highlight relative differences across conditions, we construct the helper-function .z_norm that applies a z-normalization to normalize expression values to mean 0 and standard deviation 1:

.z_norm <- function(x, th = 2.5) {
    x <- as.matrix(x)
    sds <- rowSds(x, na.rm = TRUE)
    sds[sds == 0] <- 1
    x <- t(t(x - rowMeans(x, na.rm = TRUE)) / sds)
    #x <- (x - rowMeans(x, na.rm = TRUE)) / sds
    x[x >  th] <-  th
    x[x < -th] <- -th
    return(x)
}

We further construct the wrapper .plot_diff_hm that takes as input a SCE of pseudobulk mean-logcounts, a cluster ID k, a character string of genes gs, and an experimental design table ei:

.plot_diff_hm <- function(ms, k, gs, ei) {
    mat <- assays(ms)[[k]][gs, ]
    m <- match(colnames(mat), ei$sample_id)
    cols <- hue_pal()(nlevels(ei$group_id))
    names(cols) <- levels(ei$group_id)
    col_anno <- columnAnnotation(
        df = data.frame(group_id = ei$group_id[m]),
        col = list(group_id = cols),
        gp = gpar(col = "white"),
        show_annotation_name = FALSE)
    Heatmap(.z_norm(mat), 
        column_title = k,
        name = "z-normalized\nexpression",
        col = c("royalblue", "cornflowerblue", "white", "gold", "orange"),
        cluster_rows = FALSE,
        cluster_columns = FALSE,
        row_names_side = "left",
        rect_gp = gpar(col = "white"),
        top_annotation = col_anno)
}

.plot_diff_hm uses the ComplexHeatmap package to render a heatmap of z-normalized mean expressions (for a given cluster) across samples such that rows = genes, columns = samples, and columns are annotated with group IDs. We can visualize the top-20 differential findings in each cluster as follows:

lapply(kids, function(k) {
    top20 <- res_fil[[k]]$gene[seq_len(20)] # subset top-20 hits for cluster k
    .plot_diff_hm(ms, k, top20, ei)         # render differential heatmap
})

B cells

Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.<br>Coloring represents z-normalized cluster-sample expression means.

Figure 12: Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.
Coloring represents z-normalized cluster-sample expression means.

CD14+ Monocytes

Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.<br>Coloring represents z-normalized cluster-sample expression means.

Figure 13: Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.
Coloring represents z-normalized cluster-sample expression means.

CD4 T cells

Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.<br>Coloring represents z-normalized cluster-sample expression means.

Figure 14: Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.
Coloring represents z-normalized cluster-sample expression means.

CD8 T cells

Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.<br>Coloring represents z-normalized cluster-sample expression means.

Figure 15: Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.
Coloring represents z-normalized cluster-sample expression means.

Dendritic cells

Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.<br>Coloring represents z-normalized cluster-sample expression means.

Figure 16: Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.
Coloring represents z-normalized cluster-sample expression means.

FCGR3A+ Monocytes

Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.<br>Coloring represents z-normalized cluster-sample expression means.

Figure 17: Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.
Coloring represents z-normalized cluster-sample expression means.

Megakaryocytes

Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.<br>Coloring represents z-normalized cluster-sample expression means.

Figure 18: Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.
Coloring represents z-normalized cluster-sample expression means.

NK cells

Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.<br>Coloring represents z-normalized cluster-sample expression means.

Figure 19: Differential heatmap. Shown are the top-20 (lowest adj. p-value) genes identified as differential in each cluster.
Coloring represents z-normalized cluster-sample expression means.

DA analysis

Differential abundance (DA) analysis compares the proportions of cell types across experimental conditions and aims to highlight populations that are present at different ratios. Even though DA analysis lies beyond the scope of this workflow, we present it here for completeness.

In preparation for DA analysis, we first calculate two tables that contain cell counts for each sample and population (n_cells) and the corresponding proportions of cell types by sample (freqs)3:

# calculate cluster-sample cell counts
n_cells <- table(sce$cluster_id, sce$sample_id)

# calculate cluster proportions across samples
freqs <- prop.table(n_cells, margin = 1)

Visualization

For visual inspection, we generate a barplot of cell type compositions for each sample; see Fig. 20. As alternative represenation, we further render a boxplot where each panel compares the distribution of a given subpopulation’s frequencies between groups; see Fig. 21.

# prep. data.frame for plotting
df <- data.frame(
    frequency = as.numeric(freqs), 
    cluster_id = rep(kids, ns),
    sample_id = rep(sids, each = nk))
m <- match(df$sample_id, ei$sample_id)
df$group_id <- ei$group_id[m]
# barplot of relative cluster-abundances
ggplot(df, aes(x = sample_id, y = frequency, fill = cluster_id)) +
    geom_bar(stat = "identity", position = "fill") + 
    facet_wrap(~ group_id, scales = "free_x") +
    theme_classic()  + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Barplot of subpopulation frequencies; stratified by group ID.

Figure 20: Barplot of subpopulation frequencies; stratified by group ID.

# boxplot of relative cluster-abundances
ggplot(df, aes(x = group_id, y = frequency, color = group_id)) +
    geom_boxplot(outlier.colour = NA) +  geom_jitter() +
    facet_wrap(~ cluster_id, scales = "free_y", ncol = 4) +
    theme_classic()
Boxplot of subpopulation frequencies; stratified by cluster ID.

Figure 21: Boxplot of subpopulation frequencies; stratified by cluster ID.

Testing for DA

For differential testing, we reuse the design and contrast matrix from section ??. In agreement with the results reported by Kang et al. (2018), we find that INF-β treatment does not significantly affect cell type proportions between control and stimulated cells, with the most prominent increase being in the frequency of “Dendritic cells” (logFC  ≈ 0.71):

y <- DGEList(counts = n_cells)
y <- estimateDisp(y, design, trend.method = "none")
fit <- glmFit(y, design)
fit <- glmLRT(fit, contrast = contrast)
topTags(fit, n = Inf)$table %>% round(2)
##                   logFC logCPM   LR PValue  FDR
## Dendritic cells    0.71  12.96 3.09   0.08 0.63
## NK cells           0.22  16.42 0.72   0.40 0.98
## FCGR3A+ Monocytes  0.21  15.61 0.34   0.56 0.98
## Megakaryocytes     0.16  13.28 0.22   0.64 0.98
## CD14+ Monocytes   -0.10  17.69 0.15   0.70 0.98
## CD8 T cells       -0.16  16.46 0.08   0.78 0.98
## B cells           -0.03  16.67 0.02   0.90 0.98
## CD4 T cells        0.01  18.77 0.00   0.98 0.98

Geneset enrichment analysis

Functionally related genes or pathways can be grouped together in so-called genesets. Geneset enrichment analysis (GSEA) aims at identifying genesets that are up- or downregulated in one experimental condition or group versus another, in order to unravel their association with, e.g., disease.

Organize genesets & names

The Molecular Signature Database provides a detailed list of geneset collections for GSEA. Using the msigdbr function, we can retrieve genesets that are of interest for our dataset, namely:

  • hallmark genesets "H"
  • curated genesets "C2" (among others, contains sets derived from the KEGG pathway database that includes pathways associated with immune disease, specifically, Systemic lupus erythematosus)
  • immunologic signatures "C7" (since Lupus is an autoimmune disease)
# retrieve gene sets
# - H:  hallmark genesets
# - C2: curated genesets
# - C7: immunological signatures
library(msigdbr)
(msigdbr(
    species = "Homo sapiens",
    category = c("H", "C2", "C7")) %>% 
    # MSigDB gene symbols are non-capitalize;
    # we use `toupper` to match symbols 
    # with those contained in our dataset
    mutate_at("gene_symbol", toupper) -> m_df) %>% head
## # A tibble: 6 x 9
##   gs_name gs_id gs_cat gs_subcat human_gene_symb… species_name entrez_gene
##   <chr>   <chr> <chr>  <chr>     <chr>            <chr>              <int>
## 1 ABBUD_… M1423 C2     CGP       CAPN9            Homo sapiens       10753
## 2 ABBUD_… M1423 C2     CGP       DCAF11           Homo sapiens       80344
## 3 ABBUD_… M1423 C2     CGP       DDC              Homo sapiens        1644
## 4 ABBUD_… M1423 C2     CGP       HK2              Homo sapiens        3099
## 5 ABBUD_… M1423 C2     CGP       ITGA6            Homo sapiens        3655
## 6 ABBUD_… M1423 C2     CGP       OSTF1            Homo sapiens       26578
## # … with 2 more variables: gene_symbol <chr>, sources <chr>

Ensembl IDs and gene symbols are stored in our SCE’s rowData. We add these to the DS analysis result tables, and do a spot-check to summarize how many genes of the above genesets are contained in our dataset. Additionally, we filter for genesets that comprise at least 20 and at most 1000 genes:

# add gene info to DS result tables
res <- lapply(res, function(u) {
  u$ensembl_id <- rowData(sce)[u$gene, "ENSEMBL"]
  u$symbol <- rowData(sce)[u$gene, "SYMBOL"]
  return(u)
})

# spot-check
sapply(res, function(u) table(u$symbol %in% m_df$gene_symbol))
##       B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic cells
## FALSE     476             477         477         473             425
## TRUE     5834            5918        5915        5801            5513
##       FCGR3A+ Monocytes Megakaryocytes NK cells
## FALSE               474            457      471
## TRUE               5879           5671     5835
# filter for set sizes of 20-1000
sets <- split(m_df$gene_symbol, m_df$gs_name)
n <- vapply(sets, length, numeric(1))
sets <- sets[n >= 20 & n < 1000]
length(sets)
## [1] 6662

Run camera on ea. cluster

gs_dat <- lapply(kids, function(k) {
    inds <- ids2indices(sets, res[[k]]$symbol, remove.empty = TRUE)
    dat <- assays(pb[res[[k]]$gene, ])[[k]]
    v <- voom(dat, design)
    f <- lmFit(v, design)
    f <- eBayes(f)
    cf <- contrasts.fit(f, contrast)
    cf <- eBayes(cf)
    list(indices = inds, voom = v, cluster_id = k, contrasts.fit = cf)  
})

# construct data.frame
gs_df <- lapply(gs_dat, function(u)
    camera(u$voom, u$indices, design, contrast) %>% 
        rownames_to_column("geneset")) %>% 
    bind_rows(.id = "cluster_id")

Heatmap summary

cats <- gs_df %>% 
    dplyr::filter(FDR < 1e-20) %>%
    pull(geneset) %>% unique
length(cats)
## [1] 86
gs_df %>% 
    dplyr::filter(geneset %in% cats) %>%
    dplyr::mutate(neg_log10_fdr = -log10(FDR)) %>% 
    acast(cluster_id ~ geneset, value.var = "neg_log10_fdr") %>% 
    set_colnames(strtrim(colnames(.), 30)) %>% 
    pheatmap

Barcode plots

A common visualization of GSEA results are so-called barcode plots. These plots the position of a geneset in a ranked list of test statistics, thereby showing the enrichment of a gene set amongst high or low ranked genes. Statistics are ranked left to right from smallest to largest, with the positions of the specified geneset marked by vertical bars, forming a pattern like a barcode. Here, an accumulation of lines on the left-hand side corresponds to a set of down-regulated genes, and vice versa.

cats_by_cluster <- gs_df %>% 
    group_by(cluster_id) %>% 
    top_n(5, -FDR) %>% 
    group_split %>% 
    set_names(kids) %>% 
    lapply(pull, geneset)

lapply(kids, function(k) {
    cat("#### ", k, "{-}\n")
    lapply(cats_by_cluster[[k]], function(c)
        barcodeplot(
            statistics = gs_dat[[k]]$contrasts.fit$t[, 1],
            index = gs_dat[[k]]$indices[[c]],
            quantiles = c(-1, 1) * qt(0.95, df = 14),
            main = c, cex.main = 0.8))
    cat("\n\n")
})

B cells

CD14+ Monocytes

CD4 T cells

CD8 T cells

Dendritic cells

FCGR3A+ Monocytes

Megakaryocytes

NK cells

$B cells NULL

$CD14+ Monocytes NULL

$CD4 T cells NULL

$CD8 T cells NULL

$Dendritic cells NULL

$FCGR3A+ Monocytes NULL

$Megakaryocytes NULL

$NK cells NULL

Concluding remarks

We have worked through a complete analysis of an examplary multi-sample multi-group scRNA-seq dataset. While the pipeline presented here is, in general, transferable to any dataset of similar experimental design, many analysis steps require careful consideration of the data and may need to be modified. Especially the preprocessing steps applied here were held at a minimum, and should be complemented with, e.g., additional visualizations, quality control and filtering steps.

The analyses presented here are ongoing work. We are in the process of finalizing the muscat R package (multi-sample multi-group scRNA-seq analysis tools), that implements a flexible aggregation wrapper, visualization tools, as well as sample-level (aggregation-based) and cell-level (mixed-models-based) approaches for DS analysis. A current under-development version of muscat is available at http://github.com/HelenaLC/muscat; we hope to make the package available with Bioconductor’s next release.

References

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology 36 (April). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved. SN -: 411–20. https://doi.org/10.1038/nbt.4096.

Diaz-Mejia, J. Javier, Elaine C. Meng, Alexander R. Pico, Sonya A. MacParland, Troy Ketela, Trevor J. Pugh, Gary D. Bader, and John H. Morris. 2019. “Evaluation of Methods to Assign Cell Type Labels to Cell Clusters from Single-Cell Rna-Sequencing Data.” bioRxiv, January, 562082. https://doi.org/10.1101/562082.

Duó, Angelo, Mark D. Robinson, and Charlotte Soneson. 2018. “A Systematic Performance Evaluation of Clustering Methods for Single-Cell RNA-Seq Data.” F1000 Research 7 (1141). https://doi.org/10.12688/f1000research.15666.1.

Freytag, Saskia, Luyi Tian, Ingrid Lönnstedt, Milicia Ng, and Melanie Bahlo. 2018. “Comparison of Clustering Tools in R for Medium-Sized 10x Genomics Single-Cell RNA-Sequencing Data.” F1000 Research 7 (1297). https://doi.org/10.12688/f1000research.15809.1.

Hafemeister, Christoph, and Rahul Satija. 2019. “Normalization and Variance Stabilization of Single-Cell RNA-Seq Data Using Regularized Negative Binomial Regression.” bioRxiv, January, 576827.

Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nat Biotechnol 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.

McCarthy, Davis J, Kieran R Campbell, Quin F Wills, and Aaron T L Lun. 2017. “Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell RNA-Seq Data in R.” Bioinformatics 33 (8): 1179–86. https://doi.org/10.1093/bioinformatics/btw777.

Soneson, Charlotte, and Mark D. Robinson. 2018. “Bias, Robustness and Scalability in Single-Cell Differential Expression Analysis.” Nature Methods 15 (February): 255–61.

Stegle, Oliver, Sarah A. Teichmann, and John C. Marioni. n.d. “Computational and Analytical Challenges in Single-Cell Transcriptomics.” Nature Reviews Genetics 16. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved. SN -: 133 EP. https://doi.org/10.1038/nrg3833.

Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M. Mauck, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2018. “Comprehensive Integration of Single Cell Data.” bioRxiv, January, 460147. https://doi.org/10.1101/460147.

Trapnell, Cole. n.d. “Defining Cell Types and States with Single-Cell Genomics.” Genome Res 25 (10). Department of Genome Sciences, University of Washington, Seattle, Washington 98105, USA.: 1491–8. https://doi.org/10.1101/gr.190595.115.

Wagner, Allon, Aviv Regev, and Nir Yosef. 2016. “Revealing the Vectors of Cellular Identity with Single-Cell Genomics.” Nat Biotechnol 34 (11): 1145–60. https://doi.org/10.1038/nbt.3711.

Zhang, Allen W, Ciara O’Flanagan, Elizabeth Chavez, Jamie LP Lim, Andrew McPherson, Matt Wiens, Pascale Walters, et al. 2019. “Probabilistic Cell Type Assignment of Single-Cell Transcriptomic Data Reveals Spatiotemporal Microenvironment Dynamics in Human Cancers.” bioRxiv, January, 521914. https://doi.org/10.1101/521914.


  1. Note that, in this workflow, expression values are used for visualization only, and that differential analysis is performed on cluster-sample-level aggregates; see section ??.

  2. sorted = TRUE in split() assures that the order of clusters and samples is consistent for latter column-binding.

  3. Proportions are used for plotting only; statistical modeling takes cluster-sample cell counts as input; see section ??.