Bioconductor 5xx:
Analysis of multi-sample
multi-group scRNA-seq data
Introduction
What is DS analysis?
A fundamental task in the analysis of single-cell RNA-sequencing (scRNA-seq) data is the identification of systematic transcriptional changes(Stegle, Teichmann, and Marioni, n.d.). Such analyses are a critical step in the understanding of molecular responses, and have applications in development, in perturbation studies or in disease.
Most of the current scRNA-seq differential expression (DE) analysis methods are designed to test one set of cells against another (or more generally, multiple sets together), and can be used to compare cell clusters (e.g., for identifying marker genes) or across conditions (cells from one condition versus another) (Soneson and Robinson 2018). In such statistical models, the cells are the experimental units and thus represent the population that inferences will extrapolate to.
Using established terminology, we refer to cell identity as the combination of cell type, a stable molecular signature, and cell state, a transient snapshot of a cell’s molecular events (Wagner, Regev, and Yosef 2016; Trapnell, n.d.). This classification is inherently arbitrary, but still provides a basis for biological interpretation and a framework for discovering interesting expression patterns from scRNA-seq datasets. For example, T cells could be defined as a single (albeit diverse) cell type or could be divided into discrete subtypes, if relevant information to categorize each cell at this level were available. In either case, the framework presented here would be able to focus on the cell type of interest and look for changes (in expression) across samples.
Given the emergence of multi-sample multi-group scRNA-seq datasets, the goal becomes making sample-level inferences (i.e., experimental units are samples). Thus, differential state (DS) analysis is defined as following a given cell type across a set of samples (e.g., individuals) and experimental conditions (e.g., treatments), in order to identify cell-type-specific responses, i.e., changes in cell state. DS analysis: i) should be able to detect diluted changes that only affect a single cell type, a subset of cell types or even a subset of a single subpopulation; and, ii) is intended to be orthogonal to clustering or cell type assignment.
Starting point
The starting point for a DS analysis is a (sparse) matrix of gene expression, either as counts or some kind of normalized data, where rows = genes and columns = cells. Each cell additionally has a cluster (subpopulation) label as well as a sample label; metadata should accompany the list of samples, such that they can be organized into comparable groups with sample-level replicates (e.g., via a design matrix).
The approach presented here is modular and thus the type label could originate from an earlier step in the analysis, such as clustering (Duó, Robinson, and Soneson 2018; Freytag et al. 2018), perhaps after integration (Butler et al. 2018; Stuart et al. 2018) or after labeling of clusters (Diaz-Mejia et al. 2019) or after cell-level type assignment (Zhang et al. 2019). Although we have pipelines for these various entry points, the specific details and suitability of all these pre-processing steps needs careful exploration and are beyond the scope of this workflow.
Getting started
Data description
To illustrate a standard DS analysis workflow, we will use Kang et al. (2018)’s droplet-based scRNA-seq data of PBMCs cells from 8 lupus patients measured before and after 6h-treatment with INF-β (16 samples in total). The original dataset is deposited under Gene Expression Ombnibus (GEO) accession GSE96583, and has been made available via Bioconductor’s ExperimentHub
web resource as a SingleCellExperiment (SCE) object that contains unfiltered raw counts, and any gene and cell metadata available from the original data source.
Loading the data
We first initialize a Hub instance to search for and load available data with the ExperimentHub
function, and store the complete list of records in the variable eh
. Using query
, we then retrieve any records that match our keyword(s) of interest, as well as their accession IDs (EH1234).
library(ExperimentHub)
eh <- ExperimentHub()
query(eh, "Kang")
## ExperimentHub with 3 records
## # snapshotDate(): 2019-06-20
## # $dataprovider: NCI_GDC, GEO
## # $species: Homo sapiens
## # $rdataclass: BSseq, SingleCellExperiment, character
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH1661"]]'
##
## title
## EH1661 | Whole Genome Bisulfit Sequencing Data for 47 samples
## EH1662 | Whole Genome Bisulfit Sequencing Data for 47 samples
## EH2259 | Kang18_8vs8
Finally, we load the data of interest into R via [[
and the corresponding accession ID:
(sce <- eh[["EH2259"]])
## class: SingleCellExperiment
## dim: 35635 29065
## metadata(0):
## assays(1): counts
## rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
## rowData names(2): ENSEMBL SYMBOL
## colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ...
## TTTGCATGGTTTGG-1 TTTGCATGTCTTAC-1
## colData names(5): ind stim cluster cell multiplets
## reducedDimNames(1): TSNE
## spikeNames(0):
For data handling and visualization throughout this workflow, we will require the following packages:
# data handling
library(dplyr)
library(magrittr)
library(Matrix)
library(purrr)
library(reshape2)
library(S4Vectors)
library(tibble)
# visualzation
library(ComplexHeatmap)
library(ggplot2)
library(pheatmap)
library(scales)
Data preparation
Before proceeding with basic preprocessing and filtering steps, we drop non-singlet cells as well as cells that have not been assigned a cluster ID:
sce <- sce[, sce$multiplets == "singlet" & !is.na(sce$cell)]
dim(sce)
## [1] 35635 24673
For simplicity, we will further retain only cell-metadata columns that are relevant to our analysis (and use intuitive names for these):
colData(sce) %>%
as.data.frame %>%
transmute(
group_id = stim,
patient_id = ind,
sample_id = paste0(stim, ind),
cluster_id = cell) %>%
mutate_all(as.factor) %>%
set_rownames(colnames(sce)) %>%
DataFrame -> colData(sce)
head(colData(sce))
## DataFrame with 6 rows and 4 columns
## group_id patient_id sample_id cluster_id
## <factor> <factor> <factor> <factor>
## AAACATACATTTCC-1 ctrl 1016 ctrl1016 CD14+ Monocytes
## AAACATACCAGAAA-1 ctrl 1256 ctrl1256 CD14+ Monocytes
## AAACATACCATGCA-1 ctrl 1488 ctrl1488 CD4 T cells
## AAACATACCTCGCT-1 ctrl 1256 ctrl1256 CD14+ Monocytes
## AAACATACCTGGTA-1 ctrl 1039 ctrl1039 Dendritic cells
## AAACATACGATGAA-1 ctrl 1488 ctrl1488 CD4 T cells
For consistency and easy accession throughout this workflow, we will store cluster and sample IDs, as well as the number of clusters and samples:
nk <- length(kids <- set_names(levels(sce$cluster_id)))
ns <- length(sids <- set_names(levels(sce$sample_id)))
Finally, we compile a table that summarizes the experimental design:
m <- match(sids, sce$sample_id)
n_cells <- as.numeric(table(sce$sample_id))
(ei <- data.frame(colData(sce)[m, ],
n_cells, row.names = NULL) %>%
select(-"cluster_id"))
## group_id patient_id sample_id n_cells
## 1 ctrl 101 ctrl101 858
## 2 ctrl 1015 ctrl1015 2738
## 3 ctrl 1016 ctrl1016 1723
## 4 ctrl 1039 ctrl1039 461
## 5 ctrl 107 ctrl107 517
## 6 ctrl 1244 ctrl1244 1879
## 7 ctrl 1256 ctrl1256 2108
## 8 ctrl 1488 ctrl1488 2031
## 9 stim 101 stim101 1166
## 10 stim 1015 stim1015 2352
## 11 stim 1016 stim1016 1635
## 12 stim 1039 stim1039 641
## 13 stim 107 stim107 525
## 14 stim 1244 stim1244 1464
## 15 stim 1256 stim1256 2026
## 16 stim 1488 stim1488 2549
Preprocessing
The scater
package (McCarthy et al. 2017) provides a variey of tools for preprocessing and quality control of single-cell transcriptomic data. For completeness, we will apply some minimal filtering steps to
- remove undetected genes
- remove cells with very few or many detected genes
- remove very lowly expressed genes
- compute normalized expression values for visualization
For more thorough preprocessing, we refer to the Quality control with scater vignette.
# remove undetected genes
sce[rowSums(counts(sce) > 0) > 0, ]
## class: SingleCellExperiment
## dim: 18464 24673
## metadata(0):
## assays(1): counts
## rownames(18464): RP11-34P13.8 AL627309.1 ... S100B PRMT2
## rowData names(2): ENSEMBL SYMBOL
## colnames(24673): AAACATACATTTCC-1 AAACATACCAGAAA-1 ...
## TTTGCATGGGACGA-1 TTTGCATGTCTTAC-1
## colData names(4): group_id patient_id sample_id cluster_id
## reducedDimNames(1): TSNE
## spikeNames(0):
dim(sce)
## [1] 35635 24673
We use calculateQCMetrics
to compute various quality control metrics for each cell and gene, stored in the colData
and rowData
, respectively, and proceed with filtering cells and genes as noted above:
library(scater)
# calculate quality control (QC) metrics
sce <- calculateQCMetrics(sce)
# get cells w/ few/many detected genes
sce$is_outlier <- isOutlier(
metric = sce$total_features_by_counts,
nmads = 2, type = "both", log = TRUE)
# remove outlier cells
sce <- sce[, !sce$is_outlier]
dim(sce)
## [1] 35635 22221
# remove lowly expressed genes & normalize
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
dim(sce)
## [1] 6413 22221
Finally, we use normalize
to calculate log2-transformed normalized expression values by dividing each count by its size factor, adding a pseudo-count of 1, and log-transforming1.
sizeFactors(sce) <- librarySizeFactors(sce)
sce <- normalize(sce)
assayNames(sce)
## [1] "counts" "logcounts"
range(logcounts(sce))
## [1] 0.000000 9.728775
Alternatively, expression values could be obtained via vst
(variance stabilizing transformation) from the sctransform
package (Hafemeister and Satija 2019), which returns Pearson residuals from a regularized negative binomial regression model that can be interpreted as normalized expression values.
DS analysis
Aggregation of single-cell to pseudo-bulk data
In order to leverage existing robust bulk RNA-seq DE frameworks, such as DESeq2
, edgeR
and limma
, we first aggregate measurements for each cluster at the sample level to obtain pseudobulk data. While, in principle, various combinations of input data (raw/(log-)normalized counts, CPM ect.) and summary statistics (sum, mean, median) could be applied, we here default to the sum of raw counts.
For aggregation, we use Matrix.utils
’s aggregate.Matrix
function with the colData
columns "cluster_id"
and "sample_id"
as groupings
. Note that aggregate.Matrix
initially yields a matrix of dimensions #(cluster-sample-instances) × #(genes), which we re-split and transform to obtain, for each cluster, and matrix of dimensions #(genes) × #(samples):
library(Matrix.utils)
system.time({
# aggregate by cluster-sample
groups <- colData(sce)[, c("cluster_id", "sample_id")]
pb <- aggregate.Matrix(t(counts(sce)),
groupings = groups, fun = "sum")
# split by cluster, transform & rename columns
pb <- split.data.frame(pb, rep(kids, ns)) %>%
lapply(function(u) set_colnames(t(u), unname(sids)))
})
## user system elapsed
## 0.823 0.096 0.918
From the code snippet above, we obtain a list of length nk
, where each element contains a dgCMatrix
of pseudobulk counts with rows = genes and columns = samples. For a more elegant show-method and easier data accession, we construct a SCE where each assay sheet corresponds to one cluster:
# construct SCE of pseudo-bulk counts
# (assays = clusters, rows = genes, columns = samples)
(pb <- SingleCellExperiment(assays = pb))
## class: SingleCellExperiment
## dim: 6413 16
## metadata(0):
## assays(8): B cells CD14+ Monocytes ... Megakaryocytes NK cells
## rownames(6413): NOC2L HES4 ... S100B PRMT2
## rowData names(0):
## colnames(16): ctrl101 ctrl1015 ... stim1256 stim1488
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
The approach used here for aggregation presents but one of many ways, and was selected due to its simplicity and efficiency. Alternatively, one could, for example, split the character vector of cells (colnames(sce)
) by cluster-sample, and use rowSums(counts(sce[, ...]))
for aggregation. However, this approach is comparatively more verbose and less efficient2:
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:SummarizedExperiment':
##
## shift
## The following object is masked from 'package:GenomicRanges':
##
## shift
## The following object is masked from 'package:IRanges':
##
## shift
## The following objects are masked from 'package:S4Vectors':
##
## first, second
system.time({
# split cell by cluster-sample
cs_by_ks <- as.data.frame(colData(sce)) %>%
rownames_to_column("cells") %>% setDT %>%
split(flatten = FALSE, sorted = TRUE,
by = c("cluster_id", "sample_id")) %>%
map_depth(2, "cells")
# for ea. cluster-sample..
pb2 <- map_depth(cs_by_ks, 2, function(cs)
rowSums(counts(sce[, cs]))) %>% # ..compute pseudobulks
map(data.frame) # column-bind samples
})
## user system elapsed
## 9.006 0.062 8.838
Pseudobulk-level MDS plot
Prior to conducting any formal testing, we can compute a multi-dimensional scaling (MDS) plot of aggregated signal to explore overall sample similarities. Ideally, such a represenation of the data should separate both clusters and groups from one another. Vice versa, samples from the same cluster/group should fall close to each other.
In our MDS plot on pseudobulk counts (Fig. 2), we can appreciate that the horizontal dimension (MDS dim. 1) clearly separates cell-populations (clusters), while control and stimulated samples (groups) are separated vertically (MDS dim. 2).
library(edgeR)
# compute MDS coordinates
mds <- as.list(assays(pb)) %>%
lapply(as.data.frame.matrix) %>%
bind_cols %>%
DGEList(remove.zeros = TRUE) %>%
calcNormFactors %>%
plotMDS.DGEList(plot = FALSE)
# prep. data.frame for plotting
gg_df <- data.frame(mds[c("x", "y")],
cluster_id = rep(kids, each = ns),
sample_id = rep(sids, nk),
group_id = ei$group_id[match(sids, ei$sample_id)])
ggplot(gg_df, aes(x, y, col = cluster_id, shape = group_id)) +
geom_point(size = 3, alpha = 0.8) +
labs(x = "MDS dim. 1", y = "MDS dim. 2") +
theme(panel.grid.minor = element_blank()) +
coord_fixed() + theme_bw()
Cluster-sample cell-counts
While DE analysis is typically used for comparison between cell-types, and may struggle with rare subpopulations, DS analysis compares cluster-sample instances that are likely to be much smaller. Thus, DS analysis may only be applicable to more prominent populations. It is therefore recommended to check cluster-sample cell-counts, and to possibly exclude small instances from downstream analyses. In our example, we might consider, for instance, omitting DS analysis on the “Megakaryocytes” and “Dendritic cells” clusters, as these contain less than 30 cells across almost all samples.
options(width = 100)
table(sce$cluster_id, sce$sample_id)
##
## ctrl101 ctrl1015 ctrl1016 ctrl1039 ctrl107 ctrl1244 ctrl1256 ctrl1488 stim101
## B cells 101 424 119 25 43 109 205 203 130
## CD14+ Monocytes 136 644 315 86 176 314 292 251 166
## CD4 T cells 288 819 413 177 155 1041 973 1151 375
## CD8 T cells 72 174 532 26 20 56 126 55 104
## Dendritic cells 5 6 6 3 1 17 5 9 13
## FCGR3A+ Monocytes 50 177 85 14 25 28 36 75 91
## Megakaryocytes 9 16 7 4 4 12 14 18 7
## NK cells 69 174 127 17 40 104 257 116 102
##
## stim1015 stim1016 stim1039 stim107 stim1244 stim1256 stim1488
## B cells 319 114 35 50 86 196 259
## CD14+ Monocytes 550 282 111 146 244 307 309
## CD4 T cells 714 352 265 176 835 934 1449
## CD8 T cells 128 485 29 15 30 109 39
## Dendritic cells 9 6 3 6 15 13 21
## FCGR3A+ Monocytes 159 93 22 26 22 51 113
## Megakaryocytes 17 10 10 3 9 15 26
## NK cells 191 194 23 46 113 231 167
Testing for DS
To calculate differential tests, we require a matrix describing the experimental design, and a contrast matrix specifying the comparison of interest, i.e., the combination of model parameters assumed to equal zero under the null hypothesis.
Design and contrast matrices of appropriate format can be created using the model.matrix
(stats
package) and makeContrast
function (limma
package), respectively. For clarity, we also set the dimension names of the design matrix (though this is not required by edgeR
).
library(limma)
# construct design & contrast matrix
(design <- model.matrix(~ 0 + ei$group_id) %>%
set_rownames(ei$sample_id) %>%
set_colnames(levels(ei$group_id)))
## ctrl stim
## ctrl101 1 0
## ctrl1015 1 0
## ctrl1016 1 0
## ctrl1039 1 0
## ctrl107 1 0
## ctrl1244 1 0
## ctrl1256 1 0
## ctrl1488 1 0
## stim101 0 1
## stim1015 0 1
## stim1016 0 1
## stim1039 0 1
## stim107 0 1
## stim1244 0 1
## stim1256 0 1
## stim1488 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`ei$group_id`
## [1] "contr.treatment"
(contrast <- makeContrasts("stim-ctrl", levels = design))
## Contrasts
## Levels stim-ctrl
## ctrl -1
## stim 1
# for ea. cluster, run edgeR w/ default parameters
res <- lapply(kids, function(k) {
y <- assays(pb)[[k]]
y <- DGEList(y, remove.zeros = TRUE)
y <- calcNormFactors(y)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
fit <- glmQLFTest(fit, contrast = contrast)
topTags(fit, n = Inf, sort.by = "none")$table %>%
dplyr::mutate(gene = rownames(.), cluster_id = k) %>%
dplyr::rename(p_val = PValue, p_adj = FDR)
})
Results filtering & overview
To get a general overview of the differential testing results, we first filter them to retain hits with FDR < 5% and | logFC | > 1, and count the number of differential findings by cluster.
# filter FDR < 0.05, |logFC| > 1 & sort by FDR
res_fil <- lapply(res,
function(u) u %>%
dplyr::filter(p_adj < 0.05, abs(logFC) > 1) %>%
dplyr::arrange(p_adj))
# nb. & % of DE genes per cluster
n_de <- vapply(res_fil, nrow, numeric(1))
cbind(n_de, p_gs = n_de / nrow(sce) * 100)
## n_de p_gs
## B cells 591 9.215656
## CD14+ Monocytes 1730 26.976454
## CD4 T cells 440 6.861063
## CD8 T cells 241 3.757992
## Dendritic cells 196 3.056292
## FCGR3A+ Monocytes 854 13.316700
## Megakaryocytes 117 1.824419
## NK cells 373 5.816311
Between-cluster concordance
DS analysis aims at identifying cell-type-specific changes (in expression) across conditions. In this setting, key questions of interest arise, e.g., which genes are DE in only a single (or very few) clusters? How many DE genes are shared between clusters? In summary, what is the general concordance in differential findings between clusters?
To gain an impression of the between-cluster (dis-)agreement on DE genes, we generate an upset-plot that visualizas the number of DE genes that are shared across or unique to certain clusters:
library(UpSetR)
upset(fromList(map(res_fil, "gene")))
From the upset-plot above we can deduce that, for instance, > 150 of genes are DE across all clusters, about half of CD14+ Monocytes DE genes are unique to that cluster, and that the two Monocytes clusters (FCGR3A+ and CD4+) share the largest set of DE genes ( ≈ 350), ect..
Visualization
Dimension reduction
One of the most popular plots for representing single-cell data are t-SNE plots, where each cell is represented in a lower, usually two-dimensional, space computed using t-stochastic neighbor embedding (t-SNE).
Dimensionality reductions available within our SCE can be accessed via reducedDims
from the scater
package, and visualized using plotReducedDim
. For our dataset, the t-SNE colored by cluster IDs (Fig. 3; left) shows that cell-populations are well-separated from one another. INF-β stimulation manifests as a severe shift in the t-SNE projection of cells (Fig. 3; right), indicating widespread, genome-scale transcriptiontal changes.
# t-SNE colored by cluster ID
plotReducedDim(sce, use_dimred = "TSNE",
colour_by = "cluster_id", point_size = 0.8, point_alpha = 0.4) +
guides(fill = guide_legend(override.aes = list(alpha = 1, size = 5)))
# t-SNE colored by group ID
plotReducedDim(sce, use_dimred = "TSNE",
colour_by = "group_id", point_size = 0.8, point_alpha = 0.4) +
guides(fill = guide_legend(override.aes = list(alpha = 1, size = 5)))
Another nonlinear dimensionality reduction technique, uniform manifold approximation and projection (UMAP), has recently been directly compared to t-SNE, and shown to outperform t-SNE in runtime, reproducibility, and its ability to organize cells into meaningful clusters.
Besides those already available, we can compute additional dimension reductions (Dr) using scater
’s runDR
functions, where DR
may be any of PCA
, MDS
, TSNE
, UMAP
or DiffusionMap
, and visualize these as above:
sce <- runUMAP(sce)
plotReducedDim(sce, use_dimred = "UMAP")
Cell-level visualization
For changes of high interest, we can view the cell-level expressions of specific genes in a specific cluster and across samples. The top-hits for each cluster can be pulled as follows:
# pull top-n genes for ea. cluster
top_gs <- lapply(res_fil, function(u) u$gene[seq_len(9)])
We use scater::plotExpression
to generate violin plots for the top differential genes identified for each cluster, specifying x = "sample_id"
to obtain one violin per sample, and colour_by = "group_id"
to highlight the experimental condition each sample belongs to.
Note that, as we are testing for DE on the cluster-level, for plotting, we need to subset the cells that have been assigned to a given cluster k
. To this end, we first split the character vector of cells colnames(sce)
by the colData
column "cluster_id"
:
# split cells by cluster
cs_by_k <- split(colnames(sce), sce$cluster_id)
lapply(kids, function(k) {
gs <- top_gs[[k]] # get top gene-hits for cluster k
cs <- cs_by_k[[k]] # subset cells assigned to cluster k
plotExpression(sce[, cs], features = gs,
x = "sample_id", colour_by = "group_id", ncol = 3) +
guides(fill = guide_legend(override.aes = list(size = 5, alpha = 1))) +
theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
})
B cells
CD14+ Monocytes
CD4 T cells
CD8 T cells
Dendritic cells
FCGR3A+ Monocytes
Megakaryocytes
NK cells
Sample-level visualization
Especially when wanting to gain an overview of numerous DE testing results for many clusters, both dimension reduction and cell-level visualisations require a lot of space and can become cumbersome to interpret. In this setting, it is thus recommendable to visualise aggregated measurements, e.g., mean expression values by cluster and sample.
We can assemble cluster-sample expression-means by specifying fun = "mean"
and replacing counts
with logcounts
in the code snipped used to compute pseudobulk-counts in section ??:
# calculate expression-means by cluster-sample
ms <- aggregate.Matrix(t(logcounts(sce)),
groupings = groups, fun = "mean") %>%
split.data.frame(rep(kids, ns)) %>%
lapply(function(u) set_colnames(t(u), unname(sids))) %>%
SingleCellExperiment(assays = .)
To better highlight relative differences across conditions, we construct the helper-function .z_norm
that applies a z-normalization to normalize expression values to mean 0 and standard deviation 1:
.z_norm <- function(x, th = 2.5) {
x <- as.matrix(x)
sds <- rowSds(x, na.rm = TRUE)
sds[sds == 0] <- 1
x <- t(t(x - rowMeans(x, na.rm = TRUE)) / sds)
#x <- (x - rowMeans(x, na.rm = TRUE)) / sds
x[x > th] <- th
x[x < -th] <- -th
return(x)
}
We further construct the wrapper .plot_diff_hm
that takes as input a SCE of pseudobulk mean-logcounts, a cluster ID k
, a character string of genes gs
, and an experimental design table ei
:
.plot_diff_hm <- function(ms, k, gs, ei) {
mat <- assays(ms)[[k]][gs, ]
m <- match(colnames(mat), ei$sample_id)
cols <- hue_pal()(nlevels(ei$group_id))
names(cols) <- levels(ei$group_id)
col_anno <- columnAnnotation(
df = data.frame(group_id = ei$group_id[m]),
col = list(group_id = cols),
gp = gpar(col = "white"),
show_annotation_name = FALSE)
Heatmap(.z_norm(mat),
column_title = k,
name = "z-normalized\nexpression",
col = c("royalblue", "cornflowerblue", "white", "gold", "orange"),
cluster_rows = FALSE,
cluster_columns = FALSE,
row_names_side = "left",
rect_gp = gpar(col = "white"),
top_annotation = col_anno)
}
.plot_diff_hm
uses the ComplexHeatmap
package to render a heatmap of z-normalized mean expressions (for a given cluster) across samples such that rows = genes, columns = samples, and columns are annotated with group IDs. We can visualize the top-20 differential findings in each cluster as follows:
lapply(kids, function(k) {
top20 <- res_fil[[k]]$gene[seq_len(20)] # subset top-20 hits for cluster k
.plot_diff_hm(ms, k, top20, ei) # render differential heatmap
})
B cells
CD14+ Monocytes
CD4 T cells
CD8 T cells
Dendritic cells
FCGR3A+ Monocytes
Megakaryocytes
NK cells
DA analysis
Differential abundance (DA) analysis compares the proportions of cell types across experimental conditions and aims to highlight populations that are present at different ratios. Even though DA analysis lies beyond the scope of this workflow, we present it here for completeness.
In preparation for DA analysis, we first calculate two tables that contain cell counts for each sample and population (n_cells
) and the corresponding proportions of cell types by sample (freqs
)3:
# calculate cluster-sample cell counts
n_cells <- table(sce$cluster_id, sce$sample_id)
# calculate cluster proportions across samples
freqs <- prop.table(n_cells, margin = 1)
Visualization
For visual inspection, we generate a barplot of cell type compositions for each sample; see Fig. 20. As alternative represenation, we further render a boxplot where each panel compares the distribution of a given subpopulation’s frequencies between groups; see Fig. 21.
# prep. data.frame for plotting
df <- data.frame(
frequency = as.numeric(freqs),
cluster_id = rep(kids, ns),
sample_id = rep(sids, each = nk))
m <- match(df$sample_id, ei$sample_id)
df$group_id <- ei$group_id[m]
# barplot of relative cluster-abundances
ggplot(df, aes(x = sample_id, y = frequency, fill = cluster_id)) +
geom_bar(stat = "identity", position = "fill") +
facet_wrap(~ group_id, scales = "free_x") +
theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# boxplot of relative cluster-abundances
ggplot(df, aes(x = group_id, y = frequency, color = group_id)) +
geom_boxplot(outlier.colour = NA) + geom_jitter() +
facet_wrap(~ cluster_id, scales = "free_y", ncol = 4) +
theme_classic()
Testing for DA
For differential testing, we reuse the design and contrast matrix from section ??. In agreement with the results reported by Kang et al. (2018), we find that INF-β treatment does not significantly affect cell type proportions between control and stimulated cells, with the most prominent increase being in the frequency of “Dendritic cells” (logFC ≈ 0.71):
y <- DGEList(counts = n_cells)
y <- estimateDisp(y, design, trend.method = "none")
fit <- glmFit(y, design)
fit <- glmLRT(fit, contrast = contrast)
topTags(fit, n = Inf)$table %>% round(2)
## logFC logCPM LR PValue FDR
## Dendritic cells 0.71 12.96 3.09 0.08 0.63
## NK cells 0.22 16.42 0.72 0.40 0.98
## FCGR3A+ Monocytes 0.21 15.61 0.34 0.56 0.98
## Megakaryocytes 0.16 13.28 0.22 0.64 0.98
## CD14+ Monocytes -0.10 17.69 0.15 0.70 0.98
## CD8 T cells -0.16 16.46 0.08 0.78 0.98
## B cells -0.03 16.67 0.02 0.90 0.98
## CD4 T cells 0.01 18.77 0.00 0.98 0.98
Geneset enrichment analysis
Functionally related genes or pathways can be grouped together in so-called genesets. Geneset enrichment analysis (GSEA) aims at identifying genesets that are up- or downregulated in one experimental condition or group versus another, in order to unravel their association with, e.g., disease.
Organize genesets & names
The Molecular Signature Database provides a detailed list of geneset collections for GSEA. Using the msigdbr
function, we can retrieve genesets that are of interest for our dataset, namely:
- hallmark genesets
"H"
- curated genesets
"C2"
(among others, contains sets derived from the KEGG pathway database that includes pathways associated with immune disease, specifically, Systemic lupus erythematosus) - immunologic signatures
"C7"
(since Lupus is an autoimmune disease)
# retrieve gene sets
# - H: hallmark genesets
# - C2: curated genesets
# - C7: immunological signatures
library(msigdbr)
(msigdbr(
species = "Homo sapiens",
category = c("H", "C2", "C7")) %>%
# MSigDB gene symbols are non-capitalize;
# we use `toupper` to match symbols
# with those contained in our dataset
mutate_at("gene_symbol", toupper) -> m_df) %>% head
## # A tibble: 6 x 9
## gs_name gs_id gs_cat gs_subcat human_gene_symb… species_name entrez_gene
## <chr> <chr> <chr> <chr> <chr> <chr> <int>
## 1 ABBUD_… M1423 C2 CGP CAPN9 Homo sapiens 10753
## 2 ABBUD_… M1423 C2 CGP DCAF11 Homo sapiens 80344
## 3 ABBUD_… M1423 C2 CGP DDC Homo sapiens 1644
## 4 ABBUD_… M1423 C2 CGP HK2 Homo sapiens 3099
## 5 ABBUD_… M1423 C2 CGP ITGA6 Homo sapiens 3655
## 6 ABBUD_… M1423 C2 CGP OSTF1 Homo sapiens 26578
## # … with 2 more variables: gene_symbol <chr>, sources <chr>
Ensembl IDs and gene symbols are stored in our SCE’s rowData
. We add these to the DS analysis result tables, and do a spot-check to summarize how many genes of the above genesets are contained in our dataset. Additionally, we filter for genesets that comprise at least 20 and at most 1000 genes:
# add gene info to DS result tables
res <- lapply(res, function(u) {
u$ensembl_id <- rowData(sce)[u$gene, "ENSEMBL"]
u$symbol <- rowData(sce)[u$gene, "SYMBOL"]
return(u)
})
# spot-check
sapply(res, function(u) table(u$symbol %in% m_df$gene_symbol))
## B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic cells
## FALSE 476 477 477 473 425
## TRUE 5834 5918 5915 5801 5513
## FCGR3A+ Monocytes Megakaryocytes NK cells
## FALSE 474 457 471
## TRUE 5879 5671 5835
# filter for set sizes of 20-1000
sets <- split(m_df$gene_symbol, m_df$gs_name)
n <- vapply(sets, length, numeric(1))
sets <- sets[n >= 20 & n < 1000]
length(sets)
## [1] 6662
Run camera
on ea. cluster
gs_dat <- lapply(kids, function(k) {
inds <- ids2indices(sets, res[[k]]$symbol, remove.empty = TRUE)
dat <- assays(pb[res[[k]]$gene, ])[[k]]
v <- voom(dat, design)
f <- lmFit(v, design)
f <- eBayes(f)
cf <- contrasts.fit(f, contrast)
cf <- eBayes(cf)
list(indices = inds, voom = v, cluster_id = k, contrasts.fit = cf)
})
# construct data.frame
gs_df <- lapply(gs_dat, function(u)
camera(u$voom, u$indices, design, contrast) %>%
rownames_to_column("geneset")) %>%
bind_rows(.id = "cluster_id")
Heatmap summary
cats <- gs_df %>%
dplyr::filter(FDR < 1e-20) %>%
pull(geneset) %>% unique
length(cats)
## [1] 86
gs_df %>%
dplyr::filter(geneset %in% cats) %>%
dplyr::mutate(neg_log10_fdr = -log10(FDR)) %>%
acast(cluster_id ~ geneset, value.var = "neg_log10_fdr") %>%
set_colnames(strtrim(colnames(.), 30)) %>%
pheatmap
Barcode plots
A common visualization of GSEA results are so-called barcode plots. These plots the position of a geneset in a ranked list of test statistics, thereby showing the enrichment of a gene set amongst high or low ranked genes. Statistics are ranked left to right from smallest to largest, with the positions of the specified geneset marked by vertical bars, forming a pattern like a barcode. Here, an accumulation of lines on the left-hand side corresponds to a set of down-regulated genes, and vice versa.
cats_by_cluster <- gs_df %>%
group_by(cluster_id) %>%
top_n(5, -FDR) %>%
group_split %>%
set_names(kids) %>%
lapply(pull, geneset)
lapply(kids, function(k) {
cat("#### ", k, "{-}\n")
lapply(cats_by_cluster[[k]], function(c)
barcodeplot(
statistics = gs_dat[[k]]$contrasts.fit$t[, 1],
index = gs_dat[[k]]$indices[[c]],
quantiles = c(-1, 1) * qt(0.95, df = 14),
main = c, cex.main = 0.8))
cat("\n\n")
})
B cells
CD14+ Monocytes
CD4 T cells
CD8 T cells
Dendritic cells
FCGR3A+ Monocytes
Megakaryocytes
NK cells
$B cells
NULL
$CD14+ Monocytes
NULL
$CD4 T cells
NULL
$CD8 T cells
NULL
$Dendritic cells
NULL
$FCGR3A+ Monocytes
NULL
$Megakaryocytes NULL
$NK cells
NULL
Concluding remarks
We have worked through a complete analysis of an examplary multi-sample multi-group scRNA-seq dataset. While the pipeline presented here is, in general, transferable to any dataset of similar experimental design, many analysis steps require careful consideration of the data and may need to be modified. Especially the preprocessing steps applied here were held at a minimum, and should be complemented with, e.g., additional visualizations, quality control and filtering steps.
The analyses presented here are ongoing work. We are in the process of finalizing the muscat
R package (multi-sample multi-group scRNA-seq analysis tools), that implements a flexible aggregation wrapper, visualization tools, as well as sample-level (aggregation-based) and cell-level (mixed-models-based) approaches for DS analysis. A current under-development version of muscat
is available at http://github.com/HelenaLC/muscat; we hope to make the package available with Bioconductor’s next release.
References
Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology 36 (April). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved. SN -: 411–20. https://doi.org/10.1038/nbt.4096.
Diaz-Mejia, J. Javier, Elaine C. Meng, Alexander R. Pico, Sonya A. MacParland, Troy Ketela, Trevor J. Pugh, Gary D. Bader, and John H. Morris. 2019. “Evaluation of Methods to Assign Cell Type Labels to Cell Clusters from Single-Cell Rna-Sequencing Data.” bioRxiv, January, 562082. https://doi.org/10.1101/562082.
Duó, Angelo, Mark D. Robinson, and Charlotte Soneson. 2018. “A Systematic Performance Evaluation of Clustering Methods for Single-Cell RNA-Seq Data.” F1000 Research 7 (1141). https://doi.org/10.12688/f1000research.15666.1.
Freytag, Saskia, Luyi Tian, Ingrid Lönnstedt, Milicia Ng, and Melanie Bahlo. 2018. “Comparison of Clustering Tools in R for Medium-Sized 10x Genomics Single-Cell RNA-Sequencing Data.” F1000 Research 7 (1297). https://doi.org/10.12688/f1000research.15809.1.
Hafemeister, Christoph, and Rahul Satija. 2019. “Normalization and Variance Stabilization of Single-Cell RNA-Seq Data Using Regularized Negative Binomial Regression.” bioRxiv, January, 576827.
Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nat Biotechnol 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.
McCarthy, Davis J, Kieran R Campbell, Quin F Wills, and Aaron T L Lun. 2017. “Scater: Pre-Processing, Quality Control, Normalization and Visualization of Single-Cell RNA-Seq Data in R.” Bioinformatics 33 (8): 1179–86. https://doi.org/10.1093/bioinformatics/btw777.
Soneson, Charlotte, and Mark D. Robinson. 2018. “Bias, Robustness and Scalability in Single-Cell Differential Expression Analysis.” Nature Methods 15 (February): 255–61.
Stegle, Oliver, Sarah A. Teichmann, and John C. Marioni. n.d. “Computational and Analytical Challenges in Single-Cell Transcriptomics.” Nature Reviews Genetics 16. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved. SN -: 133 EP. https://doi.org/10.1038/nrg3833.
Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M. Mauck, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2018. “Comprehensive Integration of Single Cell Data.” bioRxiv, January, 460147. https://doi.org/10.1101/460147.
Trapnell, Cole. n.d. “Defining Cell Types and States with Single-Cell Genomics.” Genome Res 25 (10). Department of Genome Sciences, University of Washington, Seattle, Washington 98105, USA.: 1491–8. https://doi.org/10.1101/gr.190595.115.
Wagner, Allon, Aviv Regev, and Nir Yosef. 2016. “Revealing the Vectors of Cellular Identity with Single-Cell Genomics.” Nat Biotechnol 34 (11): 1145–60. https://doi.org/10.1038/nbt.3711.
Zhang, Allen W, Ciara O’Flanagan, Elizabeth Chavez, Jamie LP Lim, Andrew McPherson, Matt Wiens, Pascale Walters, et al. 2019. “Probabilistic Cell Type Assignment of Single-Cell Transcriptomic Data Reveals Spatiotemporal Microenvironment Dynamics in Human Cancers.” bioRxiv, January, 521914. https://doi.org/10.1101/521914.
Note that, in this workflow, expression values are used for visualization only, and that differential analysis is performed on cluster-sample-level aggregates; see section ??.↩
sorted = TRUE
insplit()
assures that the order of clusters and samples is consistent for latter column-binding.↩Proportions are used for plotting only; statistical modeling takes cluster-sample cell counts as input; see section ??.↩