library("tidyverse")
library("SingleCellExperiment")
set.seed(1)
Multi-condition single-cell RNA-seq analysis with LEMUR
Setup
To start, we load the tidyverse
and SingleCellExperiment
packages:
The call to set.seed
, to set the initial conditions (“seed”) of R’s random number generator, is optional. We make it here to ensure that the results are exactly the same for everyone. Some of the methods here use stochastic sampling, but the expectation is that even with different random seeds, the results will be essentially, for all practical intents and purposes, the same. You can verify this by omiting the call to set.seed
or calling with a different argument.
Example data
We consider a popular dataset by Kang et al. (2018). The authors measured the effect of interferon-\(\beta\) stimulation on blood cells from eight patients. The muscData
package provides the data as a SingleCellExperiment
. If downloading the data with the muscData
package fails, you can also directly download the file from http://oc.embl.de/index.php/s/tpbYcH5P9NfXeM5 and load it into R using sce = readRDS("PATH_TO_THE_FILE")
.
= muscData::Kang18_8vs8() sce
see ?muscData and browseVignettes('muscData') for documentation
loading from cache
sce
class: SingleCellExperiment
dim: 35635 29065
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
You can find the number of genes and cells when printing the summary of the sce
. Alternatively, you can use nrow(sce)
, ncol(sce)
, or dim(sce)
to get these values.
In a SingleCellExperiment
object, the metadata (“data about data”) on the cells are accessed with colData(sce)
, and those on the genes with rowData(sce)
.
Follow-up question: why does the documentation for colData
(run ?colData
in the R console) talk about SummarizedExperiment
objects instead of SingleCellExperiment
?
We logarithm transform the data to account for the heteroskedasticity of the counts (C. Ahlmann-Eltze and Huber (2023)), perform PCA to reduce dimensionality, and run UMAP for visualization. We use the scater
package, which adds a new slot called logcounts
and the two slots reducedDims(sce)
named PCA
and UMAP
to the SummarizedExperiment
object. Equivalent functionality exists in Seurat
and scanpy
.
= scater::logNormCounts(sce)
sce = order(MatrixGenerics::rowVars(logcounts(sce)), decreasing = TRUE)
hvg = sce[hvg[1:500], ]
sce # _Note_: in case you get an error about `"function 'as_cholmod_sparse' not provided by package 'Matrix'"`,
# please reinstall the `irlba` package from source (`install.packages("irlba", type = "source")`)!
= scater::runPCA(sce, ncomponents = 50)
sce = scater::runUMAP(sce, dimred = "PCA") sce
The transformGamPoi package from Bioconductor provides a residual_transform
function.
assay(sce, "pearson_residuals") = transformGamPoi::residual_transform(sce, residual = "pearson", on_disk = FALSE)
Too few dimensions for PCA mean that it cannot capture enough of the relevant variation in the data. This leads to a loss of subtle differences between cell states.
Too many dimensions for PCA can also pose a problem. PCA smoothes out the sampling noise in the data. If too many PCA components are included, the smoothing is too weak, and the additional dimensions capture noise that can obscure biologically relevant differences. For more details see Fig. 2d in C. Ahlmann-Eltze and Huber (2023).
The scater package also provides a runTSNE
function
= scater::runTSNE(sce, dimred = "PCA") sce
On the one hand, tSNE and UMAP are de facto standards for visualizing the cell states in single cell data analysis. On the other hand, they are often criticized for failing to represent global structure and distorting differences in the data. A number of alternatives have been suggested:
- Force directed layout of the \(k\) nearest neighbor (kNN) graph (igraph),
- PHATE
- IVIS
- Unification of tSNE and UMAP using contrastive learning (Böhm, 2023)
In this tutorial, we use ggplot2
to visualize the data. This is often more verbose than calling a higher-level, specialized wrapper function (e.g., scater::plotReducedDim(sce, "UMAP", color_by = "stim")
), on the other hand, it has the advantage that we can more easily customize our plots.
In the below two UMAP plots, the cells separate by treatment status (stim
), as well as by annotated cell type (cell
). One goal of this tutorial is to understand how different approaches use these covariates to integrate the data into a shared embedding space or representation.
as_tibble(colData(sce)) |>
mutate(umap = reducedDim(sce, "UMAP")) |>
ggplot(aes(x = umap[,1], y = umap[,2])) + geom_point(aes(color = stim), size = 0.3) + coord_fixed()
colData(sce)$ind
column lists the patient identifier for each cell. Can you use this to create a separate plot for each patient?
One solution could be to use a for
loop and filter the data for each patient, but there is a much simpler solution! Adding facet_wrap() to the ggplot call will automatically create a subpanel for each patient:
as_tibble(colData(sce)) |>
mutate(umap = reducedDim(sce, "UMAP")) |>
ggplot(aes(x = umap[,1], y = umap[,2])) + geom_point(aes(color = stim), size = 0.3) + coord_fixed() +
facet_wrap(vars(ind))
This dataset already comes with cell type annotations. The cell type labels help interpret the results; however, for the purpose of this tutorial, we will not need them and will ignore them from now on.
= as_tibble(colData(sce)) |>
label_pos_df mutate(umap = reducedDim(sce, "UMAP")) |>
summarize(umap = matrix(colMedians(umap), nrow = 1), .by = c(stim, cell))
# Use `geom_text_repel` to avoid overlapping annotations
as_tibble(colData(sce)) |>
mutate(umap = reducedDim(sce, "UMAP")) |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = cell), size = 0.3) +
::geom_text_repel(data = label_pos_df, aes(label = cell)) +
ggrepelcoord_fixed()
The following code is based on Seurat’s Guided Clustering Tutorial.
# For more information about the conversion see `?as.Seurat.CellDataSet`
= Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj = Seurat::NormalizeData(seur_obj, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj = Seurat::FindVariableFeatures(seur_obj, selection.method = "vst", nfeatures = 500)
seur_obj # Subset to highly variable genes for memory efficiency
= seur_obj[Seurat::VariableFeatures(object = seur_obj),]
seur_obj = Seurat::ScaleData(seur_obj)
seur_obj = Seurat::RunPCA(seur_obj, verbose = FALSE)
seur_obj = Seurat::RunUMAP(seur_obj, dims = 1:10) seur_obj
::DimPlot(seur_obj, reduction = "umap", group.by = "stim") Seurat
Data integration
Figure 1 shows that the cell embedding locations separate by treatment status. For many analyses, we would like to overlay the cells from the stimulated condition with the cells from the control condition. For example, for cell type assignment, we might want to annotate both conditions together, irrespective of treatment. This aspiration is called integration.
The goal is a mathematical representation (e.g., a low-dimensional embedding) of the cells where the treatment status does not visibly affect the positions overall, and all residual variance comes from different cell states. Figure 4 shows an example.
There are many approaches for single-cell data integration (the question is somewhat ill-defined and can be formalized into a mathematical algorithm in many different ways). Luecken et al. (2022) benchmarked several approaches. Here, we show how to do the integration manually, or with harmony
. Later we will compare both to LEMUR
.
A 2D embedding like UMAP or tSNE gives us a first impression if cells from different conditions are mixed, but the results are not quantitative, which means we cannot objectively assess or compare the quality of the result.
One simple metric to measure integration success is to see how mixed the conditions are. For example we can count for each cell how many of the nearest neighbors come from the same condition and how many come from the other conditions. For more information see Luecken et al. (2022).
Follow-up challenge: Write a function to calculate the mixing metric.
Follow-up questions: Can you write an integration function, that scores perfectly on the integration metric? Hint: it can be biologically completely useless. What else would you need to measure to protect against such a pathological solutions?
Manual Projection
In transcriptomic data analysis, each cell is characterized by its gene expression profile. In our case, we decided to consider the 500 most variable genes. Accordingly, each cell is a point in a 500-dimensional gene expression space. Principal component analysis (PCA) finds a \(P\)-dimensional space (\(P\le500\)) such that overall, distances between each cell in the original space and the reduced space are minimized.
To make these concepts more tangible, consider the cartoon in Figure 5. The orange blob represents all cells from the control condition in a 3-dimensional gene expression space. The grey planar rectangle \(R\) is a lower-dimensional subspace covering the shape of the orange blob. The shape of the blue blob (i.e., the treated cells) resembles the orange blob but is slightly offset, in a different planar rectangle. To integrate both conditions, we can project the two planes onto each other, so that each point from the blue blob is mapped into the orange subspace.
We can implement this procedure in a few lines of R code:
- We create a matrix for the cells from the control and treated conditions,
- we fit PCA on the control cells,
- we center the data, and finally
- we project the cells from both conditions onto the subspace of the control condition
# Each column from the matrix is the coordinate of a cell in the 500-dimensional space
= as.matrix(logcounts(sce)[,sce$stim == "ctrl"])
ctrl_mat = as.matrix(logcounts(sce)[,sce$stim == "stim"])
stim_mat
= rowMeans(ctrl_mat)
ctrl_centers = rowMeans(stim_mat)
stim_centers
# `prcomp` is R's name for PCA and IRLBA is an algorithm to calculate it.
= irlba::prcomp_irlba(t(ctrl_mat - ctrl_centers), n = 20)
ctrl_pca
# With a little bit of linear algebra, we project the points onto the subspace of the control cells
= matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat $stim == "ctrl"] = t(ctrl_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat[,sce$stim == "stim"] = t(ctrl_pca$rotation) %*% (stim_mat - stim_centers) integrated_mat[,sce
We check that our implementation works, by looking at the UMAP of the integrated data.
= uwot::umap(t(integrated_mat))
int_mat_umap
as_tibble(colData(sce)) |>
mutate(umap = int_mat_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()
The overlap is not perfect, but better than in Figure 1.
"stim"
subspace instead?
The projection is orthogonal onto the subspace, which means it matters which condition is chosen as reference.
= irlba::prcomp_irlba(t(stim_mat - stim_centers), n = 20, center = FALSE)
stim_pca
= matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat2 $stim == "ctrl"] = t(stim_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat2[,sce$stim == "stim"] = t(stim_pca$rotation) %*% (stim_mat - stim_centers)
integrated_mat2[,sce
= uwot::umap(t(integrated_mat2))
int_mat_umap2
as_tibble(colData(sce)) |>
mutate(umap = int_mat_umap2) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()
For this example, using the "stim"
condition as the reference leads to a worse integration.
The projection approach consists of three steps:
- Centering the data (e.g.,
ctrl_mat - ctrl_centers
). - Choosing a reference condition and calculating the subspace that approximates the data from the reference condition (
irlba::prcomp_irlba(t(stim_mat - stim_centers))$rotation
). - Projecting the data from the other conditions onto that subspace (
t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)
).
For arbitrary experimental designs, we can perform the centering with a linear model fit. The second step remains the same. And after calculating the centered matrix, the third step is also straight forward. Below are the outlines how such a general procedure would work.
# A complex experimental design
= lm(t(logcounts(sce)) ~ condition + batch, data = colData(sce))
lm_fit # The `residuals` function returns the coordinates minus the mean at the condition.
= t(residuals(lm_fit))
centered_mat # Assuming that `is_reference_condition` contains a selection of the cells
= irlba::prcomp_irlba(centered_mat[,is_reference_condition], ...)
ref_pca = t(ref_pca$rotation) %*% centered_mat int_mat
Harmony
Harmony is one popular tool for data integration (Korsunsky et al. 2019). Harmony is relatively fast and can handle more complex experimental designs than just a treatment/control setup. It is built around maximum diversity clustering (Figure 7). Unlike classical clustering algorithms, maximum diversity clustering not just minimizes the distance of each data point to a cluster center but also maximizes the mixing of conditions assigned to each cluster.
packageVersion("harmony")
[1] '1.2.0'
# Attention: You need `packageVersion("harmony") >= "1.0.0"` for this to work.
= harmony::RunHarmony(reducedDim(sce, "PCA"), colData(sce),
harm_mat vars_use = c("stim"))
= uwot::umap(harm_mat)
harm_umap
as_tibble(colData(sce)) |>
mutate(umap = harm_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()
Seurat’s integration method is based on mutual nearest neighbors (MNN) by Haghverdi et al. (2018). However, Seurat calls the mutual nearest neighbors integration anchors. The main difference is that Seurat uses CCA instead of PCA for dimensionality reduction (Butler et al. 2018).
# This code only works with Seurat v5!
= Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj2 "originalexp"]] = split(seur_obj2[["originalexp"]], f = seur_obj2$stim)
seur_obj2[[= Seurat::NormalizeData(seur_obj2, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj2 = Seurat::FindVariableFeatures(seur_obj2, selection.method = "vst", nfeatures = 500)
seur_obj2 = Seurat::ScaleData(seur_obj2)
seur_obj2 = Seurat::RunPCA(seur_obj2, verbose = FALSE)
seur_obj2
= Seurat::IntegrateLayers(object = seur_obj2, method = Seurat::CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca", verbose = FALSE)
seur_obj2 = Seurat::RunUMAP(seur_obj2, dims = 1:30, reduction = "integrated.cca")
seur_obj2 ::DimPlot(seur_obj2, reduction = "umap", group.by = "stim") Seurat
Analysis with LEMUR
LEMUR is a tool to analyze multi-condition single-cell data. A typical analysis workflow with LEMUR goes through four steps (Figure 8):
- Covariate-adjusted dimensionality reduction.
- Prediction of the differential expression for each cell per gene.
- Identification of neighborhoods of cells with similar differential expression patterns.
- Pseudobulked differential expression testing per neighborhood.
These are implemented by the following code.
library("lemur")
# Step 1
= lemur(sce, design = ~ stim, n_embedding = 30)
fit = align_by_grouping(fit, fit$colData$cell)
fit
# Step 2
= test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))
fit
# Steps 3 and 4
= find_de_neighborhoods(fit, group_by = vars(ind, stim)) nei
In the next sections we discuss these steps in more detail.
LEMUR Integration (Step 1)
Tools like MNN and Harmony take a PCA embedding and remove the effects of the specified covariates. However, there is no way to go back from the integrated embedding to the original gene space. This means that we cannot ask the counterfactual what the expression of a cell from the control condition would have been, had it been treated.
LEMUR achieves such counterfactuals by matching the subspaces of each condition (Constantin Ahlmann-Eltze and Huber 2024). Figure 9 illustrates the principle. In some sense this is a more sophisticated version of the manual matching that we saw in the previous section. The matching is a single affine transformation in contrast to Harmony’s inferred shifts per cluster.
LEMUR takes as input a SingleCellExperiment
object, the specification of the experimental design, and the number of latent dimensions. To refine the embedding, we will use the provided cell type annotations.
library("lemur")
= lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit = align_by_grouping(fit, fit$colData$cell, verbose = FALSE) fit
Making a UMAP plot of LEMUR’s embedding shows that it successfully integrated the conditions (Figure 10).
= uwot::umap(t(fit$embedding))
lemur_umap
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()
We can cross-check our embedding with the provided cell type annotations by coloring each cell by cell type (using the information stored in the colData(sce)$cell
column).
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = cell), size = 0.3) +
coord_fixed()
If there are no cell type labels present, LEMUR uses a modified version of Harmony’s maximum diversity clustering to automatically infer the cell relationships across conditions.
= lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit2 = align_harmony(fit2, verbose = FALSE) fit2
Differential expression analysis (Step 2)
The advantage of LEMUR’s integration is that we can predict what a cell’s expression from the control condition would have been, had it been stimulated and vice versa. Contrasting those predictions tells us how much the gene expression changes for that cell in any gene.
We call LEMUR’s test_de
function to compare the expression values in the "stim"
and "ctrl"
conditions.
= test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl")) fit
We can now pick an individual gene (PLSCR1) and plot the predicted log fold change for each cell to show how it varies as a function of the underlying gene expression values (Figure 12).
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
mutate(expr = logcounts(fit)["PLSCR1",]) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = expr), size = 0.3) +
scale_color_viridis_c() +
facet_wrap(vars(stim)) +
coord_fixed()
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
mutate(de = assay(fit, "DE")["PLSCR1",]) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
coord_fixed()
This approach has the advantage over the traditional cluster/cell type-based analysis that it can detect smooth differential expression gradients!
glmGamPoi
is a differential expression tool inspired by edgeR
and DESeq2
that is specifically designed for single cell RNA-seq data.
= glmGamPoi::pseudobulk(fit, group_by = vars(cell, ind, stim)) psce
Aggregating assay 'counts' using 'rowSums2'.
Aggregating assay 'logcounts' using 'rowMeans2'.
Aggregating assay 'pearson_residuals' using 'rowMeans2'.
Aggregating assay 'DE' using 'rowMeans2'.
Aggregating reducedDim 'TSNE' using 'rowMeans2'.
Aggregating reducedDim 'PCA' using 'rowMeans2'.
Aggregating reducedDim 'UMAP' using 'rowMeans2'.
Aggregating reducedDim 'linearFit' using 'rowMeans2'.
Aggregating reducedDim 'embedding' using 'rowMeans2'.
# Filter out NAs
= psce[,! is.na(psce$cell)]
psce = glmGamPoi::glm_gp(psce, design = ~ stim * cell)
glm_fit = glmGamPoi::test_de(glm_fit, contrast = cond(stim = "stim", cell = "B cells") - cond(stim = "ctrl", cell = "B cells"))
glm_de |>
glm_de arrange(pval) |>
head()
name pval adj_pval f_statistic df1 df2 lfc
1 HERC5 9.370681e-36 4.685341e-33 329.5872 1 116.4645 5.152089
2 NT5C3A 7.953371e-35 1.988343e-32 313.5509 1 116.4645 3.760837
3 PLSCR1 1.370641e-32 2.284401e-30 277.2727 1 116.4645 4.032858
4 DYNLT1 1.018124e-31 1.272655e-29 263.9920 1 116.4645 2.913860
5 IRF7 2.424490e-30 2.424490e-28 243.9083 1 116.4645 3.543921
6 SPI1 7.996678e-30 6.663898e-28 236.6273 1 116.4645 -2.440995
We use the pseudobulk
function from glmGamPoi
to calculate the average expression and latent embedding position per condition and cell type. The mean embedding and colData
is then fed to LEMUR’s predict
function.
# You could choose any gene
= "HERC5"
gene_oi
= glmGamPoi::pseudobulk(fit, group_by = vars(stim, cell)) reduced_fit
Aggregating assay 'counts' using 'rowSums2'.
Aggregating assay 'logcounts' using 'rowMeans2'.
Aggregating assay 'pearson_residuals' using 'rowMeans2'.
Aggregating assay 'DE' using 'rowMeans2'.
Aggregating reducedDim 'TSNE' using 'rowMeans2'.
Aggregating reducedDim 'PCA' using 'rowMeans2'.
Aggregating reducedDim 'UMAP' using 'rowMeans2'.
Aggregating reducedDim 'linearFit' using 'rowMeans2'.
Aggregating reducedDim 'embedding' using 'rowMeans2'.
= predict(fit, newdata = colData(reduced_fit), embedding = t(reducedDim(reduced_fit, "embedding")))
lemur_pred_per_cell
as_tibble(colData(reduced_fit)) |>
mutate(obs = logcounts(reduced_fit)[gene_oi,]) |>
mutate(lemur_pred = lemur_pred_per_cell[gene_oi,]) |>
ggplot(aes(x = obs, y = lemur_pred)) +
geom_abline() +
geom_point(aes(color = stim)) +
labs(title = "LEMUR's predictions averaged per cell type and condition")
Neighborhood identification (Step 3) and differential expression tests (Step 4)
To help identify the most interesting changes, LEMUR can aggregate the results of each gene and identify the neighborhood of cells with the most prominent gene expression. To conduct a statistically valid differential expression test, LEMUR pseudobulks the counts for each replicate (here this is the patient).
The output is a data.frame
with one row for each gene. The neighborhood
column is a list column, where each element is a vector with the IDs of the cells that are inside the neighborhood. The other columns describe the significance of the expression change defined in the test_de
call.
= find_de_neighborhoods(fit, group_by = vars(ind, stim))
nei head(nei)
name neighborhood n_cells sel_statistic pval adj_pval
1 FTL AAACATAC.... 28830 -262.0174 2.215696e-02 2.847939e-02
2 ISG15 AAACATAC.... 14798 1106.6952 1.084384e-19 5.421921e-17
3 FTH1 AAACATAC.... 22767 -489.7753 3.792553e-06 1.102486e-05
4 TIMP1 AAACATAC.... 8709 -171.2586 2.590808e-05 6.139355e-05
5 CXCL10 AAACATAC.... 8201 294.7182 1.367230e-07 5.425515e-07
6 CCL2 AAACATAC.... 6446 277.9301 1.037136e-04 2.057809e-04
f_statistic df1 df2 lfc did_pval did_adj_pval did_lfc
1 6.286358 1 17.69881 -0.3982586 0.04472085 0.10117840 -1.12234829
2 2018.940379 1 17.69881 5.1472651 0.03154131 0.07730712 1.01188568
3 43.350214 1 17.69881 -0.9456260 0.99395469 0.99999445 0.00216271
4 31.678724 1 17.69881 -0.9744997 0.05430602 0.11857210 0.62171081
5 70.650228 1 17.69881 6.0220042 0.95981690 0.99999445 0.08575997
6 24.711774 1 17.69881 1.9174593 0.21986884 0.35577483 0.93624372
We can sort the table either by the pval
or the did_pval
. The pval
measures the evidence of an expression change between control and stimulated condition for all cells inside the neighborhood. In contrast, the did_pval
measures how much larger the expression difference is for the cells inside the neighborhood versus outside.
slice_min(nei, pval, n = 3)
name neighborhood n_cells sel_statistic pval adj_pval
1 ISG15 AAACATAC.... 14798 1106.6952 1.084384e-19 5.421921e-17
2 EIF2AK2 AAACATAC.... 27754 552.1799 4.541945e-18 1.135486e-15
3 MX2 AAACATAC.... 8009 616.8117 5.846666e-17 9.744443e-15
f_statistic df1 df2 lfc did_pval did_adj_pval did_lfc
1 2018.940 1 17.69881 5.147265 0.03154131 0.07730712 1.0118857
2 1318.035 1 17.69881 3.437423 0.60364318 0.74893695 0.4965469
3 983.279 1 17.69881 4.222732 0.09041983 0.17918008 -0.5175768
slice_min(nei, did_pval, n = 3)
name neighborhood n_cells sel_statistic pval adj_pval
1 SSB AAACATAC.... 6572 380.2204 5.742228e-16 5.742228e-14
2 RPL23 AAACATAC.... 7648 -193.6324 3.928749e-12 1.033881e-10
3 ENO1 AAACATAC.... 7269 -238.4628 6.214251e-11 8.877501e-10
f_statistic df1 df2 lfc did_pval did_adj_pval did_lfc
1 755.7363 1 17.69881 2.917861 4.140134e-13 2.070067e-10 -2.153032
2 268.0192 1 17.69881 -1.929019 7.878820e-11 1.969705e-08 1.666773
3 191.6963 1 17.69881 -1.705069 7.539221e-10 1.256537e-07 2.021944
= slice_min(nei, did_pval)
top_did_gene
as_tibble(colData(sce), rownames = "cell_name") |>
mutate(umap = lemur_umap) |>
mutate(de = assay(fit, "DE")[top_did_gene$name,]) |>
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
coord_fixed() +
labs(title = "Top DE gene by did_pval")
To see which cells LEMUR included in the neighborhood for each gene, we will first make a smaller helper function:
= function(data, fit = NULL){
neighborhoods_to_long_data = data[["neighborhood"]]
nei = data[["name"]]
gene_names = if(! is.null(fit)) colnames(fit) else unique(unlist(nei))
cell_names = seq_along(cell_names)
levels2idx names(levels2idx) = cell_names
= list(
res rep(gene_names, each = length(cell_names)),
rep(cell_names, times = length(gene_names)),
rep(FALSE, length(cell_names) * length(gene_names))
)= 0
offset for(n in nei){
3]][offset + levels2idx[n]] = TRUE
res[[= offset + length(cell_names)
offset
}
names(res) = c("name", "cell", "inside")
::as_tibble(res)
tibble }
With this function, we can annotate for any gene if a cell is included in LEMUR’s neighborhood.
= neighborhoods_to_long_data(top_did_gene, fit)
cells_inside_df
as_tibble(colData(sce), rownames = "cell_name") |>
mutate(umap = lemur_umap) |>
mutate(de = assay(fit, "DE")[top_did_gene$name,]) |>
left_join(cells_inside_df, by = c("cell_name" = "cell")) %>%
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
coord_fixed() +
facet_wrap(vars(inside), labeller = label_both) +
labs(title = "Top DE gene by did_pval")
LEMUR’s neighborhoods are adaptive to the underlying expression patterns for each gene. This means that they provide optimal power to detect differential expression as well the set of cells for which the gene expression changes most.
The plot below illustrates this for the top three DE genes by pval
which all have very different neighborhoods. Note that the neighborhoods are convex in the low-dimensional embedding space, even though in the UMAP embedding they might not look like it. This is due to the non-linear nature of the UMAP dimensionality reduction.
= neighborhoods_to_long_data(slice_min(nei, pval, n = 3), fit)
cells_inside_df_top3
as_tibble(colData(sce), rownames = "cell_name") |>
mutate(umap = lemur_umap) |>
mutate(de = as_tibble(t(assay(fit, "DE")[slice_min(nei, pval, n = 3)$name,]))) |>
unpack(de, names_sep = "_") %>%
pivot_longer(starts_with("de_"), names_sep = "_", names_to = c(".value", "gene_name")) %>%
left_join(cells_inside_df_top3, by = c("gene_name" = "name", "cell_name" = "cell")) %>%
sample_frac() |>
ggplot(aes(x = umap[,1], y = umap[,2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
coord_fixed() +
facet_grid(vars(inside), vars(gene_name), labeller = label_both) +
labs(title = "Top 3 DE gene by pval")
Outlook
LEMUR can handle arbitrary design matrices as input, which means that it can also handle more complicated experiments than the simple treatment / control comparison shown here. Just as with limma, edgeR or DESeq2, you can model
- multiple covariates, e.g., the treatment of interest as well as known confounders (experiment batch, sex, age) or blocking variables,
- interactions, e.g., drug treatment in wild type and mutant,
- continuous covariates, e.g., time courses, whose effects can also be modeled using smooth non-linear functions (spline regression).
For more details, see Constantin Ahlmann-Eltze and Huber (2024). For any questions, whether technical/practical or conceptual, please post to the Bioconductor forum using the lemur
tag and follow the posting guide.
Session Info
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Rome
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] lemur_1.2.0 muscData_1.18.0
[3] ExperimentHub_2.12.0 AnnotationHub_3.12.0
[5] BiocFileCache_2.12.0 dbplyr_2.5.0
[7] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[9] Biobase_2.64.0 GenomicRanges_1.56.1
[11] GenomeInfoDb_1.40.1 IRanges_2.38.0
[13] S4Vectors_0.42.0 BiocGenerics_0.50.0
[15] MatrixGenerics_1.16.0 matrixStats_1.3.0
[17] lubridate_1.9.3 forcats_1.0.0
[19] stringr_1.5.1 dplyr_1.1.4
[21] purrr_1.0.2 readr_2.1.5
[23] tidyr_1.3.1 tibble_3.2.1
[25] ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] DBI_1.2.3 transformGamPoi_1.10.0
[3] gridExtra_2.3 rlang_1.1.4
[5] magrittr_2.0.3 scater_1.32.0
[7] RcppAnnoy_0.0.22 compiler_4.4.1
[9] RSQLite_2.3.7 DelayedMatrixStats_1.26.0
[11] png_0.1-8 vctrs_0.6.5
[13] pkgconfig_2.0.3 crayon_1.5.2
[15] fastmap_1.2.0 XVector_0.44.0
[17] labeling_0.4.3 scuttle_1.14.0
[19] utf8_1.2.4 rmarkdown_2.27
[21] tzdb_0.4.0 UCSC.utils_1.0.0
[23] ggbeeswarm_0.7.2 bit_4.0.5
[25] xfun_0.45 zlibbioc_1.50.0
[27] cachem_1.1.0 beachmat_2.20.0
[29] jsonlite_1.8.8 blob_1.2.4
[31] DelayedArray_0.30.1 BiocParallel_1.38.0
[33] irlba_2.3.5.1 parallel_4.4.1
[35] R6_2.5.1 stringi_1.8.4
[37] Rcpp_1.0.12 knitr_1.47
[39] splines_4.4.1 Matrix_1.7-0
[41] timechange_0.3.0 tidyselect_1.2.1
[43] viridis_0.6.5 rstudioapi_0.16.0
[45] abind_1.4-5 yaml_2.3.8
[47] codetools_0.2-20 curl_5.2.1
[49] lattice_0.22-6 withr_3.0.0
[51] KEGGREST_1.44.0 evaluate_0.24.0
[53] Biostrings_2.72.1 pillar_1.9.0
[55] BiocManager_1.30.23 filelock_1.0.3
[57] generics_0.1.3 BiocVersion_3.19.1
[59] hms_1.1.3 sparseMatrixStats_1.16.0
[61] munsell_0.5.1 scales_1.3.0
[63] RhpcBLASctl_0.23-42 glue_1.7.0
[65] tools_4.4.1 BiocNeighbors_1.22.0
[67] ScaledMatrix_1.12.0 cowplot_1.1.3
[69] grid_4.4.1 AnnotationDbi_1.66.0
[71] colorspace_2.1-0 GenomeInfoDbData_1.2.12
[73] beeswarm_0.4.0 BiocSingular_1.20.0
[75] vipor_0.4.7 cli_3.6.2
[77] rsvd_1.0.5 rappdirs_0.3.3
[79] fansi_1.0.6 viridisLite_0.4.2
[81] S4Arrays_1.4.1 uwot_0.2.2
[83] glmGamPoi_1.16.0 gtable_0.3.5
[85] digest_0.6.35 ggrepel_0.9.5
[87] SparseArray_1.4.8 farver_2.1.2
[89] htmlwidgets_1.6.4 memoise_2.0.1
[91] htmltools_0.5.8.1 lifecycle_1.0.4
[93] httr_1.4.7 mime_0.12
[95] harmony_1.2.0 bit64_4.0.5