Multi-condition single-cell RNA-seq analysis with LEMUR

Author

Constantin Ahlmann-Eltze & Wolfgang Huber

Published

June 25, 2024

Figure 1 from *Analysis of multi-condition single-cell data with latent embedding multivariate regression* by Constantin Ahlmann-Eltze and Huber (2024). The goal is to identify cells that show gene expression changes associated with experimental perturbations or study conditions.

Setup

To start, we load the tidyverse and SingleCellExperiment packages:

library("tidyverse")
library("SingleCellExperiment")
set.seed(1)

The call to set.seed, to set the initial conditions (“seed”) of R’s random number generator, is optional. We make it here to ensure that the results are exactly the same for everyone. Some of the methods here use stochastic sampling, but the expectation is that even with different random seeds, the results will be essentially, for all practical intents and purposes, the same. You can verify this by omiting the call to set.seed or calling with a different argument.

Example data

We consider a popular dataset by Kang et al. (2018). The authors measured the effect of interferon-$\beta$ stimulation on blood cells from eight patients. The muscData package provides the data as a SingleCellExperiment. If downloading the data with the muscData package fails, you can also directly download the file from http://oc.embl.de/index.php/s/tpbYcH5P9NfXeM5 and load it into R using sce = readRDS("PATH_TO_THE_FILE").

sce = muscData::Kang18_8vs8()

see ?muscData and browseVignettes('muscData') for documentation

loading from cache

sce

class: SingleCellExperiment 
dim: 35635 29065 
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

Challenge: How many genes and cells are in the data? How do you find the metadata about each cell / gene?

You can find the number of genes and cells when printing the summary of the sce. Alternatively, you can use nrow(sce), ncol(sce), or dim(sce) to get these values.

In a SingleCellExperiment object, the metadata (“data about data”) on the cells are accessed with colData(sce), and those on the genes with rowData(sce).

Follow-up question: why does the documentation for colData (run ?colData in the R console) talk about SummarizedExperiment objects instead of SingleCellExperiment?

We logarithm transform the data to account for the heteroskedasticity of the counts (C. Ahlmann-Eltze and Huber (2023)), perform PCA to reduce dimensionality, and run UMAP for visualization. We use the scater package, which adds a new slot called logcounts and the two slots reducedDims(sce) named PCA and UMAP to the SummarizedExperiment object. Equivalent functionality exists in Seurat and scanpy.

sce = scater::logNormCounts(sce)
hvg = order(MatrixGenerics::rowVars(logcounts(sce)), decreasing = TRUE)
sce = sce[hvg[1:500], ]
# _Note_: in case you get an error about `"function 'as_cholmod_sparse' not provided by package 'Matrix'"`, 
# please reinstall the `irlba` package from source (`install.packages("irlba", type = "source")`)!
sce = scater::runPCA(sce, ncomponents = 50)
sce = scater::runUMAP(sce, dimred = "PCA")

Challenge: How would you do a sctransform-like transformation (i.e., Pearson residuals) without using Seurat?

The transformGamPoi package from Bioconductor provides a residual_transform function.

assay(sce, "pearson_residuals") = transformGamPoi::residual_transform(sce, residual = "pearson", on_disk = FALSE)

Question: What is the problem with using too few components for PCA? Is there also a problem with using too many?

Too few dimensions for PCA mean that it cannot capture enough of the relevant variation in the data. This leads to a loss of subtle differences between cell states.

Too many dimensions for PCA can also pose a problem. PCA smoothes out the sampling noise in the data. If too many PCA components are included, the smoothing is too weak, and the additional dimensions capture noise that can obscure biologically relevant differences. For more details see Fig. 2d in C. Ahlmann-Eltze and Huber (2023).

Challenge: How can you use tSNE instead of UMAP? What are alternative visualization methods?

The scater package also provides a runTSNE function

sce = scater::runTSNE(sce, dimred = "PCA")

On the one hand, tSNE and UMAP are de facto standards for visualizing the cell states in single cell data analysis. On the other hand, they are often criticized for failing to represent global structure and distorting differences in the data. A number of alternatives have been suggested:

Force directed layout of the $k$ nearest neighbor (kNN) graph (igraph),
PHATE
IVIS
Unification of tSNE and UMAP using contrastive learning (Böhm, 2023)

In this tutorial, we use ggplot2 to visualize the data. This is often more verbose than calling a higher-level, specialized wrapper function (e.g., scater::plotReducedDim(sce, "UMAP", color_by = "stim")), on the other hand, it has the advantage that we can more easily customize our plots.

In the below two UMAP plots, the cells separate by treatment status (stim), as well as by annotated cell type (cell). One goal of this tutorial is to understand how different approaches use these covariates to integrate the data into a shared embedding space or representation.

as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  ggplot(aes(x = umap[,1], y = umap[,2])) + geom_point(aes(color = stim), size = 0.3) + coord_fixed()

Figure 1: UMAP of log transformed counts

Challenge: The colData(sce)$ind column lists the patient identifier for each cell. Can you use this to create a separate plot for each patient?

One solution could be to use a for loop and filter the data for each patient, but there is a much simpler solution! Adding facet_wrap() to the ggplot call will automatically create a subpanel for each patient:

as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  ggplot(aes(x = umap[,1], y = umap[,2])) + geom_point(aes(color = stim), size = 0.3) + coord_fixed() +
    facet_wrap(vars(ind))

This dataset already comes with cell type annotations. The cell type labels help interpret the results; however, for the purpose of this tutorial, we will not need them and will ignore them from now on.

label_pos_df = as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  summarize(umap = matrix(colMedians(umap), nrow = 1), .by = c(stim, cell))

# Use `geom_text_repel` to avoid overlapping annotations
as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = cell), size = 0.3) +
    ggrepel::geom_text_repel(data = label_pos_df, aes(label = cell)) +
    coord_fixed()

Alternative path: Do the preprocessing with Seurat (click to see results)

The following code is based on Seurat’s Guided Clustering Tutorial.

# For more information about the conversion see `?as.Seurat.CellDataSet`
seur_obj = Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj = Seurat::NormalizeData(seur_obj, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj = Seurat::FindVariableFeatures(seur_obj, selection.method = "vst", nfeatures = 500)
# Subset to highly variable genes for memory efficiency
seur_obj = seur_obj[Seurat::VariableFeatures(object = seur_obj),]
seur_obj = Seurat::ScaleData(seur_obj)
seur_obj = Seurat::RunPCA(seur_obj, verbose = FALSE)
seur_obj = Seurat::RunUMAP(seur_obj, dims = 1:10)

Seurat::DimPlot(seur_obj, reduction = "umap", group.by = "stim")

Data integration

Figure 1 shows that the cell embedding locations separate by treatment status. For many analyses, we would like to overlay the cells from the stimulated condition with the cells from the control condition. For example, for cell type assignment, we might want to annotate both conditions together, irrespective of treatment. This aspiration is called integration.

The goal is a mathematical representation (e.g., a low-dimensional embedding) of the cells where the treatment status does not visibly affect the positions overall, and all residual variance comes from different cell states. Figure 4 shows an example.

Figure 4: UMAP of a successfully integrated dataset.

There are many approaches for single-cell data integration (the question is somewhat ill-defined and can be formalized into a mathematical algorithm in many different ways). Luecken et al. (2022) benchmarked several approaches. Here, we show how to do the integration manually, or with harmony. Later we will compare both to LEMUR.

Question: In this tutorial we will just look at plots to assess integration success. Why is that suboptimal? How can we do better?

A 2D embedding like UMAP or tSNE gives us a first impression if cells from different conditions are mixed, but the results are not quantitative, which means we cannot objectively assess or compare the quality of the result.

One simple metric to measure integration success is to see how mixed the conditions are. For example we can count for each cell how many of the nearest neighbors come from the same condition and how many come from the other conditions. For more information see Luecken et al. (2022).

Follow-up challenge: Write a function to calculate the mixing metric.

Follow-up questions: Can you write an integration function, that scores perfectly on the integration metric? Hint: it can be biologically completely useless. What else would you need to measure to protect against such a pathological solutions?

Manual Projection

Figure 5: Schematic picture of data from two conditions. The data from the treated condition is projected onto the subspace spanned by the control condition.

In transcriptomic data analysis, each cell is characterized by its gene expression profile. In our case, we decided to consider the 500 most variable genes. Accordingly, each cell is a point in a 500-dimensional gene expression space. Principal component analysis (PCA) finds a $P$-dimensional space ($P\le500$) such that overall, distances between each cell in the original space and the reduced space are minimized.

To make these concepts more tangible, consider the cartoon in Figure 5. The orange blob represents all cells from the control condition in a 3-dimensional gene expression space. The grey planar rectangle $R$ is a lower-dimensional subspace covering the shape of the orange blob. The shape of the blue blob (i.e., the treated cells) resembles the orange blob but is slightly offset, in a different planar rectangle. To integrate both conditions, we can project the two planes onto each other, so that each point from the blue blob is mapped into the orange subspace.

We can implement this procedure in a few lines of R code:

We create a matrix for the cells from the control and treated conditions,
we fit PCA on the control cells,
we center the data, and finally
we project the cells from both conditions onto the subspace of the control condition

# Each column from the matrix is the coordinate of a cell in the 500-dimensional space
ctrl_mat = as.matrix(logcounts(sce)[,sce$stim == "ctrl"])
stim_mat = as.matrix(logcounts(sce)[,sce$stim == "stim"])

ctrl_centers = rowMeans(ctrl_mat)
stim_centers = rowMeans(stim_mat)

# `prcomp` is R's name for PCA and IRLBA is an algorithm to calculate it.
ctrl_pca = irlba::prcomp_irlba(t(ctrl_mat - ctrl_centers), n = 20)

# With a little bit of linear algebra, we project the points onto the subspace of the control cells
integrated_mat = matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat[,sce$stim == "ctrl"] = t(ctrl_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat[,sce$stim == "stim"] = t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)

We check that our implementation works, by looking at the UMAP of the integrated data.

int_mat_umap = uwot::umap(t(integrated_mat))

as_tibble(colData(sce)) |>
  mutate(umap = int_mat_umap) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Figure 6: UMAP of log transformed counts

The overlap is not perfect, but better than in Figure 1.

Challenge: What happens if you project on the "stim" subspace instead?

Schematic picture of data from two conditions using the stimulated condition as reference.

The projection is orthogonal onto the subspace, which means it matters which condition is chosen as reference.

stim_pca = irlba::prcomp_irlba(t(stim_mat - stim_centers), n = 20, center = FALSE)

integrated_mat2 = matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat2[,sce$stim == "ctrl"] = t(stim_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat2[,sce$stim == "stim"] = t(stim_pca$rotation) %*% (stim_mat - stim_centers)

int_mat_umap2 = uwot::umap(t(integrated_mat2))

as_tibble(colData(sce)) |>
  mutate(umap = int_mat_umap2) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

For this example, using the "stim" condition as the reference leads to a worse integration.

Brain teaser: How can you make the manual projection approach work for any complex experimental designs?

The projection approach consists of three steps:

Centering the data (e.g., ctrl_mat - ctrl_centers).
Choosing a reference condition and calculating the subspace that approximates the data from the reference condition (irlba::prcomp_irlba(t(stim_mat - stim_centers))$rotation).
Projecting the data from the other conditions onto that subspace (t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)).

For arbitrary experimental designs, we can perform the centering with a linear model fit. The second step remains the same. And after calculating the centered matrix, the third step is also straight forward. Below are the outlines how such a general procedure would work.

# A complex experimental design
lm_fit = lm(t(logcounts(sce)) ~ condition + batch, data = colData(sce))
# The `residuals` function returns the coordinates minus the mean at the condition.
centered_mat = t(residuals(lm_fit))
# Assuming that `is_reference_condition` contains a selection of the cells
ref_pca = irlba::prcomp_irlba(centered_mat[,is_reference_condition], ...)
int_mat = t(ref_pca$rotation) %*% centered_mat

Harmony

Harmony is one popular tool for data integration (Korsunsky et al. 2019). Harmony is relatively fast and can handle more complex experimental designs than just a treatment/control setup. It is built around maximum diversity clustering (Figure 7). Unlike classical clustering algorithms, maximum diversity clustering not just minimizes the distance of each data point to a cluster center but also maximizes the mixing of conditions assigned to each cluster.

packageVersion("harmony")

[1] '1.2.0'

# Attention: You need `packageVersion("harmony") >= "1.0.0"` for this to work.
harm_mat = harmony::RunHarmony(reducedDim(sce, "PCA"), colData(sce), 
                                 vars_use = c("stim"))
harm_umap = uwot::umap(harm_mat)

as_tibble(colData(sce)) |>
  mutate(umap = harm_umap) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Challenge: How much do the results change if you change the default parameters in Harmony?

Challenge: How do you do integration in Seurat?

Seurat’s integration method is based on mutual nearest neighbors (MNN) by Haghverdi et al. (2018). However, Seurat calls the mutual nearest neighbors integration anchors. The main difference is that Seurat uses CCA instead of PCA for dimensionality reduction (Butler et al. 2018).

# This code only works with Seurat v5!
seur_obj2 = Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj2[["originalexp"]] = split(seur_obj2[["originalexp"]], f = seur_obj2$stim)
seur_obj2 = Seurat::NormalizeData(seur_obj2, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj2 = Seurat::FindVariableFeatures(seur_obj2, selection.method = "vst", nfeatures = 500)
seur_obj2 = Seurat::ScaleData(seur_obj2)
seur_obj2 = Seurat::RunPCA(seur_obj2, verbose = FALSE)

seur_obj2 = Seurat::IntegrateLayers(object = seur_obj2, method = Seurat::CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca", verbose = FALSE)
seur_obj2 = Seurat::RunUMAP(seur_obj2, dims = 1:30, reduction = "integrated.cca")
Seurat::DimPlot(seur_obj2, reduction = "umap", group.by = "stim")

Analysis with LEMUR

LEMUR is a tool to analyze multi-condition single-cell data. A typical analysis workflow with LEMUR goes through four steps (Figure 8):

Covariate-adjusted dimensionality reduction.
Prediction of the differential expression for each cell per gene.
Identification of neighborhoods of cells with similar differential expression patterns.
Pseudobulked differential expression testing per neighborhood.

These are implemented by the following code.

library("lemur")
# Step 1
fit = lemur(sce, design = ~ stim, n_embedding = 30)
fit = align_by_grouping(fit, fit$colData$cell)

# Step 2
fit = test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))

# Steps 3 and 4
nei = find_de_neighborhoods(fit, group_by = vars(ind, stim))

In the next sections we discuss these steps in more detail.

LEMUR Integration (Step 1)

Tools like MNN and Harmony take a PCA embedding and remove the effects of the specified covariates. However, there is no way to go back from the integrated embedding to the original gene space. This means that we cannot ask the counterfactual what the expression of a cell from the control condition would have been, had it been treated.

LEMUR achieves such counterfactuals by matching the subspaces of each condition (Constantin Ahlmann-Eltze and Huber 2024). Figure 9 illustrates the principle. In some sense this is a more sophisticated version of the manual matching that we saw in the previous section. The matching is a single affine transformation in contrast to Harmony’s inferred shifts per cluster.

Figure 9: Schematic picture of data from two conditions with the respective linear subspace.

LEMUR takes as input a SingleCellExperiment object, the specification of the experimental design, and the number of latent dimensions. To refine the embedding, we will use the provided cell type annotations.

library("lemur")
fit = lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit = align_by_grouping(fit, fit$colData$cell, verbose = FALSE)

Making a UMAP plot of LEMUR’s embedding shows that it successfully integrated the conditions (Figure 10).

lemur_umap = uwot::umap(t(fit$embedding))

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Figure 10: UMAP plot of LEMUR’s embedding.

We can cross-check our embedding with the provided cell type annotations by coloring each cell by cell type (using the information stored in the colData(sce)$cell column).

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = cell), size = 0.3) +
    coord_fixed()

Figure 11: Cell from the same cell types are close together after the integration

Challenge: How much do the results change if you change the default parameters in LEMUR?

Challenge: How to can the embedding of LEMUR be refined with an automated tool?

If there are no cell type labels present, LEMUR uses a modified version of Harmony’s maximum diversity clustering to automatically infer the cell relationships across conditions.

fit2 = lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit2 = align_harmony(fit2, verbose = FALSE)

Differential expression analysis (Step 2)

The advantage of LEMUR’s integration is that we can predict what a cell’s expression from the control condition would have been, had it been stimulated and vice versa. Contrasting those predictions tells us how much the gene expression changes for that cell in any gene.

Differential expression with an invertible integration can be inferred as the difference between the prediction for a position from one condition and the prediction at the same position in the other condition.

We call LEMUR’s test_de function to compare the expression values in the "stim" and "ctrl" conditions.

fit = test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))

We can now pick an individual gene (PLSCR1) and plot the predicted log fold change for each cell to show how it varies as a function of the underlying gene expression values (Figure 12).

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  mutate(expr = logcounts(fit)["PLSCR1",]) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = expr), size = 0.3) +
    scale_color_viridis_c() +
    facet_wrap(vars(stim)) +
    coord_fixed()

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  mutate(de = assay(fit, "DE")["PLSCR1",]) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = de), size = 0.3) +
    scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
    coord_fixed()

Figure 12: Expression of *PLSCR1* in control and stim condition

Figure 13: Counterfactual: LEMUR predicts *PLSCR1* differential expression between control and stimulus for each cell

This approach has the advantage over the traditional cluster/cell type-based analysis that it can detect smooth differential expression gradients!

Challenge: Calculate the differential expression for each cell type using glmGamPoi.

glmGamPoi is a differential expression tool inspired by edgeR and DESeq2 that is specifically designed for single cell RNA-seq data.

psce = glmGamPoi::pseudobulk(fit, group_by = vars(cell, ind, stim))

Aggregating assay 'counts' using 'rowSums2'.

Aggregating assay 'logcounts' using 'rowMeans2'.

Aggregating assay 'pearson_residuals' using 'rowMeans2'.

Aggregating assay 'DE' using 'rowMeans2'.

Aggregating reducedDim 'TSNE' using 'rowMeans2'.

Aggregating reducedDim 'PCA' using 'rowMeans2'.

Aggregating reducedDim 'UMAP' using 'rowMeans2'.

Aggregating reducedDim 'linearFit' using 'rowMeans2'.

Aggregating reducedDim 'embedding' using 'rowMeans2'.

# Filter out NAs
psce = psce[,! is.na(psce$cell)]
glm_fit = glmGamPoi::glm_gp(psce, design = ~ stim * cell)
glm_de = glmGamPoi::test_de(glm_fit, contrast = cond(stim = "stim", cell = "B cells") - cond(stim = "ctrl", cell = "B cells"))
glm_de |>
  arrange(pval) |>
  head()

    name         pval     adj_pval f_statistic df1      df2       lfc
1  HERC5 9.370681e-36 4.685341e-33    329.5872   1 116.4645  5.152089
2 NT5C3A 7.953371e-35 1.988343e-32    313.5509   1 116.4645  3.760837
3 PLSCR1 1.370641e-32 2.284401e-30    277.2727   1 116.4645  4.032858
4 DYNLT1 1.018124e-31 1.272655e-29    263.9920   1 116.4645  2.913860
5   IRF7 2.424490e-30 2.424490e-28    243.9083   1 116.4645  3.543921
6   SPI1 7.996678e-30 6.663898e-28    236.6273   1 116.4645 -2.440995

Challenge: For each cell type compare LEMUR’s predictions against the observed expression of a gene:

We use the pseudobulk function from glmGamPoi to calculate the average expression and latent embedding position per condition and cell type. The mean embedding and colData is then fed to LEMUR’s predict function.

# You could choose any gene
gene_oi = "HERC5"

reduced_fit = glmGamPoi::pseudobulk(fit, group_by = vars(stim, cell))

Aggregating assay 'counts' using 'rowSums2'.

Aggregating assay 'logcounts' using 'rowMeans2'.

Aggregating assay 'pearson_residuals' using 'rowMeans2'.

Aggregating assay 'DE' using 'rowMeans2'.

Aggregating reducedDim 'TSNE' using 'rowMeans2'.

Aggregating reducedDim 'PCA' using 'rowMeans2'.

Aggregating reducedDim 'UMAP' using 'rowMeans2'.

Aggregating reducedDim 'linearFit' using 'rowMeans2'.

Aggregating reducedDim 'embedding' using 'rowMeans2'.

lemur_pred_per_cell = predict(fit, newdata = colData(reduced_fit), embedding = t(reducedDim(reduced_fit, "embedding")))

as_tibble(colData(reduced_fit)) |>
  mutate(obs = logcounts(reduced_fit)[gene_oi,]) |>
  mutate(lemur_pred =  lemur_pred_per_cell[gene_oi,]) |>
  ggplot(aes(x = obs, y = lemur_pred)) +
    geom_abline() +
    geom_point(aes(color = stim)) +
    labs(title = "LEMUR's predictions averaged per cell type and condition")

Neighborhood identification (Step 3) and differential expression tests (Step 4)

To help identify the most interesting changes, LEMUR can aggregate the results of each gene and identify the neighborhood of cells with the most prominent gene expression. To conduct a statistically valid differential expression test, LEMUR pseudobulks the counts for each replicate (here this is the patient).

The output is a data.frame with one row for each gene. The neighborhood column is a list column, where each element is a vector with the IDs of the cells that are inside the neighborhood. The other columns describe the significance of the expression change defined in the test_de call.

Schematic of the neighborhood inference for a gene

nei = find_de_neighborhoods(fit, group_by = vars(ind, stim))
head(nei)

    name neighborhood n_cells sel_statistic         pval     adj_pval
1    FTL AAACATAC....   28830     -262.0174 2.215696e-02 2.847939e-02
2  ISG15 AAACATAC....   14798     1106.6952 1.084384e-19 5.421921e-17
3   FTH1 AAACATAC....   22767     -489.7753 3.792553e-06 1.102486e-05
4  TIMP1 AAACATAC....    8709     -171.2586 2.590808e-05 6.139355e-05
5 CXCL10 AAACATAC....    8201      294.7182 1.367230e-07 5.425515e-07
6   CCL2 AAACATAC....    6446      277.9301 1.037136e-04 2.057809e-04
  f_statistic df1      df2        lfc   did_pval did_adj_pval     did_lfc
1    6.286358   1 17.69881 -0.3982586 0.04472085   0.10117840 -1.12234829
2 2018.940379   1 17.69881  5.1472651 0.03154131   0.07730712  1.01188568
3   43.350214   1 17.69881 -0.9456260 0.99395469   0.99999445  0.00216271
4   31.678724   1 17.69881 -0.9744997 0.05430602   0.11857210  0.62171081
5   70.650228   1 17.69881  6.0220042 0.95981690   0.99999445  0.08575997
6   24.711774   1 17.69881  1.9174593 0.21986884   0.35577483  0.93624372

We can sort the table either by the pval or the did_pval. The pval measures the evidence of an expression change between control and stimulated condition for all cells inside the neighborhood. In contrast, the did_pval measures how much larger the expression difference is for the cells inside the neighborhood versus outside.

slice_min(nei, pval, n = 3)

     name neighborhood n_cells sel_statistic         pval     adj_pval
1   ISG15 AAACATAC....   14798     1106.6952 1.084384e-19 5.421921e-17
2 EIF2AK2 AAACATAC....   27754      552.1799 4.541945e-18 1.135486e-15
3     MX2 AAACATAC....    8009      616.8117 5.846666e-17 9.744443e-15
  f_statistic df1      df2      lfc   did_pval did_adj_pval    did_lfc
1    2018.940   1 17.69881 5.147265 0.03154131   0.07730712  1.0118857
2    1318.035   1 17.69881 3.437423 0.60364318   0.74893695  0.4965469
3     983.279   1 17.69881 4.222732 0.09041983   0.17918008 -0.5175768

slice_min(nei, did_pval, n = 3)

   name neighborhood n_cells sel_statistic         pval     adj_pval
1   SSB AAACATAC....    6572      380.2204 5.742228e-16 5.742228e-14
2 RPL23 AAACATAC....    7648     -193.6324 3.928749e-12 1.033881e-10
3  ENO1 AAACATAC....    7269     -238.4628 6.214251e-11 8.877501e-10
  f_statistic df1      df2       lfc     did_pval did_adj_pval   did_lfc
1    755.7363   1 17.69881  2.917861 4.140134e-13 2.070067e-10 -2.153032
2    268.0192   1 17.69881 -1.929019 7.878820e-11 1.969705e-08  1.666773
3    191.6963   1 17.69881 -1.705069 7.539221e-10 1.256537e-07  2.021944

top_did_gene = slice_min(nei, did_pval)

as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  mutate(de = assay(fit, "DE")[top_did_gene$name,]) |>
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = de), size = 0.3) +
    scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
    coord_fixed() +
    labs(title = "Top DE gene by did_pval")

To see which cells LEMUR included in the neighborhood for each gene, we will first make a smaller helper function:

neighborhoods_to_long_data = function(data, fit = NULL){
  nei = data[["neighborhood"]]
  gene_names = data[["name"]]
  cell_names = if(! is.null(fit)) colnames(fit) else unique(unlist(nei))
  levels2idx = seq_along(cell_names)
  names(levels2idx) = cell_names
  res = list(
    rep(gene_names, each = length(cell_names)),
    rep(cell_names, times = length(gene_names)),
    rep(FALSE, length(cell_names) * length(gene_names))
  )
  offset = 0
  for(n in nei){
    res[[3]][offset + levels2idx[n]] = TRUE
    offset = offset + length(cell_names)
  }

  names(res) = c("name", "cell", "inside")
  tibble::as_tibble(res)
}

With this function, we can annotate for any gene if a cell is included in LEMUR’s neighborhood.

cells_inside_df = neighborhoods_to_long_data(top_did_gene, fit)

as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  mutate(de = assay(fit, "DE")[top_did_gene$name,]) |>
  left_join(cells_inside_df, by = c("cell_name" = "cell")) %>%
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = de), size = 0.3) +
    scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
    coord_fixed() +
    facet_wrap(vars(inside), labeller = label_both) +
    labs(title = "Top DE gene by did_pval")

Brain teaser: How are LEMUR’s neighborhoods different from the cluster-based differential expression tests?

LEMUR’s neighborhoods are adaptive to the underlying expression patterns for each gene. This means that they provide optimal power to detect differential expression as well the set of cells for which the gene expression changes most.

The plot below illustrates this for the top three DE genes by pval which all have very different neighborhoods. Note that the neighborhoods are convex in the low-dimensional embedding space, even though in the UMAP embedding they might not look like it. This is due to the non-linear nature of the UMAP dimensionality reduction.

cells_inside_df_top3 = neighborhoods_to_long_data(slice_min(nei, pval, n = 3), fit)

as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  mutate(de = as_tibble(t(assay(fit, "DE")[slice_min(nei, pval, n = 3)$name,]))) |>
  unpack(de, names_sep = "_") %>%
  pivot_longer(starts_with("de_"), names_sep = "_", names_to = c(".value", "gene_name")) %>%
  left_join(cells_inside_df_top3, by = c("gene_name" = "name", "cell_name" = "cell")) %>%
  sample_frac() |>
  ggplot(aes(x = umap[,1], y = umap[,2])) +
    geom_point(aes(color = de), size = 0.3) +
    scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
    coord_fixed() +
    facet_grid(vars(inside), vars(gene_name), labeller = label_both) +
    labs(title = "Top 3 DE gene by pval")

Outlook

LEMUR can handle arbitrary design matrices as input, which means that it can also handle more complicated experiments than the simple treatment / control comparison shown here. Just as with limma, edgeR or DESeq2, you can model

multiple covariates, e.g., the treatment of interest as well as known confounders (experiment batch, sex, age) or blocking variables,
interactions, e.g., drug treatment in wild type and mutant,
continuous covariates, e.g., time courses, whose effects can also be modeled using smooth non-linear functions (spline regression).

For more details, see Constantin Ahlmann-Eltze and Huber (2024). For any questions, whether technical/practical or conceptual, please post to the Bioconductor forum using the lemur tag and follow the posting guide.

Session Info

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Rome
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] lemur_1.2.0                 muscData_1.18.0            
 [3] ExperimentHub_2.12.0        AnnotationHub_3.12.0       
 [5] BiocFileCache_2.12.0        dbplyr_2.5.0               
 [7] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
 [9] Biobase_2.64.0              GenomicRanges_1.56.1       
[11] GenomeInfoDb_1.40.1         IRanges_2.38.0             
[13] S4Vectors_0.42.0            BiocGenerics_0.50.0        
[15] MatrixGenerics_1.16.0       matrixStats_1.3.0          
[17] lubridate_1.9.3             forcats_1.0.0              
[19] stringr_1.5.1               dplyr_1.1.4                
[21] purrr_1.0.2                 readr_2.1.5                
[23] tidyr_1.3.1                 tibble_3.2.1               
[25] ggplot2_3.5.1               tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 transformGamPoi_1.10.0   
 [3] gridExtra_2.3             rlang_1.1.4              
 [5] magrittr_2.0.3            scater_1.32.0            
 [7] RcppAnnoy_0.0.22          compiler_4.4.1           
 [9] RSQLite_2.3.7             DelayedMatrixStats_1.26.0
[11] png_0.1-8                 vctrs_0.6.5              
[13] pkgconfig_2.0.3           crayon_1.5.2             
[15] fastmap_1.2.0             XVector_0.44.0           
[17] labeling_0.4.3            scuttle_1.14.0           
[19] utf8_1.2.4                rmarkdown_2.27           
[21] tzdb_0.4.0                UCSC.utils_1.0.0         
[23] ggbeeswarm_0.7.2          bit_4.0.5                
[25] xfun_0.45                 zlibbioc_1.50.0          
[27] cachem_1.1.0              beachmat_2.20.0          
[29] jsonlite_1.8.8            blob_1.2.4               
[31] DelayedArray_0.30.1       BiocParallel_1.38.0      
[33] irlba_2.3.5.1             parallel_4.4.1           
[35] R6_2.5.1                  stringi_1.8.4            
[37] Rcpp_1.0.12               knitr_1.47               
[39] splines_4.4.1             Matrix_1.7-0             
[41] timechange_0.3.0          tidyselect_1.2.1         
[43] viridis_0.6.5             rstudioapi_0.16.0        
[45] abind_1.4-5               yaml_2.3.8               
[47] codetools_0.2-20          curl_5.2.1               
[49] lattice_0.22-6            withr_3.0.0              
[51] KEGGREST_1.44.0           evaluate_0.24.0          
[53] Biostrings_2.72.1         pillar_1.9.0             
[55] BiocManager_1.30.23       filelock_1.0.3           
[57] generics_0.1.3            BiocVersion_3.19.1       
[59] hms_1.1.3                 sparseMatrixStats_1.16.0 
[61] munsell_0.5.1             scales_1.3.0             
[63] RhpcBLASctl_0.23-42       glue_1.7.0               
[65] tools_4.4.1               BiocNeighbors_1.22.0     
[67] ScaledMatrix_1.12.0       cowplot_1.1.3            
[69] grid_4.4.1                AnnotationDbi_1.66.0     
[71] colorspace_2.1-0          GenomeInfoDbData_1.2.12  
[73] beeswarm_0.4.0            BiocSingular_1.20.0      
[75] vipor_0.4.7               cli_3.6.2                
[77] rsvd_1.0.5                rappdirs_0.3.3           
[79] fansi_1.0.6               viridisLite_0.4.2        
[81] S4Arrays_1.4.1            uwot_0.2.2               
[83] glmGamPoi_1.16.0          gtable_0.3.5             
[85] digest_0.6.35             ggrepel_0.9.5            
[87] SparseArray_1.4.8         farver_2.1.2             
[89] htmlwidgets_1.6.4         memoise_2.0.1            
[91] htmltools_0.5.8.1         lifecycle_1.0.4          
[93] httr_1.4.7                mime_0.12                
[95] harmony_1.2.0             bit64_4.0.5

References

Ahlmann-Eltze, C., and Wolfgang Huber. 2023. “Comparison of Transformations for Single-Cell RNA-Seq Data.” Nature Methods, February, 665–72. https://doi.org/10.1038/s41592-023-01814-1.

Ahlmann-Eltze, Constantin, and Wolfgang Huber. 2024. “Analysis of Multi-Condition Single-Cell Data with Latent Embedding Multivariate Regression.” bioRxiv, February. http://dx.doi.org/10.1101/2023.03.06.531268.

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology 36 (5): 411–20. https://doi.org/10.1038/nbt.4096.

Haghverdi, Laleh, Aaron T L Lun, Michael D Morgan, and John C Marioni. 2018. “Batch Effects in Single-Cell RNA-Sequencing Data Are Corrected by Matching Mutual Nearest Neighbors.” Nature Biotechnology 36 (5): 421–27. https://doi.org/10.1038/nbt.4091.

Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.

Korsunsky, Ilya, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. 2019. “Fast, Sensitive and Accurate Integration of Single-Cell Data with Harmony.” Nature Methods 16 (12): 1289–96. https://doi.org/10.1038/s41592-019-0619-0.

Luecken, Malte D., M. Büttner, K. Chaichoompu, A. Danese, M. Interlandi, M. F. Mueller, D. C. Strobl, et al. 2022. “Benchmarking Atlas-Level Data Integration in Single-Cell Genomics.” Nature Methods 19 (1): 41–50. https://doi.org/10.1038/s41592-021-01336-8.