Contents

1 Introduction

In this tutorial we walk through a typical single-cell RNA-seq analysis using Bioconductor packages. While most steps are valid for data from different protocols, some of the EDA/QC steps are specific of the 10X Genomics Chromium protocol.

We start from the output of the Cell Ranger preprocessing software. This is an open source software suite that allows to pre-process the FASTQ files generated by the sequencing platform and perform alignment and quantification.

We will perform exploratory data analysis (EDA) and quality control (QC), focusing on single-cell, and sometimes droplet-specific, issues, such as the detection of empty droplet and of doublets. We will then cover dimensionality reduction and cell type identification.

Most of the steps covered here (and much more!) are described in great details in the Orchestrating Single Cell Analysis (OSCA) book (Amezquita et al. 2020). I encourage everyone to use that as a reference for the most typical single-cell analyses with Bioconductor.

Bioconductor (Huber et al. 2015) has many packages supporting the analysis of single-cell RNA-seq data and their package vignettes constitute an excellent resource.

While not covered in this tutorial, there are packages and software tools for the analysis of single-cell data outside of Bioconductor too. Popular tools include the Seurat R package and the scanpy python package. The Bioconductor single-cell ecosystem tries whenever possible to provide data structures and coercion functions that make it easy to interoperate between Bioconductor and external software.

1.1 Experimental data

We will focus on one experimental dataset throughout this tutorial. In particular, we will use the Peripheral Blood Mononuclear Cell (PBMC) data from healthy donors provided by 10X Genomics as example datasets, available through the TENxPBMCData Bioconductor package.

If you are an advanced user and are already familiar with the steps provided here, you can make this tutorial more challenging by using other datasets. In particular, the scRNAseq Bioconductor package contains a wide variety of experimental single-cell datasets that can be used for illustration, exploration, and methods development.

1.2 The SingleCellExperiment class

One common feature of the dataset that we will use is that it is stored as an object of the SingleCellExperiment class.

SingleCellExperiment is a S4 class that extends SummarizedExperiment and can be used for efficiently storing and working with single-cell data in R/Bioconductor.

This class serves as the common infrastructure for across 70+ single-cell-related Bioconductor packages and allows users to mix and match packages by different developers. This class implements a data structure that stores all aspects of our single-cell data - gene-by-cell expression data, per-cell metadata and per-gene annotation - and manipulate them in a synchronized manner.

A more thorough overview of SingleCellExperiment can be found in Chapter 4 of OSCA.

2 Exploratory Data Analysis / Quality Control

In this section we focus on EDA/QC steps that are typical in a single-cell analysis.

We start by considering a dataset on Peripheral Blood Mononuclear Cell (PBMC) data from a healthy donor provided by 10X Genomics. We use the DropletTestFiles Bioconductor package to download the raw (i.e., unfiltered) count matrix that contains the UMI counts of all genes in all droplets.

library(DropletTestFiles)
raw.path <- getTestFile("tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz")
out.path <- file.path(tempdir(), "pbmc4k")
untar(raw.path, exdir=out.path)

library(DropletUtils)
fname <- file.path(out.path, "raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
sce.pbmc
## class: SingleCellExperiment 
## dim: 33694 737280 
## metadata(1): Samples
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
##   TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

The read10xCounts function starts from the output of the Cell Ranger software and imports the data into R as an object of class SingleCellExperiment.

We can notice that the dimension of the matrix is very big; in fact this matrix includes the UMI that have been detected in all the droplets that have been sequenced, including the empty droplets that may contain only ambient RNA.

This is a very sparse matrix, with a large fraction of zeros; read10xCounts is aware of this and stores the counts as a sparse matrix, which has a very small memory footprint.

class(counts(sce.pbmc))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

Exercise: Explore the SingleCellExperiment object and its slots. In particular, the assay, counts, colData, rowData assessors.

Before starting the analysis, it may be a good idea to store the names of the genes in a more human-friendly ID system. We can also include information on the chromosome location of the genes; this will be useful for e.g. identifying mitochondrial genes.

library(scuttle)
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)
rowData(sce.pbmc)$location <- mapIds(EnsDb.Hsapiens.v86,
                                     keys=rowData(sce.pbmc)$ID, 
                                     column="SEQNAME", keytype="GENEID")
rowData(sce.pbmc)
## DataFrame with 33694 rows and 3 columns
##                           ID       Symbol    location
##                  <character>  <character> <character>
## RP11-34P13.3 ENSG00000243485 RP11-34P13.3           1
## FAM138A      ENSG00000237613      FAM138A           1
## OR4F5        ENSG00000186092        OR4F5           1
## RP11-34P13.7 ENSG00000238009 RP11-34P13.7           1
## RP11-34P13.8 ENSG00000239945 RP11-34P13.8           1
## ...                      ...          ...         ...
## AC233755.2   ENSG00000277856   AC233755.2  KI270726.1
## AC233755.1   ENSG00000275063   AC233755.1  KI270726.1
## AC240274.1   ENSG00000271254   AC240274.1  KI270711.1
## AC213203.1   ENSG00000277475   AC213203.1  KI270713.1
## FAM231B      ENSG00000268674      FAM231B  KI270713.1

Exercise: compute the proportion of zero for each droplet (column) and draw the distribution across the dataset.

2.1 Empty droplets

The first step is the identification of droplets that do not contain any live cell.

The reason why these droplets contain some RNA is that there may be some ambient RNA due to some cell leaking or they may contain dead or dying cells.

The barcodeRanks function can be used to rank the barcodes by number of UMIs and to estimate the knee and inflection point of the distribution.

bcrank <- barcodeRanks(counts(sce.pbmc))

# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
    xlab="Rank", ylab="Total UMI count", cex.lab=1.2)

abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)

legend("bottomleft", legend=c("Inflection", "Knee"), 
        col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)

There is a sharp distinction between droplets with very high counts, very likely to contain a live cell, and droplets with very low counts, very likely to be empty.

However, it is not straightforward to classify the droplets in the middle of the distribution.

We can apply a statistical test of hypothesis to decide, for each droplet, if its RNA profile is significantly different from the profile of ambient RNA, estimated from the very low counts (Aaron TL Lun et al. 2019).

We use a very low threshold on the False Discovery Rate to have very few false positive cells.

set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
summary(e.out$FDR <= 0.001)
##    Mode   FALSE    TRUE    NA's 
## logical     989    4300  731991

The large majority of droplets are not tested, since by default all droplets with fewer than 100 UMIs are considered empty.

table(colSums(counts(sce.pbmc))>100, e.out$FDR<=0.001, useNA = "ifany")
##        
##          FALSE   TRUE   <NA>
##   FALSE      0      0 731991
##   TRUE     989   4300      0

We can now proceed by removing the empty droplets and keep only the ones identified to be cells.

sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
sce.pbmc
## class: SingleCellExperiment 
## dim: 33694 4300 
## metadata(1): Samples
## assays(1): counts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(4300): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

2.2 Data exploration

We will see in more details in the next section how to perform accurate dimensionality reduction. Here, we quickly perform a principal component analysis (PCA) and tSNE embedding to explore the data and highlight possible quality issues for some cells.

These embeddings can be quickly computed using the runPCA and runTSNE functions in scater. It is useful to log transform the data before PCA to more accurately reflect the signal.

The logNormCounts function divides by the total number of UMIs in addition to log transforming the counts to account for the difference in sequencing depths. This is not the best normalization approach, but for our current purpose of quickly visualizing the data, it is good enough.

We will need to recompute these embeddings after filtering and proper normalization to obtain a more accurate representation of our data.

library(scater)
sce.pbmc <- logNormCounts(sce.pbmc)
sce.pbmc <- runPCA(sce.pbmc)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
sce.pbmc
## class: SingleCellExperiment 
## dim: 33694 4300 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(4300): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):

Note that it is generally a good strategy to start from PCA to compute the tSNE embedding; this is achieved using the dimred argument.

Exercise: Explore the logcounts accessor and the assayNames, assays and assay methods to extract the original counts and the log-normalized counts from the object; explore the reducedDimNames and reducedDim methods to explore the PCA and t-SNE embedding.

We can visualize the first two PCs and the tSNE embedding using the following scater functions.

plotPCA(sce.pbmc)

plotTSNE(sce.pbmc)

2.3 Quality control

The fact that the droplets are not empty does not automatically mean that they should be kept in the analysis. They may contain damaged or dying cells.

Our aim here is to identify and potentially remove those cells that have lower quality. There are several methods to identify such cells; here we use the simple but effective strategy of removing cells with high percentages of mitochondrial reads.

This is justified by the fact that mitochondiral genes are involved in biological processes in response to stress and apoptosis. Hence, high mitochondrial gene expression may be an indication of damaged or stressed cells.

The perCellQCMetrics function can be used to compute a set of metrics useful to evaluate the quality of the samples. The isOutlier function uses a data driven threshold to define cells of lower quality compared to the rest of the dataset.

stats <- perCellQCMetrics(sce.pbmc,
            subsets=list(Mito=which(rowData(sce.pbmc)$location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
table(high.mito)
## high.mito
## FALSE  TRUE 
##  3985   315

Finally, it is convenient to include this information in the colData of the object, e.g. to use it in plotting functions.

colData(sce.pbmc) <- cbind(colData(sce.pbmc), stats)
sce.pbmc$discard <- high.mito

plotColData(sce.pbmc, y="subsets_Mito_percent",
        colour_by="discard")

plotColData(sce.pbmc, y="detected",
        colour_by="discard")

plotColData(sce.pbmc, x="sum", y="subsets_Mito_percent",
    colour_by="discard") + scale_x_log10()

We can also color-code the cells by the QC metrics and/or by the indicator that we defined in the PCA and tSNE plots.

plotTSNE(sce.pbmc, colour_by="subsets_Mito_percent")

plotTSNE(sce.pbmc, colour_by="discard")

We can see that there are a few cells with high mitochondrial reads scattered around two clusters. In addition, there seems to be a cluster made of only high-mitochondrial cells. It is unclear whether this is a technical artifact due to sample processing or a genuine PBMC cell population that expresses more mitochondrial genes.

In this tutorial, we go ahead and remove these cells, but in a real analysis, careful thought must be given to whether these metrics are capturing technical noise or biological signal.

sce.pbmc <- sce.pbmc[,!high.mito]
sce.pbmc
## class: SingleCellExperiment 
## dim: 33694 3985 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(10): Sample Barcode ... total discard
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):

2.4 Normalization

Here, we use the scran method to normalize the data for differences in sequencing depth. This approach is based on the deconvolution of size factors estimated from pools of cells (Aaron T. Lun, Bach, and Marioni 2016).

Since we have a heterogeneous cell population, we perform a quick clustering to make sure that we pool together cells that are not too different from each other.

library(scran)
cl <- quickCluster(sce.pbmc)
table(cl)
## cl
##    1    2    3    4    5    6    7    8 
##  216  523  542  345  499  547  258 1055
sce.pbmc <- computeSumFactors(sce.pbmc, clusters = cl)
summary(sizeFactors(sce.pbmc))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00749  0.71207  0.87490  1.00000  1.09900 12.25412

The function returns a SingleCellExperiment object with the estimated size factors stored in the colData. To actually normalize the data, we can use the logNormCounts function that will use the size factors stored in the object to normalize the data and save the log-normalized counts as a second assay in the object.

sce.pbmc <- logNormCounts(sce.pbmc)
sce.pbmc
## class: SingleCellExperiment 
## dim: 33694 3985 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(10): Sample Barcode ... total discard
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):

Exercise: Extract the size factors with the colData and sizeFactors accessors.

We can check that the estimated library sizes are not too far from the library size factors, estimated from the total number of counts.

plot(librarySizeFactors(sce.pbmc), sizeFactors(sce.pbmc), xlab="Library size factor",
    ylab="Deconvolution size factor", log='xy', pch=16,
    col=as.integer(factor(cl)))
abline(a=0, b=1, col="red")

2.5 Highly variable genes (HVGs)

We now have a SingleCellExperiment object with only the high-quality cells. We may also want to filter the genes to focus on those that are more likely to carry information rather than noise.

In fact, many of the genes that we stored in the object have no UMI in any cell. We can easily verify that by looking at the number of zero counts in each cell.

summary(colMeans(counts(sce.pbmc)==0))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8442  0.9559  0.9627  0.9600  0.9676  0.9974

On average, 96% of the genes have 0 UMIs in a given cell.

Some genes will be expressed in a large fraction of cells, while others will almost always be 0.

summary(rowMeans(counts(sce.pbmc)==0))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0007528 0.9736512 0.9994981 0.9600243 1.0000000 1.0000000

Hence, we can start off by removing those genes that are 0 across the board.

allzero <- rowMeans(counts(sce.pbmc)==0)==1
table(allzero)
## allzero
## FALSE  TRUE 
## 19669 14025
sce.pbmc <- sce.pbmc[which(!allzero),]
sce.pbmc
## class: SingleCellExperiment 
## dim: 19669 3985 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(19669): RP11-34P13.7 RP11-34P13.8 ... AC004556.1 AC240274.1
## rowData names(3): ID Symbol location
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(10): Sample Barcode ... total discard
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):

Note that we have removed more than 40% of the genes.

This loose filter is often not sufficient, and we usually prefer to focus on a set of highly variable genes for the subsequent analysis steps.

The rationale is that if the genes are carrying biologically relevant information, e.g., in distinguishing among cell types, they will not have a constant expression, but rather will vary between cells.

The simplest approach to quantifying per-gene variation is to compute the variance of the log-normalized expression values for each gene across all cells. We need to model the mean-variance relationship since the log is not a variance stabilizing transformation.

set.seed(1001)
dec.pbmc <- modelGeneVar(sce.pbmc)
head(dec.pbmc)
## DataFrame with 6 rows and 6 columns
##                      mean       total        tech          bio   p.value
##                 <numeric>   <numeric>   <numeric>    <numeric> <numeric>
## RP11-34P13.7  0.002166050 0.002227438 0.002227459 -2.15870e-08 0.5000269
## RP11-34P13.8  0.000522431 0.000549601 0.000537242  1.23586e-05 0.4364964
## FO538757.3    0.000256287 0.000261748 0.000263553 -1.80586e-06 0.5189887
## FO538757.2    0.142907236 0.147339117 0.145040150  2.29897e-03 0.4561457
## AP006222.2    0.061806884 0.077278825 0.063524904  1.37539e-02 0.0662152
## RP4-669L17.10 0.002577010 0.002864669 0.002650070  2.14599e-04 0.2868068
##                     FDR
##               <numeric>
## RP11-34P13.7   0.756443
## RP11-34P13.8   0.756443
## FO538757.3     0.756443
## FO538757.2     0.756443
## AP006222.2     0.677861
## RP4-669L17.10  0.756443

At any given abundance, we assume that the variation in expression for most genes is driven by uninteresting processes like sampling noise. Under this assumption, the fitted value of the trend at any given gene’s abundance represents an estimate of its uninteresting variation, which we call the technical component. We then define the biological component for each gene as the difference between its total variance and the technical component. This biological component represents the “interesting” variation.

plot(dec.pbmc$mean, dec.pbmc$total, pch=16, cex=0.5,
    xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(dec.pbmc)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)

The modelGeneVar function tests the null hypothesis that the biological component of the variance is 0 for each gene: small p-values indicate a gene with evidence of biological variation. We can select the genes using a threshold on the adjusted p-values or simply ranking them by biological variation and selecting the top one or two thousand.

top.pbmc <- getTopHVGs(dec.pbmc, n=1000)
head(top.pbmc)
## [1] "LYZ"     "S100A9"  "S100A8"  "HLA-DRA" "CD74"    "CST3"

2.6 Doublet identification

As we have already mentioned, a single droplet may erroneously capture more than one cell; these are called doublets. There are some statistical methods that can be used

To try and detect the doublets, we can use statistical algorithms, many of which are implemented in the scDblFinder package. We will focus on one technique that uses simulations to identify droplets that are likely to contain more than one cell.

Note that while we use a droplet terminology here, doublets can be observed in data from non-droplet protocols as well, e.g., in Fluidigm C1 Smart-seq2 datasets.

library(scDblFinder)
dbl.dens <- computeDoubletDensity(sce.pbmc, subset.row=top.pbmc, 
    d=ncol(reducedDim(sce.pbmc)))
sce.pbmc$DoubletScore <- dbl.dens
summary(dbl.dens)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01594  0.43038  0.82091  1.15838  1.41069 18.83311

The function computeDoubletDensity identifies potential doublet cells based on the local density of simulated doublet expression profiles.

Briefly, the strategy can be described in this way:

  1. We simulate many artificial doublets by summing two random cells.
  2. We compute the density of the simulated cells in the neighborhood of each real cell.
  3. We compute the density of real cells in the neighborhood of each cell.
  4. We compute the ratio between the two densities to obtain a cell-specific “doublet score”.

These scores can be visualized in a tSNE plot.

plotTSNE(sce.pbmc, colour_by="DoubletScore")

We can define a threshold to identify putative doublets.

dbl.calls <- doubletThresholding(data.frame(score=dbl.dens),
    method="griffiths", returnType="call")
summary(dbl.calls)
## singlet doublet 
##    3442     543

And finally eliminate them from the dataset before continuing the analysis.

Note that, similarly to what we have said for the removal of low-quality cells, one should be careful in removing putative doublets. Indeed, it may be difficult to distinguish between rare transient cell populations and doublet cells.

sce.pbmc <- sce.pbmc[, dbl.calls == "singlet"]
sce.pbmc
## class: SingleCellExperiment 
## dim: 19669 3442 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(19669): RP11-34P13.7 RP11-34P13.8 ... AC004556.1 AC240274.1
## rowData names(3): ID Symbol location
## colnames(3442): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(11): Sample Barcode ... discard DoubletScore
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):

3 Dimensionality reduction

Now that we have removed low-quality cells and doublets, and we have identified highly variable genes, we can proceed with the next step of the analysis, dimentionality reduction.

In particular, we will focus on dimensionality reduction techniques useful as a preliminary step for statistical inference (e.g., clustering), such as PCA and count-based methods.

We can always apply visualization techniques, such as tSNE or UMAP, on top of these more formal dimensionality reduction techniques, but it is important that we do not base our inferential steps on them and we use them only for visualization.

3.1 PCA

We have already seen the simplest dimensionality reduction technique, PCA, in the previous section. Here, we can recalculate the principal components, starting from the data normalized by scran and using only the selected HVGs. This will lead to a better representation of our biological signal.

sce.pbmc <- runPCA(sce.pbmc, subset_row=top.pbmc)

Note that the subset_row argument ensures that we use the previously selected HVGs to compute the PCs. Also note that by default runPCA will store the PCs in the PCA slot, hence overwriting the previous PCs. In this case, this is what we want, as the previous results were based on a quick normalization and prior to removing noisy genes and cells. In other cases, it may be reasonable to keep more than one set of PCs: this can be achieved by specifying a different slot name via the name argument.

Finally, note that by default scater will compute the top 50 PCs. This is a reasonable choice, but it may be a good idea to explore the variance explained by each component to decide the number of components to retain. An alternative is to use the denoisePCA function from scran that aims at selecting the number of PCs that exlain biological variability.

plotPCA(sce.pbmc)

Again, we can compute the tSNE embedding using the PCs.

sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
plotTSNE(sce.pbmc)

3.2 Count-based models

One problem of PCA is that it assumes continuous, roughly Gaussian data, while single-cell RNA-seq data are counts.

Computing the PCs on the log-normalized counts helps, but it can be shown that it is not always effective and in some cases technical features, such as sequencing depth, can affect the first PCs (Hicks et al. 2018).

Hence, it may be better to use count-based factor analysis models for dimentionality reduction. One advantage of these methods is that they can be applied directly to count data, without need for normalization and data transformation, which can be challenging.

We will see two alternative approaches:

In both cases, we will start from the HVGs selected previously.

3.3 GLM-PCA

GLM-PCA is implemented in the scry Bioconductor package; it uses a Poisson model by default, but this can be changed to use negative binomial, multinomial or binomial.

Unlike runPCA, the GLM-PCA function does not allow for a selection of genes. Hence, we will first create a filtered SingleCellExperiment object that contains only the HVGs.

Similarly to PCA, GLM-PCA needs the number of components (factors) that need to be calculated. Here, we compute 10 latent factors.

We use the minibatch argument to use only a random subset of observations to compute the gradient in the optimization algorithm. This speeds up computations and avoids memory problems in big datasets.

library(scry)
set.seed(100000)
filtered <- sce.pbmc[top.pbmc,]
filtered <- GLMPCA(filtered, L=10, minibatch="stochastic")
filtered
## class: SingleCellExperiment 
## dim: 1000 3442 
## metadata(2): Samples glmpca
## assays(2): counts logcounts
## rownames(1000): LYZ S100A9 ... CD1C PRKCH
## rowData names(3): ID Symbol location
## colnames(3442): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(11): Sample Barcode ... discard DoubletScore
## reducedDimNames(3): PCA TSNE GLMPCA
## mainExpName: NULL
## altExpNames(0):

To visualize the first two components of GLM-PCA, we can use the plotReducedDim function from scater.

plotReducedDim(filtered, "GLMPCA")

3.4 NewWave

An alternative method is implemented in the NewWave package, which implements a negative binomial factor analysis model and uses a penalized likelihood to estimate the latent factors.

Also in this case we can use minibatches, and if multiple CPUs are available parallel computing, to speed up computations. The n_gene_par, n_cell_par, and n_gene_disp arguments allow the user to choose how many observations to use to estimate the gene- and cell-specific parameters. The children argument allows one to use multiple cores; here we use four, but you should change it depending on how many cores your computer has (see parallel::detectCores()).

library(NewWave)
set.seed(100000)
filtered <- newWave(filtered, K=10, children=4,
                    n_gene_disp = 100, n_gene_par = 100, 
                    n_cell_par = 100)
filtered
## class: SingleCellExperiment 
## dim: 1000 3442 
## metadata(2): Samples glmpca
## assays(2): counts logcounts
## rownames(1000): LYZ S100A9 ... CD1C PRKCH
## rowData names(3): ID Symbol location
## colnames(3442): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(11): Sample Barcode ... discard DoubletScore
## reducedDimNames(4): PCA TSNE GLMPCA newWave
## mainExpName: NULL
## altExpNames(0):
plotReducedDim(filtered, "newWave")

For visualization, it is also possible to compute the tSNE embedding of the 10-dimensional GLM-PCA or newWave representation.

set.seed(2242)
filtered <- runTSNE(filtered, dimred="GLMPCA")
plotTSNE(filtered)

4 Cell type assignment

There are essentially two alternative approaches for the identification of cell types:

  1. Clustering
  2. Annotation with external references

Typically, clustering is done on a low dimensional projection of the data, e.g., through PCA or count-based methods.

On the other hand, reference annotation starts from normalized data.

4.1 Clustering

One of the most popular clustering methods for single-cell RNA-seq is the Louvain algorithm. This method starts from a cell network (graph), created by connecting cells that have shared nearest neighbors.

The first step is the creation of the shared nearest neighbor graph. Briefly, we start from the low dimensional representation of the data (e.g., by GLM-PCA) and construct a graph by linking cells that have neighbors in common. This is implemented in the buildSNNgraph function of the scran package; by default it uses k=10 neighbors.

g <- buildSNNGraph(filtered, k=10, use.dimred = 'GLMPCA')
g
## IGRAPH a176a9d U-W- 3442 121566 -- 
## + attr: weight (e/n)
## + edges from a176a9d:
##  [1]  9--12  6--13  6--15 13--15  2--16  6--17 13--17 15--17 16--19  9--23
## [11] 12--23  8--24 14--27  7--30 16--32 17--33 32--35  2--35 19--37 17--38
## [21] 10--38 15--40  9--42 23--42 19--45 37--45 33--46 13--46 38--47 10--47
## [31]  7--47 17--47 44--48 20--48 39--50  2--51 16--51 35--51 40--53  3--54
## [41]  7--54  2--55 37--55 45--55 51--55 49--57 51--58 31--59  9--60 23--60
## [51] 12--60 17--62 13--62  6--62 15--62 15--63  6--63  9--66 23--66 60--66
## [61] 12--66 42--66 56--68 39--70 50--70 18--71 34--72 30--72  7--72 11--73
## [71] 41--73 36--74 28--74  5--74 40--75 53--75 22--78 67--78  3--79 54--79
## + ... omitted several edges

We can use e.g., the GGally package to visualize the graph, but this is often not very useful due to the large number of nodes and edges.

library(GGally)
ggnet2(g, size=1)

We can then use the igraph package to identify communities (clusters) in the graph, e.g., by using the Louvain algorithm.

library(igraph)
clust <- igraph::cluster_louvain(g)
ggnet2(g, color=membership(clust), size=1)

Finally, we store the results into the colData of our object and visualize the clusters in the t-SNE plot.

filtered$Louvain <- factor(membership(clust))
plotTSNE(filtered, colour_by="Louvain")

Obviously, there are many other algorithms available for the clustering of single-cell data.

Exercise: repeat the clustering with a different graph partition algorithm (e.g. cluster_walktrap) starting from the same graph and compare the results.

4.2 Identification of marker genes

Once we have cluster labels, we need to understand what each cluster represents. Typically, we look for marker genes to label the clusters.

It is not important to perform a formal differential expression test for this step. In fact, the p-values are not valid from an inferential point of view because we are using the data twice: to cluster the data and to test the difference in expression.

For this reason, we use simple methods, like a t-test that compares the average expression of one group versus the average of the others, and use the p-values only as a way to rank potentially interesting genes.

markers <- findMarkers(filtered, filtered$Louvain)
head(markers[[1]])
## DataFrame with 6 rows and 16 columns
##               Top      p.value          FDR summary.logFC   logFC.2   logFC.3
##         <integer>    <numeric>    <numeric>     <numeric> <numeric> <numeric>
## LYZ             1  0.00000e+00  0.00000e+00       5.03263   5.12901  0.537619
## S100A8          1  0.00000e+00  0.00000e+00       4.89938   4.81250  3.550882
## CD74            1  0.00000e+00  0.00000e+00      -2.51323   2.00625 -2.554604
## TYROBP          1  0.00000e+00  0.00000e+00       3.72711   3.73669  1.460459
## S100A12         1 7.66074e-241 1.70239e-239       2.62391   2.60579  2.219861
## FCN1            1  0.00000e+00  0.00000e+00       2.62224   2.59268  1.423785
##           logFC.4   logFC.5   logFC.6   logFC.7   logFC.8   logFC.9  logFC.10
##         <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## LYZ       5.03263   5.08653   5.08061   5.11122   5.26818   4.16174   5.25775
## S100A8    4.81350   4.87054   4.88885   4.89938   5.06813   3.34185   4.90440
## CD74     -2.51323   1.97407   1.70326   1.32984   1.73466   2.37730  -1.63660
## TYROBP    3.72711   3.72669   3.76784   3.65471   1.45562   3.34294   1.82583
## S100A12   2.60100   2.55872   2.58298   2.58466   2.62391   2.32774   2.60287
## FCN1      2.62224   2.57777   2.56931   2.60600   2.55907   2.71329   2.57742
##          logFC.11   logFC.12  logFC.13
##         <numeric>  <numeric> <numeric>
## LYZ      2.833193 -0.2033989   5.12512
## S100A8   4.044796  0.6879618   4.76580
## CD74    -0.803273 -0.6758913   1.91552
## TYROBP   0.181691 -0.0002082   3.75697
## S100A12  2.532983  0.2789026   2.56707
## FCN1     1.031278 -0.4494942   2.58127

We can for instance extract the top 10 genes from each comparison to use as cluster-specific markers.

mm <- unique(unlist(lapply(markers, function(x) rownames(x)[1:10])))
plotGroupedHeatmap(filtered, features=mm, group="Louvain", center=TRUE)

By cross-checking the expression of these genes with databases of known immune cell type markers, we may be able to annotate the clusters in cell types.

4.3 Annotation with external reference

Obviously, manually annotating clusters using gene markers is laborious and requires specific expertise. For this reason, algorithms that automatically annotate cells using external, already annotated reference datasets are becoming popular.

Most methods do not rely on a clustering algorighm, but annotate individual cells, by looking at some form of correlation between each cell in the query dataset and all the annotated samples in the reference.

This is the approach implemented in the SingleR Bioconductor package, which uses a simple yet effective algorithm, based on Spearman correlation.

The celldex Bioconductor package includes several annotated reference datasets (for mouse and human) that can be used with SingleR. Specifically, it includes a few immune-specific reference datasets. We choose one of them, namely the Monaco datataset, which consists of 114 bulk RNA-seq samples of sorted immune cell populations.

library(SingleR)
library(celldex)
ref <- MonacoImmuneData()
ref
## class: SummarizedExperiment 
## dim: 46077 114 
## metadata(0):
## assays(1): logcounts
## rownames(46077): A1BG A1BG-AS1 ... ZYX ZZEF1
## rowData names(0):
## colnames(114): DZQV_CD8_naive DZQV_CD8_CM ... G4YW_Neutrophils
##   G4YW_Basophils
## colData names(3): label.main label.fine label.ont

In the colData of this object, we have the labels of the sorted populations, at two levels of resolution.

colData(ref)
## DataFrame with 114 rows and 3 columns
##                       label.main             label.fine   label.ont
##                      <character>            <character> <character>
## DZQV_CD8_naive      CD8+ T cells      Naive CD8 T cells  CL:0000900
## DZQV_CD8_CM         CD8+ T cells Central memory CD8 T..  CL:0000907
## DZQV_CD8_EM         CD8+ T cells Effector memory CD8 ..  CL:0000913
## DZQV_CD8_TE         CD8+ T cells Terminal effector CD..  CL:0001062
## DZQV_MAIT                T cells             MAIT cells  CL:0000940
## ...                          ...                    ...         ...
## G4YW_NK                 NK cells   Natural killer cells  CL:0000623
## G4YW_pDC         Dendritic cells Plasmacytoid dendrit..  CL:0000784
## G4YW_mDC         Dendritic cells Myeloid dendritic ce..  CL:0000782
## G4YW_Neutrophils     Neutrophils Low-density neutroph..  CL:0000096
## G4YW_Basophils         Basophils  Low-density basophils  CL:0000043

It is straightforward to annotate our cells, by simply selecting our dataset, the reference dataset and the column that corresponds to the annotation that we want to use.

pred <- SingleR(test = sce.pbmc, ref = ref, 
    labels = ref$label.main, assay.type.test="logcounts")
head(pred)
## DataFrame with 6 rows and 4 columns
##                                            scores          labels delta.next
##                                          <matrix>     <character>  <numeric>
## AAACCTGAGAAGGCCT-1 0.247337:0.175560:0.100281:...       Monocytes 0.00164208
## AAACCTGAGACAGACC-1 0.294752:0.214352:0.146588:...       Monocytes 0.06409723
## AAACCTGAGGCATGGT-1 0.195891:0.131444:0.313910:...    CD4+ T cells 0.00176487
## AAACCTGCAGGCGATA-1 0.362410:0.269859:0.225483:... Dendritic cells 0.06348818
## AAACCTGGTACATCCA-1 0.356039:0.152885:0.212780:...         B cells 0.07288104
## AAACCTGGTGCGGTAA-1 0.248973:0.171584:0.354920:...    CD8+ T cells 0.03594280
##                      pruned.labels
##                        <character>
## AAACCTGAGAAGGCCT-1       Monocytes
## AAACCTGAGACAGACC-1       Monocytes
## AAACCTGAGGCATGGT-1    CD4+ T cells
## AAACCTGCAGGCGATA-1 Dendritic cells
## AAACCTGGTACATCCA-1         B cells
## AAACCTGGTGCGGTAA-1    CD8+ T cells

The method returns a continuous score, which represents how similar each cell is to the annotated cell populations in the reference. Moreover, it returns the final prediction in the form of a cell-type label.

We can explore the scores, to visualize the uncertainty in the annotation.

plotScoreHeatmap(pred)

We can include the predictions in the SingleCellExperiment object and visualize the results.

filtered$singler <- factor(pred$pruned.labels)
plotTSNE(filtered, colour_by = "singler")

Exercise: repeat the classification with the finer cell type classification (label.fine).

We can compare the Louvain clusters with the predicted cell types.

library(pheatmap)
pheatmap(log10(table(filtered$singler, filtered$Louvain)+1), scale = "none")

5 Additional analyses

There are many additional analyses that are useful for single-cell RNA-seq data. Two that are particularly important are:

Please, consider taking the following self-paced course to explore more advanced topics: https://ccb-hms.github.io/osca-workbench/

6 Session information

It is good practice to always include a list of the software versions that were used to perform a given analysis, for reproducibility and trouble-shooting purposes. One way of achieving this is via the sessionInfo() function.

sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.6.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Rome
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] pheatmap_1.0.12             celldex_1.14.0             
##  [3] SingleR_2.6.0               igraph_2.0.3               
##  [5] GGally_2.2.1                NewWave_1.14.0             
##  [7] scry_1.16.0                 scDblFinder_1.18.0         
##  [9] scran_1.32.0                scater_1.32.0              
## [11] ggplot2_3.5.1               EnsDb.Hsapiens.v86_2.99.0  
## [13] ensembldb_2.28.0            AnnotationFilter_1.28.0    
## [15] GenomicFeatures_1.56.0      AnnotationDbi_1.66.0       
## [17] scuttle_1.14.0              DropletUtils_1.24.0        
## [19] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [21] Biobase_2.64.0              GenomicRanges_1.56.0       
## [23] GenomeInfoDb_1.40.1         IRanges_2.38.0             
## [25] S4Vectors_0.42.0            BiocGenerics_0.50.0        
## [27] MatrixGenerics_1.16.0       matrixStats_1.3.0          
## [29] DropletTestFiles_1.14.0     BiocStyle_2.32.0           
## 
## loaded via a namespace (and not attached):
##   [1] BiocIO_1.14.0             bitops_1.0-7             
##   [3] filelock_1.0.3            tibble_3.2.1             
##   [5] R.oo_1.26.0               XML_3.99-0.16.1          
##   [7] httr2_1.0.1               lifecycle_1.0.4          
##   [9] edgeR_4.2.0               lattice_0.22-6           
##  [11] MASS_7.3-60.2             alabaster.base_1.4.1     
##  [13] magrittr_2.0.3            limma_3.60.2             
##  [15] sass_0.4.9                rmarkdown_2.27           
##  [17] jquerylib_0.1.4           yaml_2.3.8               
##  [19] metapod_1.12.0            cowplot_1.1.3            
##  [21] DBI_1.2.3                 RColorBrewer_1.1-3       
##  [23] abind_1.4-5               zlibbioc_1.50.0          
##  [25] Rtsne_0.17                purrr_1.0.2              
##  [27] R.utils_2.12.3            RCurl_1.98-1.14          
##  [29] rappdirs_0.3.3            GenomeInfoDbData_1.2.12  
##  [31] ggrepel_0.9.5             irlba_2.3.5.1            
##  [33] dqrng_0.4.1               DelayedMatrixStats_1.26.0
##  [35] codetools_0.2-20          DelayedArray_0.30.1      
##  [37] tidyselect_1.2.1          UCSC.utils_1.0.0         
##  [39] farver_2.1.2              ScaledMatrix_1.12.0      
##  [41] viridis_0.6.5             BiocFileCache_2.12.0     
##  [43] GenomicAlignments_1.40.0  jsonlite_1.8.8           
##  [45] BiocNeighbors_1.22.0      tools_4.4.0              
##  [47] sna_2.7-2                 Rcpp_1.0.12              
##  [49] glue_1.7.0                gridExtra_2.3            
##  [51] SparseArray_1.4.8         xfun_0.44                
##  [53] gypsum_1.0.1              dplyr_1.1.4              
##  [55] HDF5Array_1.32.0          withr_3.0.0              
##  [57] BiocManager_1.30.23       fastmap_1.2.0            
##  [59] rhdf5filters_1.16.0       bluster_1.14.0           
##  [61] fansi_1.0.6               digest_0.6.35            
##  [63] rsvd_1.0.5                R6_2.5.1                 
##  [65] mime_0.12                 colorspace_2.1-0         
##  [67] RSQLite_2.3.7             R.methodsS3_1.8.2        
##  [69] utf8_1.2.4                tidyr_1.3.1              
##  [71] generics_0.1.3            data.table_1.15.4        
##  [73] rtracklayer_1.64.0        httr_1.4.7               
##  [75] S4Arrays_1.4.1            ggstats_0.6.0            
##  [77] pkgconfig_2.0.3           gtable_0.3.5             
##  [79] blob_1.2.4                XVector_0.44.0           
##  [81] htmltools_0.5.8.1         bookdown_0.39            
##  [83] ProtGenerics_1.36.0       alabaster.matrix_1.4.0   
##  [85] scales_1.3.0              png_0.1-8                
##  [87] knitr_1.47                rstudioapi_0.16.0        
##  [89] rjson_0.2.21              coda_0.19-4.1            
##  [91] statnet.common_4.9.0      curl_5.2.1               
##  [93] cachem_1.1.0              rhdf5_2.48.0             
##  [95] BiocVersion_3.19.1        parallel_4.4.0           
##  [97] vipor_0.4.7               restfulr_0.0.15          
##  [99] alabaster.schemas_1.4.0   pillar_1.9.0             
## [101] grid_4.4.0                vctrs_0.6.5              
## [103] glmpca_0.2.0              BiocSingular_1.20.0      
## [105] dbplyr_2.5.0              beachmat_2.20.0          
## [107] cluster_2.1.6             beeswarm_0.4.0           
## [109] evaluate_0.24.0           tinytex_0.51             
## [111] magick_2.8.3              cli_3.6.2                
## [113] locfit_1.5-9.9            compiler_4.4.0           
## [115] Rsamtools_2.20.0          rlang_1.1.4              
## [117] crayon_1.5.2              intergraph_2.0-4         
## [119] labeling_0.4.3            plyr_1.8.9               
## [121] ggbeeswarm_0.7.2          alabaster.se_1.4.1       
## [123] network_1.18.2            viridisLite_0.4.2        
## [125] BiocParallel_1.38.0       munsell_0.5.1            
## [127] Biostrings_2.72.1         lazyeval_0.2.2           
## [129] SharedObject_1.18.0       Matrix_1.7-0             
## [131] ExperimentHub_2.12.0      sparseMatrixStats_1.16.0 
## [133] bit64_4.0.5               Rhdf5lib_1.26.0          
## [135] KEGGREST_1.44.0           statmod_1.5.0            
## [137] alabaster.ranges_1.4.1    highr_0.11               
## [139] AnnotationHub_3.12.0      memoise_2.0.1            
## [141] bslib_0.7.0               bit_4.0.5                
## [143] xgboost_1.7.7.1

References

Agostinis, Federico, Chiara Romualdi, Gabriele Sales, and Davide Risso. 2022. “NewWave: A Scalable r/Bioconductor Package for the Dimensionality Reduction and Batch Effect Removal of Single-Cell RNA-Seq Data.” Bioinformatics 38 (9): 2648–50.
Amezquita, Robert A, Aaron TL Lun, Etienne Becht, Vince J Carey, Lindsay N Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods 17 (2): 137–45.
Hicks, Stephanie C, F William Townes, Mingxiang Teng, and Rafael A Irizarry. 2018. “Missing Data and Technical Variability in Single-Cell RNA-Sequencing Experiments.” Biostatistics 19 (4): 562–78.
Huber, Wolfgang, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada C. Bravo, et al. 2015. Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.
Lun, Aaron T, Karsten Bach, and John C Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell RNA Sequencing Data with Many Zero Counts.” Genome Biology 17 (1): 1–14.
Lun, Aaron TL, Samantha Riesenfeld, Tallulah Andrews, Tomas Gomes, John C Marioni, et al. 2019. “EmptyDrops: Distinguishing Cells from Empty Droplets in Droplet-Based Single-Cell RNA Sequencing Data.” Genome Biology 20 (1): 1–9.
Townes, F William, Stephanie C Hicks, Martin J Aryee, and Rafael A Irizarry. 2019. “Feature Selection and Dimension Reduction for Single-Cell RNA-Seq Based on a Multinomial Model.” Genome Biology 20 (1): 1–16.