Contents

1 Introduction

In this tutorial we walk through a typical single-cell RNA-seq analysis using Bioconductor packages. While most steps are valid for data from different protocols, some of the EDA/QC steps are specific of the 10X Genomics Chromium protocol.

We start from the output of the Cell Ranger preprocessing software. This is an open source software suite that allows to pre-process the FASTQ files generated by the sequencing platform and perform alignment and quantification.

We will perform exploratory data analysis (EDA) and quality control (QC), focusing on single-cell, and sometimes droplet-specific, issues, such as the detection of empty droplet and of doublets. We will then cover dimensionality reduction and cell type identification.

Most of the steps covered here (and much more!) are described in great details in the Orchestrating Single Cell Analysis (OSCA) book (Amezquita et al. 2020). I encourage everyone to use that as a reference for the most typical single-cell analyses with Bioconductor.

Bioconductor (Huber et al. 2015) has many packages supporting the analysis of single-cell RNA-seq data and their package vignettes constitute an excellent resource.

While not covered in this tutorial, there are packages and software tools for the analysis of single-cell data outside of Bioconductor too. Popular tools include the Seurat R package and the scanpy python package. The Bioconductor single-cell ecosystem tries whenever possible to provide data structures and coercion functions that make it easy to interoperate between Bioconductor and external software.

1.1 Experimental data

We will focus on one experimental dataset throughout this tutorial. In particular, we will use the Peripheral Blood Mononuclear Cell (PBMC) data from healthy donors provided by 10X Genomics as example datasets, available through the TENxPBMCData Bioconductor package.

If you are an advanced user and are already familiar with the steps provided here, you can make this tutorial more challenging by using other datasets. In particular, the scRNAseq Bioconductor package contains a wide variety of experimental single-cell datasets that can be used for illustration, exploration, and methods development.

1.2 The SingleCellExperiment class

One common feature of the dataset that we will use is that it is stored as an object of the SingleCellExperiment class.

SingleCellExperiment is a S4 class that extends SummarizedExperiment and can be used for efficiently storing and working with single-cell data in R/Bioconductor.

This class serves as the common infrastructure for across 70+ single-cell-related Bioconductor packages and allows users to mix and match packages by different developers. This class implements a data structure that stores all aspects of our single-cell data - gene-by-cell expression data, per-cell metadata and per-gene annotation - and manipulate them in a synchronized manner.

A more thorough overview of SingleCellExperiment can be found in Chapter 4 of OSCA.

2 Exploratory Data Analysis / Quality Control

In this section we focus on EDA/QC steps that are typical in a single-cell analysis.

We start by considering a dataset on Peripheral Blood Mononuclear Cell (PBMC) data from a healthy donor provided by 10X Genomics. We use the DropletTestFiles Bioconductor package to download the raw (i.e., unfiltered) count matrix that contains the UMI counts of all genes in all droplets.

library(DropletTestFiles)
raw.path <- getTestFile("tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz")
out.path <- file.path(tempdir(), "pbmc4k")
untar(raw.path, exdir=out.path)

library(DropletUtils)
fname <- file.path(out.path, "raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)
sce.pbmc
## class: SingleCellExperiment 
## dim: 33694 737280 
## metadata(1): Samples
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
##   TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

The read10xCounts function starts from the output of the Cell Ranger software and imports the data into R as an object of class SingleCellExperiment.

We can notice that the dimension of the matrix is very big; in fact this matrix includes the UMI that have been detected in all the droplets that have been sequenced, including the empty droplets that may contain only ambient RNA.

This is a very sparse matrix, with a large fraction of zeros; read10xCounts is aware of this and stores the counts as a sparse matrix, which has a very small memory footprint.

class(counts(sce.pbmc))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

Exercise: Explore the SingleCellExperiment object and its slots. In particular, the assay, counts, colData, rowData assessors.

Before starting the analysis, it may be a good idea to store the names of the genes in a more human-friendly ID system. We can also include information on the chromosome location of the genes; this will be useful for e.g. identifying mitochondrial genes.

library(scuttle)
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)
rowData(sce.pbmc)$location <- mapIds(EnsDb.Hsapiens.v86,
                                     keys=rowData(sce.pbmc)$ID, 
                                     column="SEQNAME", keytype="GENEID")
rowData(sce.pbmc)
## DataFrame with 33694 rows and 3 columns
##                           ID       Symbol    location
##                  <character>  <character> <character>
## RP11-34P13.3 ENSG00000243485 RP11-34P13.3           1
## FAM138A      ENSG00000237613      FAM138A           1
## OR4F5        ENSG00000186092        OR4F5           1
## RP11-34P13.7 ENSG00000238009 RP11-34P13.7           1
## RP11-34P13.8 ENSG00000239945 RP11-34P13.8           1
## ...                      ...          ...         ...
## AC233755.2   ENSG00000277856   AC233755.2  KI270726.1
## AC233755.1   ENSG00000275063   AC233755.1  KI270726.1
## AC240274.1   ENSG00000271254   AC240274.1  KI270711.1
## AC213203.1   ENSG00000277475   AC213203.1  KI270713.1
## FAM231B      ENSG00000268674      FAM231B  KI270713.1

Exercise: compute the proportion of zero for each droplet (column) and draw the distribution across the dataset.

2.1 Empty droplets

The first step is the identification of droplets that do not contain any live cell.

The reason why these droplets contain some RNA is that there may be some ambient RNA due to some cell leaking or they may contain dead or dying cells.

The barcodeRanks function can be used to rank the barcodes by number of UMIs and to estimate the knee and inflection point of the distribution.

bcrank <- barcodeRanks(counts(sce.pbmc))

# Only showing unique points for plotting speed.
uniq <- !duplicated(bcrank$rank)
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
    xlab="Rank", ylab="Total UMI count", cex.lab=1.2)

abline(h=metadata(bcrank)$inflection, col="darkgreen", lty=2)
abline(h=metadata(bcrank)$knee, col="dodgerblue", lty=2)

legend("bottomleft", legend=c("Inflection", "Knee"), 
        col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)

There is a sharp distinction between droplets with very high counts, very likely to contain a live cell, and droplets with very low counts, very likely to be empty.

However, it is not straightforward to classify the droplets in the middle of the distribution.

We can apply a statistical test of hypothesis to decide, for each droplet, if its RNA profile is significantly different from the profile of ambient RNA, estimated from the very low counts (Aaron TL Lun et al. 2019).

We use a very low threshold on the False Discovery Rate to have very few false positive cells.

set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
summary(e.out$FDR <= 0.001)
##    Mode   FALSE    TRUE    NA's 
## logical     989    4300  731991

The large majority of droplets are not tested, since by default all droplets with fewer than 100 UMIs are considered empty.

table(colSums(counts(sce.pbmc))>100, e.out$FDR<=0.001, useNA = "ifany")
##        
##          FALSE   TRUE   <NA>
##   FALSE      0      0 731991
##   TRUE     989   4300      0

We can now proceed by removing the empty droplets and keep only the ones identified to be cells.

sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]
sce.pbmc
## class: SingleCellExperiment 
## dim: 33694 4300 
## metadata(1): Samples
## assays(1): counts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(4300): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

2.2 Data exploration

We will see in more details in the next section how to perform accurate dimensionality reduction. Here, we quickly perform a principal component analysis (PCA) and tSNE embedding to explore the data and highlight possible quality issues for some cells.

These embeddings can be quickly computed using the runPCA and runTSNE functions in scater. It is useful to log transform the data before PCA to more accurately reflect the signal.

The logNormCounts function divides by the total number of UMIs in addition to log transforming the counts to account for the difference in sequencing depths. This is not the best normalization approach, but for our current purpose of quickly visualizing the data, it is good enough.

We will need to recompute these embeddings after filtering and proper normalization to obtain a more accurate representation of our data.

library(scater)
sce.pbmc <- logNormCounts(sce.pbmc)
sce.pbmc <- runPCA(sce.pbmc)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")
sce.pbmc
## class: SingleCellExperiment 
## dim: 33694 4300 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(3): ID Symbol location
## colnames(4300): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(3): Sample Barcode sizeFactor
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):

Note that it is generally a good strategy to start from PCA to compute the tSNE embedding; this is achieved using the dimred argument.

Exercise: Explore the logcounts accessor and the assayNames, assays and assay methods to extract the original counts and the log-normalized counts from the object; explore the reducedDimNames and reducedDim methods to explore the PCA and t-SNE embedding.

We can visualize the first two PCs and the tSNE embedding using the following scater functions.

plotPCA(sce.pbmc)

plotTSNE(sce.pbmc)