In this tutorial we walk through a typical spatial transcriptomics analysis using Bioconductor packages.
Spatial transcriptomics is a fast evolving set of technologies and we cannot cover all protocols here. However, we will show an example of spot-based protocols, namely 10X Genomics Visium, and an example of imaging-based methods, namely Nanostring CosMX. Note that this is an area of current development and some methods will likely be improved in the next months.
Most of the steps covered here, especially those for spot-based data, are described in the Best Practices ST book.
The Voyager Bioconductor package (Moses et al. 2023) also has an extensive set of tutorials that cover most of the currently available spatial transcriptomics technologies.
While not covered in this tutorial, there are packages and software tools for the analysis of spatial transcriptomics data outside of Bioconductor too. Popular tools include the Seurat R package, the Giotto R package and the SpatialData python package.
We will initially focus on 10X Genomics Visium. We start from the output of the Space Ranger preprocessing software. This is the 10X Genomics software suite that allows to pre-process the FASTQ files generated by the sequencing platform and perform alignment and quantification. We will perform exploratory data analysis (EDA) and quality control (QC). We will then cover normalization, the identification of spatially variable genes dimensionality reduction and cell type identification and .
We will use one sample of human brain from the dorsolateral prefrontal cortex (DLPFC) region, measured using the 10x Genomics Visium platform.
In the full dataset, there are 12 samples in total, from 3 individuals, with 2 pairs of spatially adjacent replicates (serial sections) per individual (4 samples per individual). The individuals and spatially adjacent replicates can be used as blocking factors. Each sample spans the six layers of the cortex plus white matter in a perpendicular tissue section. For the examples in this workflow we use a single sample from this dataset (sample 151673).
For more details on the dataset, see Maynard et al. (2021). The full dataset is publicly available through the spatialLIBD Bioconductor package.
SpatialExperiment
is a S4 class that extends SingleCellExperiment
and can be used for efficiently storing and working with spatial data in R/Bioconductor.
This class is itself extended by MoleculeExperiment
and SpatialFeatureExperiment
, which allow to more easily work with imaging-based data. For the moment, we will use SpatialExperiment
.
A more thorough overview of SingleCellExperiment
can be found in Righelli et al. (2022).
We start by loading the data.
library(SpatialExperiment)
library(STexampleData)
spe <- Visium_humanDLPFC()
spe
## class: SpatialExperiment
## dim: 33538 4992
## metadata(0):
## assays(1): counts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(3): gene_id gene_name feature_type
## colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
## TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
## colData names(8): barcode_id sample_id ... reference cell_count
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
## imgData names(4): sample_id image_id data scaleFactor
The SpatialExperiment
class should be fairly familiar, since it is heavily based on the SingleCellExperiment
class. In addition to the slots that you already know, a SpatialExperiment
object includes the spatialCoords
and the imgData
slots.
head(spatialCoords(spe))
## pxl_col_in_fullres pxl_row_in_fullres
## AAACAACGAATAGTTC-1 3913 2435
## AAACAAGTATCTCCCA-1 9791 8468
## AAACAATCTACTAGCA-1 5769 2807
## AAACACCAATAACTGC-1 4068 9505
## AAACAGAGCGACTCCT-1 9271 4151
## AAACAGCTTTCAGAAG-1 3393 7583
imgData(spe)
## DataFrame with 2 rows and 4 columns
## sample_id image_id data scaleFactor
## <character> <character> <list> <numeric>
## 1 sample_151673 lowres #### 0.0450045
## 2 sample_151673 hires #### 0.1500150
Note that the colData
DataFrame
is still where most of the useful information about the spots are available.
colData(spe)
## DataFrame with 4992 rows and 8 columns
## barcode_id sample_id in_tissue array_row
## <character> <character> <integer> <integer>
## AAACAACGAATAGTTC-1 AAACAACGAATAGTTC-1 sample_151673 0 0
## AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673 1 50
## AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673 1 3
## AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673 1 59
## AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1 sample_151673 1 14
## ... ... ... ... ...
## TTGTTTCACATCCAGG-1 TTGTTTCACATCCAGG-1 sample_151673 1 58
## TTGTTTCATTAGTCTA-1 TTGTTTCATTAGTCTA-1 sample_151673 1 60
## TTGTTTCCATACAACT-1 TTGTTTCCATACAACT-1 sample_151673 1 45
## TTGTTTGTATTACACG-1 TTGTTTGTATTACACG-1 sample_151673 1 73
## TTGTTTGTGTAAATTC-1 TTGTTTGTGTAAATTC-1 sample_151673 1 7
## array_col ground_truth reference cell_count
## <integer> <character> <character> <integer>
## AAACAACGAATAGTTC-1 16 NA NA NA
## AAACAAGTATCTCCCA-1 102 Layer3 Layer3 6
## AAACAATCTACTAGCA-1 43 Layer1 Layer1 16
## AAACACCAATAACTGC-1 19 WM WM 5
## AAACAGAGCGACTCCT-1 94 Layer3 Layer3 2
## ... ... ... ... ...
## TTGTTTCACATCCAGG-1 42 WM WM 3
## TTGTTTCATTAGTCTA-1 30 WM WM 4
## TTGTTTCCATACAACT-1 27 Layer6 Layer6 3
## TTGTTTGTATTACACG-1 41 WM WM 16
## TTGTTTGTGTAAATTC-1 51 Layer2 Layer2 5
In this case, we have a ground truth annotation of the spots, as well as an indication of the number of cells covered by each spots.
One nice feature of spatial transcriptomics is that we can “see” the tissue under study and visualize the spots on top of the histology image. We will use the `ggspavis
package to do so.
library(ggspavis)
plotVisium(spe)
We can also visualize the ground truth annotation of the spots.
plotSpots(spe, annotate = "ground_truth",
pal = "libd_layer_colors")
We can see that, while the Visium platform consists of a grid of spots, only some of them will be covered by the tissue: these are called “in_tissue” in SpatialExperiment.
spe$in_tissue <- as.factor(spe$in_tissue)
plotSpots(spe, in_tissue = NULL, annotate = "in_tissue")
For the rest of the workflow, we will only keep the in-tissue spots.
spe <- spe[, colData(spe)$in_tissue == 1]
Spot-based spatial transcriptomics is based on short-read sequencing of RNA molecules and as such many methods originally developed for single-cell RNA-seq can be used to analyze Visium data.
In particular, scater’s perCellQCMetrics
function can be used to compute a set of metrics useful to evaluate the quality of the samples. The isOutlier
function uses a data driven threshold to define cells of lower quality compared to the rest of the dataset.
Similarly to what we have done in the single-cell workflow, it is useful to identify spots with high percentages of mitochondrial reads, which may be a symptom of poor sample quality.
library(EnsDb.Hsapiens.v86)
rowData(spe)$location <- mapIds(EnsDb.Hsapiens.v86,
keys=rowData(spe)$gene_id,
column="SEQNAME", keytype="GENEID")
table(rowData(spe)$location=="MT")
##
## FALSE TRUE
## 33213 13
We can now compute the QC metrics with scater.
library(scater)
spe <- addPerCellQC(spe, subsets = list(mito = which(rowData(spe)$location=="MT")))
colData(spe)
## DataFrame with 3639 rows and 14 columns
## barcode_id sample_id in_tissue array_row
## <character> <character> <factor> <integer>
## AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673 1 50
## AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673 1 3
## AAACACCAATAACTGC-1 AAACACCAATAACTGC-1 sample_151673 1 59
## AAACAGAGCGACTCCT-1 AAACAGAGCGACTCCT-1 sample_151673 1 14
## AAACAGCTTTCAGAAG-1 AAACAGCTTTCAGAAG-1 sample_151673 1 43
## ... ... ... ... ...
## TTGTTTCACATCCAGG-1 TTGTTTCACATCCAGG-1 sample_151673 1 58
## TTGTTTCATTAGTCTA-1 TTGTTTCATTAGTCTA-1 sample_151673 1 60
## TTGTTTCCATACAACT-1 TTGTTTCCATACAACT-1 sample_151673 1 45
## TTGTTTGTATTACACG-1 TTGTTTGTATTACACG-1 sample_151673 1 73
## TTGTTTGTGTAAATTC-1 TTGTTTGTGTAAATTC-1 sample_151673 1 7
## array_col ground_truth reference cell_count sum
## <integer> <character> <character> <integer> <numeric>
## AAACAAGTATCTCCCA-1 102 Layer3 Layer3 6 8458
## AAACAATCTACTAGCA-1 43 Layer1 Layer1 16 1667
## AAACACCAATAACTGC-1 19 WM WM 5 3769
## AAACAGAGCGACTCCT-1 94 Layer3 Layer3 2 5433
## AAACAGCTTTCAGAAG-1 9 Layer5 Layer5 4 4278
## ... ... ... ... ... ...
## TTGTTTCACATCCAGG-1 42 WM WM 3 4324
## TTGTTTCATTAGTCTA-1 30 WM WM 4 2761
## TTGTTTCCATACAACT-1 27 Layer6 Layer6 3 2322
## TTGTTTGTATTACACG-1 41 WM WM 16 2331
## TTGTTTGTGTAAATTC-1 51 Layer2 Layer2 5 6281
## detected subsets_mito_sum subsets_mito_detected
## <numeric> <numeric> <numeric>
## AAACAAGTATCTCCCA-1 3586 1407 13
## AAACAATCTACTAGCA-1 1150 204 11
## AAACACCAATAACTGC-1 1960 430 13
## AAACAGAGCGACTCCT-1 2424 1316 13
## AAACAGCTTTCAGAAG-1 2264 651 12
## ... ... ... ...
## TTGTTTCACATCCAGG-1 2170 370 12
## TTGTTTCATTAGTCTA-1 1560 314 12
## TTGTTTCCATACAACT-1 1343 476 13
## TTGTTTGTATTACACG-1 1420 308 12
## TTGTTTGTGTAAATTC-1 2927 991 13
## subsets_mito_percent total
## <numeric> <numeric>
## AAACAAGTATCTCCCA-1 16.6351 8458
## AAACAATCTACTAGCA-1 12.2376 1667
## AAACACCAATAACTGC-1 11.4089 3769
## AAACAGAGCGACTCCT-1 24.2223 5433
## AAACAGCTTTCAGAAG-1 15.2174 4278
## ... ... ...
## TTGTTTCACATCCAGG-1 8.55689 4324
## TTGTTTCATTAGTCTA-1 11.37269 2761
## TTGTTTCCATACAACT-1 20.49957 2322
## TTGTTTGTATTACACG-1 13.21321 2331
## TTGTTTGTGTAAATTC-1 15.77774 6281
We can visualize the metrics and compute data-driven thresholds to flag low-quality cells.
hist(spe$subsets_mito_percent)
high.mito <- isOutlier(spe$subsets_mito_percent, type="higher")
table(high.mito)
## high.mito
## FALSE TRUE
## 3627 12
hist(spe$sum)
low.umi <- isOutlier(spe$sum, type="lower")
table(low.umi)
## low.umi
## FALSE
## 3639
In this case, only a few spots are discarded for high mitochondrial content. We can check whether there are spatial patterns of low quality.
colData(spe)$high.mito <- high.mito
plotSpotQC(spe, plot_type = "spot",
annotate = "high.mito")
We can plot the library sizes against the number of cells per spot (which is available for this dataset). We expect that spots with more cells will tend to have a higher number of UMIs.
plotSpotQC(spe, plot_type = "scatter",
x_metric = "cell_count", y_metric = "sum")
We see that the spots with very high cell counts also have low numbers of expressed genes. This indicates that the experiments have failed for these spots, and they should be removed. We select a threshold of 10 cells per spot. The number of spots above this threshold is relatively small, and there is a clear downward trend in the number of expressed genes above this threshold.
high.cellcount <- colData(spe)$cell_count > 10
table(high.cellcount)
## high.cellcount
## FALSE TRUE
## 3549 90
spe$high.cellcount <- high.cellcount
plotSpotQC(spe, plot_type = "spot",
annotate = "high.cellcount")
The discarded spots are all on the edges of the tissue. It seems plausible that something has gone wrong with the cell segmentation on the edges of the images, so it makes sense to remove these spots.
Since only a small number of spots are flagged as low quality, we can leave them in the object and simply flag them as low-quality.
table(high.cellcount | high.mito)
##
## FALSE TRUE
## 3537 102
low_quality <- which(high.cellcount | high.mito)
As for quality control, normalization can be borrowed from the single-cell literature. Published work use simple methods, such as the library size normalization. We use here this simple strategy, but more advanced methods can be used (see the single-cell workflow).
spe <- logNormCounts(spe)
spe
## class: SpatialExperiment
## dim: 33538 3639
## metadata(0):
## assays(2): counts logcounts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(4): gene_id gene_name feature_type location
## colnames(3639): AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1 ...
## TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
## colData names(17): barcode_id sample_id ... high.cellcount sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
## imgData names(4): sample_id image_id data scaleFactor
Exercise: Use the scran
package to normalize the data with the pool deconvolution method and compare the size factors with the library sizes.
In a typical single-cell analysis, we focus on the highly variable genes in the hope that they will carry most of the biological information. Spatial transcriptomics allows us to identify spatially variable genes (SVGs), i.e., genes with spatially correlated patterns of expression across the tissue area.
Several methods to identify SVGs in ST data have recently been developed; here we focus on the nnSVG method (Weber et al. 2023).
In this example, we run nnSVG using a small subset of the dataset for faster runtime. We select a subset by subsampling on the set of spots and including stringent filtering for low-expressed genes. A full analysis using all spots for this dataset and default filtering parameters for Visium data from human brain tissue takes around 45 minutes for one Visium sample on a standard laptop.
library(nnSVG)
n <- 100
set.seed(123)
ix <- sample(seq_len(n), n)
spe_nnSVG <- spe[, ix]
spe_nnSVG <- filter_genes(
spe_nnSVG, filter_genes_ncounts = 10, filter_genes_pcspots = 3
)
spe_nnSVG <- logNormCounts(spe_nnSVG)
spe_nnSVG <- nnSVG(spe_nnSVG)
head(rowData(spe_nnSVG), 3)
## DataFrame with 3 rows and 18 columns
## gene_id gene_name feature_type location
## <character> <character> <character> <character>
## ENSG00000074800 ENSG00000074800 ENO1 Gene Expression 1
## ENSG00000171603 ENSG00000171603 CLSTN1 Gene Expression 1
## ENSG00000162545 ENSG00000162545 CAMK2N1 Gene Expression 1
## sigma.sq tau.sq phi loglik runtime mean
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000074800 0.00689397 0.537541 21.01207 -111.492 0.009 1.71503
## ENSG00000171603 0.57866902 0.125757 23.32804 -120.270 0.010 2.09564
## ENSG00000162545 0.18362513 0.562470 4.97238 -123.776 0.011 2.58247
## var spcov prop_sv loglik_lm LR_stat rank
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000074800 0.549317 0.048413 0.0126626 -111.437 -0.109614 161
## ENSG00000171603 0.721747 0.362993 0.8214763 -125.087 9.634798 59
## ENSG00000162545 0.759200 0.165932 0.2461151 -127.617 7.682376 75
## pval padj
## <numeric> <numeric>
## ENSG00000074800 1.0000000 1.0000000
## ENSG00000171603 0.0080878 0.0234409
## ENSG00000162545 0.0214681 0.0489472
We can for instance inspect the top SVGs by ordering by the rank
column.
rowData(spe_nnSVG)[order(rowData(spe_nnSVG)$rank),]
## DataFrame with 171 rows and 18 columns
## gene_id gene_name feature_type location
## <character> <character> <character> <character>
## ENSG00000197971 ENSG00000197971 MBP Gene Expression 18
## ENSG00000123560 ENSG00000123560 PLP1 Gene Expression X
## ENSG00000173786 ENSG00000173786 CNP Gene Expression 17
## ENSG00000109846 ENSG00000109846 CRYAB Gene Expression 11
## ENSG00000131095 ENSG00000131095 GFAP Gene Expression 17
## ... ... ... ... ...
## ENSG00000173812 ENSG00000173812 EIF1 Gene Expression 17
## ENSG00000128989 ENSG00000128989 ARPP19 Gene Expression 15
## ENSG00000150991 ENSG00000150991 UBC Gene Expression 12
## ENSG00000131143 ENSG00000131143 COX4I1 Gene Expression 16
## ENSG00000138326 ENSG00000138326 RPS24 Gene Expression 10
## sigma.sq tau.sq phi loglik runtime mean
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000197971 2.83483 0.173900 1.550925 -120.999 0.011 3.74783
## ENSG00000123560 3.80992 0.513980 0.942131 -143.802 0.010 2.78344
## ENSG00000173786 2.02307 0.390224 1.151300 -126.786 0.009 1.73816
## ENSG00000109846 2.00210 0.303676 1.793466 -127.424 0.008 1.80283
## ENSG00000131095 2.49706 0.513030 2.687935 -154.650 0.009 2.01780
## ... ... ... ... ... ... ...
## ENSG00000173812 0.01120531 0.552721 11.405284 -113.2418 0.009 2.20198
## ENSG00000128989 0.01913710 0.724128 11.921848 -127.0426 0.008 2.17656
## ENSG00000150991 0.00563829 0.332138 23.772113 -87.6236 0.007 2.69460
## ENSG00000131143 0.01643595 0.654543 15.903211 -121.9345 0.008 2.69960
## ENSG00000138326 0.00746273 0.580789 0.227044 -115.3014 0.009 2.44050
## var spcov prop_sv loglik_lm LR_stat rank
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000197971 2.74088 0.449245 0.942201 -191.805 141.6122 1
## ENSG00000123560 3.16595 0.701256 0.881130 -199.014 110.4231 2
## ENSG00000173786 1.92044 0.818305 0.838303 -174.019 94.4664 3
## ENSG00000109846 1.90257 0.784854 0.868298 -173.552 92.2545 4
## ENSG00000131095 2.77227 0.783134 0.829563 -192.375 75.4484 5
## ... ... ... ... ... ... ...
## ENSG00000173812 0.568611 0.0480727 0.0198702 -113.1634 -0.156703 167
## ENSG00000128989 0.749152 0.0635576 0.0257474 -126.9506 -0.183908 168
## ENSG00000150991 0.340548 0.0278663 0.0166924 -87.5314 -0.184430 169
## ENSG00000131143 0.676280 0.0474896 0.0244955 -121.8339 -0.201117 170
## ENSG00000138326 0.592135 0.0353973 0.0126863 -115.1903 -0.222210 171
## pval padj
## <numeric> <numeric>
## ENSG00000197971 0 0
## ENSG00000123560 0 0
## ENSG00000173786 0 0
## ENSG00000109846 0 0
## ENSG00000131095 0 0
## ... ... ...
## ENSG00000173812 1 1
## ENSG00000128989 1 1
## ENSG00000150991 1 1
## ENSG00000131143 1 1
## ENSG00000138326 1 1
These results already offer a way to explore potentially interesting spatial signal, e.g., by plotting the expression pattern of the top ranked genes.
plotSpots(spe, annotate = "ENSG00000197971")
We now apply principal component analysis (PCA) to reduce the dimensionality of the dataset, and retain the top 50 principal components (PCs) for further downstream analyses.
This is done for two reasons: (i) to reduce noise due to random variation in expression of biologically uninteresting genes, which are assumed to have expression patterns that are independent of each other, and (ii) to improve computational efficiency during downstream analyses.
Here we use scater’s implementation of PCA that by default uses the top 500 most variable genes, but one can specify to use SVGs instead, e.g., by running nnSVG
on the full dataset and select the top 500 or 1000 SVGs.
We will also visualize the data in a UMAP plot.
spe <- runPCA(spe)
spe <- runUMAP(spe)
plotPCA(spe, colour_by = "ground_truth")
plotUMAP(spe, colour_by = "ground_truth")
Again, one advantage of spatial transcriptomics is the ability of visualizing gene expression (in this case the top PCs) in the space of the original tissue. Here, we visualize the top two PCs.
spe$PC1 <- reducedDim(spe)[,1]
spe$PC2 <- reducedDim(spe)[,2]
plotSpots(spe, annotate = "PC1")
plotSpots(spe, annotate = "PC2")
It is natural to think that spots that are physically close in tissue’s space will be more likely to be of the same type (e.g., represent a morphological distinct spatial domain).
This is the main idea behind BayesSpace
(Zhao et al. 2021), which uses a Bayesian Pott’s model to cluster spots.
Here, we specify seven clusters becaue we know from the ground truth labels that we should expect seven areas in the tissue. The BayesSpace
package vignette describes a more general worklfow to use when the number of clusters is unknown.
For computational reasons we run this analysis with only 1000 iterations, but in real analyses at least 10,000 are needed.
library(BayesSpace)
spe_bay <- as(spe, "SingleCellExperiment")
spe_bay$row <- spe_bay$array_row
spe_bay$col <- spe_bay$array_col
set.seed(1650)
spe_bay <- spatialPreprocess(spe_bay, platform="Visium",
n.PCs=10, n.HVGs=1000)
spe_bay <- spatialCluster(spe_bay, q=7, platform="Visium", d=10,
init.method="mclust", model="t", gamma=2,
nrep=1000, burn.in=100,
save.chain=TRUE)
We can now visualize the spatial clusters.
clusterPlot(spe_bay)
We have seen that each spot can contain up to 10 cells in this dataset. Some methods have been developed to deconvolve the gene expression of each spot into different cell types that may make up the spot’s composition. Here, we look at one such algorithm, called SPOTlight (Elosua-Bayes et al. 2021), but many others exist.
library(SPOTlight)
library(scRNAseq)
sce <- fetchDataset("zhong-prefrontal-2018", version = "2023-12-22")
colData(sce)
## DataFrame with 2394 rows and 9 columns
## developmental_stage gender sample
## <character> <character> <character>
## GW08_PFC1_sc1 8 weeks after gestat.. female GW08_PFC1
## GW08_PFC1_sc2 8 weeks after gestat.. female GW08_PFC1
## GW08_PFC1_sc3 8 weeks after gestat.. female GW08_PFC1
## GW08_PFC1_sc4 8 weeks after gestat.. female GW08_PFC1
## GW08_PFC1_sc5 8 weeks after gestat.. female GW08_PFC1
## ... ... ... ...
## GW23_PFC2_SF2_F25_sc46 23 weeks after gesta.. female GW23_PFC2_2
## GW23_PFC2_SF2_F25_sc47 23 weeks after gesta.. female GW23_PFC2_2
## GW23_PFC2_SF2_F25_sc48 23 weeks after gesta.. female GW23_PFC2_2
## GW23_PFC2_SF2_F25_sc49 23 weeks after gesta.. female GW23_PFC2_2
## GW23_PFC2_SF2_F25_sc50 23 weeks after gesta.. female GW23_PFC2_2
## cell_types week Gene_num Pre_Map_Reads
## <character> <character> <numeric> <numeric>
## GW08_PFC1_sc1 Neurons GW08 4405 5941167
## GW08_PFC1_sc2 GABAergic neurons GW08 3587 1288826
## GW08_PFC1_sc3 Neurons GW08 2866 1273574
## GW08_PFC1_sc4 Neurons GW08 2140 1005001
## GW08_PFC1_sc5 Microglia GW08 4119 3034624
## ... ... ... ... ...
## GW23_PFC2_SF2_F25_sc46 GABAergic neurons GW23 1695 1197768
## GW23_PFC2_SF2_F25_sc47 OPC GW23 2865 2742178
## GW23_PFC2_SF2_F25_sc48 GABAergic neurons GW23 3506 6760612
## GW23_PFC2_SF2_F25_sc49 GABAergic neurons GW23 2253 2561934
## GW23_PFC2_SF2_F25_sc50 GABAergic neurons GW23 1654 1163304
## Aligned_Reads MappingRate
## <numeric> <numeric>
## GW08_PFC1_sc1 4622105 0.778
## GW08_PFC1_sc2 1025863 0.796
## GW08_PFC1_sc3 1019595 0.801
## GW08_PFC1_sc4 778936 0.775
## GW08_PFC1_sc5 2346561 0.773
## ... ... ...
## GW23_PFC2_SF2_F25_sc46 834420 0.697
## GW23_PFC2_SF2_F25_sc47 1953612 0.712
## GW23_PFC2_SF2_F25_sc48 3887377 0.575
## GW23_PFC2_SF2_F25_sc49 1369273 0.534
## GW23_PFC2_SF2_F25_sc50 744059 0.640
table(sce$cell_types)
##
## Astrocytes GABAergic neurons Microglia Neurons
## 76 701 68 1057
## OPC Stem cells
## 117 290
We want to select only informative genes from the reference, and subsample the cells to have a balanced representation of all the cell types.
library(scran)
# filter low-quality cells
sce <- addPerCellQC(sce)
sce <- sce[, sce$sum > 1000]
# normalization
sce <- logNormCounts(sce)
# hvg selection
dec <- modelGeneVar(sce)
hvg <- getTopHVGs(dec, n = 3000)
# marker genes
colLabels(sce) <- colData(sce)$cell_types
mgs <- scoreMarkers(sce)
# keep only genes informative for cell type identity
mgs_fil <- lapply(names(mgs), function(i) {
x <- mgs[[i]]
# Filter and keep relevant marker genes, those with AUC > 0.8
x <- x[x$mean.AUC > 0.8, ]
# Sort the genes from highest to lowest weight
x <- x[order(x$mean.AUC, decreasing = TRUE), ]
# Add gene and cluster id to the dataframe
x$gene <- rownames(x)
x$cluster <- i
data.frame(x)
})
mgs_df <- do.call(rbind, mgs_fil)
head(mgs_df)
## self.average other.average self.detected other.detected mean.logFC.cohen
## CLU 9.132017 3.0830456 1.0000000 0.7980547 3.620331
## CST3 11.248005 5.6029258 1.0000000 0.9304197 3.187752
## ATP1A2 7.359000 1.5855057 1.0000000 0.5272619 3.693600
## PTN 10.329705 5.2177866 1.0000000 0.9117331 2.921514
## TNC 6.192645 0.8923408 0.9736842 0.3598610 3.270166
## SLC1A3 7.627345 2.7879639 1.0000000 0.6541743 3.115507
## min.logFC.cohen median.logFC.cohen max.logFC.cohen rank.logFC.cohen
## CLU 2.255557 3.808963 4.804147 1
## CST3 1.679610 3.376374 3.974721 2
## ATP1A2 2.123685 3.833893 4.844772 3
## PTN 1.432999 3.437775 4.430631 4
## TNC 2.272726 3.394062 4.102214 3
## SLC1A3 1.762271 2.039550 5.049038 1
## mean.AUC min.AUC median.AUC max.AUC rank.AUC mean.logFC.detected
## CLU 0.9845035 0.9621008 0.9939185 0.9992260 1 0.3245889
## CST3 0.9805865 0.9129555 0.9978790 0.9999502 1 0.1053531
## ATP1A2 0.9727240 0.9162168 0.9844583 0.9963235 3 0.9378080
## PTN 0.9643422 0.9056455 0.9981605 1.0000000 1 0.1333831
## TNC 0.9571244 0.9174539 0.9687852 0.9816176 4 1.4345833
## SLC1A3 0.9550137 0.9027215 0.9491099 0.9943877 3 0.6743354
## min.logFC.detected median.logFC.detected max.logFC.detected
## CLU 0.19518667 0.3011695 0.4574587
## CST3 0.00000000 0.1407118 0.1974822
## ATP1A2 0.48767052 0.9437784 1.3996101
## PTN 0.04243527 0.1638909 0.2225432
## TNC 0.90781570 1.4962073 1.8225302
## SLC1A3 0.20163386 0.4055499 1.2835610
## rank.logFC.detected gene cluster
## CLU 6644 CLU Astrocytes
## CST3 8562 CST3 Astrocytes
## ATP1A2 1577 ATP1A2 Astrocytes
## PTN 8127 PTN Astrocytes
## TNC 743 TNC Astrocytes
## SLC1A3 1475 SLC1A3 Astrocytes
For computational reason, we select 5 cells per cell type, but in real analyses 100 cells per cell types may be used.
# split cell indices by identity
idx <- split(seq(ncol(sce)), sce$cell_types)
# downsample to at most 5 per identity & subset
set.seed(1035)
n_cells <- 5
cs_keep <- lapply(idx, function(i) {
n <- length(i)
if (n < n_cells)
n_cells <- n
sample(i, n_cells)
})
sce <- sce[, unlist(cs_keep)]
We are now ready to run the spot deconvolution.
# the reference uses gene symbol as row names
rownames(spe) <- rowData(spe)$gene_name
res <- SPOTlight(
x = sce,
y = spe,
groups = as.character(sce$cell_types),
mgs = mgs_df,
hvg = hvg,
weight_id = "mean.AUC",
group_id = "cluster",
gene_id = "gene")
# extract results
mod <- res$NMF
We can visualize the results in several ways, but the most popular is the so-called “scatter pie”, which represents the proportion of cell types in each spot as a pie chart.
# SPOTlight needs unique sample_id names
imgData(spe)$sample_id[1] <- "sample_151673_lr"
mat <- res$mat
ct <- colnames(mat)
mat[mat < 0.1] <- 0
# Define color palette
# (here we use 'paletteMartin' from the 'colorBlindness' package)
paletteMartin <- c(
"#000000", "#004949", "#009292", "#ff6db6", "#ffb6db",
"#490092", "#006ddb", "#b66dff", "#6db6ff", "#b6dbff",
"#920000", "#924900", "#db6d00", "#24ff24", "#ffff6d")
pal <- colorRampPalette(paletteMartin)(length(ct))
names(pal) <- ct
plotSpatialScatterpie(
x = spe,
y = mat,
cell_types = colnames(mat),
img = FALSE,
scatterpie_alpha = 1,
pie_scale = 0.4) +
scale_fill_manual(
values = pal,
breaks = names(pal))
We can download and visualize an example CosMX data, using the SFEData
Bioconductor package.
library(SFEData)
sfe <- HeNSCLCData()
sfe
## class: SpatialFeatureExperiment
## dim: 980 100290
## metadata(0):
## assays(1): counts
## rownames(980): AATK ABL1 ... NegPrb22 NegPrb23
## rowData names(3): means vars cv2
## colnames(100290): 1_1 1_2 ... 30_4759 30_4760
## colData names(17): Area AspectRatio ... nCounts nGenes
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : CenterX_global_px CenterY_global_px
## imgData names(1): sample_id
##
## unit: full_res_image_pixels
## Geometries:
## colGeometries: centroids (POINT), cellSeg (POLYGON)
##
## Graphs:
## sample01:
Exercise: familiarize yourself with the SpatialFeatureExperiment class, verify that it extends SpatialExperiment.
library(Voyager)
plotCellBin2D(sfe, hex = TRUE)
fov1 <- sfe[,1:2408]
plotGeometry(fov1, MARGIN = 2L, type = "cellSeg")
Exercise: identify and visualize highly-variable genes (using methods seen above).
Exercise: identify and visualize cell clusters (using methods seen in the single-cell RNA-seq lab).
Exercise: go through the Voyager vignette to compute quality control metrics and identify spatial patterns of expression.
It is good practice to always include a list of the software versions that were used to perform a given analysis, for reproducibility and trouble-shooting purposes. One way of achieving this is via the sessionInfo()
function.
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.6.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Rome
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Voyager_1.6.0 SpatialFeatureExperiment_1.6.1
## [3] SFEData_1.6.0 scran_1.32.0
## [5] scRNAseq_2.18.0 SPOTlight_1.8.0
## [7] BayesSpace_1.14.0 nnSVG_1.8.0
## [9] scater_1.32.0 scuttle_1.14.0
## [11] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.28.0
## [13] AnnotationFilter_1.28.0 GenomicFeatures_1.56.0
## [15] AnnotationDbi_1.66.0 ggspavis_1.10.0
## [17] ggplot2_3.5.1 STexampleData_1.12.3
## [19] ExperimentHub_2.12.0 AnnotationHub_3.12.0
## [21] BiocFileCache_2.12.0 dbplyr_2.5.0
## [23] SpatialExperiment_1.14.0 SingleCellExperiment_1.26.0
## [25] SummarizedExperiment_1.34.0 Biobase_2.64.0
## [27] GenomicRanges_1.56.0 GenomeInfoDb_1.40.1
## [29] IRanges_2.38.0 S4Vectors_0.42.0
## [31] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
## [33] matrixStats_1.3.0 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.36.0 bitops_1.0-7
## [3] sf_1.0-16 EBImage_4.46.0
## [5] httr_1.4.7 RColorBrewer_1.1-3
## [7] doParallel_1.0.17 tools_4.4.0
## [9] alabaster.base_1.4.1 utf8_1.2.4
## [11] R6_2.5.1 DirichletReg_0.7-1
## [13] HDF5Array_1.32.0 lazyeval_0.2.2
## [15] uwot_0.2.2 mgcv_1.9-1
## [17] rhdf5filters_1.16.0 sp_2.1-4
## [19] withr_3.0.0 gridExtra_2.3
## [21] cli_3.6.2 scatterpie_0.2.3
## [23] sandwich_3.1-0 alabaster.se_1.4.1
## [25] labeling_0.4.3 sass_0.4.9
## [27] nnls_1.5 proxy_0.4-27
## [29] pbapply_1.7-2 Rsamtools_2.20.0
## [31] R.utils_2.12.3 scico_1.5.0
## [33] limma_3.60.2 rstudioapi_0.16.0
## [35] RSQLite_2.3.7 FNN_1.1.4
## [37] generics_0.1.3 BiocIO_1.14.0
## [39] spdep_1.3-5 dplyr_1.1.4
## [41] Matrix_1.7-0 ggbeeswarm_0.7.2
## [43] fansi_1.0.6 abind_1.4-5
## [45] terra_1.7-78 R.methodsS3_1.8.2
## [47] lifecycle_1.0.4 yaml_2.3.8
## [49] edgeR_4.2.0 rhdf5_2.48.0
## [51] SparseArray_1.4.8 grid_4.4.0
## [53] blob_1.2.4 dqrng_0.4.1
## [55] crayon_1.5.2 lattice_0.22-6
## [57] beachmat_2.20.0 cowplot_1.1.3
## [59] KEGGREST_1.44.0 magick_2.8.3
## [61] zeallot_0.1.0 pillar_1.9.0
## [63] knitr_1.47 metapod_1.12.0
## [65] boot_1.3-30 rjson_0.2.21
## [67] xgboost_1.7.7.1 codetools_0.2-20
## [69] wk_0.9.1 glue_1.7.0
## [71] ggfun_0.1.5 memuse_4.2-3
## [73] data.table_1.15.4 vctrs_0.6.5
## [75] png_0.1-8 gypsum_1.0.1
## [77] gtable_0.3.5 assertthat_0.2.1
## [79] cachem_1.1.0 xfun_0.44
## [81] DropletUtils_1.24.0 S4Arrays_1.4.1
## [83] mime_0.12 ggside_0.3.1
## [85] coda_0.19-4.1 sfheaders_0.4.4
## [87] iterators_1.0.14 tinytex_0.51
## [89] units_0.8-5 maxLik_1.5-2.1
## [91] statmod_1.5.0 bluster_1.14.0
## [93] nlme_3.1-164 bit64_4.0.5
## [95] alabaster.ranges_1.4.1 filelock_1.0.3
## [97] bslib_0.7.0 irlba_2.3.5.1
## [99] KernSmooth_2.23-22 vipor_0.4.7
## [101] spData_2.3.1 colorspace_2.1-0
## [103] DBI_1.2.3 tidyselect_1.2.1
## [105] BRISC_1.0.5 bit_4.0.5
## [107] compiler_4.4.0 curl_5.2.1
## [109] httr2_1.0.1 BiocNeighbors_1.22.0
## [111] DelayedArray_0.30.1 bookdown_0.39
## [113] rtracklayer_1.64.0 scales_1.3.0
## [115] hexbin_1.28.3 classInt_0.4-10
## [117] NMF_0.27 rappdirs_0.3.3
## [119] tiff_0.1-12 stringr_1.5.1
## [121] digest_0.6.35 fftwtools_0.9-11
## [123] alabaster.matrix_1.4.0 rmarkdown_2.27
## [125] XVector_0.44.0 jpeg_0.1-10
## [127] htmltools_0.5.8.1 pkgconfig_2.0.3
## [129] sparseMatrixStats_1.16.0 highr_0.11
## [131] fastmap_1.2.0 htmlwidgets_1.6.4
## [133] rlang_1.1.4 UCSC.utils_1.0.0
## [135] DelayedMatrixStats_1.26.0 farver_2.1.2
## [137] jquerylib_0.1.4 zoo_1.8-12
## [139] jsonlite_1.8.8 BiocParallel_1.38.0
## [141] mclust_6.1.1 R.oo_1.26.0
## [143] BiocSingular_1.20.0 RCurl_1.98-1.14
## [145] magrittr_2.0.3 Formula_1.2-5
## [147] GenomeInfoDbData_1.2.12 s2_1.1.6
## [149] patchwork_1.2.0 Rhdf5lib_1.26.0
## [151] munsell_0.5.1 Rcpp_1.0.12
## [153] ggnewscale_0.4.10 viridis_0.6.5
## [155] stringi_1.8.4 alabaster.schemas_1.4.0
## [157] zlibbioc_1.50.0 MASS_7.3-60.2
## [159] plyr_1.8.9 parallel_4.4.0
## [161] ggrepel_0.9.5 deldir_2.0-4
## [163] Biostrings_2.72.1 splines_4.4.0
## [165] locfit_1.5-9.9 rdist_0.0.5
## [167] igraph_2.0.3 rngtools_1.5.2
## [169] reshape2_1.4.4 ScaledMatrix_1.12.0
## [171] BiocVersion_3.19.1 XML_3.99-0.16.1
## [173] evaluate_0.24.0 BiocManager_1.30.23
## [175] foreach_1.5.2 tweenr_2.0.3
## [177] miscTools_0.6-28 tidyr_1.3.1
## [179] RANN_2.6.1 purrr_1.0.2
## [181] polyclip_1.10-6 alabaster.sce_1.4.0
## [183] gridBase_0.4-7 ggforce_0.4.2
## [185] rsvd_1.0.5 restfulr_0.0.15
## [187] RSpectra_0.16-1 e1071_1.7-14
## [189] class_7.3-22 viridisLite_0.4.2
## [191] tibble_3.2.1 memoise_2.0.1
## [193] beeswarm_0.4.0 registry_0.5-1
## [195] GenomicAlignments_1.40.0 cluster_2.1.6