Chapter 3 Using an existing workspace

In this section we learn about

  • Working with Jupyter notebooks
  • Installing and loading R / Bioconductor packages
  • Using the OSCA book to perform single-cell data

Packages used

  • scRNAseq – Pre-defined single cell data sets
  • scater – Single-cell quality control and normalization
  • scran – Feature selection, dimensionality reduction, clustering, and visualization

3.1 Orchestrating single cell analysis

We tackle the ‘quick start’ section of the book ‘Orchestrating single cell analysis in R / Bioconductor’ (OSCA).

Visit Quick start, section 5.5, of OSCA.

3.2 Install and load R packages

We will use several R / Bioconductor packages and their dependencies. Make sure that the packages are installed in R by running the following command. Package installation can take quite a long time; this is an area that is being actively worked on.

Detailed information on R / Bioconductor packages can be found on the package ‘landing pages’, e.g., for scater. Be sure to check out the vignettes available on the landing page.

Attach the packages we will use into the current R session. Attaching a package adds the functions and other objects defined in the package to the R ‘search’ path, so that the commands can be evaluated.

3.3 Example data as a SingleCellExperiment

We’ll work with a pre-defined SingleCellExperiment data set; see the ‘Orchestrating single cell analysis in R / Bioconductor’ book for steps to create your own SingleCellExperiment. Check out the help page ?MacoskoRetinaData for more information on this data set.

## snapshotDate(): 2020-04-27
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## see ?scRNAseq and browseVignettes('scRNAseq') for documentation
## loading from cache
## class: SingleCellExperiment 
## dim: 24658 49300 
## metadata(0):
## assays(1): counts
## rownames(24658): KITL TMTC3 ... 1110059M19RIK GM20861
## rowData names(0):
## colData names(2): cluster
## reducedDimNames(0):
## altExpNames(0):

Printing sce shows dim: 24658 49300, indicating that there are 24658 rows (genes) and 49300 columns (cells). colData names(2): cluster indicates that each column (cell) has additional information on cell id and cluster, perhaps from a previous analysis.

The count data (number of reads from each cell mapping to each gene) can be extracted from sce using assay(); this is a a 24658 x 49300 matrix, so we subset it to show only the first 10 rows and 3 columns. Many of the counts in typical single cell experiments are zeros, so the matrix is represented using a sparse matrix representation defined in the base R Matrix package.

## 10 x 3 sparse Matrix of class "dgCMatrix"
## KITL                        .               .               1
## TMTC3                       3               .               .
## CEP290                      1               3               .
## 4930430F08RIK               2               1               2
## 1700017N19RIK               .               .               .
## MGAT4C                      .               .               4
## RASSF9                      .               .               .
## LRRIQ1                      .               .               .
## ADGB                        .               .               .
## SLC6A15                     4               1               3

3.4 The ‘Quick start’ workflow

Use the functions perCellQCMetrics() and quickPerCellQC() to identify cells failing to satisfy quality control metrics. Details of these functions can be found in the ‘Orchestrating single cell analysis in R / Bioconductor’ book, on the help pages ?perCellQCMetrics, and in package vignettes.

The number of cells has been reduced from 49300 to 45877.

## [1] 24658 45877

Normalization transforms counts to accommodate differences in ‘library size’ (total number of reads assayed in each cell) and the distribution of the counts. There are several approaches to normalization; we adopt the approach implemented by the logNormCounts() function. This updates sce to include a second assay, called "logcounts".

## 10 x 3 sparse Matrix of class "dgCMatrix"
## KITL               .               .                0.0670073
## TMTC3              0.14677618      .                .        
## CEP290             0.05060284      0.17027364       .        
## 4930430F08RIK      0.09949075      0.05901921       0.1310400
## 1700017N19RIK      .               .                .        
## MGAT4C             .               .                0.2511625
## RASSF9             .               .                .        
## LRRIQ1             .               .                .        
## ADGB               .               .                .        
## SLC6A15            0.19256085      0.05901921       0.1923511

Feature selection identifies genes that are particularly informative about between-cell variation in gene expression. The functions modelGeneVar() and getTopHVGs() identify 9two14 particularly informative features.

## [1] "RHO"     "CALM1"   "MEG3"    "GNGT1"   "SAG"     "RPGRIP1"
## [1] 914

Each column of the assay matrix can be though of as a vector defining the location of a cell in a high-dimensional space. Dimensionality reduction projects this high dimensional space onto lower dimensions to enhance visualization. The following commands perform ‘PCA’ and ‘UMAP’ dimensionality reduction. The sce object is updated to record these reductions, so the calculation can be performed only once. It takes 2 or 3 minutes to evaluate the following cell.

##    user  system elapsed 
## 152.906   4.526 157.439
## [1] "PCA"  "UMAP"

PCA and UMP can be used to visualize cells in two or three dimensions. The cells form visual clusters, and cluster_louvain() is a method to identify cells that are close to one another in the reduced space.

As a final step in this quick start, cells can be visualized in reduced dimensional space, colored according to the identified clusters.

3.5 Information about the packages used in this session

The R command sessionInfo() captures information about the versions of software used in the current session. This can be valuable for performing reproducible analysis.

