Chapter 4 Working with large data

In this section we learn about

Importing data from the Human Cell Atlas
Working with large on-disk data
Exploring tabular data using ‘tidyverse’ tools

Packages used

BiocFileCache – local cache of internet and other files.
LoomExperiment – ‘loom’ single-cell experiment representation
DelayedArray – large on-disk matrix-like data representation
dplyr – ‘tidyverse’ maniplation of data.frame-like data

4.1 Human cell atlas data discovery

Select on the project “A single-cell reference map of transcriptional states for human blood and tissue T cell activation” (second item on the default screen, as of this writing). Note that the data set is very well documented. We will work with the ‘Expression Matrices’ (left panel) in this section of the workshop, and ‘Analysis Protocol (optimus_v1.3.5)’ (right sidebar) in the next section of the workshop.

Select the Expression Matrices link.

Copy the link to the ‘loom’ format matrix.

4.2 R / Bioconductor libraries

Return to AnVIL and create a new notebook. Make sure that the runtime is available, and the notebook in ‘edit’ mode.

Make sure that relevant software is loaded. The first time this code is executed, it will install LoomExperiment – installation will take about 45 seconds.

packages <- c("LoomExperiment", "SingleCellExperiment", "dplyr", "BiocFileCache")
need <- packages[ !packages %in% rownames(installed.packages()) ]
if (length(need))
    BiocManager::install(need, update = FALSE)

Load the package we will use during the session

suppressPackageStartupMessages({
    library(BiocFileCache)
    library(LoomExperiment)
    library(SingleCellExperiment)
    library(dplyr)
})

4.3 Data download and import

Retrieve the HCA matrix. Do this using iocFileCache, a package which creates and maintains a local collection of files.

url <- "https://data.humancellatlas.org/project-assets/project-matrices/4a95101c-9ffc-4f30-a809-f04518a23803.homo_sapiens.loom"
loom_file <- BiocFileCache::bfcrpath(rnames = url)

Import the expression matrix into R using the LoomExperiment package.

loom <- import(loom_file)
loom

## class: LoomExperiment 
## dim: 58347 267360 
## metadata(0):
## assays(1): matrix
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
##   ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL

4.4 ‘DelayedArray’ representation of large on-disk data

The experiment has 58347 rows (genes) and 267360 columns (cells). Re-name the assay to be "counts"

assayNames(loom) <- "counts"
loom

## class: LoomExperiment 
## dim: 58347 267360 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
##   ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL

and take a look at the count data

assay(loom, "counts")

## <58347 x 267360> matrix of class DelayedMatrix and type "double":
##               [,1]      [,2]      [,3] ... [,267359] [,267360]
##     [1,]         0         0         0   .         0         0
##     [2,]         0         0         0   .         0         0
##     [3,]         0         0         0   .         0         1
##     [4,]         0         0         1   .         0         0
##     [5,]         0         0         0   .         0         0
##      ...         .         .         .   .         .         .
## [58343,]         0         0         0   .         0         0
## [58344,]         0         0         0   .         0         0
## [58345,]         0         0         0   .         0         0
## [58346,]         0         0         0   .         0         0
## [58347,]         0         0         0   .         0         0

This is an object of class DelayedArray, defined in the DelayedArray package. It behaves like a regular matrix, but the data is stored on-disk and only accessed when necessary. For instance, the following calculation provides the illusion that the full count matrix has been transformed, but actually only the cells visible in the output are calculated; the remaining data is on-disk.

log1p(assay(loom, "counts"))

## <58347 x 267360> matrix of class DelayedMatrix and type "double":
##               [,1]      [,2]      [,3] ... [,267359] [,267360]
##     [1,] 0.0000000 0.0000000 0.0000000   . 0.0000000 0.0000000
##     [2,] 0.0000000 0.0000000 0.0000000   . 0.0000000 0.0000000
##     [3,] 0.0000000 0.0000000 0.0000000   . 0.0000000 0.6931472
##     [4,] 0.0000000 0.0000000 0.6931472   . 0.0000000 0.0000000
##     [5,] 0.0000000 0.0000000 0.0000000   . 0.0000000 0.0000000
##      ...         .         .         .   .         .         .
## [58343,]         0         0         0   .         0         0
## [58344,]         0         0         0   .         0         0
## [58345,]         0         0         0   .         0         0
## [58346,]         0         0         0   .         0         0
## [58347,]         0         0         0   .         0         0

On the other hand, calculating library size (sum of reads in each column) requires all data to be accessed, and can be time-consuming. Here we calculate library sizes of the first 2000 cells; we can expect processing time to scale linearly with number of cells

system.time({
    counts <- assay(loom, "counts")[, 1:2000]
    lib_size <- colSums(counts)
})

##    user  system elapsed 
##   3.207   0.643   3.858

4.5 Subseting and in-memory location for data exploration

In many scenarios it is appropriate to perform exploratory analysis on a subset of data. Here we select the first 2000 cells (fast) and coerce the counts matrix of the subset to a sparse in-memory representation.

loom_subset <- loom[, 1:2000]
m <- assay(loom_subset, "counts")
assay(loom_subset, "counts") <- as(m, "dgCMatrix")

print(loom_subset)

## class: LoomExperiment 
## dim: 58347 2000 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
##   ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL

assay(loom_subset)[1:10, 1:5]

## dimnames(.) <- NULL:  translated to 
## dimnames(.) <- list(NULL,NULL)  <==>  unname(.)

## 10 x 5 sparse Matrix of class "dgCMatrix"
##                
##  [1,] . . . . .
##  [2,] . . . . .
##  [3,] . . . . .
##  [4,] . . 1 . .
##  [5,] . . . . .
##  [6,] . . . . .
##  [7,] . . . . .
##  [8,] . . . . .
##  [9,] . . . . .
## [10,] . . . . .

Exploratory analysis can then proceed quickly, e.g., calculating library size

system.time({
    counts <- assay(loom_subset, "counts")
    lib_size <- colSums(counts)
})

## dimnames(.) <- NULL:  translated to 
## dimnames(.) <- list(NULL,NULL)  <==>  unname(.)

##    user  system elapsed 
##   0.016   0.000   0.016

range(lib_size)

## [1]   100 44053

4.6 Exploration of column (cell) annotations

Return to the full data set

loom

## class: LoomExperiment 
## dim: 58347 267360 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
##   ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL

We will use the ‘tidyverse’ approach to data management, with data represented as a tibble (a better-behaving data.frame) and a ‘pipe’ (%>%) connecting the output of one command with the input of the next.

The colData() accessor provides annotations about each cell of the LoomExperiment. Extract that information as a tibble so that we can explore the data in some detail.

tbl <-
    colData(loom) %>%
    as_tibble()
print(tbl)

## # A tibble: 267,360 x 44
##    CellID analysis_protoc… analysis_protoc… analysis_workin… barcode bundle_uuid
##    <chr>  <chr>            <chr>            <chr>            <chr>   <chr>      
##  1 00016… optimus_v1.3.5   6e9a0638-4dcd-4… blessed          ATTGGA… c5df0207-1…
##  2 00027… optimus_v1.3.5   aaccd734-ef95-4… blessed          CGAATG… dde6b553-f…
##  3 0007f… optimus_v1.3.5   d9514087-a5b9-4… blessed          GATTCA… 2677fd9d-6…
##  4 000b8… optimus_v1.3.5   b115679e-1139-4… blessed          CGGCTA… a4606cc6-c…
##  5 000fa… optimus_v1.3.5   b115679e-1139-4… blessed          GTATTC… a4606cc6-c…
##  6 00155… optimus_v1.3.5   b115679e-1139-4… blessed          GAGCAG… a4606cc6-c…
##  7 0019d… optimus_v1.3.5   c8f7a0e8-3a4f-4… blessed          TAAACC… 9137888e-c…
##  8 001a0… optimus_v1.3.5   c8f7a0e8-3a4f-4… blessed          TGCTAC… 9137888e-c…
##  9 001b9… optimus_v1.3.5   9944ebca-03e8-4… blessed          ATTATC… 88f037c4-9…
## 10 001d3… optimus_v1.3.5   0989c692-05a5-4… blessed          GGTATT… 8c6e51e8-e…
## # … with 267,350 more rows, and 38 more variables: bundle_version <chr>,
## #   cell_suspension.genus_species.ontology <chr>,
## #   cell_suspension.genus_species.ontology_label <chr>,
## #   cell_suspension.provenance.document_id <chr>, derived_organ_label <chr>,
## #   derived_organ_ontology <chr>, derived_organ_parts_label <chr>,
## #   derived_organ_parts_ontology <chr>,
## #   donor_organism.development_stage.ontology <chr>,
## #   donor_organism.development_stage.ontology_label <chr>,
## #   donor_organism.diseases.ontology <chr>,
## #   donor_organism.diseases.ontology_label <chr>,
## #   donor_organism.human_specific.ethnicity.ontology <chr>,
## #   donor_organism.human_specific.ethnicity.ontology_label <chr>,
## #   donor_organism.is_living <chr>,
## #   donor_organism.provenance.document_id <chr>, donor_organism.sex <chr>,
## #   dss_bundle_fqid <chr>, emptydrops_is_cell <chr>, file_uuid <chr>,
## #   file_version <chr>, genes_detected <int>,
## #   library_preparation_protocol.end_bias <chr>,
## #   library_preparation_protocol.input_nucleic_acid_molecule.ontology <chr>,
## #   library_preparation_protocol.input_nucleic_acid_molecule.ontology_label <chr>,
## #   library_preparation_protocol.library_construction_method.ontology <chr>,
## #   library_preparation_protocol.library_construction_method.ontology_label <chr>,
## #   library_preparation_protocol.provenance.document_id <chr>,
## #   library_preparation_protocol.strand <chr>,
## #   project.project_core.project_short_name <chr>,
## #   project.project_core.project_title <chr>,
## #   project.provenance.document_id <chr>,
## #   specimen_from_organism.organ.ontology <chr>,
## #   specimen_from_organism.organ.ontology_label <chr>,
## #   specimen_from_organism.organ_parts.ontology <chr>,
## #   specimen_from_organism.organ_parts.ontology_label <chr>,
## #   specimen_from_organism.provenance.document_id <chr>, total_umis <dbl>

4.7 Invariant cell annotations

We have extensive data (44 columns) on each cell. Many of the columns are constant across all cells. We use the function select_if() from the dplyr package to select columns that satisfy the expression length(unique(.)) == 1, where . refers to each column in the tibble.

tbl %>%
    ## select columns that have exactly one value
    select_if(~ length(unique(.)) == 1) %>%
    distinct() %>%
    ## create a new tibble of column name / value pairs to summarize these
    tidyr::pivot_longer(everything()) %>%
    print()

## # A tibble: 17 x 2
##    name                                  value                                  
##    <chr>                                 <chr>                                  
##  1 analysis_protocol.protocol_core.prot… optimus_v1.3.5                         
##  2 analysis_working_group_approval_stat… blessed                                
##  3 cell_suspension.genus_species.ontolo… NCBITaxon:9606                         
##  4 cell_suspension.genus_species.ontolo… Homo sapiens                           
##  5 donor_organism.development_stage.ont… EFO:0001272                            
##  6 donor_organism.development_stage.ont… adult                                  
##  7 donor_organism.sex                    male                                   
##  8 library_preparation_protocol.end_bias 3 prime tag                            
##  9 library_preparation_protocol.input_n… OBI:0000869                            
## 10 library_preparation_protocol.input_n… polyA RNA extract                      
## 11 library_preparation_protocol.library… EFO:0009310                            
## 12 library_preparation_protocol.library… 10X v2 sequencing                      
## 13 library_preparation_protocol.provena… e953a093-f5ab-46df-9223-a492f4775d44   
## 14 library_preparation_protocol.strand   first                                  
## 15 project.project_core.project_short_n… HumanTissueTcellActivation             
## 16 project.project_core.project_title    A single-cell reference map of transcr…
## 17 project.provenance.document_id        4a95101c-9ffc-4f30-a809-f04518a23803

4.8 Donor provenance and organ labels

The variable columns are also easily selected

tbl_var <-
    tbl %>%
    select_if(~ length(unique(.)) != 1)

print(tbl_var)

## # A tibble: 267,360 x 27
##    CellID analysis_protoc… barcode bundle_uuid bundle_version cell_suspension…
##    <chr>  <chr>            <chr>   <chr>       <chr>          <chr>           
##  1 00016… 6e9a0638-4dcd-4… ATTGGA… c5df0207-1… 2019-09-14T09… 788f3a1c-9369-4…
##  2 00027… aaccd734-ef95-4… CGAATG… dde6b553-f… 2019-09-14T09… b72a54dc-69d7-4…
##  3 0007f… d9514087-a5b9-4… GATTCA… 2677fd9d-6… 2019-09-14T11… 10510764-ed13-4…
##  4 000b8… b115679e-1139-4… CGGCTA… a4606cc6-c… 2019-09-14T11… ac1e511f-fa9e-4…
##  5 000fa… b115679e-1139-4… GTATTC… a4606cc6-c… 2019-09-14T11… ac1e511f-fa9e-4…
##  6 00155… b115679e-1139-4… GAGCAG… a4606cc6-c… 2019-09-14T11… ac1e511f-fa9e-4…
##  7 0019d… c8f7a0e8-3a4f-4… TAAACC… 9137888e-c… 2019-09-14T09… 78b08aea-c194-4…
##  8 001a0… c8f7a0e8-3a4f-4… TGCTAC… 9137888e-c… 2019-09-14T09… 78b08aea-c194-4…
##  9 001b9… 9944ebca-03e8-4… ATTATC… 88f037c4-9… 2019-09-14T10… b29ca10d-b611-4…
## 10 001d3… 0989c692-05a5-4… GGTATT… 8c6e51e8-e… 2019-09-14T10… 7e5f6787-ab75-4…
## # … with 267,350 more rows, and 21 more variables: derived_organ_label <chr>,
## #   derived_organ_ontology <chr>, derived_organ_parts_label <chr>,
## #   derived_organ_parts_ontology <chr>, donor_organism.diseases.ontology <chr>,
## #   donor_organism.diseases.ontology_label <chr>,
## #   donor_organism.human_specific.ethnicity.ontology <chr>,
## #   donor_organism.human_specific.ethnicity.ontology_label <chr>,
## #   donor_organism.is_living <chr>,
## #   donor_organism.provenance.document_id <chr>, dss_bundle_fqid <chr>,
## #   emptydrops_is_cell <chr>, file_uuid <chr>, file_version <chr>,
## #   genes_detected <int>, specimen_from_organism.organ.ontology <chr>,
## #   specimen_from_organism.organ.ontology_label <chr>,
## #   specimen_from_organism.organ_parts.ontology <chr>,
## #   specimen_from_organism.organ_parts.ontology_label <chr>,
## #   specimen_from_organism.provenance.document_id <chr>, total_umis <dbl>

The donor_organism.provenance.documented_id column identifies indivuals contributing cells, so the cells per individual are

tbl %>%
    count(donor_organism.provenance.document_id) %>%
    print()

## # A tibble: 4 x 2
##   donor_organism.provenance.document_id      n
##   <chr>                                  <int>
## 1 4b7a3de7-94db-4507-b1f9-c3931f9f3aa5   10745
## 2 8b5774b2-2aba-4b90-8fa0-c7f69207802c   45676
## 3 95822e01-44a7-420b-92fc-b001460b1d13  199515
## 4 b09a1e10-3c38-4bb3-8553-d762810a6fb7   11424

The tissues represented in the experiment are summarized as

tbl %>%
    count(derived_organ_label) %>%
    print()

## # A tibble: 4 x 2
##   derived_organ_label        n
##   <chr>                  <int>
## 1 blood                  22169
## 2 hematopoietic system   82723
## 3 lung                   80774
## 4 mediastinal lymph node 81694

The occurrence of organism and organ shows that two samples provided blood, and tow samples provided hematopoietic / lung / lymph node.

tbl %>%
    count(donor_organism.provenance.document_id, derived_organ_label) %>%
    tidyr::pivot_wider(
               names_from = "derived_organ_label",
               values_from = "n"
           ) %>%
    print()

## # A tibble: 4 x 5
##   donor_organism.provenance… blood `hematopoietic sys…  lung `mediastinal lymph…
##   <chr>                      <int>               <int> <int>               <int>
## 1 4b7a3de7-94db-4507-b1f9-c… 10745                  NA    NA                  NA
## 2 8b5774b2-2aba-4b90-8fa0-c…    NA               12559 15257               17860
## 3 95822e01-44a7-420b-92fc-b…    NA               70164 65517               63834
## 4 b09a1e10-3c38-4bb3-8553-d… 11424                  NA    NA                  NA

4.9 Sub-setting data based on cell attributes

With this exploration, it is then easy to create a subset of the data for further exploration, e.g., focusing on blood

blood <-
    loom %>%
    subset(, derived_organ_label %in% "blood")

print(blood)

## class: LoomExperiment 
## dim: 58347 22169 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
##   ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL

4.10 Information about the packages used in this session

The R command sessionInfo() captures information about the versions of software used in the current session. This can be valuable for performing reproducible analysis.

sessionInfo()

## R version 4.0.2 Patched (2020-06-24 r78747)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-4-0-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-4-0-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] dplyr_1.0.0                 LoomExperiment_1.6.0       
##  [3] rtracklayer_1.48.0          rhdf5_2.32.2               
##  [5] BiocFileCache_1.12.0        dbplyr_1.4.4               
##  [7] scran_1.16.0                scater_1.16.2              
##  [9] ggplot2_3.3.2               scRNAseq_2.2.0             
## [11] SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2
## [13] DelayedArray_0.14.1         matrixStats_0.56.0         
## [15] Biobase_2.48.0              GenomicRanges_1.40.0       
## [17] GenomeInfoDb_1.24.2         IRanges_2.22.2             
## [19] S4Vectors_0.26.1            BiocGenerics_0.34.0        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7.1                
##  [3] httr_1.4.1                    tools_4.0.2                  
##  [5] utf8_1.1.4                    R6_2.4.1                     
##  [7] irlba_2.3.3                   HDF5Array_1.16.1             
##  [9] vipor_0.4.5                   DBI_1.1.0                    
## [11] colorspace_1.4-1              withr_2.2.0                  
## [13] tidyselect_1.1.0              gridExtra_2.3                
## [15] bit_1.1-15.2                  curl_4.3                     
## [17] compiler_4.0.2                cli_2.0.2                    
## [19] BiocNeighbors_1.6.0           labeling_0.3                 
## [21] bookdown_0.20                 scales_1.1.1                 
## [23] rappdirs_0.3.1                Rsamtools_2.4.0              
## [25] stringr_1.4.0                 digest_0.6.25                
## [27] rmarkdown_2.3                 XVector_0.28.0               
## [29] pkgconfig_2.0.3               htmltools_0.5.0              
## [31] fastmap_1.0.1                 limma_3.44.3                 
## [33] rlang_0.4.7                   rstudioapi_0.11              
## [35] RSQLite_2.2.0                 shiny_1.5.0                  
## [37] DelayedMatrixStats_1.10.1     farver_2.0.3                 
## [39] generics_0.0.2                BiocParallel_1.22.0          
## [41] RCurl_1.98-1.2                magrittr_1.5                 
## [43] BiocSingular_1.4.0            GenomeInfoDbData_1.2.3       
## [45] Matrix_1.2-18                 fansi_0.4.1                  
## [47] Rhdf5lib_1.10.1               Rcpp_1.0.5                   
## [49] ggbeeswarm_0.6.0              munsell_0.5.0                
## [51] viridis_0.5.1                 lifecycle_0.2.0              
## [53] stringi_1.4.6                 yaml_2.2.1                   
## [55] edgeR_3.30.3                  zlibbioc_1.34.0              
## [57] AnnotationHub_2.20.0          grid_4.0.2                   
## [59] blob_1.2.1                    promises_1.1.1               
## [61] dqrng_0.2.1                   ExperimentHub_1.14.0         
## [63] crayon_1.3.4                  lattice_0.20-41              
## [65] Biostrings_2.56.0             cowplot_1.0.0                
## [67] locfit_1.5-9.4                knitr_1.29                   
## [69] pillar_1.4.6                  igraph_1.2.5                 
## [71] XML_3.99-0.4                  glue_1.4.1                   
## [73] BiocVersion_3.11.1            evaluate_0.14                
## [75] BiocManager_1.30.10           vctrs_0.3.2                  
## [77] httpuv_1.5.4                  tidyr_1.1.0                  
## [79] gtable_0.3.0                  purrr_0.3.4                  
## [81] assertthat_0.2.1              xfun_0.15                    
## [83] rsvd_1.0.3                    mime_0.9                     
## [85] xtable_1.8-4                  later_1.1.0.1                
## [87] viridisLite_0.3.0             tibble_3.0.3                 
## [89] GenomicAlignments_1.24.0      AnnotationDbi_1.50.1         
## [91] beeswarm_0.2.3                memoise_1.1.0                
## [93] statmod_1.4.34                ellipsis_0.3.1               
## [95] interactiveDisplayBase_1.26.3