Chapter 4 Working with large data
In this section we learn about
- Importing data from the Human Cell Atlas
- Working with large on-disk data
- Exploring tabular data using ‘tidyverse’ tools
Packages used
- BiocFileCache – local cache of internet and other files.
- LoomExperiment – ‘loom’ single-cell experiment representation
- DelayedArray – large on-disk matrix-like data representation
- dplyr – ‘tidyverse’ maniplation of data.frame-like data
4.1 Human cell atlas data discovery
Visit the Human Cell Atlas Data Portal.
Select on the project “A single-cell reference map of transcriptional states for human blood and tissue T cell activation” (second item on the default screen, as of this writing). Note that the data set is very well documented. We will work with the ‘Expression Matrices’ (left panel) in this section of the workshop, and ‘Analysis Protocol (optimus_v1.3.5)’ (right sidebar) in the next section of the workshop.
Select the Expression Matrices
link.
Copy the link to the ‘loom’ format matrix.
4.2 R / Bioconductor libraries
Return to AnVIL and create a new notebook. Make sure that the runtime is available, and the notebook in ‘edit’ mode.
Make sure that relevant software is loaded. The first time this code is executed, it will install LoomExperiment – installation will take about 45 seconds.
packages <- c("LoomExperiment", "SingleCellExperiment", "dplyr", "BiocFileCache")
need <- packages[ !packages %in% rownames(installed.packages()) ]
if (length(need))
BiocManager::install(need, update = FALSE)
Load the package we will use during the session
4.3 Data download and import
Retrieve the HCA matrix. Do this using iocFileCache
, a package
which creates and maintains a local collection of files.
url <- "https://data.humancellatlas.org/project-assets/project-matrices/4a95101c-9ffc-4f30-a809-f04518a23803.homo_sapiens.loom"
loom_file <- BiocFileCache::bfcrpath(rnames = url)
Import the expression matrix into R using the LoomExperiment
package.
## class: LoomExperiment
## dim: 58347 267360
## metadata(0):
## assays(1): matrix
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
## ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL
4.4 ‘DelayedArray’ representation of large on-disk data
The experiment has 58347 rows (genes) and 267360 columns
(cells). Re-name the assay to be "counts"
## class: LoomExperiment
## dim: 58347 267360
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
## ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL
and take a look at the count data
## <58347 x 267360> matrix of class DelayedMatrix and type "double":
## [,1] [,2] [,3] ... [,267359] [,267360]
## [1,] 0 0 0 . 0 0
## [2,] 0 0 0 . 0 0
## [3,] 0 0 0 . 0 1
## [4,] 0 0 1 . 0 0
## [5,] 0 0 0 . 0 0
## ... . . . . . .
## [58343,] 0 0 0 . 0 0
## [58344,] 0 0 0 . 0 0
## [58345,] 0 0 0 . 0 0
## [58346,] 0 0 0 . 0 0
## [58347,] 0 0 0 . 0 0
This is an object of class DelayedArray
, defined in the
DelayedArray
package. It behaves like a regular matrix, but the
data is stored on-disk and only accessed when necessary. For
instance, the following calculation provides the illusion that the
full count matrix has been transformed, but actually only the cells
visible in the output are calculated; the remaining data is on-disk.
## <58347 x 267360> matrix of class DelayedMatrix and type "double":
## [,1] [,2] [,3] ... [,267359] [,267360]
## [1,] 0.0000000 0.0000000 0.0000000 . 0.0000000 0.0000000
## [2,] 0.0000000 0.0000000 0.0000000 . 0.0000000 0.0000000
## [3,] 0.0000000 0.0000000 0.0000000 . 0.0000000 0.6931472
## [4,] 0.0000000 0.0000000 0.6931472 . 0.0000000 0.0000000
## [5,] 0.0000000 0.0000000 0.0000000 . 0.0000000 0.0000000
## ... . . . . . .
## [58343,] 0 0 0 . 0 0
## [58344,] 0 0 0 . 0 0
## [58345,] 0 0 0 . 0 0
## [58346,] 0 0 0 . 0 0
## [58347,] 0 0 0 . 0 0
On the other hand, calculating library size (sum of reads in each column) requires all data to be accessed, and can be time-consuming. Here we calculate library sizes of the first 2000 cells; we can expect processing time to scale linearly with number of cells
## user system elapsed
## 3.207 0.643 3.858
4.5 Subseting and in-memory location for data exploration
In many scenarios it is appropriate to perform exploratory analysis on a subset of data. Here we select the first 2000 cells (fast) and coerce the counts matrix of the subset to a sparse in-memory representation.
loom_subset <- loom[, 1:2000]
m <- assay(loom_subset, "counts")
assay(loom_subset, "counts") <- as(m, "dgCMatrix")
print(loom_subset)
## class: LoomExperiment
## dim: 58347 2000
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
## ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL
## dimnames(.) <- NULL: translated to
## dimnames(.) <- list(NULL,NULL) <==> unname(.)
## 10 x 5 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . .
## [2,] . . . . .
## [3,] . . . . .
## [4,] . . 1 . .
## [5,] . . . . .
## [6,] . . . . .
## [7,] . . . . .
## [8,] . . . . .
## [9,] . . . . .
## [10,] . . . . .
Exploratory analysis can then proceed quickly, e.g., calculating library size
## dimnames(.) <- NULL: translated to
## dimnames(.) <- list(NULL,NULL) <==> unname(.)
## user system elapsed
## 0.016 0.000 0.016
## [1] 100 44053
4.6 Exploration of column (cell) annotations
Return to the full data set
## class: LoomExperiment
## dim: 58347 267360
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
## ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL
We will use the ‘tidyverse’ approach to data management, with data
represented as a tibble
(a better-behaving data.frame
) and a
‘pipe’ (%>%
) connecting the output of one command with the input of
the next.
The colData()
accessor provides annotations about each cell of the
LoomExperiment
. Extract that information as a tibble
so that we
can explore the data in some detail.
## # A tibble: 267,360 x 44
## CellID analysis_protoc… analysis_protoc… analysis_workin… barcode bundle_uuid
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 00016… optimus_v1.3.5 6e9a0638-4dcd-4… blessed ATTGGA… c5df0207-1…
## 2 00027… optimus_v1.3.5 aaccd734-ef95-4… blessed CGAATG… dde6b553-f…
## 3 0007f… optimus_v1.3.5 d9514087-a5b9-4… blessed GATTCA… 2677fd9d-6…
## 4 000b8… optimus_v1.3.5 b115679e-1139-4… blessed CGGCTA… a4606cc6-c…
## 5 000fa… optimus_v1.3.5 b115679e-1139-4… blessed GTATTC… a4606cc6-c…
## 6 00155… optimus_v1.3.5 b115679e-1139-4… blessed GAGCAG… a4606cc6-c…
## 7 0019d… optimus_v1.3.5 c8f7a0e8-3a4f-4… blessed TAAACC… 9137888e-c…
## 8 001a0… optimus_v1.3.5 c8f7a0e8-3a4f-4… blessed TGCTAC… 9137888e-c…
## 9 001b9… optimus_v1.3.5 9944ebca-03e8-4… blessed ATTATC… 88f037c4-9…
## 10 001d3… optimus_v1.3.5 0989c692-05a5-4… blessed GGTATT… 8c6e51e8-e…
## # … with 267,350 more rows, and 38 more variables: bundle_version <chr>,
## # cell_suspension.genus_species.ontology <chr>,
## # cell_suspension.genus_species.ontology_label <chr>,
## # cell_suspension.provenance.document_id <chr>, derived_organ_label <chr>,
## # derived_organ_ontology <chr>, derived_organ_parts_label <chr>,
## # derived_organ_parts_ontology <chr>,
## # donor_organism.development_stage.ontology <chr>,
## # donor_organism.development_stage.ontology_label <chr>,
## # donor_organism.diseases.ontology <chr>,
## # donor_organism.diseases.ontology_label <chr>,
## # donor_organism.human_specific.ethnicity.ontology <chr>,
## # donor_organism.human_specific.ethnicity.ontology_label <chr>,
## # donor_organism.is_living <chr>,
## # donor_organism.provenance.document_id <chr>, donor_organism.sex <chr>,
## # dss_bundle_fqid <chr>, emptydrops_is_cell <chr>, file_uuid <chr>,
## # file_version <chr>, genes_detected <int>,
## # library_preparation_protocol.end_bias <chr>,
## # library_preparation_protocol.input_nucleic_acid_molecule.ontology <chr>,
## # library_preparation_protocol.input_nucleic_acid_molecule.ontology_label <chr>,
## # library_preparation_protocol.library_construction_method.ontology <chr>,
## # library_preparation_protocol.library_construction_method.ontology_label <chr>,
## # library_preparation_protocol.provenance.document_id <chr>,
## # library_preparation_protocol.strand <chr>,
## # project.project_core.project_short_name <chr>,
## # project.project_core.project_title <chr>,
## # project.provenance.document_id <chr>,
## # specimen_from_organism.organ.ontology <chr>,
## # specimen_from_organism.organ.ontology_label <chr>,
## # specimen_from_organism.organ_parts.ontology <chr>,
## # specimen_from_organism.organ_parts.ontology_label <chr>,
## # specimen_from_organism.provenance.document_id <chr>, total_umis <dbl>
4.7 Invariant cell annotations
We have extensive data (44 columns) on each cell. Many of the columns
are constant across all cells. We use the function select_if()
from
the dplyr
package to select columns that satisfy the expression
length(unique(.)) == 1
, where .
refers to each column in the
tibble.
tbl %>%
## select columns that have exactly one value
select_if(~ length(unique(.)) == 1) %>%
distinct() %>%
## create a new tibble of column name / value pairs to summarize these
tidyr::pivot_longer(everything()) %>%
print()
## # A tibble: 17 x 2
## name value
## <chr> <chr>
## 1 analysis_protocol.protocol_core.prot… optimus_v1.3.5
## 2 analysis_working_group_approval_stat… blessed
## 3 cell_suspension.genus_species.ontolo… NCBITaxon:9606
## 4 cell_suspension.genus_species.ontolo… Homo sapiens
## 5 donor_organism.development_stage.ont… EFO:0001272
## 6 donor_organism.development_stage.ont… adult
## 7 donor_organism.sex male
## 8 library_preparation_protocol.end_bias 3 prime tag
## 9 library_preparation_protocol.input_n… OBI:0000869
## 10 library_preparation_protocol.input_n… polyA RNA extract
## 11 library_preparation_protocol.library… EFO:0009310
## 12 library_preparation_protocol.library… 10X v2 sequencing
## 13 library_preparation_protocol.provena… e953a093-f5ab-46df-9223-a492f4775d44
## 14 library_preparation_protocol.strand first
## 15 project.project_core.project_short_n… HumanTissueTcellActivation
## 16 project.project_core.project_title A single-cell reference map of transcr…
## 17 project.provenance.document_id 4a95101c-9ffc-4f30-a809-f04518a23803
4.8 Donor provenance and organ labels
The variable columns are also easily selected
## # A tibble: 267,360 x 27
## CellID analysis_protoc… barcode bundle_uuid bundle_version cell_suspension…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 00016… 6e9a0638-4dcd-4… ATTGGA… c5df0207-1… 2019-09-14T09… 788f3a1c-9369-4…
## 2 00027… aaccd734-ef95-4… CGAATG… dde6b553-f… 2019-09-14T09… b72a54dc-69d7-4…
## 3 0007f… d9514087-a5b9-4… GATTCA… 2677fd9d-6… 2019-09-14T11… 10510764-ed13-4…
## 4 000b8… b115679e-1139-4… CGGCTA… a4606cc6-c… 2019-09-14T11… ac1e511f-fa9e-4…
## 5 000fa… b115679e-1139-4… GTATTC… a4606cc6-c… 2019-09-14T11… ac1e511f-fa9e-4…
## 6 00155… b115679e-1139-4… GAGCAG… a4606cc6-c… 2019-09-14T11… ac1e511f-fa9e-4…
## 7 0019d… c8f7a0e8-3a4f-4… TAAACC… 9137888e-c… 2019-09-14T09… 78b08aea-c194-4…
## 8 001a0… c8f7a0e8-3a4f-4… TGCTAC… 9137888e-c… 2019-09-14T09… 78b08aea-c194-4…
## 9 001b9… 9944ebca-03e8-4… ATTATC… 88f037c4-9… 2019-09-14T10… b29ca10d-b611-4…
## 10 001d3… 0989c692-05a5-4… GGTATT… 8c6e51e8-e… 2019-09-14T10… 7e5f6787-ab75-4…
## # … with 267,350 more rows, and 21 more variables: derived_organ_label <chr>,
## # derived_organ_ontology <chr>, derived_organ_parts_label <chr>,
## # derived_organ_parts_ontology <chr>, donor_organism.diseases.ontology <chr>,
## # donor_organism.diseases.ontology_label <chr>,
## # donor_organism.human_specific.ethnicity.ontology <chr>,
## # donor_organism.human_specific.ethnicity.ontology_label <chr>,
## # donor_organism.is_living <chr>,
## # donor_organism.provenance.document_id <chr>, dss_bundle_fqid <chr>,
## # emptydrops_is_cell <chr>, file_uuid <chr>, file_version <chr>,
## # genes_detected <int>, specimen_from_organism.organ.ontology <chr>,
## # specimen_from_organism.organ.ontology_label <chr>,
## # specimen_from_organism.organ_parts.ontology <chr>,
## # specimen_from_organism.organ_parts.ontology_label <chr>,
## # specimen_from_organism.provenance.document_id <chr>, total_umis <dbl>
The donor_organism.provenance.documented_id
column identifies
indivuals contributing cells, so the cells per individual are
## # A tibble: 4 x 2
## donor_organism.provenance.document_id n
## <chr> <int>
## 1 4b7a3de7-94db-4507-b1f9-c3931f9f3aa5 10745
## 2 8b5774b2-2aba-4b90-8fa0-c7f69207802c 45676
## 3 95822e01-44a7-420b-92fc-b001460b1d13 199515
## 4 b09a1e10-3c38-4bb3-8553-d762810a6fb7 11424
The tissues represented in the experiment are summarized as
## # A tibble: 4 x 2
## derived_organ_label n
## <chr> <int>
## 1 blood 22169
## 2 hematopoietic system 82723
## 3 lung 80774
## 4 mediastinal lymph node 81694
The occurrence of organism and organ shows that two samples provided blood, and tow samples provided hematopoietic / lung / lymph node.
tbl %>%
count(donor_organism.provenance.document_id, derived_organ_label) %>%
tidyr::pivot_wider(
names_from = "derived_organ_label",
values_from = "n"
) %>%
print()
## # A tibble: 4 x 5
## donor_organism.provenance… blood `hematopoietic sys… lung `mediastinal lymph…
## <chr> <int> <int> <int> <int>
## 1 4b7a3de7-94db-4507-b1f9-c… 10745 NA NA NA
## 2 8b5774b2-2aba-4b90-8fa0-c… NA 12559 15257 17860
## 3 95822e01-44a7-420b-92fc-b… NA 70164 65517 63834
## 4 b09a1e10-3c38-4bb3-8553-d… 11424 NA NA NA
4.9 Sub-setting data based on cell attributes
With this exploration, it is then easy to create a subset of the data for further exploration, e.g., focusing on blood
## class: LoomExperiment
## dim: 58347 22169
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(9): Accession Gene ... genus_species isgene
## colnames: NULL
## colData names(44): CellID analysis_protocol.protocol_core.protocol_id
## ... specimen_from_organism.provenance.document_id total_umis
## rowGraphs(0): NULL
## colGraphs(0): NULL
4.10 Information about the packages used in this session
The R command sessionInfo()
captures information about the
versions of software used in the current session. This can be valuable
for performing reproducible analysis.
## R version 4.0.2 Patched (2020-06-24 r78747)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Users/ma38727/bin/R-4-0-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-4-0-branch/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dplyr_1.0.0 LoomExperiment_1.6.0
## [3] rtracklayer_1.48.0 rhdf5_2.32.2
## [5] BiocFileCache_1.12.0 dbplyr_1.4.4
## [7] scran_1.16.0 scater_1.16.2
## [9] ggplot2_3.3.2 scRNAseq_2.2.0
## [11] SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2
## [13] DelayedArray_0.14.1 matrixStats_0.56.0
## [15] Biobase_2.48.0 GenomicRanges_1.40.0
## [17] GenomeInfoDb_1.24.2 IRanges_2.22.2
## [19] S4Vectors_0.26.1 BiocGenerics_0.34.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7.1
## [3] httr_1.4.1 tools_4.0.2
## [5] utf8_1.1.4 R6_2.4.1
## [7] irlba_2.3.3 HDF5Array_1.16.1
## [9] vipor_0.4.5 DBI_1.1.0
## [11] colorspace_1.4-1 withr_2.2.0
## [13] tidyselect_1.1.0 gridExtra_2.3
## [15] bit_1.1-15.2 curl_4.3
## [17] compiler_4.0.2 cli_2.0.2
## [19] BiocNeighbors_1.6.0 labeling_0.3
## [21] bookdown_0.20 scales_1.1.1
## [23] rappdirs_0.3.1 Rsamtools_2.4.0
## [25] stringr_1.4.0 digest_0.6.25
## [27] rmarkdown_2.3 XVector_0.28.0
## [29] pkgconfig_2.0.3 htmltools_0.5.0
## [31] fastmap_1.0.1 limma_3.44.3
## [33] rlang_0.4.7 rstudioapi_0.11
## [35] RSQLite_2.2.0 shiny_1.5.0
## [37] DelayedMatrixStats_1.10.1 farver_2.0.3
## [39] generics_0.0.2 BiocParallel_1.22.0
## [41] RCurl_1.98-1.2 magrittr_1.5
## [43] BiocSingular_1.4.0 GenomeInfoDbData_1.2.3
## [45] Matrix_1.2-18 fansi_0.4.1
## [47] Rhdf5lib_1.10.1 Rcpp_1.0.5
## [49] ggbeeswarm_0.6.0 munsell_0.5.0
## [51] viridis_0.5.1 lifecycle_0.2.0
## [53] stringi_1.4.6 yaml_2.2.1
## [55] edgeR_3.30.3 zlibbioc_1.34.0
## [57] AnnotationHub_2.20.0 grid_4.0.2
## [59] blob_1.2.1 promises_1.1.1
## [61] dqrng_0.2.1 ExperimentHub_1.14.0
## [63] crayon_1.3.4 lattice_0.20-41
## [65] Biostrings_2.56.0 cowplot_1.0.0
## [67] locfit_1.5-9.4 knitr_1.29
## [69] pillar_1.4.6 igraph_1.2.5
## [71] XML_3.99-0.4 glue_1.4.1
## [73] BiocVersion_3.11.1 evaluate_0.14
## [75] BiocManager_1.30.10 vctrs_0.3.2
## [77] httpuv_1.5.4 tidyr_1.1.0
## [79] gtable_0.3.0 purrr_0.3.4
## [81] assertthat_0.2.1 xfun_0.15
## [83] rsvd_1.0.3 mime_0.9
## [85] xtable_1.8-4 later_1.1.0.1
## [87] viridisLite_0.3.0 tibble_3.0.3
## [89] GenomicAlignments_1.24.0 AnnotationDbi_1.50.1
## [91] beeswarm_0.2.3 memoise_1.1.0
## [93] statmod_1.4.34 ellipsis_0.3.1
## [95] interactiveDisplayBase_1.26.3