Save/load an HDF5-based SummarizedExperiment object

saveHDF5SummarizedExperiment and loadHDF5SummarizedExperiment can be used to save/load an HDF5-based SummarizedExperiment object to/from disk.

NOTE: These functions use functionalities from the SummarizedExperiment package internally and so require this package to be installed.

Usage

saveHDF5SummarizedExperiment(x, dir="my_h5_se", prefix="", replace=FALSE,
                                chunkdim=NULL, level=NULL, as.sparse=NA,
                                verbose=NA)

loadHDF5SummarizedExperiment(dir="my_h5_se", prefix="")

quickResaveHDF5SummarizedExperiment(x, verbose=FALSE)

Arguments

x

A SummarizedExperiment object or derivative.

For quickResaveHDF5SummarizedExperiment the object must have been previously saved with saveHDF5SummarizedExperiment (and has been possibly modified since then).

dir

The path (as a single string) to the directory where to save the HDF5-based SummarizedExperiment object or to load it from.

When saving, the directory will be created if it doesn't already exist. If the directory already exists and no prefix is specified and replace is set to TRUE, then it's replaced with an empty directory.

prefix

An optional prefix to add to the names of the files created inside dir. Allows saving more than one object in the same directory.

replace

When no prefix is specified, should a pre-existing directory be replaced with a new empty one? The content of the pre-existing directory will be lost!

chunkdim, level

The dimensions of the chunks and the compression level to use for writing the assay data to disk.

Passed to the internal calls to writeHDF5Array. See ?writeHDF5Array for more information.

as.sparse

Whether the assay data should be flagged as sparse or not. If set to NA (the default), then the specific as.sparse value to use for each assay is determined by calling is_sparse() on them.

Passed to the internal calls to writeHDF5Array. See ?writeHDF5Array for more information and an IMPORTANT NOTE.

verbose

Set to TRUE to make the function display progress.

In the case of saveHDF5SummarizedExperiment(), verbose is set to NA by default, in which case verbosity is controlled by DelayedArray:::get_verbose_block_processing(). Setting verbose to TRUE or FALSE overrides this.

Details

saveHDF5SummarizedExperiment():

Creates the directory specified thru the dir argument and populates it with the HDF5 datasets (one per assay in x) plus a serialized version of x that contains pointers to these datasets. This directory provides a self-contained HDF5-based representation of x that can then be loaded back in R with loadHDF5SummarizedExperiment.

Note that this directory is relocatable i.e. it can be moved (or copied) to a different place, on the same or a different computer, before calling loadHDF5SummarizedExperiment on it. For convenient sharing with collaborators, it is suggested to turn it into a tarball (with Unix command tar), or zip file, before the transfer.

Please keep in mind that saveHDF5SummarizedExperiment and loadHDF5SummarizedExperiment don't know how to produce/read tarballs or zip files at the moment, so the process of packaging/extracting the tarball or zip file is entirely the user responsibility. This is typically done from outside R.

Finally please note that, depending on the size of the data to write to disk and the performance of the disk, saveHDF5SummarizedExperiment can take a long time to complete. Use verbose=TRUE to see its progress.

loadHDF5SummarizedExperiment():

Typically very fast, even if the assay data is big, because all the assays in the returned object are HDF5Array objects pointing to the on-disk HDF5 datasets located in dir. HDF5Array objects are typically light-weight in memory.

quickResaveHDF5SummarizedExperiment():

Preserves the HDF5 file and datasets that the assays in x are already pointing to (and which were created by an earlier call to saveHDF5SummarizedExperiment). All it does is re-serialize x on top of the .rds file that is associated with this HDF5 file (and which was created by an earlier call to saveHDF5SummarizedExperiment or quickResaveHDF5SummarizedExperiment). Because the delayed operations possibly carried by the assays in x are not realized, this is very fast.

Value

saveHDF5SummarizedExperiment returns an invisible SummarizedExperiment object that is the same as what loadHDF5SummarizedExperiment will return when loading back the object. All the assays in the object are HDF5Array objects pointing to datasets in the HDF5 file saved in dir.

Difference between saveHDF5SummarizedExperiment() and saveRDS()

Roughly speaking, saveRDS() only serializes the part of an object that resides in memory (the reality is a little bit more nuanced, but discussing the full details is not important here, and would only distract us). For most objects in R, that's the whole object, so saveRDS() does the job.

However some objects are pointing to on-disk data. For example: a TxDb object (the TxDb class is implemented and documented in the GenomicFeatures package) points to an SQLite db; an HDF5Array object points to a dataset in an HDF5 file; a SummarizedExperiment derivative can have one or more of its assays that point to datasets (one per assay) in an HDF5 file. These objects have 2 parts: one part is in memory, and one part is on disk. The 1st part is sometimes called the object shell and is generally thin (i.e. it has a small memory footprint). The 2nd part is the data and is typically big. The object shell and data are linked together via some kind of pointer stored in the shell (e.g. an SQLite connection, or a path to a file, etc...). Note that this is a one way link in the sense that the object shell "knows" where to find the on-disk data but the on-disk data knows nothing about the object shell (and is completely agnostic about what kind of object shell could be pointing to it). Furthermore, at any given time on a given system, there could be more than one object shell pointing to the same on-disk data. These object shells could exist in the same R session or in sessions in other languages (e.g. Python). These various sessions could be run by the same or by different users.

Using saveRDS() on such object will only serialize the shell part so will produce a small .rds file that contains the serialized object shell but not the object data.

This is problematic because:

If you later unserialize the object (with readRDS()) on the same system where you originally serialized it, it is possible that you will get back an object that is fully functional and semantically equivalent to the original object. But here is the catch: this will be the case ONLY if the data is still at the original location and has not been modified (i.e. nobody wrote or altered the data in the SQLite db or HDF5 file in the mean time), and if the serialization/unserialization cycle didn't break the link between the object shell and the data (this serialization/unserialization cycle is known to break open SQLite connections).
After serialization the object shell and data are stored in separate files (in the new .rds file for the shell, still in the original SQLite or HDF5 file for the data), typically in very different places on the file system. But these 2 files are not relocatable, that is, moving or copying them to another system or sending them to collaborators will typically break the link between them. Concretely this means that the object obtained by using readRDS() on the destination system will be broken.

saveHDF5SummarizedExperiment() addresses these issues by saving the object shell and assay data in a folder that is relocatable.

Note that it only works on SummarizedExperiment derivatives. What it does exactly is (1) write all the assay data to an HDF5 file, and (2) serialize the object shell, which in this case is everything in the object that is not the assay data. The 2 files (HDF5 and .rds) are written to the directory specified by the user. The resulting directory contains a full representation of the object and is relocatable, that is, it can be moved or copied to another place on the system, or to another system (possibly after making a tarball of it), where loadHDF5SummarizedExperiment() can then be used to load the object back in R.

Note

The files created by saveHDF5SummarizedExperiment in the user-specified directory dir should not be renamed.

The user-specified directory created by saveHDF5SummarizedExperiment is relocatable i.e. it can be renamed and/or moved around, but not the individual files in it.

Author

Hervé Pagès

Examples

## ---------------------------------------------------------------------
## saveHDF5SummarizedExperiment() / loadHDF5SummarizedExperiment()
## ---------------------------------------------------------------------
library(SummarizedExperiment)

nrow <- 200
ncol <- 6
counts <- matrix(as.integer(runif(nrow * ncol, 1, 1e4)), nrow)
colData <- DataFrame(Treatment=rep(c("ChIP", "Input"), 3),
                     row.names=LETTERS[1:6])
se0 <- SummarizedExperiment(assays=list(counts=counts), colData=colData)
se0
#> class: SummarizedExperiment 
#> dim: 200 6 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment

## Save 'se0' as an HDF5-based SummarizedExperiment object:
dir <- tempfile("h5_se0_")
h5_se0 <- saveHDF5SummarizedExperiment(se0, dir)
list.files(dir)
#> [1] "assays.h5" "se.rds"   

h5_se0
#> class: SummarizedExperiment 
#> dim: 200 6 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment
assay(h5_se0, withDimnames=FALSE)   # HDF5Matrix object
#> <200 x 6> HDF5Matrix object of type "integer":
#>        [,1] [,2] [,3] [,4] [,5] [,6]
#>   [1,] 2172 6936 9302 8785  635 3224
#>   [2,] 2232 6721 2326 6429 6081 5678
#>   [3,] 4500 2562 9501 7793 4561 6154
#>   [4,]  358  848 7369 7552 5462 4971
#>   [5,] 8450 7478 4286 4352 9932 5283
#>    ...    .    .    .    .    .    .
#> [196,] 1459 6856 3595  158 7583  377
#> [197,]  912 4436 2806 7460 9351 4576
#> [198,] 8402  205 5777 7147 9254  865
#> [199,] 9611 8187 9516 6895 6215 9277
#> [200,] 2653 8395 8406 6221 8726 2486

h5_se0b <- loadHDF5SummarizedExperiment(dir)
h5_se0b
#> class: SummarizedExperiment 
#> dim: 200 6 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment
assay(h5_se0b, withDimnames=FALSE)  # HDF5Matrix object
#> <200 x 6> HDF5Matrix object of type "integer":
#>        [,1] [,2] [,3] [,4] [,5] [,6]
#>   [1,] 2172 6936 9302 8785  635 3224
#>   [2,] 2232 6721 2326 6429 6081 5678
#>   [3,] 4500 2562 9501 7793 4561 6154
#>   [4,]  358  848 7369 7552 5462 4971
#>   [5,] 8450 7478 4286 4352 9932 5283
#>    ...    .    .    .    .    .    .
#> [196,] 1459 6856 3595  158 7583  377
#> [197,]  912 4436 2806 7460 9351 4576
#> [198,] 8402  205 5777 7147 9254  865
#> [199,] 9611 8187 9516 6895 6215 9277
#> [200,] 2653 8395 8406 6221 8726 2486

## Sanity checks:
stopifnot(is(assay(h5_se0, withDimnames=FALSE), "HDF5Matrix"))
stopifnot(identical(assay(se0), as.matrix(assay(h5_se0))))
stopifnot(is(assay(h5_se0b, withDimnames=FALSE), "HDF5Matrix"))
stopifnot(identical(assay(se0), as.matrix(assay(h5_se0b))))

## ---------------------------------------------------------------------
## More sanity checks
## ---------------------------------------------------------------------

## Make a copy of directory 'dir':
somedir <- tempfile("somedir")
dir.create(somedir)
file.copy(dir, somedir, recursive=TRUE)
#> [1] TRUE
dir2 <- list.files(somedir, full.names=TRUE)

## 'dir2' contains a copy of 'dir'. Call loadHDF5SummarizedExperiment()
## on it.
h5_se0c <- loadHDF5SummarizedExperiment(dir2)

stopifnot(is(assay(h5_se0c, withDimnames=FALSE), "HDF5Matrix"))
stopifnot(identical(assay(se0), as.matrix(assay(h5_se0c))))

## ---------------------------------------------------------------------
## Using a prefix
## ---------------------------------------------------------------------

se1 <- se0[51:100, ]
saveHDF5SummarizedExperiment(se1, dir, prefix="xx_")
list.files(dir)
#> [1] "assays.h5"    "se.rds"       "xx_assays.h5" "xx_se.rds"   
loadHDF5SummarizedExperiment(dir, prefix="xx_")
#> class: SummarizedExperiment 
#> dim: 50 6 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment

## ---------------------------------------------------------------------
## quickResaveHDF5SummarizedExperiment()
## ---------------------------------------------------------------------

se2 <- loadHDF5SummarizedExperiment(dir, prefix="xx_")
se2 <- se2[1:14, ]
assay1 <- assay(se2, withDimnames=FALSE)
assays(se2, withDimnames=FALSE) <- c(assays(se2), list(score=assay1/100))
rowRanges(se2) <- GRanges("chr1", IRanges(1:14, width=5))
rownames(se2) <- letters[1:14]
se2
#> class: RangedSummarizedExperiment 
#> dim: 14 6 
#> metadata(0):
#> assays(2): counts score
#> rownames(14): a b ... m n
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment

## This will replace saved 'se1'!
quickResaveHDF5SummarizedExperiment(se2, verbose=TRUE)
#> All assay data already in HDF5 file:
#>   /tmp/RtmpYQEKxJ/h5_se0_f485ee53368/xx_assays.h5
#> Serialize RangedSummarizedExperiment object to existing RDS file:
#>   /tmp/RtmpYQEKxJ/h5_se0_f485ee53368/xx_se.rds
list.files(dir)
#> [1] "assays.h5"    "se.rds"       "xx_assays.h5" "xx_se.rds"   
loadHDF5SummarizedExperiment(dir, prefix="xx_")
#> class: RangedSummarizedExperiment 
#> dim: 14 6 
#> metadata(0):
#> assays(2): counts score
#> rownames(14): a b ... m n
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment