Save/load an HDF5-based SummarizedExperiment object
saveHDF5SummarizedExperiment.Rd
saveHDF5SummarizedExperiment
and
loadHDF5SummarizedExperiment
can be used to save/load an HDF5-based
SummarizedExperiment object to/from disk.
NOTE: These functions use functionalities from the SummarizedExperiment package internally and so require this package to be installed.
Usage
saveHDF5SummarizedExperiment(x, dir="my_h5_se", prefix="", replace=FALSE,
chunkdim=NULL, level=NULL, as.sparse=NA,
verbose=NA)
loadHDF5SummarizedExperiment(dir="my_h5_se", prefix="")
quickResaveHDF5SummarizedExperiment(x, verbose=FALSE)
Arguments
- x
A SummarizedExperiment object or derivative.
For
quickResaveHDF5SummarizedExperiment
the object must have been previously saved withsaveHDF5SummarizedExperiment
(and has been possibly modified since then).- dir
The path (as a single string) to the directory where to save the HDF5-based SummarizedExperiment object or to load it from.
When saving, the directory will be created if it doesn't already exist. If the directory already exists and no prefix is specified and
replace
is set toTRUE
, then it's replaced with an empty directory.- prefix
An optional prefix to add to the names of the files created inside
dir
. Allows saving more than one object in the same directory.- replace
When no prefix is specified, should a pre-existing directory be replaced with a new empty one? The content of the pre-existing directory will be lost!
- chunkdim, level
The dimensions of the chunks and the compression level to use for writing the assay data to disk.
Passed to the internal calls to
writeHDF5Array
. See?writeHDF5Array
for more information.- as.sparse
Whether the assay data should be flagged as sparse or not. If set to
NA
(the default), then the specificas.sparse
value to use for each assay is determined by callingis_sparse()
on them.Passed to the internal calls to
writeHDF5Array
. See?writeHDF5Array
for more information and an IMPORTANT NOTE.- verbose
Set to
TRUE
to make the function display progress.In the case of
saveHDF5SummarizedExperiment()
,verbose
is set toNA
by default, in which case verbosity is controlled byDelayedArray:::get_verbose_block_processing()
. Settingverbose
toTRUE
orFALSE
overrides this.
Details
saveHDF5SummarizedExperiment()
:Creates the directory specified thru the
dir
argument and populates it with the HDF5 datasets (one per assay inx
) plus a serialized version ofx
that contains pointers to these datasets. This directory provides a self-contained HDF5-based representation ofx
that can then be loaded back in R withloadHDF5SummarizedExperiment
.Note that this directory is relocatable i.e. it can be moved (or copied) to a different place, on the same or a different computer, before calling
loadHDF5SummarizedExperiment
on it. For convenient sharing with collaborators, it is suggested to turn it into a tarball (with Unix commandtar
), or zip file, before the transfer.Please keep in mind that
saveHDF5SummarizedExperiment
andloadHDF5SummarizedExperiment
don't know how to produce/read tarballs or zip files at the moment, so the process of packaging/extracting the tarball or zip file is entirely the user responsibility. This is typically done from outside R.Finally please note that, depending on the size of the data to write to disk and the performance of the disk,
saveHDF5SummarizedExperiment
can take a long time to complete. Useverbose=TRUE
to see its progress.loadHDF5SummarizedExperiment()
:Typically very fast, even if the assay data is big, because all the assays in the returned object are HDF5Array objects pointing to the on-disk HDF5 datasets located in
dir
. HDF5Array objects are typically light-weight in memory.quickResaveHDF5SummarizedExperiment()
:Preserves the HDF5 file and datasets that the assays in
x
are already pointing to (and which were created by an earlier call tosaveHDF5SummarizedExperiment
). All it does is re-serializex
on top of the.rds
file that is associated with this HDF5 file (and which was created by an earlier call tosaveHDF5SummarizedExperiment
orquickResaveHDF5SummarizedExperiment
). Because the delayed operations possibly carried by the assays inx
are not realized, this is very fast.
Value
saveHDF5SummarizedExperiment
returns an invisible
SummarizedExperiment object that is the
same as what loadHDF5SummarizedExperiment
will return when loading
back the object.
All the assays in the object are HDF5Array objects pointing to
datasets in the HDF5 file saved in dir
.
Difference between saveHDF5SummarizedExperiment() and saveRDS()
Roughly speaking, saveRDS()
only serializes the part of an object
that resides in memory (the reality is a little bit more nuanced, but
discussing the full details is not important here, and would only distract
us). For most objects in R, that's the whole object, so saveRDS()
does the job.
However some objects are pointing to on-disk data. For example: a TxDb object (the TxDb class is implemented and documented in the GenomicFeatures package) points to an SQLite db; an HDF5Array object points to a dataset in an HDF5 file; a SummarizedExperiment derivative can have one or more of its assays that point to datasets (one per assay) in an HDF5 file. These objects have 2 parts: one part is in memory, and one part is on disk. The 1st part is sometimes called the object shell and is generally thin (i.e. it has a small memory footprint). The 2nd part is the data and is typically big. The object shell and data are linked together via some kind of pointer stored in the shell (e.g. an SQLite connection, or a path to a file, etc...). Note that this is a one way link in the sense that the object shell "knows" where to find the on-disk data but the on-disk data knows nothing about the object shell (and is completely agnostic about what kind of object shell could be pointing to it). Furthermore, at any given time on a given system, there could be more than one object shell pointing to the same on-disk data. These object shells could exist in the same R session or in sessions in other languages (e.g. Python). These various sessions could be run by the same or by different users.
Using saveRDS()
on such object will only serialize the shell part
so will produce a small .rds
file that contains the serialized
object shell but not the object data.
This is problematic because:
If you later unserialize the object (with
readRDS()
) on the same system where you originally serialized it, it is possible that you will get back an object that is fully functional and semantically equivalent to the original object. But here is the catch: this will be the case ONLY if the data is still at the original location and has not been modified (i.e. nobody wrote or altered the data in the SQLite db or HDF5 file in the mean time), and if the serialization/unserialization cycle didn't break the link between the object shell and the data (this serialization/unserialization cycle is known to break open SQLite connections).After serialization the object shell and data are stored in separate files (in the new
.rds
file for the shell, still in the original SQLite or HDF5 file for the data), typically in very different places on the file system. But these 2 files are not relocatable, that is, moving or copying them to another system or sending them to collaborators will typically break the link between them. Concretely this means that the object obtained by usingreadRDS()
on the destination system will be broken.
saveHDF5SummarizedExperiment()
addresses these issues by saving
the object shell and assay data in a folder that is relocatable.
Note that it only works on SummarizedExperiment
derivatives. What it does exactly is (1) write all the assay data to an
HDF5 file, and (2) serialize the object shell, which in this case is
everything in the object that is not the assay data. The 2 files
(HDF5 and .rds
) are written to the directory specified by the user.
The resulting directory contains a full representation of the object and
is relocatable, that is, it can be moved or copied to another place on
the system, or to another system (possibly after making a tarball of it),
where loadHDF5SummarizedExperiment()
can then be used to load the
object back in R.
Note
The files created by saveHDF5SummarizedExperiment
in the
user-specified directory dir
should not be renamed.
The user-specified directory created by
saveHDF5SummarizedExperiment
is relocatable i.e. it can
be renamed and/or moved around, but not the individual files in it.
See also
SummarizedExperiment and RangedSummarizedExperiment objects in the SummarizedExperiment package.
The
writeHDF5Array
function whichsaveHDF5SummarizedExperiment
uses internally to write the assay data to disk.base::saveRDS
Examples
## ---------------------------------------------------------------------
## saveHDF5SummarizedExperiment() / loadHDF5SummarizedExperiment()
## ---------------------------------------------------------------------
library(SummarizedExperiment)
nrow <- 200
ncol <- 6
counts <- matrix(as.integer(runif(nrow * ncol, 1, 1e4)), nrow)
colData <- DataFrame(Treatment=rep(c("ChIP", "Input"), 3),
row.names=LETTERS[1:6])
se0 <- SummarizedExperiment(assays=list(counts=counts), colData=colData)
se0
#> class: SummarizedExperiment
#> dim: 200 6
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment
## Save 'se0' as an HDF5-based SummarizedExperiment object:
dir <- tempfile("h5_se0_")
h5_se0 <- saveHDF5SummarizedExperiment(se0, dir)
list.files(dir)
#> [1] "assays.h5" "se.rds"
h5_se0
#> class: SummarizedExperiment
#> dim: 200 6
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment
assay(h5_se0, withDimnames=FALSE) # HDF5Matrix object
#> <200 x 6> HDF5Matrix object of type "integer":
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 2172 6936 9302 8785 635 3224
#> [2,] 2232 6721 2326 6429 6081 5678
#> [3,] 4500 2562 9501 7793 4561 6154
#> [4,] 358 848 7369 7552 5462 4971
#> [5,] 8450 7478 4286 4352 9932 5283
#> ... . . . . . .
#> [196,] 1459 6856 3595 158 7583 377
#> [197,] 912 4436 2806 7460 9351 4576
#> [198,] 8402 205 5777 7147 9254 865
#> [199,] 9611 8187 9516 6895 6215 9277
#> [200,] 2653 8395 8406 6221 8726 2486
h5_se0b <- loadHDF5SummarizedExperiment(dir)
h5_se0b
#> class: SummarizedExperiment
#> dim: 200 6
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment
assay(h5_se0b, withDimnames=FALSE) # HDF5Matrix object
#> <200 x 6> HDF5Matrix object of type "integer":
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 2172 6936 9302 8785 635 3224
#> [2,] 2232 6721 2326 6429 6081 5678
#> [3,] 4500 2562 9501 7793 4561 6154
#> [4,] 358 848 7369 7552 5462 4971
#> [5,] 8450 7478 4286 4352 9932 5283
#> ... . . . . . .
#> [196,] 1459 6856 3595 158 7583 377
#> [197,] 912 4436 2806 7460 9351 4576
#> [198,] 8402 205 5777 7147 9254 865
#> [199,] 9611 8187 9516 6895 6215 9277
#> [200,] 2653 8395 8406 6221 8726 2486
## Sanity checks:
stopifnot(is(assay(h5_se0, withDimnames=FALSE), "HDF5Matrix"))
stopifnot(identical(assay(se0), as.matrix(assay(h5_se0))))
stopifnot(is(assay(h5_se0b, withDimnames=FALSE), "HDF5Matrix"))
stopifnot(identical(assay(se0), as.matrix(assay(h5_se0b))))
## ---------------------------------------------------------------------
## More sanity checks
## ---------------------------------------------------------------------
## Make a copy of directory 'dir':
somedir <- tempfile("somedir")
dir.create(somedir)
file.copy(dir, somedir, recursive=TRUE)
#> [1] TRUE
dir2 <- list.files(somedir, full.names=TRUE)
## 'dir2' contains a copy of 'dir'. Call loadHDF5SummarizedExperiment()
## on it.
h5_se0c <- loadHDF5SummarizedExperiment(dir2)
stopifnot(is(assay(h5_se0c, withDimnames=FALSE), "HDF5Matrix"))
stopifnot(identical(assay(se0), as.matrix(assay(h5_se0c))))
## ---------------------------------------------------------------------
## Using a prefix
## ---------------------------------------------------------------------
se1 <- se0[51:100, ]
saveHDF5SummarizedExperiment(se1, dir, prefix="xx_")
list.files(dir)
#> [1] "assays.h5" "se.rds" "xx_assays.h5" "xx_se.rds"
loadHDF5SummarizedExperiment(dir, prefix="xx_")
#> class: SummarizedExperiment
#> dim: 50 6
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment
## ---------------------------------------------------------------------
## quickResaveHDF5SummarizedExperiment()
## ---------------------------------------------------------------------
se2 <- loadHDF5SummarizedExperiment(dir, prefix="xx_")
se2 <- se2[1:14, ]
assay1 <- assay(se2, withDimnames=FALSE)
assays(se2, withDimnames=FALSE) <- c(assays(se2), list(score=assay1/100))
rowRanges(se2) <- GRanges("chr1", IRanges(1:14, width=5))
rownames(se2) <- letters[1:14]
se2
#> class: RangedSummarizedExperiment
#> dim: 14 6
#> metadata(0):
#> assays(2): counts score
#> rownames(14): a b ... m n
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment
## This will replace saved 'se1'!
quickResaveHDF5SummarizedExperiment(se2, verbose=TRUE)
#> All assay data already in HDF5 file:
#> /tmp/RtmpYQEKxJ/h5_se0_f485ee53368/xx_assays.h5
#> Serialize RangedSummarizedExperiment object to existing RDS file:
#> /tmp/RtmpYQEKxJ/h5_se0_f485ee53368/xx_se.rds
list.files(dir)
#> [1] "assays.h5" "se.rds" "xx_assays.h5" "xx_se.rds"
loadHDF5SummarizedExperiment(dir, prefix="xx_")
#> class: RangedSummarizedExperiment
#> dim: 14 6
#> metadata(0):
#> assays(2): counts score
#> rownames(14): a b ... m n
#> rowData names(0):
#> colnames(6): A B ... E F
#> colData names(1): Treatment