The _h5mread_ package
Hervé Pagès
Fred Hutch Cancer Center, Seattle, WACompiled 28 May 2025; Modified 14 January 2025
Source:vignettes/h5mread.Rmd
h5mread.Rmd
Introduction
h5mread
is an R/Bioconductor package that allows fast and memory-efficient
loading of HDF5 data into R. The main function in the package is
h5mread()
which allows reading arbitrary data from an HDF5
dataset into R, similarly to what the h5read()
function
from the rhdf5
package does. In the case of h5mread()
, the implementation
has been optimized to make it as fast and memory-efficient as
possible.
In addition to the h5mread()
function, the package also
provides the following low-level functionality:
h5dim()
andh5chunkdim()
;Utility functions to manipulate the dimnames of an HDF5 dataset;
H5File objects to facilitate access to remote HDF5 files like files stored in an Amazon S3 bucket.
Note that the primary use case for the h5mread package is to support higher-level functionality implemented in the HDF5Array package.
Install and load the package
Like any other Bioconductor package, h5mread
should always be installed with BiocManager::install()
:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("h5mread")
Load the package:
The h5mread() function
h5mread()
is an efficient and flexible alternative to
rhdf5::h5read()
.
Note that we’ll use writeHDF5Array()
from the HDF5Array
package to conveniently create the HDF5 datasets used in the examples
below.
Basic example
Create a 70,000 x 1,500 random dataset:
set.seed(2009)
m0 <- matrix(runif(105e6), ncol=1500) # 70,000 x 1,500 matrix
temp0_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(m0, temp0_h5, "m0", chunkdim=c(100, 100))
Load 1,000 random rows from the HDF5 dataset:
h5ls(temp0_h5)
## group name otype dclass dim
## 0 / m0 H5I_DATASET FLOAT 70000 x 1500
nrow0 <- h5dim(temp0_h5, "m0")[[1L]]
starts <- list(sample(nrow0, 1000), NULL)
m <- h5mread(temp0_h5, "m0", starts=starts)
See ?h5mread
for more information and additional
examples.
Sanity check:
An example involving sparse data
The HDF5 format doesn’t natively support sparse data representation. However, because it stores the data in compressed chunks, sparse data gets efficiently compressed in the HDF5 file.
The more sparse the data, the smaller the resulting HDF5 file:
a1 <- poissonSparseArray(c(6100, 960, 75), density=0.5)
a2 <- poissonSparseArray(c(6100, 960, 75), density=0.05)
temp1_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(a1, temp1_h5, "a1", chunkdim=c(50, 40, 5), level=5)
temp2_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(a2, temp2_h5, "a2", chunkdim=c(50, 40, 5), level=5)
file.size(temp1_h5)
## [1] 128949150
file.size(temp2_h5)
## [1] 38533754
However, the small size on disk won’t translate into a small size in memory if the data gets loaded back as an ordinary array:
a21 <- h5mread(temp2_h5, "a2") # not memory-efficient
object.size(a21)
## 1756800224 bytes
To keep memory usage as low as possible, h5mread()
can
load the data in a SparseArray derivative from the SparseArray
package. This is achieved by setting its as.sparse
argument
to TRUE
:
a22 <- h5mread(temp2_h5, "a2", as.sparse=TRUE) # memory-efficient
object.size(a22)
## 351486208 bytes
Note that the data is loaded as a COO_SparseArray object:
class(a22)
## [1] "COO_SparseArray"
## attr(,"package")
## [1] "SparseArray"
See ?h5mread
for more information and additional
examples.
Sanity checks:
A note about COO_SparseArray objects
The COO_SparseArray class is one of the two SparseArray concrete subclasses defined in the SparseArray package, the other one being SVT_SparseArray. Note that the latter tends to be even more memory-efficient than the former and to achieve better performance overall.
Use coercion to switch back and forth between the two representations:
## [1] "SVT_SparseArray"
## attr(,"package")
## [1] "SparseArray"
About half the memory footprint of the COO_SparseArray representation:
object.size(a22)
## 188106768 bytes
See ?SparseArray
in the SparseArray
package for more information.
Other functionality provided by the h5mread package
h5dim() and h5chunkdim()
Two convenience functions to obtain the dimensions of an HDF5 dataset as well as the dimensions of its chunks.
See ?h5dim
for more information and some examples.
Utility functions to manipulate the dimnames of an HDF5 dataset
A small set of low-level utilities is provided to manipulate the dimnames of an HDF5 dataset.
See ?h5writeDimnames
for more information and some
examples.
H5File objects
Use an H5File object to access an HDF5 file stored in an Amazon S3 bucket.
See ?H5File
for more information and some examples.
Session information
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] h5mread_1.0.1 SparseArray_1.8.0 S4Arrays_1.8.0
## [4] IRanges_2.42.0 abind_1.4-8 S4Vectors_0.46.0
## [7] MatrixGenerics_1.20.0 matrixStats_1.5.0 Matrix_1.7-3
## [10] BiocGenerics_0.54.0 generics_0.1.4 rhdf5_2.52.0
## [13] BiocStyle_2.36.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_2.0.0 compiler_4.5.0 BiocManager_1.30.25
## [4] crayon_1.5.3 rhdf5filters_1.20.0 jquerylib_0.1.4
## [7] systemfonts_1.2.3 textshaping_1.0.1 yaml_2.3.10
## [10] fastmap_1.2.0 lattice_0.22-7 XVector_0.48.0
## [13] R6_2.6.1 knitr_1.50 htmlwidgets_1.6.4
## [16] DelayedArray_0.34.1 bookdown_0.43 desc_1.4.3
## [19] bslib_0.9.0 rlang_1.1.6 HDF5Array_1.36.0
## [22] cachem_1.1.0 xfun_0.52 fs_1.6.6
## [25] sass_0.4.10 cli_3.6.5 pkgdown_2.1.3
## [28] Rhdf5lib_1.30.0 digest_0.6.37 grid_4.5.0
## [31] lifecycle_1.0.4 evaluate_1.0.3 ragg_1.4.0
## [34] rmarkdown_2.29 tools_4.5.0 htmltools_0.5.8.1