An alternative to rhdf5::h5read
h5mread.Rd
An efficient and flexible alternative to rhdf5::h5read()
.
Usage
h5mread(filepath, name, starts=NULL, counts=NULL, noreduce=FALSE,
as.vector=NA, as.integer=FALSE, as.sparse=FALSE,
method=0L, use.H5Dread_chunk=FALSE)
get_h5mread_returned_type(filepath, name, as.integer=FALSE)
Arguments
- filepath
The path (as a single string) to the HDF5 file where the dataset to read from is located, or an H5File object.
Note that you must create and use an H5File object if the HDF5 file to access is stored in an Amazon S3 bucket. See
?H5File
for how to do this.Also please note that H5File objects must NOT be used in the context of parallel evaluation at the moment.
- name
The name of the dataset in the HDF5 file.
- starts, counts
starts
andcounts
are used to specify the array selection. Each argument can be eitherNULL
or a list with one list element per dimension in the dataset.If
starts
andcounts
are bothNULL
, then the entire dataset is read.If
starts
is a list, each list element in it must be a vector of valid positive indices along the corresponding dimension in the dataset. An empty vector (integer(0)
) is accepted and indicates an empty selection along that dimension. ANULL
is accepted and indicates a full selection along the dimension so has the same meaning as a missing subscript when subsetting an array-like object with[
. (Note that for[
aNULL
subscript indicates an empty selection.)Each list element in
counts
must beNULL
or a vector of non-negative integers of the same length as the corresponding list element instarts
. Each value in the vector indicates how many positions to select starting from the associated start value. ANULL
indicates that a single position is selected for each value along the corresponding dimension.If
counts
isNULL
, then each index in eachstarts
list element indicates a single position selection along the corresponding dimension. Note that in this case thestarts
argument is equivalent to theindex
argument ofh5read
andextract_array
(with the caveat thath5read
doesn't accept empty selections).Finally note that when
counts
is notNULL
then the selection described bystarts
andcounts
must be strictly ascending along each dimension.- as.vector
Should the data be returned in a vector instead of an array? By default (i.e. when set to
NA
), the data is returned in an ordinary array when reading from a multidimensional dataset, and in an ordinary vector when reading from a 1D dataset. You can override this by settingas.vector
toTRUE
orFALSE
.- as.integer
If set to
TRUE
then the data is loaded in an ordinary array (or vector) oftype()
"integer"
. This will typically reduce the memory footprint of the returned array or vector by half if the values in the HDF5 dataset are floating point values.Note that, when
as.integer=TRUE
, the values loaded from the HDF5 dataset get coerced to integers at the C level as early as possible so this transformation is very efficient.- as.sparse
By default
h5mread()
returns the data in an ordinary array or vector. Useas.sparse=TRUE
to return it in a SparseArray derivative from the SparseArray package. This will significantly reduce the memory footprint of the returned object if the HDF5 dataset contains mostly zeros.- noreduce, method, use.H5Dread_chunk
For testing and advanced usage only. Do not use.
Value
h5mread()
returns an ordinary array or vector if as.sparse
is FALSE
(the default), and a COO_SparseArray
object if as.sparse
is TRUE
.
get_h5mread_returned_type()
returns the type of the array or
vector that will be returned by h5mread()
.
Equivalent to (but more efficient than):
where ndim
is the number of dimensions (a.k.a. rank in
HDF5 jargon) of the dataset.
See also
H5File objects.
h5read
in the rhdf5 package.extract_array
in the S4Arrays package.COO_SparseArray objects in the SparseArray package.
h5mread_from_reshaped
to read data from a virtually reshaped HDF5 dataset.
Examples
## ---------------------------------------------------------------------
## BASIC EXAMPLES
## ---------------------------------------------------------------------
test_h5 <- system.file("extdata", "test.h5", package="h5mread")
h5ls(test_h5)
#> Warning: An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
#> group name otype dclass dim
#> 0 / .m2_dimnames H5I_GROUP
#> 1 /.m2_dimnames 1 H5I_DATASET STRING 4000
#> 2 /.m2_dimnames 2 H5I_DATASET STRING 90
#> 3 / a3 H5I_DATASET INTEGER 180 x 75 x 4
#> 4 / m1 H5I_DATASET INTEGER 12 x 5
#> 5 / m2 H5I_DATASET FLOAT 4000 x 90
#> 6 / m4 H5I_DATASET INTEGER 28 x 4000
#> 7 / rwords H5I_DATASET STRING 30000
m1 <- h5mread(test_h5, "m1") # 12 x 5 integer matrix
m <- h5mread(test_h5, "m1", starts=list(c(8, 12:7), NULL))
m
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 108 120 132 144 156
#> [2,] 112 124 136 148 160
#> [3,] 111 123 135 147 159
#> [4,] 110 122 134 146 158
#> [5,] 109 121 133 145 157
#> [6,] 108 120 132 144 156
#> [7,] 107 119 131 143 155
## Sanity check:
stopifnot(identical(m1[c(8, 12:7), ], m))
m <- h5mread(test_h5, "m1", starts=list(c(8, 12:7), integer(0)))
m
#>
#> [1,]
#> [2,]
#> [3,]
#> [4,]
#> [5,]
#> [6,]
#> [7,]
## Sanity check:
stopifnot(identical(m1[c(8, 12:7), NULL], m))
m2 <- h5mread(test_h5, "m2") # 4000 x 90 double matrix
m2a <- h5mread(test_h5, "m2", starts=list(31, 1), counts=list(10, 8))
m2a
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 4.6302423 3.6920781 2.5704878 3.8089040 -0.7006228 2.8039929
#> [2,] 4.0229905 -0.6359713 4.1256160 -4.1306267 -0.4303055 2.0140353
#> [3,] 1.9070528 -3.7952366 -2.0674090 -2.5155661 -4.6262732 -0.1125200
#> [4,] 2.9546742 1.1895336 0.5573193 3.3682705 3.7067432 -4.2372909
#> [5,] -4.7538632 -0.1949125 -3.0470539 -1.2498946 4.0524254 1.3917748
#> [6,] -0.2220403 4.2389309 2.7640452 0.7412419 -0.4210158 2.5244328
#> [7,] 2.5845954 -2.5932731 0.9087094 -2.4473713 2.6756166 -4.0774453
#> [8,] -2.8359206 2.7648158 0.4323687 -4.8216287 -2.7992909 3.9625529
#> [9,] -1.8181899 2.5198789 -3.2706827 -0.6458960 3.3948262 -2.5945095
#> [10,] -2.6837421 1.7648242 4.0621661 4.0650954 1.1591354 -0.1142555
#> [,7] [,8]
#> [1,] -0.2448190 -3.022135
#> [2,] -2.8104900 3.098096
#> [3,] -2.2713924 -3.524646
#> [4,] 2.9222810 -3.140645
#> [5,] 1.0613606 4.821568
#> [6,] 2.2018533 -3.013025
#> [7,] -3.3362339 1.122616
#> [8,] 2.2856397 3.749805
#> [9,] 0.7560304 2.855820
#> [10,] -4.7331954 1.640497
## Sanity check:
stopifnot(identical(m2[31:40, 1:8], m2a))
m2b <- h5mread(test_h5, "m2", starts=list(31, 1), counts=list(10, 8),
as.integer=TRUE)
m2b
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 4 3 2 3 0 2 0 -3
#> [2,] 4 0 4 -4 0 2 -2 3
#> [3,] 1 -3 -2 -2 -4 0 -2 -3
#> [4,] 2 1 0 3 3 -4 2 -3
#> [5,] -4 0 -3 -1 4 1 1 4
#> [6,] 0 4 2 0 0 2 2 -3
#> [7,] 2 -2 0 -2 2 -4 -3 1
#> [8,] -2 2 0 -4 -2 3 2 3
#> [9,] -1 2 -3 0 3 -2 0 2
#> [10,] -2 1 4 4 1 0 -4 1
## Sanity check:
storage.mode(m2a) <- "integer"
stopifnot(identical(m2a, m2b))
a3 <- h5mread(test_h5, "a3") # 180 x 75 x 4 integer array
starts <- list(c(21, 101), NULL, 3:4)
counts <- list(c( 5, 22), NULL, NULL)
a <- h5mread(test_h5, "a3", starts=starts, counts=counts)
dim(a)
#> [1] 27 75 2
a[1:10, 1:12, ]
#> , , 1
#>
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> [1,] 0 0 0 0 0 0 0 1 0 0 0 0
#> [2,] 0 1 1 0 1 0 0 0 0 0 1 0
#> [3,] 0 0 0 0 1 0 0 2 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0 2 0
#> [5,] 0 0 0 0 0 0 0 0 0 0 0 1
#> [6,] 0 0 0 0 0 0 0 0 0 1 0 0
#> [7,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [8,] 0 0 0 1 0 0 0 0 0 0 1 0
#> [9,] 1 0 0 0 0 0 0 0 0 0 0 1
#> [10,] 0 0 0 1 1 0 0 0 0 0 0 0
#>
#> , , 2
#>
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> [1,] 1 0 0 0 0 0 0 0 1 0 0 0
#> [2,] 0 0 0 0 0 0 1 0 0 0 0 0
#> [3,] 1 0 1 0 0 0 0 0 1 0 0 1
#> [4,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 1 0 0 0 0 0 1 0
#> [6,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [7,] 0 0 0 0 0 0 0 1 0 0 0 0
#> [8,] 0 0 0 0 0 1 0 0 0 0 0 1
#> [9,] 0 0 0 0 0 0 1 0 0 0 0 0
#> [10,] 1 0 0 0 0 0 0 1 0 0 0 0
#>
## Sanity check:
stopifnot(identical(a3[c(21:25, 101:122), , 3:4, drop=FALSE], a))
## ---------------------------------------------------------------------
## RETURNING THE DATA AS A SPARSE ARRAY
## ---------------------------------------------------------------------
starts <- list(c(21:25, 101:122), NULL, 3:4)
coo <- h5mread(test_h5, "a3", starts=starts, as.sparse=TRUE)
coo
#> <27 x 75 x 2 SparseArray> of type "integer" [nzcount=577 (14%)]:
#> ,,1
#> [,1] [,2] [,3] [,4] ... [,72] [,73] [,74] [,75]
#> [1,] 0 0 0 0 . 0 0 0 0
#> [2,] 0 1 1 0 . 0 0 0 0
#> ... . . . . . . . . .
#> [26,] 0 0 0 0 . 0 0 1 0
#> [27,] 1 0 0 0 . 0 0 0 0
#>
#> ,,2
#> [,1] [,2] [,3] [,4] ... [,72] [,73] [,74] [,75]
#> [1,] 1 0 0 0 . 0 0 0 0
#> [2,] 0 0 0 0 . 0 0 0 0
#> ... . . . . . . . . .
#> [26,] 0 0 0 1 . 0 1 1 0
#> [27,] 1 0 1 1 . 0 0 0 0
#>
class(coo) # COO_SparseArray object (see ?COO_SparseArray)
#> [1] "COO_SparseArray"
#> attr(,"package")
#> [1] "SparseArray"
dim(coo)
#> [1] 27 75 2
## Sanity check:
stopifnot(is(coo, "COO_SparseArray"), identical(a, as.array(coo)))
## ---------------------------------------------------------------------
## PERFORMANCE
## ---------------------------------------------------------------------
library(ExperimentHub)
#> Loading required package: AnnotationHub
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr
hub <- ExperimentHub()
## With the "sparse" TENxBrainData dataset
## ---------------------------------------
fname0 <- hub[["EH1039"]]
#> see ?TENxBrainData and browseVignettes('TENxBrainData') for documentation
#> loading from cache
h5ls(fname0) # all datasets are 1D datasets
#> group name otype dclass dim
#> 0 / mm10 H5I_GROUP
#> 1 /mm10 barcodes H5I_DATASET STRING 1306127
#> 2 /mm10 data H5I_DATASET INTEGER 2624828308
#> 3 /mm10 gene_names H5I_DATASET STRING 27998
#> 4 /mm10 genes H5I_DATASET STRING 27998
#> 5 /mm10 indices H5I_DATASET INTEGER 2624828308
#> 6 /mm10 indptr H5I_DATASET INTEGER 1306128
#> 7 /mm10 shape H5I_DATASET INTEGER 2
index <- list(77 * sample(34088679, 5000, replace=TRUE))
## h5mread() is about 4x faster than h5read():
system.time(a <- h5mread::h5mread(fname0, "mm10/data", index))
#> user system elapsed
#> 0.726 0.076 1.908
system.time(b <- h5read(fname0, "mm10/data", index=index))
#> user system elapsed
#> 5.376 0.006 5.383
stopifnot(identical(a, as.vector(b)))
index <- list(sample(1306127, 7500, replace=TRUE))
## h5mread() is about 14x faster than h5read():
system.time(a <- h5mread::h5mread(fname0, "mm10/barcodes", index))
#> user system elapsed
#> 0.075 0.001 0.077
system.time(b <- h5read(fname0, "mm10/barcodes", index=index))
#> user system elapsed
#> 2.042 0.012 2.053
stopifnot(identical(a, as.vector(b)))
## With the "dense" TENxBrainData dataset
## --------------------------------------
fname1 <- hub[["EH1040"]]
#> see ?TENxBrainData and browseVignettes('TENxBrainData') for documentation
#> loading from cache
h5ls(fname1) # "counts" is a 2D dataset
#> group name otype dclass dim
#> 0 / counts H5I_DATASET INTEGER 27998 x 1306127
set.seed(33)
index <- list(sample(27998, 300), sample(1306127, 450))
## h5mread() is about 2x faster than h5read():
system.time(a <- h5mread::h5mread(fname1, "counts", index))
#> user system elapsed
#> 5.136 0.125 5.413
system.time(b <- h5read(fname1, "counts", index=index))
#> user system elapsed
#> 13.803 0.305 14.108
stopifnot(identical(a, b))
## Alternatively 'as.sparse=TRUE' can be used to reduce memory usage:
system.time(coo <- h5mread::h5mread(fname1, "counts", index, as.sparse=TRUE))
#> user system elapsed
#> 5.106 0.074 5.181
stopifnot(identical(a, as.array(coo)))
## The bigger the selection, the greater the speedup between
## h5read() and h5mread():
# \donttest{
index <- list(sample(27998, 1000), sample(1306127, 1000))
## h5mread() about 4x faster than h5read() (12s vs 48s):
system.time(a <- h5mread::h5mread(fname1, "counts", index))
#> user system elapsed
#> 16.129 0.395 16.899
system.time(b <- h5read(fname1, "counts", index=index))
#> user system elapsed
#> 84.705 0.732 85.447
stopifnot(identical(a, b))
## With 'as.sparse=TRUE' (about the same speed as with 'as.sparse=FALSE'):
system.time(coo <- h5mread::h5mread(fname1, "counts", index, as.sparse=TRUE))
#> user system elapsed
#> 15.553 0.281 15.836
stopifnot(identical(a, as.array(coo)))
# }