An alternative to rhdf5::h5read — h5mread • h5mread

An efficient and flexible alternative to rhdf5::h5read().

Usage

h5mread(filepath, name, starts=NULL, counts=NULL, noreduce=FALSE,
        as.vector=NA, as.integer=FALSE, as.sparse=FALSE,
        method=0L, use.H5Dread_chunk=FALSE)

get_h5mread_returned_type(filepath, name, as.integer=FALSE)

Arguments

filepath

The path (as a single string) to the HDF5 file where the dataset to read from is located, or an H5File object.

Note that you must create and use an H5File object if the HDF5 file to access is stored in an Amazon S3 bucket. See ?H5File for how to do this.

Also please note that H5File objects must NOT be used in the context of parallel evaluation at the moment.

name

The name of the dataset in the HDF5 file.

starts, counts

starts and counts are used to specify the array selection. Each argument can be either NULL or a list with one list element per dimension in the dataset.

If starts and counts are both NULL, then the entire dataset is read.

If starts is a list, each list element in it must be a vector of valid positive indices along the corresponding dimension in the dataset. An empty vector (integer(0)) is accepted and indicates an empty selection along that dimension. A NULL is accepted and indicates a full selection along the dimension so has the same meaning as a missing subscript when subsetting an array-like object with [. (Note that for [ a NULL subscript indicates an empty selection.)

Each list element in counts must be NULL or a vector of non-negative integers of the same length as the corresponding list element in starts. Each value in the vector indicates how many positions to select starting from the associated start value. A NULL indicates that a single position is selected for each value along the corresponding dimension.

If counts is NULL, then each index in each starts list element indicates a single position selection along the corresponding dimension. Note that in this case the starts argument is equivalent to the index argument of h5read and extract_array (with the caveat that h5read doesn't accept empty selections).

Finally note that when counts is not NULL then the selection described by starts and counts must be strictly ascending along each dimension.

as.vector

Should the data be returned in a vector instead of an array? By default (i.e. when set to NA), the data is returned in an ordinary array when reading from a multidimensional dataset, and in an ordinary vector when reading from a 1D dataset. You can override this by setting as.vector to TRUE or FALSE.

as.integer

If set to TRUE then the data is loaded in an ordinary array (or vector) of type() "integer". This will typically reduce the memory footprint of the returned array or vector by half if the values in the HDF5 dataset are floating point values.

Note that, when as.integer=TRUE, the values loaded from the HDF5 dataset get coerced to integers at the C level as early as possible so this transformation is very efficient.

as.sparse

By default h5mread() returns the data in an ordinary array or vector. Use as.sparse=TRUE to return it in a SparseArray derivative from the SparseArray package. This will significantly reduce the memory footprint of the returned object if the HDF5 dataset contains mostly zeros.

noreduce, method, use.H5Dread_chunk

For testing and advanced usage only. Do not use.

Value

h5mread() returns an ordinary array or vector if as.sparse is FALSE (the default), and a COO_SparseArray object if as.sparse is TRUE.

get_h5mread_returned_type() returns the type of the array or vector that will be returned by h5mread(). Equivalent to (but more efficient than):

  typeof(h5mread(filepath, name, rep(list(integer(0)), ndim)))

where ndim is the number of dimensions (a.k.a. rank in HDF5 jargon) of the dataset.

Examples

## ---------------------------------------------------------------------
## BASIC EXAMPLES
## ---------------------------------------------------------------------
test_h5 <- system.file("extdata", "test.h5", package="h5mread")
h5ls(test_h5)
#> Warning: An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
#>           group         name       otype  dclass          dim
#> 0             / .m2_dimnames   H5I_GROUP                     
#> 1 /.m2_dimnames            1 H5I_DATASET  STRING         4000
#> 2 /.m2_dimnames            2 H5I_DATASET  STRING           90
#> 3             /           a3 H5I_DATASET INTEGER 180 x 75 x 4
#> 4             /           m1 H5I_DATASET INTEGER       12 x 5
#> 5             /           m2 H5I_DATASET   FLOAT    4000 x 90
#> 6             /           m4 H5I_DATASET INTEGER    28 x 4000
#> 7             /       rwords H5I_DATASET  STRING        30000

m1 <- h5mread(test_h5, "m1")  # 12 x 5 integer matrix

m <- h5mread(test_h5, "m1", starts=list(c(8, 12:7), NULL))
m
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]  108  120  132  144  156
#> [2,]  112  124  136  148  160
#> [3,]  111  123  135  147  159
#> [4,]  110  122  134  146  158
#> [5,]  109  121  133  145  157
#> [6,]  108  120  132  144  156
#> [7,]  107  119  131  143  155

## Sanity check:
stopifnot(identical(m1[c(8, 12:7), ], m))

m <- h5mread(test_h5, "m1", starts=list(c(8, 12:7), integer(0)))
m
#>     
#> [1,]
#> [2,]
#> [3,]
#> [4,]
#> [5,]
#> [6,]
#> [7,]

## Sanity check:
stopifnot(identical(m1[c(8, 12:7), NULL], m))

m2 <- h5mread(test_h5, "m2")  # 4000 x 90 double matrix

m2a <- h5mread(test_h5, "m2", starts=list(31, 1), counts=list(10, 8))
m2a
#>             [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
#>  [1,]  4.6302423  3.6920781  2.5704878  3.8089040 -0.7006228  2.8039929
#>  [2,]  4.0229905 -0.6359713  4.1256160 -4.1306267 -0.4303055  2.0140353
#>  [3,]  1.9070528 -3.7952366 -2.0674090 -2.5155661 -4.6262732 -0.1125200
#>  [4,]  2.9546742  1.1895336  0.5573193  3.3682705  3.7067432 -4.2372909
#>  [5,] -4.7538632 -0.1949125 -3.0470539 -1.2498946  4.0524254  1.3917748
#>  [6,] -0.2220403  4.2389309  2.7640452  0.7412419 -0.4210158  2.5244328
#>  [7,]  2.5845954 -2.5932731  0.9087094 -2.4473713  2.6756166 -4.0774453
#>  [8,] -2.8359206  2.7648158  0.4323687 -4.8216287 -2.7992909  3.9625529
#>  [9,] -1.8181899  2.5198789 -3.2706827 -0.6458960  3.3948262 -2.5945095
#> [10,] -2.6837421  1.7648242  4.0621661  4.0650954  1.1591354 -0.1142555
#>             [,7]      [,8]
#>  [1,] -0.2448190 -3.022135
#>  [2,] -2.8104900  3.098096
#>  [3,] -2.2713924 -3.524646
#>  [4,]  2.9222810 -3.140645
#>  [5,]  1.0613606  4.821568
#>  [6,]  2.2018533 -3.013025
#>  [7,] -3.3362339  1.122616
#>  [8,]  2.2856397  3.749805
#>  [9,]  0.7560304  2.855820
#> [10,] -4.7331954  1.640497

## Sanity check:
stopifnot(identical(m2[31:40, 1:8], m2a))

m2b <- h5mread(test_h5, "m2", starts=list(31, 1), counts=list(10, 8),
               as.integer=TRUE)
m2b
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#>  [1,]    4    3    2    3    0    2    0   -3
#>  [2,]    4    0    4   -4    0    2   -2    3
#>  [3,]    1   -3   -2   -2   -4    0   -2   -3
#>  [4,]    2    1    0    3    3   -4    2   -3
#>  [5,]   -4    0   -3   -1    4    1    1    4
#>  [6,]    0    4    2    0    0    2    2   -3
#>  [7,]    2   -2    0   -2    2   -4   -3    1
#>  [8,]   -2    2    0   -4   -2    3    2    3
#>  [9,]   -1    2   -3    0    3   -2    0    2
#> [10,]   -2    1    4    4    1    0   -4    1

## Sanity check:
storage.mode(m2a) <- "integer"
stopifnot(identical(m2a, m2b))

a3 <- h5mread(test_h5, "a3")  # 180 x 75 x 4 integer array

starts <- list(c(21, 101), NULL, 3:4)
counts <- list(c( 5,  22), NULL, NULL)
a <- h5mread(test_h5, "a3", starts=starts, counts=counts)
dim(a)
#> [1] 27 75  2
a[1:10, 1:12, ]
#> , , 1
#> 
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#>  [1,]    0    0    0    0    0    0    0    1    0     0     0     0
#>  [2,]    0    1    1    0    1    0    0    0    0     0     1     0
#>  [3,]    0    0    0    0    1    0    0    2    0     0     0     0
#>  [4,]    0    0    0    0    0    0    0    0    0     0     2     0
#>  [5,]    0    0    0    0    0    0    0    0    0     0     0     1
#>  [6,]    0    0    0    0    0    0    0    0    0     1     0     0
#>  [7,]    0    0    0    0    0    0    0    0    0     0     0     0
#>  [8,]    0    0    0    1    0    0    0    0    0     0     1     0
#>  [9,]    1    0    0    0    0    0    0    0    0     0     0     1
#> [10,]    0    0    0    1    1    0    0    0    0     0     0     0
#> 
#> , , 2
#> 
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#>  [1,]    1    0    0    0    0    0    0    0    1     0     0     0
#>  [2,]    0    0    0    0    0    0    1    0    0     0     0     0
#>  [3,]    1    0    1    0    0    0    0    0    1     0     0     1
#>  [4,]    0    0    0    0    0    0    0    0    0     0     0     0
#>  [5,]    0    0    0    0    1    0    0    0    0     0     1     0
#>  [6,]    0    0    0    0    0    0    0    0    0     0     0     0
#>  [7,]    0    0    0    0    0    0    0    1    0     0     0     0
#>  [8,]    0    0    0    0    0    1    0    0    0     0     0     1
#>  [9,]    0    0    0    0    0    0    1    0    0     0     0     0
#> [10,]    1    0    0    0    0    0    0    1    0     0     0     0
#> 

## Sanity check:
stopifnot(identical(a3[c(21:25, 101:122), , 3:4, drop=FALSE], a))

## ---------------------------------------------------------------------
## RETURNING THE DATA AS A SPARSE ARRAY
## ---------------------------------------------------------------------

starts <- list(c(21:25, 101:122), NULL, 3:4)
coo <- h5mread(test_h5, "a3", starts=starts, as.sparse=TRUE)
coo
#> <27 x 75 x 2 SparseArray> of type "integer" [nzcount=577 (14%)]:
#> ,,1
#>        [,1]  [,2]  [,3]  [,4] ... [,72] [,73] [,74] [,75]
#>  [1,]     0     0     0     0   .     0     0     0     0
#>  [2,]     0     1     1     0   .     0     0     0     0
#>   ...     .     .     .     .   .     .     .     .     .
#> [26,]     0     0     0     0   .     0     0     1     0
#> [27,]     1     0     0     0   .     0     0     0     0
#> 
#> ,,2
#>        [,1]  [,2]  [,3]  [,4] ... [,72] [,73] [,74] [,75]
#>  [1,]     1     0     0     0   .     0     0     0     0
#>  [2,]     0     0     0     0   .     0     0     0     0
#>   ...     .     .     .     .   .     .     .     .     .
#> [26,]     0     0     0     1   .     0     1     1     0
#> [27,]     1     0     1     1   .     0     0     0     0
#> 

class(coo)  # COO_SparseArray object (see ?COO_SparseArray)
#> [1] "COO_SparseArray"
#> attr(,"package")
#> [1] "SparseArray"
dim(coo)
#> [1] 27 75  2

## Sanity check:
stopifnot(is(coo, "COO_SparseArray"), identical(a, as.array(coo)))

## ---------------------------------------------------------------------
## PERFORMANCE
## ---------------------------------------------------------------------
library(ExperimentHub)
#> Loading required package: AnnotationHub
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr
hub <- ExperimentHub()

## With the "sparse" TENxBrainData dataset
## ---------------------------------------
fname0 <- hub[["EH1039"]]
#> see ?TENxBrainData and browseVignettes('TENxBrainData') for documentation
#> loading from cache
h5ls(fname0)  # all datasets are 1D datasets
#>   group       name       otype  dclass        dim
#> 0     /       mm10   H5I_GROUP                   
#> 1 /mm10   barcodes H5I_DATASET  STRING    1306127
#> 2 /mm10       data H5I_DATASET INTEGER 2624828308
#> 3 /mm10 gene_names H5I_DATASET  STRING      27998
#> 4 /mm10      genes H5I_DATASET  STRING      27998
#> 5 /mm10    indices H5I_DATASET INTEGER 2624828308
#> 6 /mm10     indptr H5I_DATASET INTEGER    1306128
#> 7 /mm10      shape H5I_DATASET INTEGER          2

index <- list(77 * sample(34088679, 5000, replace=TRUE))
## h5mread() is about 4x faster than h5read():
system.time(a <- h5mread::h5mread(fname0, "mm10/data", index))
#>    user  system elapsed 
#>   0.726   0.076   1.908 
system.time(b <- h5read(fname0, "mm10/data", index=index))
#>    user  system elapsed 
#>   5.376   0.006   5.383 
stopifnot(identical(a, as.vector(b)))

index <- list(sample(1306127, 7500, replace=TRUE))
## h5mread() is about 14x faster than h5read():
system.time(a <- h5mread::h5mread(fname0, "mm10/barcodes", index))
#>    user  system elapsed 
#>   0.075   0.001   0.077 
system.time(b <- h5read(fname0, "mm10/barcodes", index=index))
#>    user  system elapsed 
#>   2.042   0.012   2.053 
stopifnot(identical(a, as.vector(b)))

## With the "dense" TENxBrainData dataset
## --------------------------------------
fname1 <- hub[["EH1040"]]
#> see ?TENxBrainData and browseVignettes('TENxBrainData') for documentation
#> loading from cache
h5ls(fname1)  # "counts" is a 2D dataset
#>   group   name       otype  dclass             dim
#> 0     / counts H5I_DATASET INTEGER 27998 x 1306127

set.seed(33)
index <- list(sample(27998, 300), sample(1306127, 450))
## h5mread() is about 2x faster than h5read():
system.time(a <- h5mread::h5mread(fname1, "counts", index))
#>    user  system elapsed 
#>   5.136   0.125   5.413 
system.time(b <- h5read(fname1, "counts", index=index))
#>    user  system elapsed 
#>  13.803   0.305  14.108 
stopifnot(identical(a, b))

## Alternatively 'as.sparse=TRUE' can be used to reduce memory usage:
system.time(coo <- h5mread::h5mread(fname1, "counts", index, as.sparse=TRUE))
#>    user  system elapsed 
#>   5.106   0.074   5.181 
stopifnot(identical(a, as.array(coo)))

## The bigger the selection, the greater the speedup between
## h5read() and h5mread():
# \donttest{
  index <- list(sample(27998, 1000), sample(1306127, 1000))
  ## h5mread() about 4x faster than h5read() (12s vs 48s):
  system.time(a <- h5mread::h5mread(fname1, "counts", index))
#>    user  system elapsed 
#>  16.129   0.395  16.899 
  system.time(b <- h5read(fname1, "counts", index=index))
#>    user  system elapsed 
#>  84.705   0.732  85.447 
  stopifnot(identical(a, b))

  ## With 'as.sparse=TRUE' (about the same speed as with 'as.sparse=FALSE'):
  system.time(coo <- h5mread::h5mread(fname1, "counts", index, as.sparse=TRUE))
#>    user  system elapsed 
#>  15.553   0.281  15.836 
  stopifnot(identical(a, as.array(coo)))
# }

An alternative to `rhdf5::h5read`

Usage

Arguments

Value

See also

Examples