H5File objects
H5File-class.Rd
The H5File class provides a formal representation of an HDF5 file (local or remote).
Arguments
- filepath
A single string specifying the path or URL to an HDF5 file.
- s3
TRUE
orFALSE
. Should thefilepath
argument be treated as the URL to a file stored in an Amazon S3 bucket, rather than the path to a local file?- s3credentials
A list of length 3, providing the credentials for accessing files stored in a private Amazon S3 bucket. See
?H5Pset_fapl_ros3
in the rhdf5 package for more information.- .no_rhdf5_h5id
For internal use only. Don't use.
Details
IMPORTANT NOTE ABOUT H5File OBJECTS AND PARALLEL EVALUATION
The short story is that H5File objects cannot be used in the context of parallel evaluation at the moment.
Here is why:
H5File objects contain an identifier to an open connection to the HDF5 file. This identifier becomes invalid in the 2 following situations:
After serialization/deserialization, that is, after loading a serialized H5File object with
readRDS()
orload()
.In the context of parallel evaluation, when using the SnowParam parallelization backend. This is because, unlike the MulticoreParam backend which used a system
fork
, the SnowParam backend uses serialization/deserialization to transmit the object to the workers.
In both cases, the connection to the file is lost and any attempt to
read data from the H5File object will fail. Note that the above also
happens to any H5File object that got serialized indirectly i.e. as
part of a bigger object. For example, if an HDF5Array
object was constructed from an H5File object, then it contains the H5File
object and therefore
blockApply(..., BPPARAM=SnowParam(4))
cannot be used on it.
Furthermore, even if sometimes an H5File object seems to work fine with the MulticoreParam parallelization backend, this is highly unreliable and must be avoided.
See also
H5Pset_fapl_ros3 in the rhdf5 package for detailed information about how to pass your S3 credentials to the
s3credentials
argument.The HDF5Array class defined in the HDF5Array package for representing and operating on a conventional (a.k.a. dense) HDF5 dataset.
The H5SparseMatrix class defined in the HDF5Array package for representing and operating on an HDF5 sparse matrix.
The H5ADMatrix class defined in the HDF5Array package for representing and operating on the central matrix of an
h5ad
file, or any matrix in its/layers
group.The TENxMatrix class defined in the HDF5Array package for representing and operating on a 10x Genomics dataset.
The
h5mread
function in this package (h5mread) that is used internally by HDF5Array, TENxMatrix, and H5ADMatrix objects, for (almost) all their data reading needs.h5ls
to list the content of an HDF5 file.bplapply
,MulticoreParam
, andSnowParam
, in the BiocParallel package.
Examples
## ---------------------------------------------------------------------
## A. BASIC USAGE
## ---------------------------------------------------------------------
## With a local file:
test_h5 <- system.file("extdata", "test.h5", package="h5mread")
h5file1 <- H5File(test_h5)
h5ls(h5file1)
#> group name otype dclass dim
#> 0 / .m2_dimnames H5I_GROUP
#> 1 /.m2_dimnames 1 H5I_DATASET STRING 4000
#> 2 /.m2_dimnames 2 H5I_DATASET STRING 90
#> 3 / a3 H5I_DATASET INTEGER 180 x 75 x 4
#> 4 / m1 H5I_DATASET INTEGER 12 x 5
#> 5 / m2 H5I_DATASET FLOAT 4000 x 90
#> 6 / m4 H5I_DATASET INTEGER 28 x 4000
#> 7 / rwords H5I_DATASET STRING 30000
path(h5file1)
#> [1] "/tmp/RtmpFXHjrt/temp_libpathc9d33e3a6bd/h5mread/extdata/test.h5"
h5mread(h5file1, "m2", list(1:10, 1:6))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] -2.1242248 NA NaN -2.5779525 1.9532461 4.9112339
#> [2,] 2.8830514 Inf -Inf -0.1996812 -1.2167090 -1.9776935
#> [3,] -0.9102308 0.9364751 2.4219044 3.4048459 4.6469619 -0.6624099
#> [4,] 3.8301740 3.0719664 -2.7439101 4.9207697 0.2735301 -3.3947909
#> [5,] 4.4046728 -2.0594922 2.4889107 -2.5576531 -4.5567993 3.2302671
#> [6,] -4.5444350 -3.5891478 3.2486851 1.2303872 1.7265274 -2.9190945
#> [7,] 0.2810549 3.8862110 -4.0045662 -0.4121020 1.1311564 -2.1503403
#> [8,] 3.9241904 -4.9171500 -2.2412863 3.4837234 -0.7017639 0.4620889
#> [9,] 0.5143501 0.6912065 -4.7862322 2.2703366 -1.0607908 0.8904540
#> [10,] -0.4338526 4.6755019 0.9624297 -3.0869540 2.6021692 -0.5774682
get_h5mread_returned_type(h5file1, "m2")
#> [1] "double"
## With a file stored in an Amazon S3 bucket:
if (Sys.info()[["sysname"]] != "Darwin") {
public_S3_url <-
"https://rhdf5-public.s3.eu-central-1.amazonaws.com/rhdf5ex_t_float_3d.h5"
h5file2 <- H5File(public_S3_url, s3=TRUE)
h5ls(h5file2)
h5mread(h5file2, "a1")
get_h5mread_returned_type(h5file2, "a1")
}
#> [1] "double"
## ---------------------------------------------------------------------
## B. H5File OBJECTS AND PARALLEL EVALUATION
## ---------------------------------------------------------------------
## H5File objects cannot be used in the context of parallel evaluation
## at the moment!
library(BiocParallel)
FUN1 <- function(i, h5file, name)
sum(h5mread::h5mread(h5file, name, list(i, NULL)))
FUN2 <- function(i, h5file, name)
sum(h5mread::h5mread(h5file, name, list(i, NULL, NULL)))
## With the SnowParam parallelization backend, the H5File object
## does NOT work on the workers:
if (FALSE) { # \dontrun{
## ERROR!
res1 <- bplapply(1:150, FUN1, h5file1, "m2", BPPARAM=SnowParam(3))
## ERROR!
res2 <- bplapply(1:5, FUN2, h5file2, "a1", BPPARAM=SnowParam(3))
} # }
## With the MulticoreParam parallelization backend, the H5File object
## might seem to work on the workers. However this is highly unreliable
## and must be avoided:
if (FALSE) { # \dontrun{
if (.Platform$OS.type != "windows") {
## UNRELIABLE!
res1 <- bplapply(1:150, FUN1, h5file1, "m2", BPPARAM=MulticoreParam(3))
## UNRELIABLE!
res2 <- bplapply(1:5, FUN2, h5file2, "a1", BPPARAM=MulticoreParam(3))
}
} # }