Parallel computations by ranges
reduceByRange-methods.Rd
Computations are distributed in parallel by range. Data subsets are extracted and manipulated (MAP) and optionally combined (REDUCE) across all files.
Usage
# S4 method for class 'GRanges,ANY'
reduceByRange(ranges, files, MAP,
REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
# S4 method for class 'GRangesList,ANY'
reduceByRange(ranges, files, MAP,
REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
# S4 method for class 'GenomicFiles,missing'
reduceByRange(ranges, files, MAP,
REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
reduceRanges(ranges, files, MAP, REDUCE, ..., init)
Arguments
- ranges
A
GRanges
,GrangesList
orGenomicFiles
object.A
GRangesList
implies a grouping of the ranges;MAP
is applied to each element of theGRangesList
vs each range whenranges
is aGRanges
.When
ranges
is aGenomicFiles
thefiles
argument is missing; both ranges and files are extracted from the object.- files
A
character
vector orList
of filenames. AList
implies a grouping of the files;MAP
is applied to each element of theList
vs each file individually.- MAP
A function executed on each worker. The signature must contain a minimum of two arguments representing the ranges and files. There is no restriction on argument names and additional arguments can be provided.
MAP = function(range, file, ...)
- REDUCE
An optional function that combines output from the
MAP
step applied across all files. The signature must contain at least one argument representing the list output fromMAP
. There is no restriction on argument names and additional arguments can be provided.REDUCE = function(mapped, ...)
Reduction combines data from a single worker and is always performed as part of the distributed step. When
iterate=TRUE
REDUCE
is applied after eachMAP
step; depending on the nature ofREDUCE
, iterative reduction can substantially decrease the data stored in memory. Wheniterate=FALSE
reduction is applied to the list ofMAP
outputs for a single range, applied to all files.When
REDUCE
is missing, output is a list fromMAP
.- iterate
A logical that, when
TRUE
, indicates that theREDUCE
function should be applied iteratively to the output ofMAP
. WhenREDUCE
is missingiterate
is set to FALSE. This argument applies toreduceByRange
only.Collapsing results iteratively is useful when the number of records to be processed is large (maybe complete files) but the end result is a much reduced representation of all records. Iteratively applying
REDUCE
reduces the amount of data on each worker at any one time and can substantially reduce the memory footprint.- summarize
A logical indicating if results should be returned as a
SummarizedExperiment
object instead of a list; data are returned in theassays
slot named `data`. This argument applies toreduceByRange
only.When
REDUCE
is providedsummarize
is ignored (i.e., set to FALSE). ASummarizedExperiment
requires the number of rows incolData
and the columns inassays
to match. BecauseREDUCE
collapses the data across files, the dimension of the result no longer matches that of the original ranges.- init
An optional initial value for
REDUCE
wheniterate=TRUE
.init
must be an object of the same type as the elements returned fromMAP
.REDUCE
logically addsinit
to the start (when proceeding left to right) or end of results obtained withMAP
.- ...
Arguments passed to other methods. Currently not used.
Details
reduceByRange
extracts, manipulates and combines ranges across
different files. Each element of ranges
is sent to a worker;
this is a single range when ranges
is a GRanges and may be
multiple ranges when ranges
is a GRangesList. The worker then
iterates across all files, applying MAP(range, file, ...)
to
each. When iterate=FALSE
, REDUCE
is applied to the list
of results from MAP
applied to all files. When iterate =
TRUE
, the argument to REDUCE
is always a list of length
2. REDUCE
is first invoked after the second file has been
processed. The first element of the list to REDUCE
is the
result of calling MAP
on the first file; the second element is
the result of calling MAP
on the second file. For the
n
th file, the first element is the result of the call to
REDUCE
for the n-1
th file, and the second element is the
result of calling MAP
on the n
th file.
reduceRanges
is essentially equivalent to reduceByRange
,
but with iterate = FALSE
.
Both MAP
and REDUCE
are applied in the distributed step
(“on the worker“). REDUCE
provides a way to summarize results
for a single range across all files; REDUCE
does not
provide a mechanism to summarize results across ranges.
Value
reduceByRange: When
summarize=FALSE
the return value is alist
or the value from the final invocation ofREDUCE
. Whensummarize=TRUE
output is aSummarizedExperiment
. Whenranges
is aGenomicFiles
object data fromrowRanges
,colData
andmetadata
are transferred to theSummarizedExperiment
.reduceRanges: A
list
or the value returned by the final invocation ofREDUCE
.
Examples
if (all(requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE) &&
require(GenomicAlignments))) {
## -----------------------------------------------------------------------
## Compute coverage across BAM files.
## -----------------------------------------------------------------------
fls <- ## 8 bam files
RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES
## Regions of interest.
gr <- GRanges("chr14", IRanges(c(62262735, 63121531, 63980327),
width=214700))
## The MAP computes the coverage ...
MAP <- function(range, file, ...) {
requireNamespace("GenomicFiles", quietly=TRUE)
## for coverage(), Rsamtools::ScanBamParam()
param = Rsamtools::ScanBamParam(which=range)
GenomicFiles::coverage(file, param=param)[range]
}
## and REDUCE adds the last and current results.
REDUCE <- function(mapped, ...)
Reduce("+", mapped)
## -----------------------------------------------------------------------
## reduceByRange:
## With no REDUCE, coverage is computed for each range / file combination.
cov1 <- reduceByRange(gr, fls, MAP)
cov1[[1]]
## Each call to coverage() produces an RleList which accumulate on the
## workers. We can use a reducer to combine these lists either iteratively
## or non-iteratively. When iterate = TRUE the current result
## is collapsed with the last resulting in a maximum of 2 RleLists on
## a worker at any given time.
cov2 <- reduceByRange(gr, fls, MAP, REDUCE, iterate=TRUE)
cov2[[1]]
## If memory use is not a concern (or if MAP output is not large) the
## REDUCE function can be applied non-iteratively.
cov3 <- reduceByRange(gr, fls, MAP, REDUCE, iterate=FALSE)
## Results match those obtained with the iterative REDUCE.
cov3[[1]]
## When 'ranges' is a GRangesList, the list elements are sent to the
## workers instead of a single range as in the case of a GRanges.
grl <- GRangesList(gr[1], gr[2:3])
grl
cov4 <- reduceByRange(grl, fls, MAP)
length(cov4) ## length of GRangesList
elementNROWS(cov4) ## number of files
## -----------------------------------------------------------------------
## reduceRanges:
## This function passes the character vector of all file names to MAP.
## MAP must handle each file separately or invoke a method that operates
## on a list of files.
## TODO: example
}
#> Loading required package: GenomicAlignments
#> [1] 8 8