Parallel computations by files
reduceByFile-methods.Rd
Computations are distributed in parallel by file. Data subsets are extracted and manipulated (MAP) and optionally combined (REDUCE) within a single file.
Usage
# S4 method for class 'GRanges,ANY'
reduceByFile(ranges, files, MAP,
REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
# S4 method for class 'GRangesList,ANY'
reduceByFile(ranges, files, MAP,
REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
# S4 method for class 'GenomicFiles,missing'
reduceByFile(ranges, files, MAP,
REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
reduceFiles(ranges, files, MAP, REDUCE, ..., init)
Arguments
- ranges
A
GRanges
,GrangesList
orGenomicFiles
object.A
GRangesList
implies a grouping of the ranges;MAP
is applied to each element of theGRangesList
vs each range whenranges
is aGRanges
.When
ranges
is aGenomicFiles
thefiles
argument is missing; both ranges and files are extracted from the object.- files
A
character
vector orList
of filenames. AList
implies a grouping of the files;MAP
is applied to each element of theList
vs each file individually.- MAP
A function executed on each worker. The signature must contain a minimum of two arguments representing the ranges and files. There is no restriction on argument names and additional arguments can be provided.
MAP = function(range, file, ...)
- REDUCE
An optional function that combines output from the
MAP
step. The signature must contain at least one argument representing the list output fromMAP
. There is no restriction on argument names and additional arguments can be provided.REDUCE = function(mapped, ...)
Reduction combines data from a single worker and is always performed as part of the distributed step. When
iterate=TRUE
REDUCE
is applied after eachMAP
step; depending on the nature ofREDUCE
, iterative reduction can substantially decrease the data stored in memory. Wheniterate=FALSE
reduction is applied to the list ofMAP
output applied to all files / ranges.When
REDUCE
is missing, output is a list fromMAP
.- iterate
A logical indicating if the
REDUCE
function should be applied iteratively to the output ofMAP
. WhenREDUCE
is missingiterate
is set to FALSE. This argument applies toreduceByFile
only (reduceFiles
calls MAP a single time on each worker).Collapsing results iteratively is useful when the number of records to be processed is large (maybe complete files) but the end result is a much reduced representation of all records. Iteratively applying
REDUCE
reduces the amount of data on each worker at any one time and can substantially reduce the memory footprint.- summarize
A logical indicating if results should be returned as a
SummarizedExperiment
object instead of a list; data are returned in theassays
slot named `data`. This argument applies toreduceByFile
only.When
REDUCE
is providedsummarize
is ignored (i.e., set to FALSE). ASummarizedExperiment
requires the number of rows inrowRanges
andassays
to match. BecauseREDUCE
collapses the data across ranges, the dimension of the result no longer matches that of the original ranges.- init
An optional initial value for
REDUCE
wheniterate=TRUE
.init
must be an object of the same type as the elements returned fromMAP
.REDUCE
logically addsinit
to the start (when proceeding left to right) or end of results obtained withMAP
.- ...
Arguments passed to other methods.
Details
reduceByFile
extracts, manipulates and combines multiple ranges
within a single file. Each file is sent to a worker where MAP
is
invoked on each file / range combination. This approach allows multiple
ranges extracted from a single file to be kept separate or combined with
REDUCE
.
In contrast, reduceFiles
treats the output of all MAP calls
as a group and reduces them together. REDUCE
usually plays
a minor role by concatenating or unlisting results.
Both MAP
and REDUCE
are applied in the distributed
step (“on the worker“). Results are not combined across workers in
the distributed step.
Value
reduceByFile: When
summarize=FALSE
the return value is alist
or the value from the final invocation ofREDUCE
. Whensummarize=TRUE
output is aSummarizedExperiment
. Whenranges
is aGenomicFiles
object data fromrowRanges
,colData
andmetadata
are transferred to theSummarizedExperiment
.reduceFiles: A
list
or the value returned by the final invocation ofREDUCE
.
Examples
if (requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE)) {
## -----------------------------------------------------------------------
## Count junction reads in BAM files
## -----------------------------------------------------------------------
fls <- ## 8 bam files
RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES
## Ranges of interest.
gr <- GRanges("chr14", IRanges(c(19100000, 106000000), width=1e7))
## MAP outputs a table of junction counts per range.
MAP <- function(range, file, ...) {
## for readGAlignments(), Rsamtools::ScanBamParam()
requireNamespace("GenomicAlignments", quietly=TRUE)
param = Rsamtools::ScanBamParam(which=range)
gal = GenomicAlignments::readGAlignments(file, param=param)
table(GenomicAlignments::njunc(gal))
}
## -----------------------------------------------------------------------
## reduceByFile:
## With no REDUCE, counts are computed for each range / file combination.
counts1 <- reduceByFile(gr, fls, MAP)
length(counts1) ## 8 files
elementNROWS(counts1) ## 2 ranges each
## Tables of counts for each range:
counts1[[1]]
## With a REDUCE, results are combined on the fly. This reducer sums the
## number of records in each range with exactly 1 junction.
REDUCE <- function(mapped, ...)
sum(sapply(mapped, "[", "1"))
reduceByFile(gr, fls, MAP, REDUCE)
## -----------------------------------------------------------------------
## reduceFiles:
## All ranges are treated as a single group:
counts2 <- reduceFiles(gr, fls, MAP)
## Counts are for all ranges grouped:
counts2[[1]]
## Contrast the above with that from reduceByFile() where counts
## are for each range separately:
counts1[[1]]
## -----------------------------------------------------------------------
## Methods for the GenomicFiles class:
## Both reduceByFiles() and reduceFiles() can operate on a GenomicFiles
## object.
colData <- DataFrame(method=rep("RNASeq", length(fls)),
format=rep("bam", length(fls)))
gf <- GenomicFiles(files=fls, rowRanges=gr, colData=colData)
gf
## Subset on ranges or files for different experimental runs.
dim(gf)
gf_sub <- gf[2, 3:4]
dim(gf_sub)
## When summarize = TRUE and no REDUCE is given, the output is a
## SummarizedExperiment object.
se <- reduceByFile(gf, MAP=MAP, summarize=TRUE)
se
## Data from the rowRanges, colData and metadata slots in the
## GenomicFiles are transferred to the SummarizedExperiment.
colData(se)
## Results are in the assays slot named 'data'.
assays(se)
}
#> List of length 1
#> names(1): data