Iterate through a BAM (or other) file, reducing output to a single result.
reduceByYield.Rd
Rsamtools files can be created with a ‘yieldSize’ argument that
influences the number of records (chunk size) input at one time (see,
e.g,. BamFile
). reduceByYield
iterates
through the file, processing each chunk and reducing it with previously
input chunks. This is a memory efficient way to process large data files,
especially when the final result fits in memory.
Arguments
- X
A
BamFile
instance (or other class for whichisOpen
,open
,close
methods are defined, and which support extraction of sequential chunks).- YIELD
A function name or user-supplied function that operates on
X
to produce aVALUE
that is passed toDONE
andMAP
. GenerallyYIELD
will be a data extractor such asreadGAlignments
,scanBam
,yield
, etc. andVALUE
is a chunk of data.YIELD(X)
- MAP
A function of one or more arguments that operates on the chunk of data from
YIELD
.MAP(VALUE, ...)
- REDUCE
A function of one (
iterate=FALSE
or two (iterate=TRUE
) arguments, returning the reduction (e.g., sum, mean, concatenate) of the arguments.REDUCE(mapped, ...) ## iterate=FALSE
REDUCE(x, y, ...) ## iterate=TRUE
- DONE
A function of one argument, the
VALUE
output of the most recent call toYIELD(X, ...)
. If missing,DONE
isfunction(VALUE) length(VALUE) == 0
.- ...
Additional arguments, passed to
MAP
.- iterate
logical(1) determines whether the call to
REDUCE
is iterative (iterate=TRUE
) or cumulative (iterate=FALSE
).- parallel
logical(1) determines if the
MAP
step is run in parallel.bpiterate
is used under the hood and is currently supported for Unix/Mac only. For Windows machines,parallel
is ignored.- init
(Optional) Initial value used for
REDUCE
wheniterate=TRUE
.- sampleSize
Initial value used for
REDUCEsampler
.- verbose
logical(1) determines if total records sampled are reported at each iteration. Applicable to
REDUCEsampler
only.
Details
reduceByYield
:When
iterate=TRUE
,REDUCE
requires 2 arguments and is invoked withinit
and the output from the first call toMAP
. Ifinit
is missing, it operates on the first two outputs fromMAP
.When
iterate=FALSE
,REDUCE
requires 1 argument and is is invoked with a list containing a list containing all results fromMAP
.REDUCEsampler
:REDUCEsampler
creates a function that can be used as theREDUCE
argument toreduceByYield
.Invoking
REDUCEsampler
withsampleSize
returns a function (call itmyfun
) that takes two arguments,x
andy
. As with any iterativeREDUCE
function,x
represents records that have been yield'ed andy
is the new chunk of records.myfun
samples records from consecutive chunks returned by theYIELD
function. (Re)sampling takes into consideration the total number of records yield'ed, thesampleSize
, and the size of the new chunk.
Value
The value returned by the final invocation of REDUCE
, or init
if provided and no data were yield'ed, or list()
if init
is
missing and no data were yield'ed.
Examples
if (all(require(RNAseqData.HNRNPC.bam.chr14) &&
require(GenomicAlignments))) {
## -----------------------------------------------------------------------
## Nucleotide frequency of mapped reads
## -----------------------------------------------------------------------
## In this example nucleotide frequency of mapped reads is computed
## for a single file. The MAP step is run in parallel and REDUCE
## is iterative.
## Create a BamFile and set a 'yieldSize'.
fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
bf <- BamFile(fl, yieldSize=500)
## Define 'YIELD', 'MAP' and 'REDUCE' functions.
YIELD <- function(X, ...) {
flag = scanBamFlag(isUnmappedQuery=FALSE)
param = ScanBamParam(flag=flag, what="seq")
scanBam(X, param=param, ...)[[1]][['seq']]
}
MAP <- function(value, ...) {
requireNamespace("Biostrings", quietly=TRUE) ## for alphabetFrequency()
Biostrings::alphabetFrequency(value, collapse=TRUE)
}
REDUCE <- `+` # add successive alphabetFrequency matrices
## 'parallel=TRUE' runs the MAP step in parallel and is currently
## implemented for Unix/Mac only.
register(MulticoreParam(3))
reduceByYield(bf, YIELD, MAP, REDUCE, parallel=TRUE)
## -----------------------------------------------------------------------
## Coverage
## -----------------------------------------------------------------------
## If sufficient resources are available coverage can be computed
## across several large BAM files by combining reduceByYield() with
## bplapply().
## Create a BamFileList with a few sample files and a Snow cluster
## with the same number of workers as files.
bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1:3])
bpparam <- SnowParam(length(bfl))
## 'FUN' is run on each worker. Because these are Snow workers each
## variable used in 'FUN' must be explicitly passed. (This is not the case
## when using Multicore.)
FUN <- function(bf, YIELD, MAP, REDUCE, parallel, ...) {
requireNamespace("GenomicFiles", quietly=TRUE) ## for reduceByYield()
GenomicFiles::reduceByYield(bf, YIELD, MAP, REDUCE, parallel=parallel)
}
## Passing parallel=FALSE to reduceByYield() runs the MAP step in serial on
## each worker. In this example, parallel dispatch is at the file-level
## only (bplapply()).
YIELD <- `readGAlignments`
MAP <- function(value, ...) {
requireNamespace("GenomicAlignments", quietly=TRUE)
GenomicAlignments::coverage(value)[["chr14"]]
}
bplapply(bfl, FUN, YIELD=YIELD, MAP=MAP, REDUCE=`+`,
parallel=FALSE, BPPARAM = bpparam)
## -----------------------------------------------------------------------
## Sample records from a Bam file
## -----------------------------------------------------------------------
fl <- system.file(package="Rsamtools", "extdata", "ex1.bam")
bf <- BamFile(fl, yieldSize=1000)
yield <- function(x)
readGAlignments(x, param=ScanBamParam(what=c( "qwidth", "mapq" )))
map <- identity
## Samples records from successive chunks of aligned reads.
reduceByYield(bf, yield, map, REDUCEsampler(1000, TRUE))
}
#> Error: BiocParallel errors
#> 3 remote errors, element index: 1, 2, 3
#> 0 unevaluated and other errors
#> first remote error:
#> Error in loadNamespace(x): there is no package called ‘GenomicFiles’