Parallel computations by ranges

Computations are distributed in parallel by range. Data subsets are extracted and manipulated (MAP) and optionally combined (REDUCE) across all files.

Usage

# S4 method for class 'GRanges,ANY'
reduceByRange(ranges, files, MAP,
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
# S4 method for class 'GRangesList,ANY'
reduceByRange(ranges, files, MAP,
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
# S4 method for class 'GenomicFiles,missing'
reduceByRange(ranges, files, MAP,
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)

reduceRanges(ranges, files, MAP, REDUCE, ..., init)

Arguments

ranges

A GRanges, GrangesList or GenomicFiles object.

A GRangesList implies a grouping of the ranges; MAP is applied to each element of the GRangesList vs each range when ranges is a GRanges.

When ranges is a GenomicFiles the files argument is missing; both ranges and files are extracted from the object.

files

A character vector or List of filenames. A List implies a grouping of the files; MAP is applied to each element of the List vs each file individually.

MAP

A function executed on each worker. The signature must contain a minimum of two arguments representing the ranges and files. There is no restriction on argument names and additional arguments can be provided.

MAP = function(range, file, ...)

REDUCE

An optional function that combines output from the MAP step applied across all files. The signature must contain at least one argument representing the list output from MAP. There is no restriction on argument names and additional arguments can be provided.

REDUCE = function(mapped, ...)

Reduction combines data from a single worker and is always performed as part of the distributed step. When iterate=TRUE REDUCE is applied after each MAP step; depending on the nature of REDUCE, iterative reduction can substantially decrease the data stored in memory. When iterate=FALSE reduction is applied to the list of MAP outputs for a single range, applied to all files.

When REDUCE is missing, output is a list from MAP.

iterate

A logical that, when TRUE, indicates that the REDUCE function should be applied iteratively to the output of MAP. When REDUCE is missing iterate is set to FALSE. This argument applies to reduceByRange only.

Collapsing results iteratively is useful when the number of records to be processed is large (maybe complete files) but the end result is a much reduced representation of all records. Iteratively applying REDUCE reduces the amount of data on each worker at any one time and can substantially reduce the memory footprint.

summarize

A logical indicating if results should be returned as a SummarizedExperiment object instead of a list; data are returned in the assays slot named `data`. This argument applies to reduceByRange only.

When REDUCE is provided summarize is ignored (i.e., set to FALSE). A SummarizedExperiment requires the number of rows in colData and the columns in assays to match. Because REDUCE collapses the data across files, the dimension of the result no longer matches that of the original ranges.

init

An optional initial value for REDUCE when iterate=TRUE. init must be an object of the same type as the elements returned from MAP. REDUCE logically adds init to the start (when proceeding left to right) or end of results obtained with MAP.

...

Arguments passed to other methods. Currently not used.

Details

reduceByRange extracts, manipulates and combines ranges across different files. Each element of ranges is sent to a worker; this is a single range when ranges is a GRanges and may be multiple ranges when ranges is a GRangesList. The worker then iterates across all files, applying MAP(range, file, ...) to each. When iterate=FALSE, REDUCE is applied to the list of results from MAP applied to all files. When iterate = TRUE, the argument to REDUCE is always a list of length 2. REDUCE is first invoked after the second file has been processed. The first element of the list to REDUCE is the result of calling MAP on the first file; the second element is the result of calling MAP on the second file. For the nth file, the first element is the result of the call to REDUCE for the n-1th file, and the second element is the result of calling MAP on the nth file.

reduceRanges is essentially equivalent to reduceByRange, but with iterate = FALSE.

Both MAP and REDUCE are applied in the distributed step (“on the worker“). REDUCE provides a way to summarize results for a single range across all files; REDUCE does not provide a mechanism to summarize results across ranges.

Value

reduceByRange: When summarize=FALSE the return value is a list or the value from the final invocation of REDUCE. When summarize=TRUE output is a SummarizedExperiment. When ranges is a GenomicFiles object data from rowRanges, colData and metadata are transferred to the SummarizedExperiment.
reduceRanges: A list or the value returned by the final invocation of REDUCE.

Author

Martin Morgan and Valerie Obenchain

Examples


if (all(requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE) &&
        require(GenomicAlignments))) {
  ## -----------------------------------------------------------------------
  ## Compute coverage across BAM files.
  ## -----------------------------------------------------------------------
  fls <-                                ## 8 bam files
      RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES

  ## Regions of interest.
  gr <- GRanges("chr14", IRanges(c(62262735, 63121531, 63980327),
                width=214700))

  ## The MAP computes the coverage ...
  MAP <- function(range, file, ...) {
      requireNamespace("GenomicFiles", quietly=TRUE)
      ## for coverage(), Rsamtools::ScanBamParam()
      param = Rsamtools::ScanBamParam(which=range)
      GenomicFiles::coverage(file, param=param)[range]
  }
  ## and REDUCE adds the last and current results.
  REDUCE <- function(mapped, ...)
      Reduce("+", mapped)

  ## -----------------------------------------------------------------------
  ## reduceByRange:

  ## With no REDUCE, coverage is computed for each range / file combination.
  cov1 <- reduceByRange(gr, fls, MAP)
  cov1[[1]]

  ## Each call to coverage() produces an RleList which accumulate on the
  ## workers. We can use a reducer to combine these lists either iteratively
  ## or non-iteratively. When iterate = TRUE the current result
  ## is collapsed with the last resulting in a maximum of 2 RleLists on
  ## a worker at any given time.
  cov2 <- reduceByRange(gr, fls, MAP, REDUCE, iterate=TRUE)
  cov2[[1]]

  ## If memory use is not a concern (or if MAP output is not large) the
  ## REDUCE function can be applied non-iteratively.
  cov3 <- reduceByRange(gr, fls, MAP, REDUCE, iterate=FALSE)

  ## Results match those obtained with the iterative REDUCE.
  cov3[[1]]

  ## When 'ranges' is a GRangesList, the list elements are sent to the
  ## workers instead of a single range as in the case of a GRanges.
  grl <- GRangesList(gr[1], gr[2:3])
  grl

  cov4 <- reduceByRange(grl, fls, MAP)
  length(cov4)          ## length of GRangesList
  elementNROWS(cov4)  ## number of files

  ## -----------------------------------------------------------------------
  ## reduceRanges:

  ## This function passes the character vector of all file names to MAP.
  ## MAP must handle each file separately or invoke a method that operates
  ## on a list of files.

  ## TODO: example
}
#> Loading required package: GenomicAlignments
#> [1] 8 8