Parallel computations by files

Computations are distributed in parallel by file. Data subsets are extracted and manipulated (MAP) and optionally combined (REDUCE) within a single file.

Usage

# S4 method for class 'GRanges,ANY'
reduceByFile(ranges, files, MAP, 
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
# S4 method for class 'GRangesList,ANY'
reduceByFile(ranges, files, MAP, 
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)
# S4 method for class 'GenomicFiles,missing'
reduceByFile(ranges, files, MAP, 
    REDUCE, ..., summarize=FALSE, iterate=TRUE, init)

reduceFiles(ranges, files, MAP, REDUCE, ..., init)

Arguments

ranges

A GRanges, GrangesList or GenomicFiles object.

A GRangesList implies a grouping of the ranges; MAP is applied to each element of the GRangesList vs each range when ranges is a GRanges.

When ranges is a GenomicFiles the files argument is missing; both ranges and files are extracted from the object.

files

A character vector or List of filenames. A List implies a grouping of the files; MAP is applied to each element of the List vs each file individually.

MAP

A function executed on each worker. The signature must contain a minimum of two arguments representing the ranges and files. There is no restriction on argument names and additional arguments can be provided.

MAP = function(range, file, ...)

REDUCE

An optional function that combines output from the MAP step. The signature must contain at least one argument representing the list output from MAP. There is no restriction on argument names and additional arguments can be provided.

REDUCE = function(mapped, ...)

Reduction combines data from a single worker and is always performed as part of the distributed step. When iterate=TRUE REDUCE is applied after each MAP step; depending on the nature of REDUCE, iterative reduction can substantially decrease the data stored in memory. When iterate=FALSE reduction is applied to the list of MAP output applied to all files / ranges.

When REDUCE is missing, output is a list from MAP.

iterate

A logical indicating if the REDUCE function should be applied iteratively to the output of MAP. When REDUCE is missing iterate is set to FALSE. This argument applies to reduceByFile only (reduceFiles calls MAP a single time on each worker).

Collapsing results iteratively is useful when the number of records to be processed is large (maybe complete files) but the end result is a much reduced representation of all records. Iteratively applying REDUCE reduces the amount of data on each worker at any one time and can substantially reduce the memory footprint.

summarize

A logical indicating if results should be returned as a SummarizedExperiment object instead of a list; data are returned in the assays slot named `data`. This argument applies to reduceByFile only.

When REDUCE is provided summarize is ignored (i.e., set to FALSE). A SummarizedExperiment requires the number of rows in rowRanges and assays to match. Because REDUCE collapses the data across ranges, the dimension of the result no longer matches that of the original ranges.

init

An optional initial value for REDUCE when iterate=TRUE. init must be an object of the same type as the elements returned from MAP. REDUCE logically adds init to the start (when proceeding left to right) or end of results obtained with MAP.

...

Arguments passed to other methods.

Details

reduceByFile extracts, manipulates and combines multiple ranges within a single file. Each file is sent to a worker where MAP is invoked on each file / range combination. This approach allows multiple ranges extracted from a single file to be kept separate or combined with REDUCE.

In contrast, reduceFiles treats the output of all MAP calls as a group and reduces them together. REDUCE usually plays a minor role by concatenating or unlisting results.

Both MAP and REDUCE are applied in the distributed step (“on the worker“). Results are not combined across workers in the distributed step.

Value

reduceByFile: When summarize=FALSE the return value is a list or the value from the final invocation of REDUCE. When summarize=TRUE output is a SummarizedExperiment. When ranges is a GenomicFiles object data from rowRanges, colData and metadata are transferred to the SummarizedExperiment.
reduceFiles: A list or the value returned by the final invocation of REDUCE.

Author

Martin Morgan and Valerie Obenchain

Examples


if (requireNamespace("RNAseqData.HNRNPC.bam.chr14", quietly=TRUE)) {
  ## -----------------------------------------------------------------------
  ## Count junction reads in BAM files
  ## -----------------------------------------------------------------------
  fls <-                                      ## 8 bam files
      RNAseqData.HNRNPC.bam.chr14::RNAseqData.HNRNPC.bam.chr14_BAMFILES
 
  ## Ranges of interest.
  gr <- GRanges("chr14", IRanges(c(19100000, 106000000), width=1e7))
 
  ## MAP outputs a table of junction counts per range.
  MAP <- function(range, file, ...) {
      ## for readGAlignments(), Rsamtools::ScanBamParam()
      requireNamespace("GenomicAlignments", quietly=TRUE)
      param = Rsamtools::ScanBamParam(which=range)
      gal = GenomicAlignments::readGAlignments(file, param=param)
      table(GenomicAlignments::njunc(gal))
  } 

  ## -----------------------------------------------------------------------
  ## reduceByFile:

  ## With no REDUCE, counts are computed for each range / file combination.
  counts1 <- reduceByFile(gr, fls, MAP)
  length(counts1)          ## 8 files
  elementNROWS(counts1)    ## 2 ranges each
 
  ## Tables of counts for each range:
  counts1[[1]]

  ## With a REDUCE, results are combined on the fly. This reducer sums the 
  ## number of records in each range with exactly 1 junction.
  REDUCE <- function(mapped, ...)
      sum(sapply(mapped, "[", "1"))
 
  reduceByFile(gr, fls, MAP, REDUCE)

  ## -----------------------------------------------------------------------
  ## reduceFiles:

  ## All ranges are treated as a single group:
  counts2 <- reduceFiles(gr, fls, MAP)

  ## Counts are for all ranges grouped:
  counts2[[1]]

  ## Contrast the above with that from reduceByFile() where counts 
  ## are for each range separately:
  counts1[[1]]

  ## -----------------------------------------------------------------------
  ## Methods for the GenomicFiles class:
 
  ## Both reduceByFiles() and reduceFiles() can operate on a GenomicFiles
  ## object.
  colData <- DataFrame(method=rep("RNASeq", length(fls)),
                       format=rep("bam", length(fls)))
  gf <- GenomicFiles(files=fls, rowRanges=gr, colData=colData)
  gf
  
  ## Subset on ranges or files for different experimental runs.
  dim(gf)
  gf_sub <- gf[2, 3:4]
  dim(gf_sub)
  
  ## When summarize = TRUE and no REDUCE is given, the output is a 
  ## SummarizedExperiment object.
  se <- reduceByFile(gf, MAP=MAP, summarize=TRUE)
  se
  
  ## Data from the rowRanges, colData and metadata slots in the
  ## GenomicFiles are transferred to the SummarizedExperiment.
  colData(se)
  
  ## Results are in the assays slot named 'data'.
  assays(se) 
}
#> List of length 1
#> names(1): data