suppressPackageStartupMessages({
library(pasillaBamSubset)
library(Rsamtools)
})
# Path to a bam file with single-end reads
(un1 <- untreated1_chr4())
#> [1] "/Users/runner/work/_temp/Library/pasillaBamSubset/extdata/untreated1_chr4.bam"
bf <- BamFile(un1, yieldSize = 100000)
How to read a BAM file in chunks
2025-06-19
Source:vignettes/how-to-read-big-bam-file-in-chunks.qmd
This HowTo has been adapted from the list of HowTos provided in the vignette for the GenomicRanges Bioconductor package.
Bioconductor packages used in this document
How to read a BAM file in chunks
A large BAM file can be iterated through in chunks, in order to reduce the memory usage, by setting a yieldSize
for the BamFile
object. For illustration, we use data from the pasillaBamSubset data package.
Iteration through a BAM file requires that the file be opened, repeatedly queried inside a loop, then closed. Repeated calls to GenomicAlignments::readGAlignments
without opening the file first result in the same 100000 records returned each time (with a yieldSize
of 100000). As an example, let’s calculate the coverage for the bam file above.
suppressPackageStartupMessages({
library(GenomicAlignments)
})
open(bf)
cvg <- NULL
repeat {
chunk <- readGAlignments(bf)
if (length(chunk) == 0L) {
break
}
chunk_cvg <- coverage(chunk)
if (is.null(cvg)) {
cvg <- chunk_cvg
} else {
cvg <- cvg + chunk_cvg
}
}
close(bf)
cvg
#> RleList of length 8
#> $chr2L
#> integer-Rle of length 23011544 with 1 run
#> Lengths: 23011544
#> Values : 0
#>
#> $chr2R
#> integer-Rle of length 21146708 with 1 run
#> Lengths: 21146708
#> Values : 0
#>
#> $chr3L
#> integer-Rle of length 24543557 with 1 run
#> Lengths: 24543557
#> Values : 0
#>
#> $chr3R
#> integer-Rle of length 27905053 with 1 run
#> Lengths: 27905053
#> Values : 0
#>
#> $chr4
#> integer-Rle of length 1351857 with 122061 runs
#> Lengths: 891 27 5 12 13 45 5 ... 3 106 75 1600 75 1659
#> Values : 0 1 2 3 4 5 4 ... 6 0 1 0 1 0
#>
#> ...
#> <3 more elements>
Session info
Click to display session info
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.7.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: UTC
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicAlignments_1.45.0 SummarizedExperiment_1.39.0
#> [3] Biobase_2.69.0 MatrixGenerics_1.21.0
#> [5] matrixStats_1.5.0 Rsamtools_2.25.0
#> [7] Biostrings_2.77.1 XVector_0.49.0
#> [9] GenomicRanges_1.61.0 GenomeInfoDb_1.45.3
#> [11] IRanges_2.43.0 S4Vectors_0.47.0
#> [13] BiocGenerics_0.55.0 generics_0.1.4
#> [15] pasillaBamSubset_0.47.0 BiocStyle_2.37.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-3 jsonlite_2.0.0 compiler_4.5.1
#> [4] BiocManager_1.30.25 crayon_1.5.3 bitops_1.0-9
#> [7] parallel_4.5.1 BiocParallel_1.43.2 yaml_2.3.10
#> [10] fastmap_1.2.0 lattice_0.22-7 R6_2.6.1
#> [13] S4Arrays_1.9.0 knitr_1.50 DelayedArray_0.35.1
#> [16] rlang_1.1.6 xfun_0.52 SparseArray_1.9.0
#> [19] cli_3.6.5 digest_0.6.37 grid_4.5.1
#> [22] evaluate_1.0.3 codetools_0.2-20 abind_1.4-8
#> [25] rmarkdown_2.29 httr_1.4.7 tools_4.5.1
#> [28] htmltools_0.5.8.1 UCSC.utils_1.5.0