How to read single-end reads from a BAM file

This HowTo has been adapted from the list of HowTos provided in the vignette for the GenomicRanges Bioconductor package.

Bioconductor packages used in this document

For illustration, we use data from the pasillaBamSubset data package.

suppressPackageStartupMessages({
    library(pasillaBamSubset)
})
# Path to a bam file with single-end reads
(un1 <- untreated1_chr4())
#> [1] "/Users/runner/work/_temp/Library/pasillaBamSubset/extdata/untreated1_chr4.bam"

Several functions are available for reading BAM files into R:

scanBam is a low-level function that returns a list of lists and is not discussed further here. See ?scanBam in the Rsamtools package for more information. Single-end reads can be read with the readGAlignments function from the GenomicAlignments package.

suppressPackageStartupMessages({
    library(GenomicAlignments)
})
gal <- readGAlignments(un1)
class(gal)
#> [1] "GAlignments"
#> attr(,"package")
#> [1] "GenomicAlignments"
gal
#> GAlignments object with 204355 alignments and 0 metadata columns:
#>            seqnames strand       cigar    qwidth     start       end     width
#>               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
#>        [1]     chr4      -         75M        75       892       966        75
#>        [2]     chr4      -         75M        75       919       993        75
#>        [3]     chr4      +         75M        75       924       998        75
#>        [4]     chr4      +         75M        75       936      1010        75
#>        [5]     chr4      +         75M        75       949      1023        75
#>        ...      ...    ...         ...       ...       ...       ...       ...
#>   [204351]     chr4      +         75M        75   1348268   1348342        75
#>   [204352]     chr4      +         75M        75   1348268   1348342        75
#>   [204353]     chr4      +         75M        75   1348268   1348342        75
#>   [204354]     chr4      -         75M        75   1348449   1348523        75
#>   [204355]     chr4      -         75M        75   1350124   1350198        75
#>                njunc
#>            <integer>
#>        [1]         0
#>        [2]         0
#>        [3]         0
#>        [4]         0
#>        [5]         0
#>        ...       ...
#>   [204351]         0
#>   [204352]         0
#>   [204353]         0
#>   [204354]         0
#>   [204355]         0
#>   -------
#>   seqinfo: 8 sequences from an unspecified genome

Data subsets can be specified by genomic position, field names, or flag criteria using Rsamtools::ScanBamParam. Here, as an example we import records that overlap position 1 to 5000 on the negative strand of chromosome 4, with flag and cigar as metadata columns.

suppressPackageStartupMessages({
    library(Rsamtools)
})
what <- c("flag", "cigar")
which <- GRanges("chr4", IRanges(1, 5000))
flag <- scanBamFlag(isMinusStrand = TRUE)
param <- ScanBamParam(which = which, what = what, flag = flag)
neg <- readGAlignments(un1, param = param)
neg
#> GAlignments object with 37 alignments and 2 metadata columns:
#>        seqnames strand       cigar    qwidth     start       end     width
#>           <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
#>    [1]     chr4      -         75M        75       892       966        75
#>    [2]     chr4      -         75M        75       919       993        75
#>    [3]     chr4      -         75M        75       967      1041        75
#>    [4]     chr4      -         75M        75      1035      1109        75
#>    [5]     chr4      -         75M        75      1236      1310        75
#>    ...      ...    ...         ...       ...       ...       ...       ...
#>   [33]     chr4      -         75M        75      4995      5069        75
#>   [34]     chr4      -         75M        75      4997      5071        75
#>   [35]     chr4      -         75M        75      4997      5071        75
#>   [36]     chr4      -         75M        75      4998      5072        75
#>   [37]     chr4      -         75M        75      4999      5073        75
#>            njunc |      flag       cigar
#>        <integer> | <integer> <character>
#>    [1]         0 |        16         75M
#>    [2]         0 |        16         75M
#>    [3]         0 |        16         75M
#>    [4]         0 |        16         75M
#>    [5]         0 |        16         75M
#>    ...       ... .       ...         ...
#>   [33]         0 |        16         75M
#>   [34]         0 |        16         75M
#>   [35]         0 |        16         75M
#>   [36]         0 |        16         75M
#>   [37]         0 |        16         75M
#>   -------
#>   seqinfo: 8 sequences from an unspecified genome

Another approach to subsetting the data is to use Rsamtools::filterBam. This function creates a new BAM file of records passing user-defined criteria. See ?filterBam in the Rsamtools package for more information.

Session info

Click to display session info

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.7.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] GenomicAlignments_1.45.0    Rsamtools_2.25.0           
#>  [3] Biostrings_2.77.1           XVector_0.49.0             
#>  [5] SummarizedExperiment_1.39.0 Biobase_2.69.0             
#>  [7] MatrixGenerics_1.21.0       matrixStats_1.5.0          
#>  [9] GenomicRanges_1.61.0        GenomeInfoDb_1.45.3        
#> [11] IRanges_2.43.0              S4Vectors_0.47.0           
#> [13] BiocGenerics_0.55.0         generics_0.1.4             
#> [15] pasillaBamSubset_0.47.0     BiocStyle_2.37.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.7-3        jsonlite_2.0.0      compiler_4.5.1     
#>  [4] BiocManager_1.30.25 crayon_1.5.3        bitops_1.0-9       
#>  [7] parallel_4.5.1      BiocParallel_1.43.2 yaml_2.3.10        
#> [10] fastmap_1.2.0       lattice_0.22-7      R6_2.6.1           
#> [13] S4Arrays_1.9.0      knitr_1.50          DelayedArray_0.35.1
#> [16] rlang_1.1.6         xfun_0.52           SparseArray_1.9.0  
#> [19] cli_3.6.5           digest_0.6.37       grid_4.5.1         
#> [22] evaluate_1.0.3      codetools_0.2-20    abind_1.4-8        
#> [25] rmarkdown_2.29      httr_1.4.7          tools_4.5.1        
#> [28] htmltools_0.5.8.1   UCSC.utils_1.5.0