suppressPackageStartupMessages({
library(pasillaBamSubset)
})
# Path to a bam file with single-end reads
(un1 <- untreated1_chr4())
#> [1] "/Users/runner/work/_temp/Library/pasillaBamSubset/extdata/untreated1_chr4.bam"
How to read single-end reads from a BAM file
2025-02-17
This HowTo has been adapted from the list of HowTos provided in the vignette for the GenomicRanges Bioconductor package.
Bioconductor packages used in this document
How to read single-end reads from a BAM file
For illustration, we use data from the pasillaBamSubset data package.
Several functions are available for reading BAM files into R:
GenomicAlignments::readGAlignments()
GenomicAlignments::readGAlignmentPairs()
GenomicAlignments::readGAlignmentsList()
Rsamtools::scanBam()
scanBam
is a low-level function that returns a list of lists and is not discussed further here. See ?scanBam
in the Rsamtools package for more information. Single-end reads can be read with the readGAlignments
function from the GenomicAlignments package.
suppressPackageStartupMessages({
library(GenomicAlignments)
})
gal <- readGAlignments(un1)
class(gal)
#> [1] "GAlignments"
#> attr(,"package")
#> [1] "GenomicAlignments"
gal
#> GAlignments object with 204355 alignments and 0 metadata columns:
#> seqnames strand cigar qwidth start end width
#> <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
#> [1] chr4 - 75M 75 892 966 75
#> [2] chr4 - 75M 75 919 993 75
#> [3] chr4 + 75M 75 924 998 75
#> [4] chr4 + 75M 75 936 1010 75
#> [5] chr4 + 75M 75 949 1023 75
#> ... ... ... ... ... ... ... ...
#> [204351] chr4 + 75M 75 1348268 1348342 75
#> [204352] chr4 + 75M 75 1348268 1348342 75
#> [204353] chr4 + 75M 75 1348268 1348342 75
#> [204354] chr4 - 75M 75 1348449 1348523 75
#> [204355] chr4 - 75M 75 1350124 1350198 75
#> njunc
#> <integer>
#> [1] 0
#> [2] 0
#> [3] 0
#> [4] 0
#> [5] 0
#> ... ...
#> [204351] 0
#> [204352] 0
#> [204353] 0
#> [204354] 0
#> [204355] 0
#> -------
#> seqinfo: 8 sequences from an unspecified genome
Data subsets can be specified by genomic position, field names, or flag criteria using Rsamtools::ScanBamParam
. Here, as an example we import records that overlap position 1 to 5000 on the negative strand of chromosome 4, with flag
and cigar
as metadata columns.
suppressPackageStartupMessages({
library(Rsamtools)
})
what <- c("flag", "cigar")
which <- GRanges("chr4", IRanges(1, 5000))
flag <- scanBamFlag(isMinusStrand = TRUE)
param <- ScanBamParam(which = which, what = what, flag = flag)
neg <- readGAlignments(un1, param = param)
neg
#> GAlignments object with 37 alignments and 2 metadata columns:
#> seqnames strand cigar qwidth start end width
#> <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
#> [1] chr4 - 75M 75 892 966 75
#> [2] chr4 - 75M 75 919 993 75
#> [3] chr4 - 75M 75 967 1041 75
#> [4] chr4 - 75M 75 1035 1109 75
#> [5] chr4 - 75M 75 1236 1310 75
#> ... ... ... ... ... ... ... ...
#> [33] chr4 - 75M 75 4995 5069 75
#> [34] chr4 - 75M 75 4997 5071 75
#> [35] chr4 - 75M 75 4997 5071 75
#> [36] chr4 - 75M 75 4998 5072 75
#> [37] chr4 - 75M 75 4999 5073 75
#> njunc | flag cigar
#> <integer> | <integer> <character>
#> [1] 0 | 16 75M
#> [2] 0 | 16 75M
#> [3] 0 | 16 75M
#> [4] 0 | 16 75M
#> [5] 0 | 16 75M
#> ... ... . ... ...
#> [33] 0 | 16 75M
#> [34] 0 | 16 75M
#> [35] 0 | 16 75M
#> [36] 0 | 16 75M
#> [37] 0 | 16 75M
#> -------
#> seqinfo: 8 sequences from an unspecified genome
Another approach to subsetting the data is to use Rsamtools::filterBam
. This function creates a new BAM file of records passing user-defined criteria. See ?filterBam
in the Rsamtools package for more information.
Session info
Click to display session info
sessionInfo()
#> R Under development (unstable) (2025-02-15 r87725)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.7.2
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: UTC
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicAlignments_1.43.0 Rsamtools_2.23.1
#> [3] Biostrings_2.75.3 XVector_0.47.2
#> [5] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [7] MatrixGenerics_1.19.1 matrixStats_1.5.0
#> [9] GenomicRanges_1.59.1 GenomeInfoDb_1.43.4
#> [11] IRanges_2.41.3 S4Vectors_0.45.4
#> [13] BiocGenerics_0.53.6 generics_0.1.3
#> [15] pasillaBamSubset_0.45.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-2 jsonlite_1.8.9 compiler_4.5.0
#> [4] BiocManager_1.30.25 crayon_1.5.3 bitops_1.0-9
#> [7] parallel_4.5.0 BiocParallel_1.41.0 yaml_2.3.10
#> [10] fastmap_1.2.0 lattice_0.22-6 R6_2.6.1
#> [13] S4Arrays_1.7.3 knitr_1.49 DelayedArray_0.33.6
#> [16] GenomeInfoDbData_1.2.13 rlang_1.1.5 xfun_0.50
#> [19] SparseArray_1.7.5 cli_3.6.4 digest_0.6.37
#> [22] grid_4.5.0 evaluate_1.0.3 codetools_0.2-20
#> [25] abind_1.4-8 rmarkdown_2.29 httr_1.4.7
#> [28] tools_4.5.0 htmltools_0.5.8.1 UCSC.utils_1.3.1