How to compute read coverage

This HowTo has been adapted from the list of HowTos provided in the vignette for the GenomicRanges Bioconductor package.

Bioconductor packages used in this document

pasillaBamSubset
GenomicAlignments

By “read coverage”, we mean the number of reads that cover a given genomic position. Computing the read coverage generally consists in computing the coverage at each position in the genome. This can be done with the coverage function from GenomicAlignments. For illustration, we use data from the pasillaBamSubset data package.

suppressPackageStartupMessages({
    library(pasillaBamSubset)
})
# Path to a bam file with single-end reads
(un1 <- untreated1_chr4())
#> [1] "/Users/runner/work/_temp/Library/pasillaBamSubset/extdata/untreated1_chr4.bam"

Single-end reads can be read with the readGAlignments function from GenomicAlignments.

suppressPackageStartupMessages({
    library(GenomicAlignments)
})
(reads <- readGAlignments(un1))
#> GAlignments object with 204355 alignments and 0 metadata columns:
#>            seqnames strand       cigar    qwidth     start       end     width
#>               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
#>        [1]     chr4      -         75M        75       892       966        75
#>        [2]     chr4      -         75M        75       919       993        75
#>        [3]     chr4      +         75M        75       924       998        75
#>        [4]     chr4      +         75M        75       936      1010        75
#>        [5]     chr4      +         75M        75       949      1023        75
#>        ...      ...    ...         ...       ...       ...       ...       ...
#>   [204351]     chr4      +         75M        75   1348268   1348342        75
#>   [204352]     chr4      +         75M        75   1348268   1348342        75
#>   [204353]     chr4      +         75M        75   1348268   1348342        75
#>   [204354]     chr4      -         75M        75   1348449   1348523        75
#>   [204355]     chr4      -         75M        75   1350124   1350198        75
#>                njunc
#>            <integer>
#>        [1]         0
#>        [2]         0
#>        [3]         0
#>        [4]         0
#>        [5]         0
#>        ...       ...
#>   [204351]         0
#>   [204352]         0
#>   [204353]         0
#>   [204354]         0
#>   [204355]         0
#>   -------
#>   seqinfo: 8 sequences from an unspecified genome

The coverage can then be calculated using the coverage function from GenomicAlignments:

(cvg <- coverage(reads))
#> RleList of length 8
#> $chr2L
#> integer-Rle of length 23011544 with 1 run
#>   Lengths: 23011544
#>   Values :        0
#> 
#> $chr2R
#> integer-Rle of length 21146708 with 1 run
#>   Lengths: 21146708
#>   Values :        0
#> 
#> $chr3L
#> integer-Rle of length 24543557 with 1 run
#>   Lengths: 24543557
#>   Values :        0
#> 
#> $chr3R
#> integer-Rle of length 27905053 with 1 run
#>   Lengths: 27905053
#>   Values :        0
#> 
#> $chr4
#> integer-Rle of length 1351857 with 122061 runs
#>   Lengths:  891   27    5   12   13   45    5 ...    3  106   75 1600   75 1659
#>   Values :    0    1    2    3    4    5    4 ...    6    0    1    0    1    0
#> 
#> ...
#> <3 more elements>

We can extract the coverage on a specific chromosome:

cvg$chr4
#> integer-Rle of length 1351857 with 122061 runs
#>   Lengths:  891   27    5   12   13   45    5 ...    3  106   75 1600   75 1659
#>   Values :    0    1    2    3    4    5    4 ...    6    0    1    0    1    0

We can also perform calculations on the returned object, e.g. to get the average and maximum coverage on chromosome 4:

mean(cvg$chr4)
#> [1] 11.33746
max(cvg$chr4)
#> [1] 5627

We can use the slice function to find the genomic regions where the coverage is greater or equal to a given threshold.

(chr4_highcov <- slice(cvg$chr4, lower = 500))
#> Views on a 1351857-length Rle subject
#> 
#> views:
#>        start     end width
#>  [1]   86849   87364   516 [ 525  538  554  580  583  585  589  706  942 ...]
#>  [2]   87466   87810   345 [4924 4928 4941 4943 4972 5026 5039 5037 5050 ...]
#>  [3]  340791  340798     8 [508 512 506 530 521 519 518 501]
#>  [4]  340800  340885    86 [500 505 560 560 565 558 564 559 555 556 565 ...]
#>  [5]  348477  348483     7 [503 507 501 524 515 513 512]
#>  [6]  348488  348571    84 [554 554 559 552 558 553 549 550 559 560 565 ...]
#>  [7]  692512  692530    19 [502 507 508 518 520 522 524 526 547 554 591 ...]
#>  [8]  692551  692657   107 [ 530  549  555  635  645  723  725  733  740 ...]
#>  [9]  692798  692800     3 [503 500 503]
#>  ...     ...     ...   ... ...
#> [34] 1054306 1054306     1 [502]
#> [35] 1054349 1054349     1 [501]
#> [36] 1054355 1054444    90 [510 521 525 532 532 539 549 555 557 560 563 ...]
#> [37] 1054448 1054476    29 [502 507 516 517 508 517 525 528 532 520 518 ...]
#> [38] 1054479 1054482     4 [504 503 506 507]
#> [39] 1054509 1054509     1 [500]
#> [40] 1054511 1054511     1 [502]
#> [41] 1054521 1054623   103 [529 521 529 530 524 525 547 540 536 533 529 ...]
#> [42] 1054653 1054717    65 [520 519 516 528 526 585 591 589 584 595 619 ...]

The weight of a given such region can be defined as the number of aligned nucleotides that belong to the region (i.e. the area under the coverage curve). It can be obtained with sum:

sum(chr4_highcov)
#>  [1] 1726347 1300700    4115   52301    3575   51233   10382   95103    1506
#> [10]     500    2051     500    5834   10382   92163     500   88678    1512
#> [19]     500   11518   14514    5915    3598    7821     511     508     503
#> [28]     500    1547    8961   43426   22842     503     502     501   51881
#> [37]   15116    2020     500     502   67010   40496

Note that coverage and slice are generic functions with methods for different types of objects. See ?coverage or ?slice for more information.

Session info

Click to display session info

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.7.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] GenomicAlignments_1.45.0    Rsamtools_2.25.0           
#>  [3] Biostrings_2.77.1           XVector_0.49.0             
#>  [5] SummarizedExperiment_1.39.0 Biobase_2.69.0             
#>  [7] MatrixGenerics_1.21.0       matrixStats_1.5.0          
#>  [9] GenomicRanges_1.61.0        GenomeInfoDb_1.45.3        
#> [11] IRanges_2.43.0              S4Vectors_0.47.0           
#> [13] BiocGenerics_0.55.0         generics_0.1.4             
#> [15] pasillaBamSubset_0.47.0     BiocStyle_2.37.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.7-3        jsonlite_2.0.0      compiler_4.5.1     
#>  [4] BiocManager_1.30.25 crayon_1.5.3        bitops_1.0-9       
#>  [7] parallel_4.5.1      BiocParallel_1.43.2 yaml_2.3.10        
#> [10] fastmap_1.2.0       lattice_0.22-7      R6_2.6.1           
#> [13] S4Arrays_1.9.0      knitr_1.50          DelayedArray_0.35.1
#> [16] rlang_1.1.6         xfun_0.52           SparseArray_1.9.0  
#> [19] cli_3.6.5           digest_0.6.37       grid_4.5.1         
#> [22] evaluate_1.0.3      codetools_0.2-20    abind_1.4-8        
#> [25] rmarkdown_2.29      httr_1.4.7          tools_4.5.1        
#> [28] htmltools_0.5.8.1   UCSC.utils_1.5.0

How to compute read coverage

Bioconductor packages used in this document

How to compute read coverage

Further reading

Session info