How to get the exon and intron sequences of a given gene

This HowTo has been adapted from the list of HowTos provided in the vignette for the GenomicRanges Bioconductor package.

Bioconductor packages used in this document

The exon and intron sequences of a gene are essentially the DNA sequences of the introns and exons of all known transcripts of the gene. The first task, therefore, is to identify all transcripts associated with the gene of interest. Our example gene is the human TRAK2, which is involved in regulation of endosome-to-lysosome trafficking of membrane cargo. The Entrez gene id is ‘66008’.

trak2 <- "66008"

The TxDb.Hsapiens.UCSC.hg38.knownGene data package contains the gene models corresponding to the UCSC ‘Known Genes’ track.

suppressPackageStartupMessages({
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
})
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

The transcript ranges for all the genes in the gene model can be extracted with the transcriptsBy function from the GenomicFeatures package. They will be returned in a named GRangesList object containing all the transcripts grouped by gene. In order to keep only the transcripts of the TRAK2 gene we will subset the GRangesList object using the [[ operator.

suppressPackageStartupMessages({
    library(GenomicFeatures)
})
(trak2_txs <- transcriptsBy(txdb, by = "gene")[[trak2]])
#> GRanges object with 5 ranges and 2 metadata columns:
#>       seqnames              ranges strand |     tx_id           tx_name
#>          <Rle>           <IRanges>  <Rle> | <integer>       <character>
#>   [1]     chr2 201377207-201451458      - |     61820 ENST00000332624.8
#>   [2]     chr2 201395128-201451500      - |     61821 ENST00000430254.1
#>   [3]     chr2 201397606-201399510      - |     61822 ENST00000486291.1
#>   [4]     chr2 201419197-201451453      - |     61824 ENST00000451703.5
#>   [5]     chr2 201420417-201433554      - |     61825 ENST00000440597.1
#>   -------
#>   seqinfo: 711 sequences (1 circular) from hg38 genome

trak2_txs is a GRanges object with one range per transcript in the TRAK2 gene. The transcript names are stored in the tx_name metadata column. We will need them to subset the extracted intron and exon regions:

(trak2_tx_names <- mcols(trak2_txs)$tx_name)
#> [1] "ENST00000332624.8" "ENST00000430254.1" "ENST00000486291.1"
#> [4] "ENST00000451703.5" "ENST00000440597.1"

The exon and intron genomic ranges for all the transcripts in the gene model can be extracted with the exonsBy and intronsByTranscript functions, respectively. Both functions return a GRangesList object. Then we keep only the exon and intron for the transcripts of the TRAK2 gene by subsetting each GRangesList object by the TRAK2 transcript names.

Extract the exon regions:

trak2_exbytx <- exonsBy(txdb, "tx", use.names = TRUE)[trak2_tx_names]
elementNROWS(trak2_exbytx)
#> ENST00000332624.8 ENST00000430254.1 ENST00000486291.1 ENST00000451703.5 
#>                16                 8                 2                 3 
#> ENST00000440597.1 
#>                 2

… and the intron regions:

trak2_inbytx <- intronsByTranscript(txdb, use.names = TRUE)[trak2_tx_names]
elementNROWS(trak2_inbytx)
#> ENST00000332624.8 ENST00000430254.1 ENST00000486291.1 ENST00000451703.5 
#>                15                 7                 1                 2 
#> ENST00000440597.1 
#>                 1

Next we want the DNA sequences for these exons and introns. The getSeq function from the Biostrings package can be used to query a BSgenome object with a set of genomic ranges and retrieve the corresponding DNA sequences.

suppressPackageStartupMessages({
    library(BSgenome.Hsapiens.UCSC.hg38)
})

Extract the exon sequences:

(trak2_ex_seqs <- getSeq(Hsapiens, trak2_exbytx))
#> DNAStringSetList of length 5
#> [["ENST00000332624.8"]] GGCAGTGGGAGCAGCGGCAGCAGCTTCGGCTGCTGCTTTCAGGCTGCCGCTGC...
#> [["ENST00000430254.1"]] GTGAAGTCGCCCGCCTGTCCCTGCCACGCCCGGGCGGTTGCTGGCAGTGGGAG...
#> [["ENST00000486291.1"]] ATTTGGTATTTTAACAGAGGGATCGTGATCTGGAACTCGCTGCTCGAATTGGA...
#> [["ENST00000451703.5"]] TGGGAGCAGCGGCAGCAGCTTCGGCTGCTGCTTTCAGGCTGCCGCTGCATTAG...
#> [["ENST00000440597.1"]] GGAAATACTGACCCCATAAACCTAAATCTCACACACGGTTTTGGACGTTTAAA...

… and the intron sequences:

(trak2_in_seqs <- getSeq(Hsapiens, trak2_inbytx))
#> DNAStringSetList of length 5
#> [["ENST00000332624.8"]] GTAAGAGTGCCTGGGAAATCTGGGGCCTCACTTCTTTCCTCAGCTATATTTTC...
#> [["ENST00000430254.1"]] GTGAGTATTAACATATTCTCTTTTGTACCTTTTTGGACAATTCTTTGGTAGGG...
#> [["ENST00000486291.1"]] GTAAGCCTTTGATCAAATGTCTGCAGTATGAAATAATTAGGTTTTATGCAATT...
#> [["ENST00000451703.5"]] GTAAGTCCAGTTTAATAAATATTGAAGTGCTATCGCTTATAGGAGAAAATGCA...
#> [["ENST00000440597.1"]] GTAAGTAGAGCGACCTCCATGTATTTCATATTCTGACACCCTACTTATCTTTA...

Each list element is a DNAStringSet object, e.g.:

trak2_in_seqs[[1]]
#> DNAStringSet object of length 15:
#>      width seq
#>  [1]  2892 GTAAGAGTGCCTGGGAAATCTGGGGCCTCACTT...CACTGTCTCCCACTTTTTTTTTTTTTTTTTAAG
#>  [2]  2001 GTGAGAAGAGTGTCTGGTTGAATATGGTATAAT...ATCTTGTATTTGCTCCCTAAAAATCTATTTCAG
#>  [3]  1218 GTAATAAATCAGTAAGGGCCCTTACTAAGCCAT...TGCCTTTCCCCTTCCTTTGTTTTGCATATTCAG
#>  [4]  1298 GTAGGAATCCGATGATAACAGTGATTCTGCATT...AAGATCTAATCAGTGAACGCGATTTTCCTATAG
#>  [5]   297 GTAAGTACAGGTCACATGTGTTGTTACTCAATC...AGTGATAATAACGAACACTTCCTTTTTTGACAG
#>  ...   ... ...
#> [11]  1022 GTAAGCCTTTGATCAAATGTCTGCAGTATGAAA...AGAACATGAAAATCAAGCATTTTATATGGACAG
#> [12]  1524 GTAGGAATATCTTTTCTTTCTCCAGTACAAAAA...CTCAAAGAAAAGGTGTATTTGGTATTTTAACAG
#> [13]  6308 GTGAGTATTTTTTTTACTCTTTTAGTTTGTGCA...AAGCCTATAAATAGTTGTTTTTAACTATATTAG
#> [14] 12819 GTAAGTCCAGTTTAATAAATATTGAAGTGCTAT...ATTGGATTCATTTACATAGACTCTCCTCTTTAG
#> [15] 30643 GTGAGTAAGCTGTCCGCGCAGAACCCGAACTTG...CTGAGTTCTAGTCACTTGATGTTTTTGTTTTAG

Each sequence here corresponds to a single intron.

Session info

Click to display session info

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.7.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5       
#>  [2] BSgenome_1.77.0                         
#>  [3] rtracklayer_1.69.0                      
#>  [4] BiocIO_1.19.0                           
#>  [5] Biostrings_2.77.1                       
#>  [6] XVector_0.49.0                          
#>  [7] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0
#>  [8] GenomicFeatures_1.61.3                  
#>  [9] AnnotationDbi_1.71.0                    
#> [10] Biobase_2.69.0                          
#> [11] GenomicRanges_1.61.0                    
#> [12] GenomeInfoDb_1.45.3                     
#> [13] IRanges_2.43.0                          
#> [14] S4Vectors_0.47.0                        
#> [15] BiocGenerics_0.55.0                     
#> [16] generics_0.1.4                          
#> [17] BiocStyle_2.37.0                        
#> 
#> loaded via a namespace (and not attached):
#>  [1] SparseArray_1.9.0           bitops_1.0-9               
#>  [3] lattice_0.22-7              RSQLite_2.3.11             
#>  [5] digest_0.6.37               grid_4.5.1                 
#>  [7] evaluate_1.0.3              fastmap_1.2.0              
#>  [9] blob_1.2.4                  Matrix_1.7-3               
#> [11] jsonlite_2.0.0              restfulr_0.0.15            
#> [13] DBI_1.2.3                   BiocManager_1.30.25        
#> [15] httr_1.4.7                  UCSC.utils_1.5.0           
#> [17] XML_3.99-0.18               codetools_0.2-20           
#> [19] abind_1.4-8                 cli_3.6.5                  
#> [21] rlang_1.1.6                 crayon_1.5.3               
#> [23] bit64_4.6.0-1               DelayedArray_0.35.1        
#> [25] cachem_1.1.0                yaml_2.3.10                
#> [27] S4Arrays_1.9.0              tools_4.5.1                
#> [29] parallel_4.5.1              BiocParallel_1.43.2        
#> [31] memoise_2.0.1               Rsamtools_2.25.0           
#> [33] SummarizedExperiment_1.39.0 curl_6.2.3                 
#> [35] vctrs_0.6.5                 R6_2.6.1                   
#> [37] png_0.1-8                   matrixStats_1.5.0          
#> [39] KEGGREST_1.49.0             bit_4.6.0                  
#> [41] pkgconfig_2.0.3             xfun_0.52                  
#> [43] GenomicAlignments_1.45.0    MatrixGenerics_1.21.0      
#> [45] knitr_1.50                  rjson_0.2.23               
#> [47] htmltools_0.5.8.1           rmarkdown_2.29             
#> [49] compiler_4.5.1              RCurl_1.98-1.17

How to get the exon and intron sequences of a given gene

Bioconductor packages used in this document

How to get the exon and intron sequences of a given gene

Further reading

Session info