trak2 <- "66008"
How to get the exon and intron sequences of a given gene
2025-06-19
Source:vignettes/how-to-get-exon-intron-sequence-for-gene.qmd
This HowTo has been adapted from the list of HowTos provided in the vignette for the GenomicRanges Bioconductor package.
Bioconductor packages used in this document
How to get the exon and intron sequences of a given gene
The exon and intron sequences of a gene are essentially the DNA sequences of the introns and exons of all known transcripts of the gene. The first task, therefore, is to identify all transcripts associated with the gene of interest. Our example gene is the human TRAK2
, which is involved in regulation of endosome-to-lysosome trafficking of membrane cargo. The Entrez gene id is ‘66008’.
The TxDb.Hsapiens.UCSC.hg38.knownGene data package contains the gene models corresponding to the UCSC ‘Known Genes’ track.
suppressPackageStartupMessages({
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
})
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
The transcript ranges for all the genes in the gene model can be extracted with the transcriptsBy
function from the GenomicFeatures package. They will be returned in a named GRangesList
object containing all the transcripts grouped by gene. In order to keep only the transcripts of the TRAK2
gene we will subset the GRangesList
object using the [[
operator.
suppressPackageStartupMessages({
library(GenomicFeatures)
})
(trak2_txs <- transcriptsBy(txdb, by = "gene")[[trak2]])
#> GRanges object with 5 ranges and 2 metadata columns:
#> seqnames ranges strand | tx_id tx_name
#> <Rle> <IRanges> <Rle> | <integer> <character>
#> [1] chr2 201377207-201451458 - | 61820 ENST00000332624.8
#> [2] chr2 201395128-201451500 - | 61821 ENST00000430254.1
#> [3] chr2 201397606-201399510 - | 61822 ENST00000486291.1
#> [4] chr2 201419197-201451453 - | 61824 ENST00000451703.5
#> [5] chr2 201420417-201433554 - | 61825 ENST00000440597.1
#> -------
#> seqinfo: 711 sequences (1 circular) from hg38 genome
trak2_txs
is a GRanges
object with one range per transcript in the TRAK2
gene. The transcript names are stored in the tx_name
metadata column. We will need them to subset the extracted intron and exon regions:
(trak2_tx_names <- mcols(trak2_txs)$tx_name)
#> [1] "ENST00000332624.8" "ENST00000430254.1" "ENST00000486291.1"
#> [4] "ENST00000451703.5" "ENST00000440597.1"
The exon and intron genomic ranges for all the transcripts in the gene model can be extracted with the exonsBy
and intronsByTranscript
functions, respectively. Both functions return a GRangesList
object. Then we keep only the exon and intron for the transcripts of the TRAK2
gene by subsetting each GRangesList
object by the TRAK2
transcript names.
Extract the exon regions:
trak2_exbytx <- exonsBy(txdb, "tx", use.names = TRUE)[trak2_tx_names]
elementNROWS(trak2_exbytx)
#> ENST00000332624.8 ENST00000430254.1 ENST00000486291.1 ENST00000451703.5
#> 16 8 2 3
#> ENST00000440597.1
#> 2
… and the intron regions:
trak2_inbytx <- intronsByTranscript(txdb, use.names = TRUE)[trak2_tx_names]
elementNROWS(trak2_inbytx)
#> ENST00000332624.8 ENST00000430254.1 ENST00000486291.1 ENST00000451703.5
#> 15 7 1 2
#> ENST00000440597.1
#> 1
Next we want the DNA sequences for these exons and introns. The getSeq
function from the Biostrings package can be used to query a BSgenome
object with a set of genomic ranges and retrieve the corresponding DNA sequences.
suppressPackageStartupMessages({
library(BSgenome.Hsapiens.UCSC.hg38)
})
Extract the exon sequences:
(trak2_ex_seqs <- getSeq(Hsapiens, trak2_exbytx))
#> DNAStringSetList of length 5
#> [["ENST00000332624.8"]] GGCAGTGGGAGCAGCGGCAGCAGCTTCGGCTGCTGCTTTCAGGCTGCCGCTGC...
#> [["ENST00000430254.1"]] GTGAAGTCGCCCGCCTGTCCCTGCCACGCCCGGGCGGTTGCTGGCAGTGGGAG...
#> [["ENST00000486291.1"]] ATTTGGTATTTTAACAGAGGGATCGTGATCTGGAACTCGCTGCTCGAATTGGA...
#> [["ENST00000451703.5"]] TGGGAGCAGCGGCAGCAGCTTCGGCTGCTGCTTTCAGGCTGCCGCTGCATTAG...
#> [["ENST00000440597.1"]] GGAAATACTGACCCCATAAACCTAAATCTCACACACGGTTTTGGACGTTTAAA...
… and the intron sequences:
(trak2_in_seqs <- getSeq(Hsapiens, trak2_inbytx))
#> DNAStringSetList of length 5
#> [["ENST00000332624.8"]] GTAAGAGTGCCTGGGAAATCTGGGGCCTCACTTCTTTCCTCAGCTATATTTTC...
#> [["ENST00000430254.1"]] GTGAGTATTAACATATTCTCTTTTGTACCTTTTTGGACAATTCTTTGGTAGGG...
#> [["ENST00000486291.1"]] GTAAGCCTTTGATCAAATGTCTGCAGTATGAAATAATTAGGTTTTATGCAATT...
#> [["ENST00000451703.5"]] GTAAGTCCAGTTTAATAAATATTGAAGTGCTATCGCTTATAGGAGAAAATGCA...
#> [["ENST00000440597.1"]] GTAAGTAGAGCGACCTCCATGTATTTCATATTCTGACACCCTACTTATCTTTA...
Each list element is a DNAStringSet
object, e.g.:
trak2_in_seqs[[1]]
#> DNAStringSet object of length 15:
#> width seq
#> [1] 2892 GTAAGAGTGCCTGGGAAATCTGGGGCCTCACTT...CACTGTCTCCCACTTTTTTTTTTTTTTTTTAAG
#> [2] 2001 GTGAGAAGAGTGTCTGGTTGAATATGGTATAAT...ATCTTGTATTTGCTCCCTAAAAATCTATTTCAG
#> [3] 1218 GTAATAAATCAGTAAGGGCCCTTACTAAGCCAT...TGCCTTTCCCCTTCCTTTGTTTTGCATATTCAG
#> [4] 1298 GTAGGAATCCGATGATAACAGTGATTCTGCATT...AAGATCTAATCAGTGAACGCGATTTTCCTATAG
#> [5] 297 GTAAGTACAGGTCACATGTGTTGTTACTCAATC...AGTGATAATAACGAACACTTCCTTTTTTGACAG
#> ... ... ...
#> [11] 1022 GTAAGCCTTTGATCAAATGTCTGCAGTATGAAA...AGAACATGAAAATCAAGCATTTTATATGGACAG
#> [12] 1524 GTAGGAATATCTTTTCTTTCTCCAGTACAAAAA...CTCAAAGAAAAGGTGTATTTGGTATTTTAACAG
#> [13] 6308 GTGAGTATTTTTTTTACTCTTTTAGTTTGTGCA...AAGCCTATAAATAGTTGTTTTTAACTATATTAG
#> [14] 12819 GTAAGTCCAGTTTAATAAATATTGAAGTGCTAT...ATTGGATTCATTTACATAGACTCTCCTCTTTAG
#> [15] 30643 GTGAGTAAGCTGTCCGCGCAGAACCCGAACTTG...CTGAGTTCTAGTCACTTGATGTTTTTGTTTTAG
Each sequence here corresponds to a single intron.
Further reading
- The GenomicFeatures vignette.
Session info
Click to display session info
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.7.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: UTC
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [2] BSgenome_1.77.0
#> [3] rtracklayer_1.69.0
#> [4] BiocIO_1.19.0
#> [5] Biostrings_2.77.1
#> [6] XVector_0.49.0
#> [7] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0
#> [8] GenomicFeatures_1.61.3
#> [9] AnnotationDbi_1.71.0
#> [10] Biobase_2.69.0
#> [11] GenomicRanges_1.61.0
#> [12] GenomeInfoDb_1.45.3
#> [13] IRanges_2.43.0
#> [14] S4Vectors_0.47.0
#> [15] BiocGenerics_0.55.0
#> [16] generics_0.1.4
#> [17] BiocStyle_2.37.0
#>
#> loaded via a namespace (and not attached):
#> [1] SparseArray_1.9.0 bitops_1.0-9
#> [3] lattice_0.22-7 RSQLite_2.3.11
#> [5] digest_0.6.37 grid_4.5.1
#> [7] evaluate_1.0.3 fastmap_1.2.0
#> [9] blob_1.2.4 Matrix_1.7-3
#> [11] jsonlite_2.0.0 restfulr_0.0.15
#> [13] DBI_1.2.3 BiocManager_1.30.25
#> [15] httr_1.4.7 UCSC.utils_1.5.0
#> [17] XML_3.99-0.18 codetools_0.2-20
#> [19] abind_1.4-8 cli_3.6.5
#> [21] rlang_1.1.6 crayon_1.5.3
#> [23] bit64_4.6.0-1 DelayedArray_0.35.1
#> [25] cachem_1.1.0 yaml_2.3.10
#> [27] S4Arrays_1.9.0 tools_4.5.1
#> [29] parallel_4.5.1 BiocParallel_1.43.2
#> [31] memoise_2.0.1 Rsamtools_2.25.0
#> [33] SummarizedExperiment_1.39.0 curl_6.2.3
#> [35] vctrs_0.6.5 R6_2.6.1
#> [37] png_0.1-8 matrixStats_1.5.0
#> [39] KEGGREST_1.49.0 bit_4.6.0
#> [41] pkgconfig_2.0.3 xfun_0.52
#> [43] GenomicAlignments_1.45.0 MatrixGenerics_1.21.0
#> [45] knitr_1.50 rjson_0.2.23
#> [47] htmltools_0.5.8.1 rmarkdown_2.29
#> [49] compiler_4.5.1 RCurl_1.98-1.17