Skip to contents

This HowTo has been adapted from the list of HowTos provided in the vignette for the GenomicRanges Bioconductor package.

Bioconductor packages used in this document

How to retrieve a gene model from AnnotationHub

When a gene model is not available as a GRanges or GRangesList object or as a data package, it may be available on AnnotationHub. In this HOWTO, will look for a gene model for Drosophila melanogaster on AnnotationHub. Create a `hub’ and then filter on Drosophila melanogaster:

suppressPackageStartupMessages({
    library(AnnotationHub)
})
### Internet connection required!
hub <- AnnotationHub()
hub <- subset(hub, hub$species=='Drosophila melanogaster')

There are 471 files that match Drosophila melanogaster. If you look at the metadata in hub, you can see that the 5th record represents a GRanges object from UCSC RefSeq Gene track.

length(hub)
#> [1] 471
head(names(hub))
#> [1] "AH6789" "AH6790" "AH6791" "AH6792" "AH6793" "AH6794"
head(hub$title, n=10)
#>  [1] "Assembly"       "GDP Insertions" "BAC End Pairs"  "FlyBase Genes" 
#>  [5] "RefSeq Genes"   "Ensembl Genes"  "CONTRAST"       "Human Proteins"
#>  [9] "Spliced ESTs"   "Other mRNAs"
## then look at a specific slice of the hub object.
hub[5]
#> AnnotationHub with 1 record
#> # snapshotDate(): 2025-01-21
#> # names(): AH6793
#> # $dataprovider: UCSC
#> # $species: Drosophila melanogaster
#> # $rdataclass: GRanges
#> # $rdatadateadded: 2013-04-04
#> # $title: RefSeq Genes
#> # $description: GRanges object from on UCSC track ‘RefSeq Genes’
#> # $taxonomyid: 7227
#> # $genome: dm3
#> # $sourcetype: UCSC track
#> # $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/dm3/database/...
#> # $sourcesize: NA
#> # $tags: c("refGene", "UCSC", "track", "Gene", "Transcript",
#> #   "Annotation") 
#> # retrieve record with 'object[["AH6793"]]'

So you can retrieve that dm3 file as a GRanges like this:

gr <- hub[[names(hub)[5]]]
#> loading from cache
#> require("GenomicRanges")
summary(gr)
#> [1] "GRanges object with 30551 ranges and 5 metadata columns"

The metadata fields contain the details of file origin and content.

metadata(gr)
#> $AnnotationHubName
#> [1] "AH6793"
#> 
#> $`File Name`
#> [1] "refGene"
#> 
#> $`Data Source`
#> [1] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/dm3/database/refGene"
#> 
#> $Provider
#> [1] "UCSC"
#> 
#> $Organism
#> [1] "Drosophila melanogaster"
#> 
#> $`Taxonomy ID`
#> [1] 7227

Split the GRanges object by gene name to get a GRangesList object of transcript ranges grouped by gene.

txbygn <- split(gr, gr$name)

You can now use txbygn with the summarizeOverlaps function to prepare a table of read counts for RNA-Seq differential gene expression.

Further reading

Note that before passing txbygn to summarizeOverlaps, you should confirm that the seqlevels (chromosome names) in it match those in the BAM file. See ?renameSeqlevels, ?keepSeqlevels and ?seqlevels for examples of renaming seqlevels.

Session info

Click to display session info
sessionInfo()
#> R Under development (unstable) (2025-02-15 r87725)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.7.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] GenomicRanges_1.59.1 GenomeInfoDb_1.43.4  IRanges_2.41.3      
#>  [4] S4Vectors_0.45.4     AnnotationHub_3.15.0 BiocFileCache_2.15.1
#>  [7] dbplyr_2.5.0         BiocGenerics_0.53.6  generics_0.1.3      
#> [10] BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] rappdirs_0.3.3          BiocVersion_3.21.1      RSQLite_2.3.9          
#>  [4] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.3         
#>  [7] fastmap_1.2.0           blob_1.2.4              jsonlite_1.8.9         
#> [10] AnnotationDbi_1.69.0    DBI_1.2.3               BiocManager_1.30.25    
#> [13] httr_1.4.7              purrr_1.0.4             UCSC.utils_1.3.1       
#> [16] Biostrings_2.75.3       cli_3.6.4               rlang_1.1.5            
#> [19] crayon_1.5.3            XVector_0.47.2          Biobase_2.67.0         
#> [22] bit64_4.6.0-1           withr_3.0.2             cachem_1.1.0           
#> [25] yaml_2.3.10             tools_4.5.0             memoise_2.0.1          
#> [28] dplyr_1.1.4             GenomeInfoDbData_1.2.13 filelock_1.0.3         
#> [31] curl_6.2.0              mime_0.12               vctrs_0.6.5            
#> [34] R6_2.6.1                png_0.1-8               lifecycle_1.0.4        
#> [37] KEGGREST_1.47.0         bit_4.5.0.1             pkgconfig_2.0.3        
#> [40] pillar_1.10.1           glue_1.8.0              xfun_0.50              
#> [43] tibble_3.2.1            tidyselect_1.2.1        knitr_1.49             
#> [46] htmltools_0.5.8.1       rmarkdown_2.29          compiler_4.5.0