suppressPackageStartupMessages({
library(AnnotationHub)
})
### Internet connection required!
hub <- AnnotationHub()
hub <- subset(hub, hub$species=='Drosophila melanogaster')
How to retrieve a gene model from AnnotationHub
2025-02-17
This HowTo has been adapted from the list of HowTos provided in the vignette for the GenomicRanges Bioconductor package.
Bioconductor packages used in this document
How to retrieve a gene model from AnnotationHub
When a gene model is not available as a GRanges
or GRangesList
object or as a data package, it may be available on AnnotationHub. In this HOWTO, will look for a gene model for Drosophila melanogaster on AnnotationHub. Create a `hub’ and then filter on Drosophila melanogaster:
There are 471 files that match Drosophila melanogaster. If you look at the metadata in hub, you can see that the 5th record represents a GRanges object from UCSC RefSeq Gene track.
length(hub)
#> [1] 471
head(names(hub))
#> [1] "AH6789" "AH6790" "AH6791" "AH6792" "AH6793" "AH6794"
head(hub$title, n=10)
#> [1] "Assembly" "GDP Insertions" "BAC End Pairs" "FlyBase Genes"
#> [5] "RefSeq Genes" "Ensembl Genes" "CONTRAST" "Human Proteins"
#> [9] "Spliced ESTs" "Other mRNAs"
## then look at a specific slice of the hub object.
hub[5]
#> AnnotationHub with 1 record
#> # snapshotDate(): 2025-01-21
#> # names(): AH6793
#> # $dataprovider: UCSC
#> # $species: Drosophila melanogaster
#> # $rdataclass: GRanges
#> # $rdatadateadded: 2013-04-04
#> # $title: RefSeq Genes
#> # $description: GRanges object from on UCSC track ‘RefSeq Genes’
#> # $taxonomyid: 7227
#> # $genome: dm3
#> # $sourcetype: UCSC track
#> # $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/dm3/database/...
#> # $sourcesize: NA
#> # $tags: c("refGene", "UCSC", "track", "Gene", "Transcript",
#> # "Annotation")
#> # retrieve record with 'object[["AH6793"]]'
So you can retrieve that dm3 file as a GRanges like this:
The metadata fields contain the details of file origin and content.
metadata(gr)
#> $AnnotationHubName
#> [1] "AH6793"
#>
#> $`File Name`
#> [1] "refGene"
#>
#> $`Data Source`
#> [1] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/dm3/database/refGene"
#>
#> $Provider
#> [1] "UCSC"
#>
#> $Organism
#> [1] "Drosophila melanogaster"
#>
#> $`Taxonomy ID`
#> [1] 7227
Split the GRanges
object by gene name to get a GRangesList
object of transcript ranges grouped by gene.
txbygn <- split(gr, gr$name)
You can now use txbygn
with the summarizeOverlaps
function to prepare a table of read counts for RNA-Seq differential gene expression.
Further reading
Note that before passing txbygn
to summarizeOverlaps
, you should confirm that the seqlevels (chromosome names) in it match those in the BAM file. See ?renameSeqlevels
, ?keepSeqlevels
and ?seqlevels
for examples of renaming seqlevels.
Session info
Click to display session info
sessionInfo()
#> R Under development (unstable) (2025-02-15 r87725)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.7.2
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: UTC
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicRanges_1.59.1 GenomeInfoDb_1.43.4 IRanges_2.41.3
#> [4] S4Vectors_0.45.4 AnnotationHub_3.15.0 BiocFileCache_2.15.1
#> [7] dbplyr_2.5.0 BiocGenerics_0.53.6 generics_0.1.3
#> [10] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] rappdirs_0.3.3 BiocVersion_3.21.1 RSQLite_2.3.9
#> [4] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3
#> [7] fastmap_1.2.0 blob_1.2.4 jsonlite_1.8.9
#> [10] AnnotationDbi_1.69.0 DBI_1.2.3 BiocManager_1.30.25
#> [13] httr_1.4.7 purrr_1.0.4 UCSC.utils_1.3.1
#> [16] Biostrings_2.75.3 cli_3.6.4 rlang_1.1.5
#> [19] crayon_1.5.3 XVector_0.47.2 Biobase_2.67.0
#> [22] bit64_4.6.0-1 withr_3.0.2 cachem_1.1.0
#> [25] yaml_2.3.10 tools_4.5.0 memoise_2.0.1
#> [28] dplyr_1.1.4 GenomeInfoDbData_1.2.13 filelock_1.0.3
#> [31] curl_6.2.0 mime_0.12 vctrs_0.6.5
#> [34] R6_2.6.1 png_0.1-8 lifecycle_1.0.4
#> [37] KEGGREST_1.47.0 bit_4.5.0.1 pkgconfig_2.0.3
#> [40] pillar_1.10.1 glue_1.8.0 xfun_0.50
#> [43] tibble_3.2.1 tidyselect_1.2.1 knitr_1.49
#> [46] htmltools_0.5.8.1 rmarkdown_2.29 compiler_4.5.0