Contents

Original Authors: Martin Morgan, Sonali Arora, Lori Shepherd
Presenting Author: Davide Risso, Ilaria Billato
Date: 6-11 July, 2025
Back: Monday labs

Objective: Learn about Bioconductor resources for gene and genome annotation.

Lessons learned:

Note: It may be helpful to wait or look at lecture material for annotations planned for Wednesday.

1 Gene annotation

1.1 Data packages

Organism-level (‘org’) packages contain mappings between a central identifier (e.g., Entrez gene ids) and other identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.). The name of an org package is always of the form org.<Sp>.<id>.db (e.g. org.Sc.sgd.db) where <Sp> is a 2-letter abbreviation of the organism (e.g. Sc for Saccharomyces cerevisiae) and <id> is an abbreviation (in lower-case) describing the type of central identifier (e.g. sgd for gene identifiers assigned by the Saccharomyces Genome Database, or eg for Entrez gene ids). The “How to use the ‘.db’ annotation packages” vignette in the AnnotationDbi package (org packages are only one type of “.db” annotation packages) is a key reference. The ‘.db’ and most other Bioconductor annotation packages are updated every 6 months.

Annotation packages usually contain an object named after the package itself. These objects are collectively called AnnotationDb objects, with more specific classes named OrgDb, ChipDb or TranscriptDb objects. Methods that can be applied to these objects include cols(), keys(), keytypes() and select(). Common operations for retrieving annotations are summarized in the table.

Category Function Description
Discover columns() List the kinds of columns that can be returned
keytypes() List columns that can be used as keys
keys() List values that can be expected for a given keytype
select() Retrieve annotations matching keys, keytype and columns
Manipulate setdiff(), union(), intersect() Operations on sets
duplicated(), unique() Mark or remove duplicates
%in%, match() Find matches
any(), all() Are any TRUE? Are all?
merge() Combine two different based on shared keys
GRanges* transcripts(), exons(), cds() Features (transcripts, exons, coding sequence) as GRanges.
transcriptsBy() , exonsBy() Features group by gene, transcript, etc., as GRangesList.
cdsBy()

1.2 Internet resources

A short summary of select Bioconductor packages enabling web-based queries is in following Table.

Package Description
AnnotationHub Ensembl, Encode, dbSNP, UCSC data objects
biomaRt Ensembl and other annotations
PSICQUIC Protein interactions
uniprot.ws Protein annotations
KEGGREST KEGG pathways
SRAdb Sequencing experiments.
rtracklayer genome tracks.
GEOquery Array and other data
ArrayExpress Array and other data

1.3 Exercises

Exercise 1: This exercise illustrates basic use of the `select’ interface to annotation packages.

  1. Install and attach the org.Hs.eg.db annotation package; it contains ‘symbol mapping’ information for Homo sapiens, based on NCBI ‘Entrez’ identifiers.

    library(org.Hs.eg.db)
  2. Take a quick look at a summary of data in this package

    org.Hs.eg.db
    ## OrgDb object:
    ## | DBSCHEMAVERSION: 2.1
    ## | Db type: OrgDb
    ## | Supporting package: AnnotationDbi
    ## | DBSCHEMA: HUMAN_DB
    ## | ORGANISM: Homo sapiens
    ## | SPECIES: Human
    ## | EGSOURCEDATE: 2025-Feb22
    ## | EGSOURCENAME: Entrez Gene
    ## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
    ## | CENTRALID: EG
    ## | TAXID: 9606
    ## | GOSOURCENAME: Gene Ontology
    ## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
    ## | GOSOURCEDATE: 2025-02-06
    ## | GOEGSOURCEDATE: 2025-Feb22
    ## | GOEGSOURCENAME: Entrez Gene
    ## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
    ## | KEGGSOURCENAME: KEGG GENOME
    ## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
    ## | KEGGSOURCEDATE: 2011-Mar15
    ## | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
    ## | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/database
    ## | GPSOURCEDATE: UTC-Mar1
    ## | ENSOURCEDATE: 2024-Oct17
    ## | ENSOURCENAME: Ensembl
    ## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
    ## | UPSOURCENAME: Uniprot
    ## | UPSOURCEURL: http://www.UniProt.org/
    ## | UPSOURCEDATE: Sat Mar  1 19:14:02 2025
    ## 
    ## Please see: help('select') for usage information
  3. The idea is that there are keytypes() that can be mapped to different columns(); keys() can be used to see available keys. Explore the package to see what sorts of information is available, e.g.,

    keytypes(org.Hs.eg.db)
    ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
    ##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
    ## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
    ## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
    ## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
    ## [26] "UNIPROT"
    columns(org.Hs.eg.db)
    ##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
    ##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
    ## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
    ## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
    ## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
    ## [26] "UNIPROT"
    head(keys(org.Hs.eg.db, "SYMBOL"))
    ## [1] "A1BG"  "A2M"   "A2MP1" "NAT1"  "NAT2"  "NATP"
  4. There are two basic ways of extracting data from an org.* package – mapIds() to create a 1:1 mapping between key and a single column, and select() (it’s often necessary to specify this function directly, to avoid a conflict with dplyr, as AnnotationDbi::select()). Explore these functions, e.g.,

    set.seed(123)
    egid <- sample(keys(org.Hs.eg.db), 6)
    mapIds(org.Hs.eg.db, egid, "SYMBOL", "ENTREZID")
    ## 'select()' returned 1:1 mapping between keys and columns
    ##      130060841      130067052      127892030      127830407      129994078 
    ## "LOC130060841" "LOC130067052" "LOC127892030" "LOC127830407" "LOC129994078" 
    ##      127408916 
    ## "LOC127408916"
    AnnotationDbi::select(
        org.Hs.eg.db, egid, c("SYMBOL", "ENSEMBL", "GENENAME"), "ENTREZID"
    )
    ## 'select()' returned 1:1 mapping between keys and columns
    ##    ENTREZID       SYMBOL ENSEMBL
    ## 1 130060841 LOC130060841    <NA>
    ## 2 130067052 LOC130067052    <NA>
    ## 3 127892030 LOC127892030    <NA>
    ## 4 127830407 LOC127830407    <NA>
    ## 5 129994078 LOC129994078    <NA>
    ## 6 127408916 LOC127408916    <NA>
    ##                                                             GENENAME
    ## 1                  ATAC-STARR-seq lymphoblastoid active region 12162
    ## 2                  ATAC-STARR-seq lymphoblastoid active region 18723
    ## 3 NANOG-H3K27ac-H3K4me1 hESC enhancer GRCh37_chr19:50373269-50373916
    ## 4               H3K4me1 hESC enhancer GRCh37_chr15:79497097-79497660
    ## 5                  ATAC-STARR-seq lymphoblastoid active region 22684
    ## 6             OCT4-NANOG hESC enhancer GRCh37_chr7:38026488-38027240
  5. Some key - column mappings are 1:many, e.g., Entrez ID "3812" maps to 44 Ensembl Ids. What does mapIds() return when mapping Entrez ID "3812" to Ensembl ids? Use the additional argument multiVals = "CharacterList" to explore further. Compare results to those returned by select().

    egid <- "3812"
    mapIds(org.Hs.eg.db, egid, "ENSEMBL", "ENTREZID")
    ## 'select()' returned 1:many mapping between keys and columns
    ##              3812 
    ## "ENSG00000240403"
    mapIds(
        org.Hs.eg.db, egid, "ENSEMBL", "ENTREZID",
        multiVals = "CharacterList"
    )
    ## 'select()' returned 1:many mapping between keys and columns
    ## CharacterList of length 1
    ## [["3812"]] ENSG00000240403 ENSG00000288389 ... ENSG00000277181 ENSG00000275566
    AnnotationDbi::select(
        org.Hs.eg.db, egid, c("SYMBOL", "ENSEMBL"),
        multiVals = "CharacterList"
    )
    ## 'select()' returned 1:many mapping between keys and columns
    ##    ENTREZID  SYMBOL         ENSEMBL
    ## 1      3812 KIR3DL2 ENSG00000240403
    ## 2      3812 KIR3DL2 ENSG00000288389
    ## 3      3812 KIR3DL2 ENSG00000284213
    ## 4      3812 KIR3DL2 ENSG00000274722
    ## 5      3812 KIR3DL2 ENSG00000278474
    ## 6      3812 KIR3DL2 ENSG00000284295
    ## 7      3812 KIR3DL2 ENSG00000276004
    ## 8      3812 KIR3DL2 ENSG00000284192
    ## 9      3812 KIR3DL2 ENSG00000275629
    ## 10     3812 KIR3DL2 ENSG00000276882
    ## 11     3812 KIR3DL2 ENSG00000278809
    ## 12     3812 KIR3DL2 ENSG00000276424
    ## 13     3812 KIR3DL2 ENSG00000278656
    ## 14     3812 KIR3DL2 ENSG00000275416
    ## 15     3812 KIR3DL2 ENSG00000275626
    ## 16     3812 KIR3DL2 ENSG00000275083
    ## 17     3812 KIR3DL2 ENSG00000278726
    ## 18     3812 KIR3DL2 ENSG00000277982
    ## 19     3812 KIR3DL2 ENSG00000278361
    ## 20     3812 KIR3DL2 ENSG00000273735
    ## 21     3812 KIR3DL2 ENSG00000284466
    ## 22     3812 KIR3DL2 ENSG00000278710
    ## 23     3812 KIR3DL2 ENSG00000276357
    ## 24     3812 KIR3DL2 ENSG00000275262
    ## 25     3812 KIR3DL2 ENSG00000278442
    ## 26     3812 KIR3DL2 ENSG00000277709
    ## 27     3812 KIR3DL2 ENSG00000276739
    ## 28     3812 KIR3DL2 ENSG00000278403
    ## 29     3812 KIR3DL2 ENSG00000278758
    ## 30     3812 KIR3DL2 ENSG00000275838
    ## 31     3812 KIR3DL2 ENSG00000284528
    ## 32     3812 KIR3DL2 ENSG00000273911
    ## 33     3812 KIR3DL2 ENSG00000278850
    ## 34     3812 KIR3DL2 ENSG00000278707
    ## 35     3812 KIR3DL2 ENSG00000275511
    ## 36     3812 KIR3DL2 ENSG00000277181
    ## 37     3812 KIR3DL2 ENSG00000275566
  6. It seems like it might often be useful to use the tidyverse on return values from mapIds() and select(); explore this usage

    library(tidyverse)
    egid <- keys(org.Hs.eg.db)    # all ENTREZIDs
    mapIds(org.Hs.eg.db, egid, "SYMBOL", "ENTREZID") |> 
       as_tibble() |> 
       rownames_to_column("ENTREZID")
    ## # A tibble: 193,430 × 2
    ##    ENTREZID value   
    ##    <chr>    <chr>   
    ##  1 1        A1BG    
    ##  2 2        A2M     
    ##  3 3        A2MP1   
    ##  4 4        NAT1    
    ##  5 5        NAT2    
    ##  6 6        NATP    
    ##  7 7        SERPINA3
    ##  8 8        AADAC   
    ##  9 9        AAMP    
    ## 10 10       AANAT   
    ## # ℹ 193,420 more rows
    AnnotationDbi::select(
        org.Hs.eg.db, egid, c("SYMBOL", "GO", "GENENAME"), "ENTREZID"
    ) |>
        as_tibble()
    ## # A tibble: 532,170 × 6
    ##    ENTREZID SYMBOL GO         EVIDENCE ONTOLOGY GENENAME              
    ##    <chr>    <chr>  <chr>      <chr>    <chr>    <chr>                 
    ##  1 1        A1BG   GO:0002764 IBA      BP       alpha-1-B glycoprotein
    ##  2 1        A1BG   GO:0005576 HDA      CC       alpha-1-B glycoprotein
    ##  3 1        A1BG   GO:0005576 IDA      CC       alpha-1-B glycoprotein
    ##  4 1        A1BG   GO:0005576 TAS      CC       alpha-1-B glycoprotein
    ##  5 1        A1BG   GO:0005615 HDA      CC       alpha-1-B glycoprotein
    ##  6 1        A1BG   GO:0005886 IBA      CC       alpha-1-B glycoprotein
    ##  7 1        A1BG   GO:0031093 TAS      CC       alpha-1-B glycoprotein
    ##  8 1        A1BG   GO:0034774 TAS      CC       alpha-1-B glycoprotein
    ##  9 1        A1BG   GO:0062023 HDA      CC       alpha-1-B glycoprotein
    ## 10 1        A1BG   GO:0070062 HDA      CC       alpha-1-B glycoprotein
    ## # ℹ 532,160 more rows

Concept Check:

After completing this previous section you should be able to answer the following:

  1. What gene symbol corresponds to Entrez Gene ID 1000?
  2. What is the Ensembl Gene ID for PPARG?
  3. What is the UniProt ID for GAPDH?

Exercise 2: biomaRt.

Internet access required for this exercise

  1. Explore the Biomart web site https://www.ensembl.org/biomart for retrieving all kinds of genomic annotations.

    Start by choosing a database (e.g., ‘Ensembl Genes 92’), dataset (e.g., ‘Human genes (GRCh38.p12)’), filter (e.g., ‘GENE’ / ‘Input external reference’ / ‘Gene stable id’ and enter ‘ENSG00000000003’), attributes (default is ok), then press ‘Results’ to map from Ensembl identifier to transcript identifier.

  2. Install (if necessary) and load the biomaRt package. Use listMarts() to see availble databases, useMart() to select the mart you’re interested in.

    library(biomaRt)
    head(listMarts())
    ##                biomart                version
    ## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 114
    ## 2   ENSEMBL_MART_MOUSE      Mouse strains 114
    ## 3     ENSEMBL_MART_SNP  Ensembl Variation 114
    ## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 114
    mart <- useMart("ENSEMBL_MART_ENSEMBL")
  3. Use listDatasets() and useDataset() to select the Homo sapiens gene dataset.

    head(listDatasets(mart))
    ##                        dataset                           description
    ## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
    ## 2     acalliptera_gene_ensembl      Eastern happy genes (fAstCal1.3)
    ## 3   acarolinensis_gene_ensembl       Green anole genes (AnoCar2.0v2)
    ## 4    acchrysaetos_gene_ensembl       Golden eagle genes (bAquChr1.2)
    ## 5    acitrinellus_gene_ensembl        Midas cichlid genes (Midas_v5)
    ## 6    amelanoleuca_gene_ensembl       Giant panda genes (ASM200744v2)
    ##       version
    ## 1 ASM259213v1
    ## 2  fAstCal1.3
    ## 3 AnoCar2.0v2
    ## 4  bAquChr1.2
    ## 5    Midas_v5
    ## 6 ASM200744v2
    dataset <- useDataset("hsapiens_gene_ensembl", mart)
  4. Use listFilters() to see available filters. The filter is the type of data that you are querying with. Choose one.

    head(listFilters(dataset))
    ##              name              description
    ## 1 chromosome_name Chromosome/scaffold name
    ## 2           start                    Start
    ## 3             end                      End
    ## 4      band_start               Band Start
    ## 5        band_end                 Band End
    ## 6    marker_start             Marker Start
    filters <- "ensembl_gene_id"                    # see `listFilters()`
  5. Use listAttrbutes() to see available attributes. Attributes represent the information you’d like to retrieve. Choose some!

    head(listAttributes(dataset))
    ##                            name                  description         page
    ## 1               ensembl_gene_id               Gene stable ID feature_page
    ## 2       ensembl_gene_id_version       Gene stable ID version feature_page
    ## 3         ensembl_transcript_id         Transcript stable ID feature_page
    ## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
    ## 5            ensembl_peptide_id            Protein stable ID feature_page
    ## 6    ensembl_peptide_id_version    Protein stable ID version feature_page
    attrs <- c("ensembl_gene_id", "hgnc_symbol")    # see `listAttributes()`
  6. Create a character vector of Ensembl gene ids, compose and execute the query, transforming the result to a tibble.

    ids <- c(
        "ENSG00000000003", "ENSG00000000005", "ENSG00000000419", 
        "ENSG00000000457", "ENSG00000000460", "ENSG00000000938"
    )
    tbl <-
        getBM(attrs, filters, ids, dataset) |>
        as_tibble()
    tbl
    ## # A tibble: 6 × 2
    ##   ensembl_gene_id hgnc_symbol
    ##   <chr>           <chr>      
    ## 1 ENSG00000000003 TSPAN6     
    ## 2 ENSG00000000005 TNMD       
    ## 3 ENSG00000000419 DPM1       
    ## 4 ENSG00000000457 SCYL3      
    ## 5 ENSG00000000460 FIRRM      
    ## 6 ENSG00000000938 FGR

Exercise 3: KEGGREST

Internet access required for this exercise

  1. Explore the KEGG web site https://www.genome.jp/kegg/ KEGG is a database of information on pathways.

  2. Load the KEGGREST package and discover available databases

    library(KEGGREST)
    KEGGREST::listDatabases()
    ##  [1] "pathway"  "brite"    "module"   "ko"       "genome"   "vg"      
    ##  [7] "ag"       "compound" "glycan"   "reaction" "rclass"   "enzyme"  
    ## [13] "disease"  "drug"     "dgroup"   "environ"  "genes"    "ligand"  
    ## [19] "kegg"
  3. Use keggList() to query the pathway database for human pathways; present the result as a tibble

    hsa_pathways <-
        keggList("pathway", "hsa") |> 
        enframe(name = "pathway", value = "description")
    hsa_pathways
    ## # A tibble: 367 × 2
    ##    pathway  description                                             
    ##    <chr>    <chr>                                                   
    ##  1 hsa01100 Metabolic pathways - Homo sapiens (human)               
    ##  2 hsa01200 Carbon metabolism - Homo sapiens (human)                
    ##  3 hsa01210 2-Oxocarboxylic acid metabolism - Homo sapiens (human)  
    ##  4 hsa01212 Fatty acid metabolism - Homo sapiens (human)            
    ##  5 hsa01230 Biosynthesis of amino acids - Homo sapiens (human)      
    ##  6 hsa01232 Nucleotide metabolism - Homo sapiens (human)            
    ##  7 hsa01250 Biosynthesis of nucleotide sugars - Homo sapiens (human)
    ##  8 hsa01240 Biosynthesis of cofactors - Homo sapiens (human)        
    ##  9 hsa01320 Sulfur cycle - Homo sapiens (human)                     
    ## 10 hsa00010 Glycolysis / Gluconeogenesis - Homo sapiens (human)     
    ## # ℹ 357 more rows
  4. Use keggLink() to recover the genes in each pathway.

    hsa_path_eg  <-
        keggLink("pathway", "hsa") |> 
        enframe(name = "egid", value = "pathway") |>
        mutate(egid = sub("hsa:", "", egid))
    hsa_path_eg
    ## # A tibble: 39,313 × 2
    ##    egid   pathway      
    ##    <chr>  <chr>        
    ##  1 10327  path:hsa00010
    ##  2 124    path:hsa00010
    ##  3 125    path:hsa00010
    ##  4 126    path:hsa00010
    ##  5 127    path:hsa00010
    ##  6 128    path:hsa00010
    ##  7 130    path:hsa00010
    ##  8 130589 path:hsa00010
    ##  9 131    path:hsa00010
    ## 10 160287 path:hsa00010
    ## # ℹ 39,303 more rows
    hsa_path_eg |>
        group_by(pathway) |>
        summarize(genes = list(egid))
    ## # A tibble: 367 × 2
    ##    pathway       genes     
    ##    <chr>         <list>    
    ##  1 path:hsa00010 <chr [67]>
    ##  2 path:hsa00020 <chr [30]>
    ##  3 path:hsa00030 <chr [31]>
    ##  4 path:hsa00040 <chr [36]>
    ##  5 path:hsa00051 <chr [34]>
    ##  6 path:hsa00052 <chr [32]>
    ##  7 path:hsa00053 <chr [30]>
    ##  8 path:hsa00061 <chr [18]>
    ##  9 path:hsa00062 <chr [28]>
    ## 10 path:hsa00071 <chr [43]>
    ## # ℹ 357 more rows
  5. Update the hsa_path_eg table to include information on gene symbol and Ensembl id from the org.Hs.eg.db package. Retrieve the relevant information using mapIds(). How would you deal with entrez gene ids that map to multiple Ensembl ids?

    hsa_kegg_anno <-
        hsa_path_eg |>
        mutate(
            symbol = mapIds(org.Hs.eg.db, egid, "SYMBOL", "ENTREZID"),
            ensembl = mapIds(org.Hs.eg.db, egid, "ENSEMBL", "ENTREZID")
        )
    ## 'select()' returned 1:1 mapping between keys and columns
    ## 'select()' returned 1:many mapping between keys and columns
  6. Use left_join() to append pathway descriptions to the hsa_kegg_anno table.

    left_join(hsa_kegg_anno, hsa_pathways, by = "pathway")
    ## # A tibble: 39,313 × 5
    ##    egid   pathway       symbol  ensembl         description
    ##    <chr>  <chr>         <chr>   <chr>           <chr>      
    ##  1 10327  path:hsa00010 AKR1A1  ENSG00000117448 <NA>       
    ##  2 124    path:hsa00010 ADH1A   ENSG00000187758 <NA>       
    ##  3 125    path:hsa00010 ADH1B   ENSG00000196616 <NA>       
    ##  4 126    path:hsa00010 ADH1C   ENSG00000248144 <NA>       
    ##  5 127    path:hsa00010 ADH4    ENSG00000198099 <NA>       
    ##  6 128    path:hsa00010 ADH5    ENSG00000197894 <NA>       
    ##  7 130    path:hsa00010 ADH6    ENSG00000172955 <NA>       
    ##  8 130589 path:hsa00010 GALM    ENSG00000143891 <NA>       
    ##  9 131    path:hsa00010 ADH7    ENSG00000196344 <NA>       
    ## 10 160287 path:hsa00010 LDHAL6A ENSG00000166800 <NA>       
    ## # ℹ 39,303 more rows

2 Genome annotation

There are a diversity of packages and classes available for representing large genomes. Several include:

2.1 Transcript annotation packages

Genome-centric packages are very useful for annotations involving genomic coordinates. It is straight-forward, for instance, to discover the coordinates of coding sequences in regions of interest, and from these retrieve corresponding DNA or protein coding sequences. Other examples of the types of operations that are easy to perform with genome-centric annotations include defining regions of interest for counting aligned reads in RNA-seq experiments and retrieving DNA sequences underlying regions of interest in ChIP-seq analysis, e.g., for motif characterization.

2.2 rtracklayer

The rtracklayer package allows us to query the UCSC genome browser, as well as providing import() and export() functions for common annotation file formats like GFF, GTF, and BED. The exercise below illustrates some of the functionality of rtracklayer.

2.3 Exercises

Exercise 4: TxDb.* packages

  1. Install and attach the TxDb.Hsapiens.UCSC.hg38.knownGene package. This contains the gene models for Homo sapiens based on the ‘hg38’ build of the human genome, using gene annotations in the UCSC ‘knownGene’ annotation track; TxDb’s for more recent builds and for different annotation tracks are available. Take a look at a summary of the package, and create an alias for easy typing

    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
    TxDb.Hsapiens.UCSC.hg38.knownGene
    ## TxDb object:
    ## # Db type: TxDb
    ## # Supporting package: GenomicFeatures
    ## # Data source: UCSC
    ## # Genome: hg38
    ## # Organism: Homo sapiens
    ## # Taxonomy ID: 9606
    ## # UCSC Table: knownGene
    ## # UCSC Track: GENCODE V47
    ## # Resource URL: https://genome.ucsc.edu/
    ## # Type of Gene ID: Entrez Gene ID
    ## # Full dataset: yes
    ## # miRBase build ID: NA
    ## # Nb of transcripts: 412034
    ## # Db created by: txdbmaker package from Bioconductor
    ## # Creation time: 2025-03-02 02:45:03 +0000 (Sun, 02 Mar 2025)
    ## # txdbmaker version at creation time: 1.3.1
    ## # RSQLite version at creation time: 2.3.9
    ## # DBSCHEMAVERSION: 1.2
    txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
  2. The main purpose of this package is to provide genomic coordinates of genomic features such as exons(), coding sequences (cds()), transcripts() and genes(). Explore, for example,

    ex <- exons(txdb)
    ex
    ## GRanges object with 966585 ranges and 1 metadata column:
    ##                       seqnames        ranges strand |   exon_id
    ##                          <Rle>     <IRanges>  <Rle> | <integer>
    ##        [1]                chr1   11121-11211      + |         1
    ##        [2]                chr1   11125-11211      + |         2
    ##        [3]                chr1   11410-11671      + |         3
    ##        [4]                chr1   11411-11671      + |         4
    ##        [5]                chr1   11426-11671      + |         5
    ##        ...                 ...           ...    ... .       ...
    ##   [966581] chrX_MU273397v1_alt 314193-314248      - |    966581
    ##   [966582] chrX_MU273397v1_alt 314813-315236      - |    966582
    ##   [966583] chrX_MU273397v1_alt 315258-315407      - |    966583
    ##   [966584] chrX_MU273397v1_alt 316254-316302      - |    966584
    ##   [966585] chrX_MU273397v1_alt 324527-324923      - |    966585
    ##   -------
    ##   seqinfo: 711 sequences (1 circular) from hg38 genome
    library(ggplot2)
    qplot(log10(width(ex)))
    ## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
    ## This warning is displayed once every 8 hours.
    ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
    ## generated.
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    ex[ which.max(width(ex)) ]
    ## GRanges object with 1 range and 1 metadata column:
    ##       seqnames              ranges strand |   exon_id
    ##          <Rle>           <IRanges>  <Rle> | <integer>
    ##   [1]     chrX 113616300-113963599      - |    887690
    ##   -------
    ##   seqinfo: 711 sequences (1 circular) from hg38 genome
  3. Extract all genes, and then keep only the ‘standard’ chromosomes 1:22, X, Y, and M. Use table() of seqnames() to determine how many genes are on each chromosome. Also do this in a dplyr way; note that the seqnames(gn) need to be coerced with as.factor().

    gn <- genes(txdb)
    ##   2169 genes were dropped because they have exons located on both strands of
    ##   the same reference sequence or on more than one reference sequence, so cannot
    ##   be represented by a single genomic range.
    ##   Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
    ##   object, or use suppressMessages() to suppress this message.
    length(gn)
    ## [1] 35332
    std <- paste0("chr", c(1:22, "X", "Y", "M"))
    seqlevels(gn, pruning.mode = "coarse") <- std
    length(gn)
    ## [1] 35138
    seqlevels(gn)
    ##  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
    ## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
    ## [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM"
    table( seqnames(gn) )
    ## 
    ##  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 
    ##  3485  2489  1992  1567  1709  1723  1703  1294  1493  1507  1984  1772   898 
    ## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY  chrM 
    ##  1106  1157  1365  1704   672  1866  1002   487   755  1271   137     0
    tibble(chr = as.factor(seqnames(gn))) |> 
        group_by(chr) |> 
        summarize(n = n())
    ## # A tibble: 24 × 2
    ##    chr       n
    ##    <fct> <int>
    ##  1 chr1   3485
    ##  2 chr2   2489
    ##  3 chr3   1992
    ##  4 chr4   1567
    ##  5 chr5   1709
    ##  6 chr6   1723
    ##  7 chr7   1703
    ##  8 chr8   1294
    ##  9 chr9   1493
    ## 10 chr10  1507
    ## # ℹ 14 more rows
  4. exonsBy() groups exons by gene or transcript; extract exons grouped by gene. (Challenging!) can you identify genes with exons on different chromosomes? Are there any of these genes on the standard chromosomes?

    exByGn <- exonsBy(txdb, "gene")
    ##
    trans <- lengths(unique(seqnames(exByGn)))
    table( trans )
    ## trans
    ##     1     2     3     4     5     6     7     8     9    10    12    13    16 
    ## 35332  1706   157    56    28    39    77    75     2    17     2     1     1 
    ##    31    33    38    39    41    42 
    ##     1     3     1     1     1     1
    seqnames( exByGn[ trans > 1 ] )
    ## RleList of length 2169
    ## $`10000`
    ## factor-Rle of length 94 with 2 runs
    ##   Lengths:                  63                  31
    ##   Values : chr1                chr1_KI270763v1_alt
    ## Levels(711): chr1 chr2 chr3 ... chrX_MU273396v1_alt chrX_MU273397v1_alt
    ## 
    ## $`10003`
    ## factor-Rle of length 76 with 2 runs
    ##   Lengths:                   38                   38
    ##   Values : chr11                chr11_KZ559110v1_alt
    ## Levels(711): chr1 chr2 chr3 ... chrX_MU273396v1_alt chrX_MU273397v1_alt
    ## 
    ## $`100033819`
    ## factor-Rle of length 3 with 2 runs
    ##   Lengths:                    2                    1
    ##   Values : chr11                chr11_ML143358v1_fix
    ## Levels(711): chr1 chr2 chr3 ... chrX_MU273396v1_alt chrX_MU273397v1_alt
    ## 
    ## $`100037417`
    ## factor-Rle of length 6 with 2 runs
    ##   Lengths:                    3                    3
    ##   Values : chr22                chr22_KI270879v1_alt
    ## Levels(711): chr1 chr2 chr3 ... chrX_MU273396v1_alt chrX_MU273397v1_alt
    ## 
    ## $`100049076`
    ## factor-Rle of length 107 with 2 runs
    ##   Lengths:                  93                  14
    ##   Values : chr5                chr5_KI270897v1_alt
    ## Levels(711): chr1 chr2 chr3 ... chrX_MU273396v1_alt chrX_MU273397v1_alt
    ## 
    ## ...
    ## <2164 more elements>
    ##
    std <- paste0("chr", c(1:22, "X", "Y", "M"))
    unames <- unique(seqnames(exByGn[ trans > 1 ]))
    transstd <- all(unames %in% std)
    unames[transstd]
    ## FactorList of length 29
    ## [["100128260"]] chrX chrY
    ## [["100359394"]] chrX chrY
    ## [["100500894"]] chrX chrY
    ## [["101928032"]] chrX chrY
    ## [["101928092"]] chrX chrY
    ## [["102464837"]] chrX chrY
    ## [["102724521"]] chrX chrY
    ## [["105373105"]] chrX chrY
    ## [["124900597"]] chrX chrY
    ## [["1438"]] chrX chrY
    ## ...
    ## <19 more elements>
  5. The previous exercise indicated that gene "22947" has exons on both chromosomes 4 and 10. Find out more about this gene using the org.Hs.eg.db package and by searching for the gene symbol on the NCBI web site.

    egid <- "22947"
    AnnotationDbi::select(
        org.Hs.eg.db, egid, c("SYMBOL", "GENENAME"), "ENTREZID"
    )
    ## 'select()' returned 1:1 mapping between keys and columns
    ##   ENTREZID SYMBOL                              GENENAME
    ## 1    22947 DUX4L1 double homeobox 4 like 1 (pseudogene)
    url <- paste0("https://www.ncbi.nlm.nih.gov/gene/", egid)
    browseURL(url)
  6. Note that the TxDb.* packages also support keytypes(), columns(), and select() for mapping between exon, cds, transcript, and gene identifiers.

Exercise 5: BSgenome.* packages

  1. Install (if necessary) and load the BSgenome.Hsapiens.UCSC.hg38 package, containing the entire sequence of the hg38 build of Homo sapiens. Check out it’s contents, and create a simple alias.

    library(BSgenome.Hsapiens.UCSC.hg38)
    BSgenome.Hsapiens.UCSC.hg38
    ## | BSgenome object for Human
    ## | - organism: Homo sapiens
    ## | - provider: UCSC
    ## | - genome: hg38
    ## | - release date: 2023/01/31
    ## | - 711 sequence(s):
    ## |     chr1                    chr2                    chr3                   
    ## |     chr4                    chr5                    chr6                   
    ## |     chr7                    chr8                    chr9                   
    ## |     chr10                   chr11                   chr12                  
    ## |     chr13                   chr14                   chr15                  
    ## |     ...                     ...                     ...                    
    ## |     chr19_KV575256v1_alt    chr19_KV575257v1_alt    chr19_KV575258v1_alt   
    ## |     chr19_KV575259v1_alt    chr19_KV575260v1_alt    chr19_MU273387v1_alt   
    ## |     chr22_KN196485v1_alt    chr22_KN196486v1_alt    chr22_KQ458387v1_alt   
    ## |     chr22_KQ458388v1_alt    chr22_KQ759761v1_alt    chrX_KV766199v1_alt    
    ## |     chrX_MU273395v1_alt     chrX_MU273396v1_alt     chrX_MU273397v1_alt    
    ## | 
    ## | Tips: call 'seqnames()' on the object to get all the sequence names, call
    ## | 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to
    ## | access a given sequence, see '?BSgenome' for more information.
    hg38 <- BSgenome.Hsapiens.UCSC.hg38
  2. Genomic sequence can be retrieved by chromosome, e.g., hg38[["chr1"]], or by genomic range, e.g., getSeq(hg38, GRanges("chr1:1000000-2000000")). Retrieve your favorite chunk(s) of DNA and calculate GC content.

    dna <- getSeq(hg38, GRanges("chr1:1000000-2000000"))
    letterFrequency(dna, "GC", as.prob=TRUE)
    ##            G|C
    ## [1,] 0.5728534
  3. Use the org.*, TxDb.*, and BSgenome.* packages to retrieve the BRCA1 exon DNA sequence.

    brca1_egid <- mapIds(org.Hs.eg.db, "BRCA1", "ENTREZID", "SYMBOL")
    ## 'select()' returned 1:1 mapping between keys and columns
    brca1_exons <- exonsBy(txdb, "gene")[[brca1_egid]]
    getSeq(hg38, brca1_exons)
    ## DNAStringSet object of length 84:
    ##      width seq
    ##  [1]  1508 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...CACTGCAAATAAACTTGGTAGCAAACACTTCCA
    ##  [2]  1505 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...TAACACTGCAAATAAACTTGGTAGCAAACACTT
    ##  [3]  1501 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...CTGTTAACACTGCAAATAAACTTGGTAGCAAAC
    ##  [4]  1499 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...TGCTGTTAACACTGCAAATAAACTTGGTAGCAA
    ##  [5]   998 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...TGAGACTGTGGCTCAAAAAAAAAAAAAAAAAAA
    ##  ...   ... ...
    ## [80]   126 ACAGATAAATTAAAACTGCGACTGCGCGGCGTG...CTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGG
    ## [81]   174 TTAGCGGTAGCCCCTTGGTTTCCGTGGCAACGG...CTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGG
    ## [82]   175 CTTAGCGGTAGCCCCTTGGTTTCCGTGGCAACG...CTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGG
    ## [83]   207 GTACCTTGATTTCGTATTCTGAGAGGCTGCTGC...CTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGG
    ## [84]   120 AAAGCGTGGGAATTACAGATAAATTAAAACTGT...GGCCGCGTTGGGGTGAGACCCTCACTTCATCCG

Exercise 6

This exercise uses annotation resources to go from a gene symbol ‘BRCA1’ through to the genomic coordinates of each transcript associated with the gene, and finally to the DNA sequences of the transcripts. This can be achieved using an EnsDb package along with a BSgenome package, or with a combination of TxDb, Homo.sapiens and BSgenome packages. We will focus here on the former approach.

  1. Use AnnotationHub to discover and retrieve a current Ensembl annotation (‘EnsDb’) for Homo sapiens.

  2. Use the cdsBy() function to retrieve the genomic coordinates of all coding sequences for the gene ‘BRCA1’ from the EnsDb.Hsapiens.v86 package. To retrieve only data for the specified gene, submit either a GenenameFilter or a filter formula/expression to the function’s filter parameter. This avoids to extract the coding region for all genes, which takes a long time.

  3. Visualize the transcripts in genomic coordinates using the Gviz package to construct a GeneRegionTrack, and plotting it using plotTracks().

  4. Use the Bsgenome.Hsapiens.UCSC.hg38 package and extractTranscriptSeqs() function to extract the DNA sequence of each transcript.

Solution

Retrieve the coding sequences grouped by transcript for the gene of interest and verify that each coding sequence is a multiple of 3.

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

brca1cds <- cdsBy(edb, by = "tx", filter = ~ genename == "BRCA1")

class(brca1cds)
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
length(brca1cds)
## [1] 29
brca1cds[[1]]                           # exons in cds
## GRanges object with 21 ranges and 3 metadata columns:
##        seqnames            ranges strand |   gene_name         exon_id
##           <Rle>         <IRanges>  <Rle> | <character>     <character>
##    [1]       17 43124017-43124096      - |       BRCA1 ENSE00003559512
##    [2]       17 43115726-43115779      - |       BRCA1 ENSE00003510592
##    [3]       17 43106456-43106533      - |       BRCA1 ENSE00003541068
##    [4]       17 43104868-43104956      - |       BRCA1 ENSE00003531836
##    [5]       17 43104122-43104261      - |       BRCA1 ENSE00003513709
##    ...      ...               ...    ... .         ...             ...
##   [17]       17 43057052-43057135      - |       BRCA1 ENSE00003458468
##   [18]       17 43051063-43051117      - |       BRCA1 ENSE00003477922
##   [19]       17 43049121-43049194      - |       BRCA1 ENSE00003628864
##   [20]       17 43047643-43047703      - |       BRCA1 ENSE00003687053
##   [21]       17 43045678-43045802      - |       BRCA1 ENSE00001814242
##        exon_rank
##        <integer>
##    [1]         2
##    [2]         3
##    [3]         4
##    [4]         5
##    [5]         6
##    ...       ...
##   [17]        18
##   [18]        19
##   [19]        20
##   [20]        21
##   [21]        22
##   -------
##   seqinfo: 1 sequence from GRCh38 genome
cdswidth <- width(brca1cds)             # width of each exon
all((sum(cdswidth) %% 3) == 0)          # sum within cds, modulus 3
## [1] FALSE

The CDS for some transcripts is not of the expected length, how come? Get the transcript ID of the first transcript that does have a CDS of the wrong size and look this transcript up in the Ensembl genome browser (http://www.ensembl.org).

tx_cds_fail <- names(brca1cds)[(sum(cdswidth) %% 3) != 0]

length(tx_cds_fail)
## [1] 10
tx_cds_fail[1]
## [1] "ENST00000412061"

In the description of the transcript it says CDS 5’ incomplete. Thus, in addition to known protein coding transcripts, Ensembl provides annotations for transcripts known to be targeted for nonsense mediated mRNA decay or that have incomplete CDS. Such transcripts would however not be listed in e.g. the TxDb.Hsapiens.UCSC.hg38.knownGene package.

Next we visualize the BRCA1 transcripts using Gviz (this package has an excellent vignette, vignette("Gviz"))

library(Gviz)

## Use the function from the ensembldb package to extract the data in the
## format suitable for Gviz
grt <- getGeneRegionTrackForGviz(edb, filter = ~genename == "BRCA1")
plotTracks(list(GenomeAxisTrack(), GeneRegionTrack(grt)))

Extract the coding sequences of each transcript. EnsDb databases provide annotations from Ensembl and use hence Ensembl style chromosome names (such as “Y”) while the BSgenome package is based on UCSC annotations that use a naming style that prepends a “chr” to each chromosome name (e.g. “chrY”). Change thus the seqlevelsStyle from the default UCSC chromosome naming to Ensembl naming style.

library(BSgenome.Hsapiens.UCSC.hg38)
genome <- BSgenome.Hsapiens.UCSC.hg38

## Change the seqlevelsStyle from UCSC to Ensembl
seqlevelsStyle(genome) <- "Ensembl"
tx_seq <- extractTranscriptSeqs(genome, brca1cds)
tx_seq
## DNAStringSet object of length 29:
##      width seq                                              names               
##  [1]  2166 ATGGATTTATCTGCTCTTCGCGT...GATCCCCCACAGCCACTACTGA ENST00000352993
##  [2]  4200 ATGGATTTATCTGCTCTTCGCGT...ACAAATTTCCAAGTATAGTTAA ENST00000354071
##  [3]  5592 ATGGATTTATCTGCTCTTCGCGT...GATCCCCCACAGCCACTACTGA ENST00000357654
##  [4]  1312 GTTTGGATTCTGCAAAAAAGGCT...GTGAAGAGATAAAGAAAAAAAA ENST00000412061
##  [5]   192 ATGGATTTATCTGCTCTTCGCGT...GCCTTCACAGTGTCCTTTATGA ENST00000461221
##  ...   ... ...
## [25]  1065 ATGCACAGTTGCTCTGGGAGTCT...GATCCCCCACAGCCACTACTGA ENST00000591534
## [26]   291 ATGAGTGACAGCAAGAAAACCTG...GATCCCCCACAGCCACTACTGA ENST00000591849
## [27]    71 ATGGATTTATCTGCTCTTCGCGT...CTATGCAGAAAATCTTAGAGTG ENST00000618469
## [28]  2395 ATGGATTTATCTGCTCTTCGCGT...TTGGGACATGAAGTTAACCACA ENST00000634433
## [29]  5592 ATGGATTTATCTGCTCTTCGCGT...GATCCCCCACAGCCACTACTGA LRG_292t1

We can also inspect the CDS sequence for the transcripts with incomplete CDS. Many of them do not start with a start codon hence indicating that the CDS is incomplete on their 5’ end.

tx_seq[tx_cds_fail]
## DNAStringSet object of length 10:
##      width seq                                              names               
##  [1]  1312 GTTTGGATTCTGCAAAAAAGGCT...GTGAAGAGATAAAGAAAAAAAA ENST00000412061
##  [2]   958 TTCAGCTTGACACAGGTTTGGAG...TTCACTCCAAATCAGTAGAGAG ENST00000473961
##  [3]   667 ATGGATTTATCTGCTCTTCGCGT...AGTTTGGATTCTGCAAAAAAGG ENST00000476777
##  [4]  1867 ATGGATTTATCTGCTCTTCGCGT...GATAGTTGTTCTAGCAGTGAAG ENST00000477152
##  [5]  1870 ATGGATTTATCTGCTCTTCGCGT...CAGTCTATTAAAGAAAGAAAAA ENST00000478531
##  [6]  1495 GAGCTATTGAAAATCATTTGTGC...CAGTCTATTAAAGAAAGAAAAA ENST00000484087
##  [7]   800 GAGCTATTGAAAATCATTTGTGC...ATAAAGAACCAGGAGTGGAAAG ENST00000487825
##  [8]   296 ATGGATTTATCTGCTCTTCGCGT...TAAAAGATGAAGTTTCTATCAT ENST00000489037
##  [9]    71 ATGGATTTATCTGCTCTTCGCGT...CTATGCAGAAAATCTTAGAGTG ENST00000618469
## [10]  2395 ATGGATTTATCTGCTCTTCGCGT...TTGGGACATGAAGTTAACCACA ENST00000634433

Intron coordinates can be identified by first calculating the range of the genome (from the start of the first exon to the end of the last exon) covered by each transcript, and then taking the (algebraic) set difference between this and the genomic coordinates covered by each exon

introns <- psetdiff(unlist(range(brca1cds)), brca1cds)

Retrieve the intronic sequences with getSeq() (these are not assembled, the way that extractTranscriptSeqs() assembles exon sequences into mature transcripts); note that introns start and end with the appropriate acceptor and donor site sequences. Unfortunately, UCSC and Ensembl do also use different names for the genome assembly. Change the genome name for the introns object to matche the one from the genome object.

unique(genome(genome))
## [1] "hg38"
genome(introns)
##       17 
## "GRCh38"
## Change the genome name on introns to match the one from the
## BSgenome package
genome(introns) <- c(`17` = unique(genome(genome)))

seq <- getSeq(genome, introns)
names(seq)
##  [1] "ENST00000352993" "ENST00000354071" "ENST00000357654" "ENST00000412061"
##  [5] "ENST00000461221" "ENST00000461574" "ENST00000461798" "ENST00000468300"
##  [9] "ENST00000470026" "ENST00000471181" "ENST00000473961" "ENST00000476777"
## [13] "ENST00000477152" "ENST00000478531" "ENST00000484087" "ENST00000487825"
## [17] "ENST00000489037" "ENST00000491747" "ENST00000492859" "ENST00000493795"
## [21] "ENST00000493919" "ENST00000494123" "ENST00000497488" "ENST00000586385"
## [25] "ENST00000591534" "ENST00000591849" "ENST00000618469" "ENST00000634433"
## [29] "LRG_292t1"
seq[["ENST00000352993"]]                     # 20 introns
## DNAStringSet object of length 20:
##      width seq
##  [1]  1840 GTAAGGTGCCTGCATGTACCTGTGCTATATGGG...ACACTAATCTCTGCTTGTGTTCTCTGTCTCCAG
##  [2]  1417 GTAAGTATTGGGTGCCCTGTCAGAGAGGGAGGA...ACTTTGAATGCTCTTTCCTTCCTGGGGATCCAG
##  [3]  1868 GTAAGAGCCTGGGAGAACCCCAGAGTTCCAGCA...TGCAGTGATTTTACATCTAAATGTCCATTTTAG
##  [4]  5934 GTAAAGCTCCCTCCCTCAAGTTGACAAAAATCT...CCCCTGTCCCTCTCTCTTCCTCTCTTCTTCCAG
##  [5]  6197 GTAAGTACTTGATGTTACAAACTAACCAGAGAT...TTATCCTGATGGGTTGTGTTTGGTTTCTTTCAG
##  ...   ... ...
## [16]  4241 GTAAAACCATTTGTTTTCTTCTTCTTCTTCTTC...AATTGCTTGACTGTTCTTTACCATACTGTTTAG
## [17]   606 GTAAGTGTTGAATATCCCAAGAATGACACTCAA...TGCAAACATAATGTTTTCCCTTGTATTTTACAG
## [18]  1499 GTATATAATTTGGTAATGATGCTAGGTTGGAAG...GCTGAGTGTGTTTCTCAAACAATTTAATTTCAG
## [19]  9192 GTAAGTTTGAATGTGTTATGTGGCTCCATTATT...TTAAATTGTTCTTTCTTTCTTTATAATTTATAG
## [20]  8237 GTAAGTCAGCACAAGAGTGTATTAATTTGGGAT...TTATTTTCTTTTTCTCCCCCCCTACCCTGCTAG

Exercise 7

Internet access required for this exercise

Here we use rtracklayer to retrieve estrogen receptor binding sites identified across cell lines in the ENCODE project. We focus on binding sites in the vicinity of a particularly interesting region.

  1. Define our region of interest by creating a GRanges instance with appropriate genomic coordinates. Our region corresponds to 10Mb up- and down-stream of a particular gene.
  2. Create a session for the UCSC genome browser
  3. Query the UCSC genome browser for ENCODE estrogen receptor ERalphaa transcription marks; identifying the appropriate track, table, and transcription factor requires biological knowledge and detective work.
  4. Visualize the location of the binding sites and their scores; annotate the mid-point of the region of interest.

Solution

Define the region of interest

library(GenomicRanges)
roi <- GRanges("chr10", IRanges(92106877, 112106876, names="ENSG00000099194"))

Create a session

library(rtracklayer) 
session <- browserSession()

Query the UCSC for a particular track, table, and transcription factor, in our region of interest

trackName <- "wgEncodeRegTfbsClusteredV2"
tableName <- "wgEncodeRegTfbsClusteredV2"
trFactor <- "ERalpha_a"
ucscTable <- getTable(ucscTableQuery(session, track=trackName,
    range=roi, table=tableName, name=trFactor))

Visualize the result

plot(score ~ chromStart, ucscTable, pch="+")
abline(v=start(roi) + (end(roi) - start(roi) + 1) / 2, col="blue")

3 AnnotationHub

AnnotationHub is a data base of large-scale whole-genome resources, e.g., regulatory elements from the Roadmap Epigenomics project, Ensembl GTF and FASTA files for model and other organisms, and the NHLBI grasp2db data base of GWAS results. There are many interesting ways in which these resources can be used. Examples include

AnnotationHub makes extensive use of internet resources and so we will not pursue it in this course extensively; see the vignettes that come with the pacakge, for instance AnnotationHub HOW-TOs. Also see Wednesday’s annotation lecture and additional materials for more information on AnnotationHub and ExperimentHub.

Create a session

library(AnnotationHub)
hub = AnnotationHub()
hub
## AnnotationHub with 70637 records
## # snapshotDate(): 2025-04-08
## # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Rattus norv...
## # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, ChainFile...
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5012"]]' 
## 
##              title                                   
##   AH5012   | Chromosome Band                         
##   AH5013   | STS Markers                             
##   AH5014   | FISH Clones                             
##   AH5015   | Recomb Rate                             
##   AH5016   | ENCODE Pilot                            
##   ...        ...                                     
##   AH121712 | Data.table for PubMed Author Information
##   AH121713 | Data.table for PMC                      
##   AH121714 | Data.table for MeSH (Descriptor)        
##   AH121715 | Data.table for MeSH (Qualifier)         
##   AH121716 | Data.table for MeSH (SCR)

Finding what you need in the hubs requires a well formed query or subset call against the metadata columns of the searchable database. What types of metadata columns are available?

names(mcols(hub))
##  [1] "title"              "dataprovider"       "species"           
##  [4] "taxonomyid"         "genome"             "description"       
##  [7] "coordinate_1_based" "maintainer"         "rdatadateadded"    
## [10] "preparerclass"      "tags"               "rdataclass"        
## [13] "rdatapath"          "sourceurl"          "sourcetype"

You can see what values are in each metadata columns

head(unique(hub$dataprovider))
## [1] "UCSC"        "Ensembl"     "RefNet"      "Inparanoid8" "NHLBI"      
## [6] "ChEA"
head(unique(hub$rdataclass))
## [1] "GRanges"          "data.frame"       "Inparanoid8Db"    "TwoBitFile"      
## [5] "ChainFile"        "SQLiteConnection"

There are two ways to search for data subset and query. Query is recommended as subset involves an exact match search while query is less strict.

query(hub, c("granges","homo sapiens","ensembl"))
## AnnotationHub with 121 records
## # snapshotDate(): 2025-04-08
## # $dataprovider: Ensembl, UCSC, TargetScan,miRTarBase,USCS,ENSEMBL
## # $species: Homo sapiens, homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5046"]]' 
## 
##              title                                           
##   AH5046   | Ensembl Genes                                   
##   AH5160   | Ensembl Genes                                   
##   AH5311   | Ensembl Genes                                   
##   AH5434   | Ensembl Genes                                   
##   AH5435   | Ensembl EST Genes                               
##   ...        ...                                             
##   AH110100 | Homo_sapiens.GRCh38.108.gtf                     
##   AH110866 | Homo_sapiens.GRCh38.109.abinitio.gtf            
##   AH110867 | Homo_sapiens.GRCh38.109.chr.gtf                 
##   AH110868 | Homo_sapiens.GRCh38.109.chr_patch_hapl_scaff.gtf
##   AH110869 | Homo_sapiens.GRCh38.109.gtf

You can look at the full metadata for a resource without downloading by using a single square bracket. A double bracket will download the resource and cache it locally.

hub["AH110100"]
## AnnotationHub with 1 record
## # snapshotDate(): 2025-04-08
## # names(): AH110100
## # $dataprovider: Ensembl
## # $species: homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2022-10-30
## # $title: Homo_sapiens.GRCh38.108.gtf
## # $description: Gene Annotation for homo sapiens
## # $taxonomyid: 9606
## # $genome: GRCh38.p13
## # $sourcetype: GTF
## # $sourceurl: ftp://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sa...
## # $sourcesize: 54107597
## # $tags: c("GTF", "ensembl", "Gene", "Transcript", "Annotation") 
## # retrieve record with 'object[["AH110100"]]'

4 Annotating variants

Bioconductor provides facilities for reading VCF files. These work very well with the annotation resources described above, so for instance it is straight-forward to identify variants in coding or other regions of interest.

To develop a sense of the capabilities available, work through the VariantAnnotation vignette ‘Introduction to Variant Annotation’, and the VariantFiltering vignette.

5 End matter

5.1 Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Rome
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] KEGGREST_1.48.1                         
##  [2] lubridate_1.9.4                         
##  [3] forcats_1.0.0                           
##  [4] stringr_1.5.1                           
##  [5] dplyr_1.1.4                             
##  [6] purrr_1.0.4                             
##  [7] readr_2.1.5                             
##  [8] tidyr_1.3.1                             
##  [9] tibble_3.3.0                            
## [10] ggplot2_3.5.2                           
## [11] tidyverse_2.0.0                         
## [12] AnnotationHub_3.16.0                    
## [13] BiocFileCache_2.16.0                    
## [14] dbplyr_2.5.0                            
## [15] Gviz_1.52.0                             
## [16] biomaRt_2.64.0                          
## [17] BSgenome.Hsapiens.UCSC.hg38_1.4.5       
## [18] BSgenome_1.76.0                         
## [19] rtracklayer_1.68.0                      
## [20] BiocIO_1.18.0                           
## [21] Biostrings_2.76.0                       
## [22] XVector_0.48.0                          
## [23] EnsDb.Hsapiens.v86_2.99.0               
## [24] ensembldb_2.32.0                        
## [25] AnnotationFilter_1.32.0                 
## [26] TxDb.Hsapiens.UCSC.hg38.knownGene_3.21.0
## [27] GenomicFeatures_1.60.0                  
## [28] GenomicRanges_1.60.0                    
## [29] GenomeInfoDb_1.44.0                     
## [30] org.Hs.eg.db_3.21.0                     
## [31] AnnotationDbi_1.70.0                    
## [32] IRanges_2.42.0                          
## [33] S4Vectors_0.46.0                        
## [34] Biobase_2.68.0                          
## [35] BiocGenerics_0.54.0                     
## [36] generics_0.1.4                          
## [37] BiocStyle_2.36.0                        
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.17.1          
##   [3] jsonlite_2.0.0              magrittr_2.0.3             
##   [5] magick_2.8.7                farver_2.1.2               
##   [7] rmarkdown_2.29              vctrs_0.6.5                
##   [9] memoise_2.0.1               Rsamtools_2.24.0           
##  [11] RCurl_1.98-1.17             base64enc_0.1-3            
##  [13] tinytex_0.57                htmltools_0.5.8.1          
##  [15] S4Arrays_1.8.1              progress_1.2.3             
##  [17] curl_6.4.0                  SparseArray_1.8.0          
##  [19] Formula_1.2-5               sass_0.4.10                
##  [21] bslib_0.9.0                 htmlwidgets_1.6.4          
##  [23] httr2_1.1.2                 cachem_1.1.0               
##  [25] GenomicAlignments_1.44.0    mime_0.13                  
##  [27] lifecycle_1.0.4             pkgconfig_2.0.3            
##  [29] Matrix_1.7-3                R6_2.6.1                   
##  [31] fastmap_1.2.0               GenomeInfoDbData_1.2.14    
##  [33] MatrixGenerics_1.20.0       digest_0.6.37              
##  [35] colorspace_2.1-1            Hmisc_5.2-3                
##  [37] RSQLite_2.4.1               labeling_0.4.3             
##  [39] filelock_1.0.3              timechange_0.3.0           
##  [41] httr_1.4.7                  abind_1.4-8                
##  [43] compiler_4.5.1              withr_3.0.2                
##  [45] bit64_4.6.0-1               htmlTable_2.4.3            
##  [47] backports_1.5.0             BiocParallel_1.42.1        
##  [49] DBI_1.2.3                   rappdirs_0.3.3             
##  [51] DelayedArray_0.34.1         rjson_0.2.23               
##  [53] tools_4.5.1                 foreign_0.8-90             
##  [55] nnet_7.3-20                 glue_1.8.0                 
##  [57] restfulr_0.0.16             checkmate_2.3.2            
##  [59] cluster_2.1.8.1             gtable_0.3.6               
##  [61] tzdb_0.5.0                  data.table_1.17.6          
##  [63] hms_1.1.3                   utf8_1.2.6                 
##  [65] xml2_1.3.8                  BiocVersion_3.21.1         
##  [67] pillar_1.10.2               lattice_0.22-7             
##  [69] bit_4.6.0                   deldir_2.0-4               
##  [71] biovizBase_1.56.0           tidyselect_1.2.1           
##  [73] knitr_1.50                  gridExtra_2.3              
##  [75] bookdown_0.43               ProtGenerics_1.40.0        
##  [77] SummarizedExperiment_1.38.1 xfun_0.52                  
##  [79] matrixStats_1.5.0           stringi_1.8.7              
##  [81] UCSC.utils_1.4.0            lazyeval_0.2.2             
##  [83] yaml_2.3.10                 evaluate_1.0.4             
##  [85] codetools_0.2-20            interp_1.1-6               
##  [87] BiocManager_1.30.26         cli_3.6.5                  
##  [89] rpart_4.1.24                jquerylib_0.1.4            
##  [91] dichromat_2.0-0.1           Rcpp_1.1.0                 
##  [93] png_0.1-8                   XML_3.99-0.18              
##  [95] parallel_4.5.1              blob_1.2.4                 
##  [97] prettyunits_1.2.0           latticeExtra_0.6-30        
##  [99] jpeg_0.1-11                 bitops_1.0-9               
## [101] VariantAnnotation_1.54.1    scales_1.4.0               
## [103] crayon_1.5.3                rlang_1.1.6

5.2 Acknowledgements

Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U24HG004059 (Bioconductor), U24HG010263 (AnVIL) and U24CA180996 (ITCR).