How to extract promoter sequences

Promoters are DNA sequences typically located upstream of a gene’s transcription start site (TSS), and they contain binding sites for transcription factors. Researchers working on gene regulation often want to extract promoter sequences for all genes or a set of genes. With these sequences, researchers can, for example, look for known transcription factor binding sites (TFBS), or predict TFBS de novo. This HowTo will demonstrate how to extract promoter sequences for any gene(s) in a genome.

Bioconductor packages used in this document

We will start by loading the required packages.

library(Biostrings)
library(GenomicRanges)
library(rtracklayer)
library(txdbmaker)
library(GenomicFeatures)
library(BSgenome)

Then, we will get our example data set. Here, we will use data for the model plant Arabidopsis thaliana. We will start by obtaining genome sequences and gene annotation from Ensembl Plants.

Working with model and non-model organisms

Here, we’re obtaining data from Ensembl Plants to explicitly show how this can be done for any organism. However, if you’re lucky enough to work with a model organism (e.g. human, mouse, etc), you can find curated genome sequences and gene annotation in AnnotationHub (see, for instance, this HowTo article).

# Read genomic data from Ensembl Plants
## Genome
genome <- Biostrings::readDNAStringSet(
    "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-61/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz"
)
names(genome) <- gsub(" .*", "", names(genome)) # remove text after chr name

## Annotation
annotation <- rtracklayer::import(
    "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-61/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.61.gff3.gz"
)

Genome sequences are stored in DNAStringSet objects, and gene annotation (i.e., start and end coordinates of each gene) are stored in GRanges objects. This is what these objects look like:

# Take a look at the genome and annotation
genome
#> DNAStringSet object of length 7:
#>        width seq                                            names               
#> [1] 30427671 CCCTAAACCCTAAACCCTAAAC...TTTAGGGTTTAGGGTTTAGGG 1
#> [2] 19698289 NNNNNNNNNNNNNNNNNNNNNN...TTTAGGGTTTAGGGTTTAGGG 2
#> [3] 23459830 NNNNNNNNNNNNNNNNNNNNNN...TAAACCCTAAACCCTAAACCC 3
#> [4] 18585056 NNNNNNNNNNNNNNNNNNNNNN...GTTTAGGGTTTAGGGTTTAGG 4
#> [5] 26975502 TATACCATGTACCCTCAACCTT...AGGATTTAGGGTTTTTAGATC 5
#> [6]   366924 GGATCCGTTCGAAACAGGTTAG...AGAATGGAAACAAACCGGATT Mt
#> [7]   154478 ATGGGCGAACGACGGGAATTGA...AATAACTTGGTCCCGGGCATC Pt
annotation
#> GRanges object with 791564 ranges and 21 metadata columns:
#>            seqnames        ranges strand |    source           type     score
#>               <Rle>     <IRanges>  <Rle> |  <factor>       <factor> <numeric>
#>        [1]        1    1-30427671      * | TAIR10    chromosome            NA
#>        [2]        1     3631-5899      + | araport11 gene                  NA
#>        [3]        1     3631-5899      + | araport11 mRNA                  NA
#>        [4]        1     3631-3759      + | araport11 five_prime_UTR        NA
#>        [5]        1     3631-3913      + | araport11 exon                  NA
#>        ...      ...           ...    ... .       ...            ...       ...
#>   [791560]       Pt 152806-154312      + | araport11           mRNA        NA
#>   [791561]       Pt 152806-153195      + | araport11           exon        NA
#>   [791562]       Pt 152806-153195      + | araport11           CDS         NA
#>   [791563]       Pt 153878-154312      + | araport11           exon        NA
#>   [791564]       Pt 153878-154312      + | araport11           CDS         NA
#>                phase                     ID                       Alias
#>            <integer>            <character>             <CharacterList>
#>        [1]      <NA>           chromosome:1 Chr1,CP002684.1,NC_003070.9
#>        [2]      <NA>         gene:AT1G01010                            
#>        [3]      <NA> transcript:AT1G01010.1                            
#>        [4]      <NA>                   <NA>                            
#>        [5]      <NA>                   <NA>                            
#>        ...       ...                    ...                         ...
#>   [791560]      <NA> transcript:ATCG01310.1                            
#>   [791561]      <NA>                   <NA>                            
#>   [791562]         0        CDS:ATCG01310.1                            
#>   [791563]      <NA>                   <NA>                            
#>   [791564]         0        CDS:ATCG01310.1                            
#>                         Name        biotype            description     gene_id
#>                  <character>    <character>            <character> <character>
#>        [1]              <NA>           <NA>                   <NA>        <NA>
#>        [2]            NAC001 protein_coding NAC domain containin..   AT1G01010
#>        [3]        NAC001-201 protein_coding                   <NA>        <NA>
#>        [4]              <NA>           <NA>                   <NA>        <NA>
#>        [5] AT1G01010.1.exon1           <NA>                   <NA>        <NA>
#>        ...               ...            ...                    ...         ...
#>   [791560]          rpl2-201 protein_coding                   <NA>        <NA>
#>   [791561] ATCG01310.1.exon1           <NA>                   <NA>        <NA>
#>   [791562]              <NA>           <NA>                   <NA>        <NA>
#>   [791563] ATCG01310.1.exon2           <NA>                   <NA>        <NA>
#>   [791564]              <NA>           <NA>                   <NA>        <NA>
#>             logic_name                 Parent               tag transcript_id
#>            <character>        <CharacterList>       <character>   <character>
#>        [1]        <NA>                                     <NA>          <NA>
#>        [2]   araport11                                     <NA>          <NA>
#>        [3]        <NA>         gene:AT1G01010 Ensembl_canonical   AT1G01010.1
#>        [4]        <NA> transcript:AT1G01010.1              <NA>          <NA>
#>        [5]        <NA> transcript:AT1G01010.1              <NA>          <NA>
#>        ...         ...                    ...               ...           ...
#>   [791560]        <NA>         gene:ATCG01310 Ensembl_canonical   ATCG01310.1
#>   [791561]        <NA> transcript:ATCG01310.1              <NA>          <NA>
#>   [791562]        <NA> transcript:ATCG01310.1              <NA>          <NA>
#>   [791563]        <NA> transcript:ATCG01310.1              <NA>          <NA>
#>   [791564]        <NA> transcript:ATCG01310.1              <NA>          <NA>
#>            constitutive ensembl_end_phase ensembl_phase           exon_id
#>             <character>       <character>   <character>       <character>
#>        [1]         <NA>              <NA>          <NA>              <NA>
#>        [2]         <NA>              <NA>          <NA>              <NA>
#>        [3]         <NA>              <NA>          <NA>              <NA>
#>        [4]         <NA>              <NA>          <NA>              <NA>
#>        [5]            1                 1            -1 AT1G01010.1.exon1
#>        ...          ...               ...           ...               ...
#>   [791560]         <NA>              <NA>          <NA>              <NA>
#>   [791561]            1                 0             0 ATCG01310.1.exon1
#>   [791562]         <NA>              <NA>          <NA>              <NA>
#>   [791563]            1                 0             0 ATCG01310.1.exon2
#>   [791564]         <NA>              <NA>          <NA>              <NA>
#>                   rank  protein_id Is_circular
#>            <character> <character> <character>
#>        [1]        <NA>        <NA>        <NA>
#>        [2]        <NA>        <NA>        <NA>
#>        [3]        <NA>        <NA>        <NA>
#>        [4]        <NA>        <NA>        <NA>
#>        [5]           1        <NA>        <NA>
#>        ...         ...         ...         ...
#>   [791560]        <NA>        <NA>        <NA>
#>   [791561]           1        <NA>        <NA>
#>   [791562]        <NA> ATCG01310.1        <NA>
#>   [791563]           2        <NA>        <NA>
#>   [791564]        <NA> ATCG01310.1        <NA>
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Note that our GRanges object contains ranges for multiple types of features, including genes, mRNAs, exons, and even entire chromosomes. Since we’re only interested in promoter regions, we will subset annotation to keep only ‘gene’ ranges.

# Keep only gene ranges
genes <- annotation[annotation$type == "gene"]
genes
#> GRanges object with 27655 ranges and 21 metadata columns:
#>           seqnames        ranges strand |    source     type     score
#>              <Rle>     <IRanges>  <Rle> |  <factor> <factor> <numeric>
#>       [1]        1     3631-5899      + | araport11     gene        NA
#>       [2]        1     6788-9130      - | araport11     gene        NA
#>       [3]        1   11649-13714      - | araport11     gene        NA
#>       [4]        1   23121-31227      + | araport11     gene        NA
#>       [5]        1   31170-33171      - | araport11     gene        NA
#>       ...      ...           ...    ... .       ...      ...       ...
#>   [27651]       Pt 141854-143708      + | araport11     gene        NA
#>   [27652]       Pt 144921-145154      - | araport11     gene        NA
#>   [27653]       Pt 145291-152175      - | araport11     gene        NA
#>   [27654]       Pt 152506-152787      + | araport11     gene        NA
#>   [27655]       Pt 152806-154312      + | araport11     gene        NA
#>               phase             ID           Alias        Name        biotype
#>           <integer>    <character> <CharacterList> <character>    <character>
#>       [1]      <NA> gene:AT1G01010                      NAC001 protein_coding
#>       [2]      <NA> gene:AT1G01020                        ARV1 protein_coding
#>       [3]      <NA> gene:AT1G01030                        NGA3 protein_coding
#>       [4]      <NA> gene:AT1G01040                        DCL1 protein_coding
#>       [5]      <NA> gene:AT1G01050                        PPa1 protein_coding
#>       ...       ...            ...             ...         ...            ...
#>   [27651]      <NA> gene:ATCG01250                        ndhB protein_coding
#>   [27652]      <NA> gene:ATCG01270                     ycf15-B protein_coding
#>   [27653]      <NA> gene:ATCG01280                        ycf2 protein_coding
#>   [27654]      <NA> gene:ATCG01300                       rpl23 protein_coding
#>   [27655]      <NA> gene:ATCG01310                        rpl2 protein_coding
#>                      description     gene_id  logic_name          Parent
#>                      <character> <character> <character> <CharacterList>
#>       [1] NAC domain containin..   AT1G01010   araport11                
#>       [2] ARV1 family protein ..   AT1G01020   araport11                
#>       [3] AP2/B3-like transcri..   AT1G01030   araport11                
#>       [4] dicer-like 1 [Source..   AT1G01040   araport11                
#>       [5] pyrophosphorylase 1 ..   AT1G01050   araport11                
#>       ...                    ...         ...         ...             ...
#>   [27651]                   <NA>   ATCG01250   araport11                
#>   [27652]                   <NA>   ATCG01270   araport11                
#>   [27653] Ycf2 [Source:NCBI ge..   ATCG01280   araport11                
#>   [27654] ribosomal protein L2..   ATCG01300   araport11                
#>   [27655] ribosomal protein L2..   ATCG01310   araport11                
#>                   tag transcript_id constitutive ensembl_end_phase
#>           <character>   <character>  <character>       <character>
#>       [1]        <NA>          <NA>         <NA>              <NA>
#>       [2]        <NA>          <NA>         <NA>              <NA>
#>       [3]        <NA>          <NA>         <NA>              <NA>
#>       [4]        <NA>          <NA>         <NA>              <NA>
#>       [5]        <NA>          <NA>         <NA>              <NA>
#>       ...         ...           ...          ...               ...
#>   [27651]        <NA>          <NA>         <NA>              <NA>
#>   [27652]        <NA>          <NA>         <NA>              <NA>
#>   [27653]        <NA>          <NA>         <NA>              <NA>
#>   [27654]        <NA>          <NA>         <NA>              <NA>
#>   [27655]        <NA>          <NA>         <NA>              <NA>
#>           ensembl_phase     exon_id        rank  protein_id Is_circular
#>             <character> <character> <character> <character> <character>
#>       [1]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [2]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [3]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [4]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [5]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       ...           ...         ...         ...         ...         ...
#>   [27651]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [27652]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [27653]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [27654]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [27655]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Next, we can use the promoters() function from GenomicRanges to extract the genomic coordinates of promoters. By default, promoters start at 2000 bp upstream the transcription start site (TSS), and end at 200 bp downstream the TSS, but you can modify these positions in parameters upstream and downstream of promoters(). For example, some transcription factor families in plants (e.g. HSF and C2C2-GATA) bind exclusively to regions downstream the TSS, which is contrary to the common view of TFBS being located upstream TSSs. Importantly, if a gene has multiple TSSs (e.g., because of different transcripts starting at different positions), the most upstream TSS is used, not a cleverly chosen ‘representative’ or ‘canonical’ transcript.

# Get promoter regions from a `GRanges` object
prom_genes <- GenomicRanges::promoters(genes)

prom_genes
#> GRanges object with 27655 ranges and 21 metadata columns:
#>           seqnames        ranges strand |    source     type     score
#>              <Rle>     <IRanges>  <Rle> |  <factor> <factor> <numeric>
#>       [1]        1     1631-3830      + | araport11     gene        NA
#>       [2]        1    8931-11130      - | araport11     gene        NA
#>       [3]        1   13515-15714      - | araport11     gene        NA
#>       [4]        1   21121-23320      + | araport11     gene        NA
#>       [5]        1   32972-35171      - | araport11     gene        NA
#>       ...      ...           ...    ... .       ...      ...       ...
#>   [27651]       Pt 139854-142053      + | araport11     gene        NA
#>   [27652]       Pt 144955-147154      - | araport11     gene        NA
#>   [27653]       Pt 151976-154175      - | araport11     gene        NA
#>   [27654]       Pt 150506-152705      + | araport11     gene        NA
#>   [27655]       Pt 150806-153005      + | araport11     gene        NA
#>               phase             ID           Alias        Name        biotype
#>           <integer>    <character> <CharacterList> <character>    <character>
#>       [1]      <NA> gene:AT1G01010                      NAC001 protein_coding
#>       [2]      <NA> gene:AT1G01020                        ARV1 protein_coding
#>       [3]      <NA> gene:AT1G01030                        NGA3 protein_coding
#>       [4]      <NA> gene:AT1G01040                        DCL1 protein_coding
#>       [5]      <NA> gene:AT1G01050                        PPa1 protein_coding
#>       ...       ...            ...             ...         ...            ...
#>   [27651]      <NA> gene:ATCG01250                        ndhB protein_coding
#>   [27652]      <NA> gene:ATCG01270                     ycf15-B protein_coding
#>   [27653]      <NA> gene:ATCG01280                        ycf2 protein_coding
#>   [27654]      <NA> gene:ATCG01300                       rpl23 protein_coding
#>   [27655]      <NA> gene:ATCG01310                        rpl2 protein_coding
#>                      description     gene_id  logic_name          Parent
#>                      <character> <character> <character> <CharacterList>
#>       [1] NAC domain containin..   AT1G01010   araport11                
#>       [2] ARV1 family protein ..   AT1G01020   araport11                
#>       [3] AP2/B3-like transcri..   AT1G01030   araport11                
#>       [4] dicer-like 1 [Source..   AT1G01040   araport11                
#>       [5] pyrophosphorylase 1 ..   AT1G01050   araport11                
#>       ...                    ...         ...         ...             ...
#>   [27651]                   <NA>   ATCG01250   araport11                
#>   [27652]                   <NA>   ATCG01270   araport11                
#>   [27653] Ycf2 [Source:NCBI ge..   ATCG01280   araport11                
#>   [27654] ribosomal protein L2..   ATCG01300   araport11                
#>   [27655] ribosomal protein L2..   ATCG01310   araport11                
#>                   tag transcript_id constitutive ensembl_end_phase
#>           <character>   <character>  <character>       <character>
#>       [1]        <NA>          <NA>         <NA>              <NA>
#>       [2]        <NA>          <NA>         <NA>              <NA>
#>       [3]        <NA>          <NA>         <NA>              <NA>
#>       [4]        <NA>          <NA>         <NA>              <NA>
#>       [5]        <NA>          <NA>         <NA>              <NA>
#>       ...         ...           ...          ...               ...
#>   [27651]        <NA>          <NA>         <NA>              <NA>
#>   [27652]        <NA>          <NA>         <NA>              <NA>
#>   [27653]        <NA>          <NA>         <NA>              <NA>
#>   [27654]        <NA>          <NA>         <NA>              <NA>
#>   [27655]        <NA>          <NA>         <NA>              <NA>
#>           ensembl_phase     exon_id        rank  protein_id Is_circular
#>             <character> <character> <character> <character> <character>
#>       [1]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [2]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [3]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [4]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [5]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       ...           ...         ...         ...         ...         ...
#>   [27651]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [27652]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [27653]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [27654]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [27655]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Alternatively, if you want to extract promoters for each transcript separately, you can subset the GRanges object to extract only ranges of type ‘transcript’, and then use promoters() as demonstrated above. Note that GFF3 files from some databases do not contain a ‘transcript’ type. In these cases, you will most likely find transcript-level ranges in rows of type ‘mRNA’. This is the case for Ensembl Plants, from where we obtained our example data. To extract promoters for each A. thaliana transcript, you’d execute:

# Extract transcript ranges from a GRanges object
tx_ranges <- annotation[annotation$type == "mRNA"] # or 'transcript' (if any)

# Extract promoters for each transcript separately
prom_tx <- promoters(tx_ranges)
prom_tx
#> GRanges object with 48359 ranges and 21 metadata columns:
#>           seqnames        ranges strand |    source     type     score
#>              <Rle>     <IRanges>  <Rle> |  <factor> <factor> <numeric>
#>       [1]        1     1631-3830      + | araport11     mRNA        NA
#>       [2]        1    8538-10737      - | araport11     mRNA        NA
#>       [3]        1    8538-10737      - | araport11     mRNA        NA
#>       [4]        1    8931-11130      - | araport11     mRNA        NA
#>       [5]        1    8931-11130      - | araport11     mRNA        NA
#>       ...      ...           ...    ... .       ...      ...       ...
#>   [48355]       Pt 139854-142053      + | araport11     mRNA        NA
#>   [48356]       Pt 144955-147154      - | araport11     mRNA        NA
#>   [48357]       Pt 151976-154175      - | araport11     mRNA        NA
#>   [48358]       Pt 150506-152705      + | araport11     mRNA        NA
#>   [48359]       Pt 150806-153005      + | araport11     mRNA        NA
#>               phase                     ID           Alias        Name
#>           <integer>            <character> <CharacterList> <character>
#>       [1]      <NA> transcript:AT1G01010.1                  NAC001-201
#>       [2]      <NA> transcript:AT1G01020.2                    ARV1-202
#>       [3]      <NA> transcript:AT1G01020.6                    ARV1-201
#>       [4]      <NA> transcript:AT1G01020.1                    ARV1-206
#>       [5]      <NA> transcript:AT1G01020.4                    ARV1-205
#>       ...       ...                    ...             ...         ...
#>   [48355]      <NA> transcript:ATCG01250.1                    ndhB-201
#>   [48356]      <NA> transcript:ATCG01270.1                 ycf15-B-201
#>   [48357]      <NA> transcript:ATCG01280.1                    ycf2-201
#>   [48358]      <NA> transcript:ATCG01300.1                   rpl23-201
#>   [48359]      <NA> transcript:ATCG01310.1                    rpl2-201
#>                  biotype description     gene_id  logic_name          Parent
#>              <character> <character> <character> <character> <CharacterList>
#>       [1] protein_coding        <NA>        <NA>        <NA>  gene:AT1G01010
#>       [2] protein_coding        <NA>        <NA>        <NA>  gene:AT1G01020
#>       [3] protein_coding        <NA>        <NA>        <NA>  gene:AT1G01020
#>       [4] protein_coding        <NA>        <NA>        <NA>  gene:AT1G01020
#>       [5] protein_coding        <NA>        <NA>        <NA>  gene:AT1G01020
#>       ...            ...         ...         ...         ...             ...
#>   [48355] protein_coding        <NA>        <NA>        <NA>  gene:ATCG01250
#>   [48356] protein_coding        <NA>        <NA>        <NA>  gene:ATCG01270
#>   [48357] protein_coding        <NA>        <NA>        <NA>  gene:ATCG01280
#>   [48358] protein_coding        <NA>        <NA>        <NA>  gene:ATCG01300
#>   [48359] protein_coding        <NA>        <NA>        <NA>  gene:ATCG01310
#>                         tag transcript_id constitutive ensembl_end_phase
#>                 <character>   <character>  <character>       <character>
#>       [1] Ensembl_canonical   AT1G01010.1         <NA>              <NA>
#>       [2]              <NA>   AT1G01020.2         <NA>              <NA>
#>       [3]              <NA>   AT1G01020.6         <NA>              <NA>
#>       [4] Ensembl_canonical   AT1G01020.1         <NA>              <NA>
#>       [5]              <NA>   AT1G01020.4         <NA>              <NA>
#>       ...               ...           ...          ...               ...
#>   [48355] Ensembl_canonical   ATCG01250.1         <NA>              <NA>
#>   [48356] Ensembl_canonical   ATCG01270.1         <NA>              <NA>
#>   [48357] Ensembl_canonical   ATCG01280.1         <NA>              <NA>
#>   [48358] Ensembl_canonical   ATCG01300.1         <NA>              <NA>
#>   [48359] Ensembl_canonical   ATCG01310.1         <NA>              <NA>
#>           ensembl_phase     exon_id        rank  protein_id Is_circular
#>             <character> <character> <character> <character> <character>
#>       [1]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [2]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [3]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [4]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       [5]          <NA>        <NA>        <NA>        <NA>        <NA>
#>       ...           ...         ...         ...         ...         ...
#>   [48355]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [48356]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [48357]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [48358]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   [48359]          <NA>        <NA>        <NA>        <NA>        <NA>
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Extracting promoter regions from TxDb objects

If you have a TxDb object with transcript coordinates and metadata, you can extract promoter regions for each transcript with the promoters() function from the GenomicFeatures package. Note that function names are the same, but GenomicFeatures::promoters() takes TxDb objects as input, while GenomicRanges::promoters() takes GRanges objects as input.

# Create a `TxDb` object from a `GRanges` object
txdb <- txdbmaker::makeTxDbFromGRanges(annotation)
#> Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information
#> is not available for this TxDb object

# Get promoter regions for all transcripts
prom_tx2 <- GenomicFeatures::promoters(txdb)

prom_tx2
#> GRanges object with 54013 ranges and 2 metadata columns:
#>               seqnames        ranges strand |     tx_id     tx_name
#>                  <Rle>     <IRanges>  <Rle> | <integer> <character>
#>   AT1G01010.1        1     1631-3830      + |         1 AT1G01010.1
#>   AT1G03987.1        1    9101-11300      + |         2 AT1G03987.1
#>   AT1G01040.1        1   21121-23320      + |         3 AT1G01040.1
#>   AT1G01040.2        1   21416-23615      + |         4 AT1G01040.2
#>     at1g01046        1   26500-28699      + |         5   at1g01046
#>           ...      ...           ...    ... .       ...         ...
#>   ATCG01200.1       Pt 135649-137848      - |     54009 ATCG01200.1
#>   ATCG01210.1       Pt 137438-139637      - |     54010 ATCG01210.1
#>   ATCG01220.1       Pt 137741-139940      - |     54011 ATCG01220.1
#>   ATCG01270.1       Pt 144955-147154      - |     54012 ATCG01270.1
#>   ATCG01280.1       Pt 151976-154175      - |     54013 ATCG01280.1
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Finally, you can extract sequences given a genome and some coordinates with the getSeq() function from BSgenome.

# Extract promoter sequences (1k genes only for demo purposes)
prom_seqs <- getSeq(genome, prom_genes[1:1000])

prom_seqs
#> DNAStringSet object of length 1000:
#>        width seq
#>    [1]  2200 TTCCGTCATTTGGAGATACGAAATCAAATCTC...GAGGAGCTCGTTGGTCACTATCTCCGTAACAA
#>    [2]  2200 CAAAAGGTATGATATGTTTTCACATTGTTTGA...TATAATATGTTTCATTCATGAACCTACGCATT
#>    [3]  2200 TAATTAACAGTTTAACACACTTACGTGATTAA...ATTTGCTTAGCTCTTTTCTTTCCTTGATTTCA
#>    [4]  2200 GCTAAAGAGACGGCGCTCTTGTCATTTTAAGT...TCTTCTTCGTGACCCCTTTTTACCTGCAAACA
#>    [5]  2200 TTTCCAGCTTGTCATTCACAGGATGATTACCG...GTTCTAAGATCATCAAAGTTTGATGCTTTTCC
#>    ...   ... ...
#>  [996]  2200 GTTTGTGTCTGCTTATATGCTCCAACGATTTG...CATCTCGTTCCTCCACCAACGCATCCACAACA
#>  [997]  2200 CTCATGTAACCATATATCCACTAATAAAGAGA...TATCCACTTACCTTTTTCGATTTCGTTTGATC
#>  [998]  2200 TTCTCTGGCCGAACAAACCCATCAATGCTCCA...CCAACGAAAAATCTTAGCTTAACACCATCGCG
#>  [999]  2200 TAAATCTATGTTAGATACGTTCACTAGTCTTA...GAGCTATAATAGCATTGTTATAAGATTAGCAA
#> [1000]  2200 AGCTATTCTTGCTTACCAGTCCATTAATGGTC...ATTGTGGCGGCGAGGACCGTTCTGATAACGGG

Session info

Click to display session info

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.7.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] BSgenome_1.77.0        BiocIO_1.19.0          txdbmaker_1.5.4       
#>  [4] GenomicFeatures_1.61.3 AnnotationDbi_1.71.0   Biobase_2.69.0        
#>  [7] rtracklayer_1.69.0     GenomicRanges_1.61.0   Biostrings_2.77.1     
#> [10] GenomeInfoDb_1.45.3    XVector_0.49.0         IRanges_2.43.0        
#> [13] S4Vectors_0.47.0       BiocGenerics_0.55.0    generics_0.1.4        
#> [16] BiocStyle_2.37.0      
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.49.0             SummarizedExperiment_1.39.0
#>  [3] httr2_1.1.2                 rjson_0.2.23               
#>  [5] xfun_0.52                   lattice_0.22-7             
#>  [7] vctrs_0.6.5                 tools_4.5.1                
#>  [9] bitops_1.0-9                curl_6.2.3                 
#> [11] parallel_4.5.1              tibble_3.2.1               
#> [13] RSQLite_2.3.11              blob_1.2.4                 
#> [15] pkgconfig_2.0.3             Matrix_1.7-3               
#> [17] dbplyr_2.5.0                lifecycle_1.0.4            
#> [19] stringr_1.5.1               compiler_4.5.1             
#> [21] Rsamtools_2.25.0            progress_1.2.3             
#> [23] codetools_0.2-20            htmltools_0.5.8.1          
#> [25] RCurl_1.98-1.17             yaml_2.3.10                
#> [27] pillar_1.10.2               crayon_1.5.3               
#> [29] BiocParallel_1.43.2         DelayedArray_0.35.1        
#> [31] cachem_1.1.0                abind_1.4-8                
#> [33] tidyselect_1.2.1            digest_0.6.37              
#> [35] stringi_1.8.7               dplyr_1.1.4                
#> [37] restfulr_0.0.15             biomaRt_2.65.0             
#> [39] fastmap_1.2.0               grid_4.5.1                 
#> [41] cli_3.6.5                   SparseArray_1.9.0          
#> [43] magrittr_2.0.3              S4Arrays_1.9.0             
#> [45] XML_3.99-0.18               filelock_1.0.3             
#> [47] rappdirs_0.3.3              prettyunits_1.2.0          
#> [49] UCSC.utils_1.5.0            bit64_4.6.0-1              
#> [51] rmarkdown_2.29              httr_1.4.7                 
#> [53] matrixStats_1.5.0           bit_4.6.0                  
#> [55] png_0.1-8                   hms_1.1.3                  
#> [57] memoise_2.0.1               evaluate_1.0.3             
#> [59] knitr_1.50                  BiocFileCache_2.99.5       
#> [61] rlang_1.1.6                 glue_1.8.0                 
#> [63] DBI_1.2.3                   xml2_1.3.8                 
#> [65] BiocManager_1.30.25         jsonlite_2.0.0             
#> [67] R6_2.6.1                    MatrixGenerics_1.21.0      
#> [69] GenomicAlignments_1.45.0

How to extract promoter sequences

Bioconductor packages used in this document

How to extract promoter sequences

Further reading

Session info