Skip to contents

A gene set is simple yet useful way to define pathways without regard to the specific molecular interactions. It can also represent the signature of a characteristic pattern of gene expression associated with a cell type or a disease for diagnostic or prognostic purposes. An important source of gene sets is the Molecular Signatures Database (MSigDB), which stores them as plain text files following the so-called gene matrix transposed (GMT) format. In the GMT format, each line stores a gene set with the following values separated by tabs:

  • A unique gene set identifier.
  • A gene set description.
  • One or more gene identifiers.

Because each different gene set may consist of a different number of genes, each line in a GMT file may contain a different number of tab-separated values. This means that the GMT format is not a tabular format, and therefore cannot be directly read with base R functions such as read.table() or read.csv().

Bioconductor packages used in this document

How to read gene sets from GMT files

The package GSEABase provides object classes and methods to represent and manipulate gene sets, including reading GMT files with the function getGmt(). Here is a first example in which we read a GMT file from the MSigDB database:

library(GSEABase)

URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release",
             "2023.2.Hs/c7.immunesigdb.v2023.2.Hs.symbols.gmt", sep="/")
genesets <- getGmt(URL, geneIdType=SymbolIdentifier())
genesets
#> GeneSetCollection
#>   names: GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN, GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP, ..., KAECH_NAIVE_VS_MEMORY_CD8_TCELL_UP (4872 total)
#>   unique identifiers: ABCA2, ABCC5, ..., LINC00841 (20457 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: NullCollection (1 total)
length(genesets)
#> [1] 4872

Next to the filename of the GMT file, we provided the argument geneIdType=SymbolIdentifier(). This argument adds the necessary metadata that later on allows other software to figure out what kind of gene identifiers are used in this collection of gene sets, to attempt mapping them to other type of identifiers, if necessary. While this argument is optional, we should always try to provide it.

The output of the getGmt() function is a GeneSetCollection object, which pretty much behaves like a list object and many Bioconductor packages analysing gene sets may expect that kind of object as input. However, we can coerce it to a regular list object with the function geneIds():

genesets.list <- geneIds(genesets)
head(lapply(genesets.list, head, 3), 3)
#> $GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN
#> [1] "ABCA2"   "ABCC5"   "ABHD14A"
#> 
#> $GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP
#> [1] "ADAM8" "AK3"   "ANLN" 
#> 
#> $GOLDRATH_NAIVE_VS_EFF_CD8_TCELL_DN
#> [1] "ACOT7" "ADAM8" "ANLN"

By definition, GMT files should have unique gene set identifiers and if we attempt to read with getGmt() a malformed GMT file with duplicated identifiers it will prompt an error:

URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release",
             "7.5/c2.all.v7.5.symbols.gmt", sep="/")
genesets <- getGmt(URL, geneIdType=SymbolIdentifier())
#> Error in validObject(.Object): invalid class "GeneSetCollection" object: each setName must be distinct

So if we still want to read the contents of that GMT file, we would have to download it and edit the lines with duplicated identifiers. An alternative is to use the function readGMT() of the GSVA package:

library(GSVA)

genesets <- readGMT(URL)
#> Warning in deduplicateGmtLines(lines, deduplUse): GMT contains duplicated gene
#> set names; deduplicated using method: first
genesets
#> GeneSetCollection
#>   names: CORONEL_RFX7_DIRECT_TARGETS_UP, FOROUTAN_TGFB_EMT_UP, ..., SA_TRKA_RECEPTOR (6363 total)
#>   unique identifiers: ABAT, DIP2A, ..., IGHG3 (21728 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: NullCollection (1 total)

As we may see, it has detected the presence duplicated gene set identifiers and applied a deduplication policy, which can be changed through the parameter dedupUse; see its help page with ?readGMT for further deduplication policies. As getGmt(), it also takes a parameter geneIdType, but by default it will automatically detect the type of gene identifier and fill in the corresponding metadata in the resulting GeneSetCollection object, so that in principle we need not to specify it.

Further reading

  • The Introduction to GSEABase vignette from the GSEABase package.
  • The help pages of the getGmt() and readGMT() functions in the GSEABase and GSVA packages, respectively.
  • The Bioconductor package msigdb provides an API for directly querying and downloading MSigDB GMT files as GeneSetCollection objects.
  • Similarly to msigdb, the msigdbr CRAN package provides an API to query and download MSigDB GMT files, but as tidy data.frame objects with one gene per row.

Session info

Click to display session info
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.7.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: UTC
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] GSVA_2.3.1           GSEABase_1.71.0      graph_1.87.0        
#>  [4] annotate_1.87.0      XML_3.99-0.18        AnnotationDbi_1.71.0
#>  [7] IRanges_2.43.0       S4Vectors_0.47.0     Biobase_2.69.0      
#> [10] BiocGenerics_0.55.0  generics_0.1.4       BiocStyle_2.37.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.49.0             SummarizedExperiment_1.39.0
#>  [3] rjson_0.2.23                xfun_0.52                  
#>  [5] rhdf5_2.53.0                lattice_0.22-7             
#>  [7] rhdf5filters_1.21.0         vctrs_0.6.5                
#>  [9] tools_4.5.1                 parallel_4.5.1             
#> [11] RSQLite_2.3.11              blob_1.2.4                 
#> [13] Matrix_1.7-3                sparseMatrixStats_1.21.0   
#> [15] compiler_4.5.1              Biostrings_2.77.1          
#> [17] codetools_0.2-20            GenomeInfoDb_1.45.3        
#> [19] htmltools_0.5.8.1           yaml_2.3.10                
#> [21] crayon_1.5.3                BiocParallel_1.43.2        
#> [23] SingleCellExperiment_1.31.0 DelayedArray_0.35.1        
#> [25] cachem_1.1.0                magick_2.8.6               
#> [27] abind_1.4-8                 rsvd_1.0.5                 
#> [29] digest_0.6.37               BiocSingular_1.25.0        
#> [31] fastmap_1.2.0               grid_4.5.1                 
#> [33] cli_3.6.5                   SparseArray_1.9.0          
#> [35] magrittr_2.0.3              S4Arrays_1.9.0             
#> [37] h5mread_1.1.1               UCSC.utils_1.5.0           
#> [39] bit64_4.6.0-1               rmarkdown_2.29             
#> [41] XVector_0.49.0              httr_1.4.7                 
#> [43] matrixStats_1.5.0           bit_4.6.0                  
#> [45] png_0.1-8                   ScaledMatrix_1.17.0        
#> [47] SpatialExperiment_1.19.1    beachmat_2.25.0            
#> [49] memoise_2.0.1               HDF5Array_1.37.0           
#> [51] evaluate_1.0.3              knitr_1.50                 
#> [53] GenomicRanges_1.61.0        irlba_2.3.5.1              
#> [55] rlang_1.1.6                 Rcpp_1.0.14                
#> [57] xtable_1.8-4                DBI_1.2.3                  
#> [59] BiocManager_1.30.25         jsonlite_2.0.0             
#> [61] R6_2.6.1                    Rhdf5lib_1.31.0            
#> [63] MatrixGenerics_1.21.0