library(GSEABase)
URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release",
"2023.2.Hs/c7.immunesigdb.v2023.2.Hs.symbols.gmt", sep="/")
genesets <- getGmt(URL, geneIdType=SymbolIdentifier())
genesets
#> GeneSetCollection
#> names: GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_DN, GOLDRATH_EFF_VS_MEMORY_CD8_TCELL_UP, ..., KAECH_NAIVE_VS_MEMORY_CD8_TCELL_UP (4872 total)
#> unique identifiers: ABCA2, ABCC5, ..., LINC00841 (20457 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: NullCollection (1 total)
length(genesets)
#> [1] 4872
How to read gene sets from GMT files
2025-06-19
Source:vignettes/how-to-read-gene-sets-from-gmt-files.qmd
A gene set is simple yet useful way to define pathways without regard to the specific molecular interactions. It can also represent the signature of a characteristic pattern of gene expression associated with a cell type or a disease for diagnostic or prognostic purposes. An important source of gene sets is the Molecular Signatures Database (MSigDB), which stores them as plain text files following the so-called gene matrix transposed (GMT) format. In the GMT format, each line stores a gene set with the following values separated by tabs:
- A unique gene set identifier.
- A gene set description.
- One or more gene identifiers.
Because each different gene set may consist of a different number of genes, each line in a GMT file may contain a different number of tab-separated values. This means that the GMT format is not a tabular format, and therefore cannot be directly read with base R functions such as read.table()
or read.csv()
.
Bioconductor packages used in this document
How to read gene sets from GMT files
The package GSEABase provides object classes and methods to represent and manipulate gene sets, including reading GMT files with the function getGmt()
. Here is a first example in which we read a GMT file from the MSigDB database:
Next to the filename of the GMT file, we provided the argument geneIdType=SymbolIdentifier()
. This argument adds the necessary metadata that later on allows other software to figure out what kind of gene identifiers are used in this collection of gene sets, to attempt mapping them to other type of identifiers, if necessary. While this argument is optional, we should always try to provide it.
The output of the getGmt()
function is a GeneSetCollection
object, which pretty much behaves like a list object and many Bioconductor packages analysing gene sets may expect that kind of object as input. However, we can coerce it to a regular list
object with the function geneIds()
:
By definition, GMT files should have unique gene set identifiers and if we attempt to read with getGmt()
a malformed GMT file with duplicated identifiers it will prompt an error:
URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release",
"7.5/c2.all.v7.5.symbols.gmt", sep="/")
genesets <- getGmt(URL, geneIdType=SymbolIdentifier())
#> Error in validObject(.Object): invalid class "GeneSetCollection" object: each setName must be distinct
So if we still want to read the contents of that GMT file, we would have to download it and edit the lines with duplicated identifiers. An alternative is to use the function readGMT()
of the GSVA package:
library(GSVA)
genesets <- readGMT(URL)
#> Warning in deduplicateGmtLines(lines, deduplUse): GMT contains duplicated gene
#> set names; deduplicated using method: first
genesets
#> GeneSetCollection
#> names: CORONEL_RFX7_DIRECT_TARGETS_UP, FOROUTAN_TGFB_EMT_UP, ..., SA_TRKA_RECEPTOR (6363 total)
#> unique identifiers: ABAT, DIP2A, ..., IGHG3 (21728 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: NullCollection (1 total)
As we may see, it has detected the presence duplicated gene set identifiers and applied a deduplication policy, which can be changed through the parameter dedupUse
; see its help page with ?readGMT
for further deduplication policies. As getGmt()
, it also takes a parameter geneIdType
, but by default it will automatically detect the type of gene identifier and fill in the corresponding metadata in the resulting GeneSetCollection
object, so that in principle we need not to specify it.
Further reading
- The Introduction to GSEABase vignette from the GSEABase package.
- The help pages of the
getGmt()
andreadGMT()
functions in the GSEABase and GSVA packages, respectively. - The Bioconductor package msigdb provides an API for directly querying and downloading MSigDB GMT files as
GeneSetCollection
objects. - Similarly to msigdb, the msigdbr CRAN package provides an API to query and download MSigDB GMT files, but as tidy
data.frame
objects with one gene per row.
Session info
Click to display session info
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Ventura 13.7.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: UTC
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GSVA_2.3.1 GSEABase_1.71.0 graph_1.87.0
#> [4] annotate_1.87.0 XML_3.99-0.18 AnnotationDbi_1.71.0
#> [7] IRanges_2.43.0 S4Vectors_0.47.0 Biobase_2.69.0
#> [10] BiocGenerics_0.55.0 generics_0.1.4 BiocStyle_2.37.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.49.0 SummarizedExperiment_1.39.0
#> [3] rjson_0.2.23 xfun_0.52
#> [5] rhdf5_2.53.0 lattice_0.22-7
#> [7] rhdf5filters_1.21.0 vctrs_0.6.5
#> [9] tools_4.5.1 parallel_4.5.1
#> [11] RSQLite_2.3.11 blob_1.2.4
#> [13] Matrix_1.7-3 sparseMatrixStats_1.21.0
#> [15] compiler_4.5.1 Biostrings_2.77.1
#> [17] codetools_0.2-20 GenomeInfoDb_1.45.3
#> [19] htmltools_0.5.8.1 yaml_2.3.10
#> [21] crayon_1.5.3 BiocParallel_1.43.2
#> [23] SingleCellExperiment_1.31.0 DelayedArray_0.35.1
#> [25] cachem_1.1.0 magick_2.8.6
#> [27] abind_1.4-8 rsvd_1.0.5
#> [29] digest_0.6.37 BiocSingular_1.25.0
#> [31] fastmap_1.2.0 grid_4.5.1
#> [33] cli_3.6.5 SparseArray_1.9.0
#> [35] magrittr_2.0.3 S4Arrays_1.9.0
#> [37] h5mread_1.1.1 UCSC.utils_1.5.0
#> [39] bit64_4.6.0-1 rmarkdown_2.29
#> [41] XVector_0.49.0 httr_1.4.7
#> [43] matrixStats_1.5.0 bit_4.6.0
#> [45] png_0.1-8 ScaledMatrix_1.17.0
#> [47] SpatialExperiment_1.19.1 beachmat_2.25.0
#> [49] memoise_2.0.1 HDF5Array_1.37.0
#> [51] evaluate_1.0.3 knitr_1.50
#> [53] GenomicRanges_1.61.0 irlba_2.3.5.1
#> [55] rlang_1.1.6 Rcpp_1.0.14
#> [57] xtable_1.8-4 DBI_1.2.3
#> [59] BiocManager_1.30.25 jsonlite_2.0.0
#> [61] R6_2.6.1 Rhdf5lib_1.31.0
#> [63] MatrixGenerics_1.21.0