MSigDF: Molecular Signature Database (MSigDB) in a Data Frame

Stephen D. Turner1*

1University of Virginia, Charlottesville VA, United States

3 December 2018

Abstract

This data package contains the Molecular Signature Database (MSigDB) for both human and predicted mouse orthologs in separate data frames (tibbles). Each data frame (msigdf.human and msigdf.mouse) contain three columns: the collection (Hallmark, or c1-c7), the gene set, and Entrez IDs for genes in that set. The msigdf.urls tibble contains links to descriptions on the Broad Institute’s website of each gene set. Source code available on GitHub.

Package

msigdf 5.2

1 Data sources
2 Example usage
3 Further exploration
Session info

1 Data sources

Original data from the Broad Institute’s Molecular Signature Database (MSigDB)11 http://www.broad.mit.edu/gsea/msigdb/index.jsp, redistributed as separate R data files containing named lists of gene sets, available from WEHI.22 http://bioinf.wehi.edu.au/software/MSigDB/ The following description applies to the R-formatted data:

The gene sets contained in the MSigDB are from a wide variety of sources, and relate to a variety of species, mostly human. Our work at the WEHI predominately uses mouse models of human disease. To facilitate use of the MSigDB in our work, we have created a pure mouse version of the MSigDB by mapping all sets to mouse orthologs. A pure human version is also provided.

Prodecure:

1. The current MSigDB v5.2 xml file was downloaded. 2. Human Entrez Gene IDs were mapped to Mouse Entrez Gene IDs, using the HGNC Comparison of Orthology Predictions (HCOP) (downloaded 11 Octtober 2016). 3. Each collection was converted to a list in R, and written to a RData file using save().

See the script in data-raw/ to see how the data frames (tibbles) were created.

2 Example usage

There are three data frames (tibbles) this package. The msigdf.human data frame has columns for each MSigDB collection (c1-7 and hallmark), each gene set, and Entrez ID, where each row is a single Entrez gene ID. The msigdf.mouse data frame has the same structure for mouse orthologs. The msigdf.urls data frame links the name of the gene set to the URL on the Broad’s website.

The data sets in this package have several million rows. The package imports the tibble package so they’re displayed nicely.

library(tidyverse)
library(msigdf)

Take a look:

msigdf.human %>% head()

## # A tibble: 6 x 3
##   collection geneset entrez
##   <chr>      <chr>    <int>
## 1 c1         chr5q23   5759
## 2 c1         chr5q23  94033
## 3 c1         chr5q23  51334
## 4 c1         chr5q23 153163
## 5 c1         chr5q23 133615
## 6 c1         chr5q23 402229

msigdf.mouse %>% head()

## # A tibble: 6 x 3
##   collection geneset    entrez
##   <chr>      <chr>       <int>
## 1 c1         chr5q23 100042150
## 2 c1         chr5q23    103236
## 3 c1         chr5q23    106869
## 4 c1         chr5q23    107022
## 5 c1         chr5q23    109700
## 6 c1         chr5q23     11548

msigdf.urls %>% as.data.frame() %>% head()

##   collection  geneset
## 1         c1  chr5q23
## 2         c1 chr16q24
## 3         c1  chr8q24
## 4         c1 chr13q11
## 5         c1  chr7p21
## 6         c1 chr10q23
##                                                             url
## 1  http://software.broadinstitute.org/gsea/msigdb/cards/chr5q23
## 2 http://software.broadinstitute.org/gsea/msigdb/cards/chr16q24
## 3  http://software.broadinstitute.org/gsea/msigdb/cards/chr8q24
## 4 http://software.broadinstitute.org/gsea/msigdb/cards/chr13q11
## 5  http://software.broadinstitute.org/gsea/msigdb/cards/chr7p21
## 6 http://software.broadinstitute.org/gsea/msigdb/cards/chr10q23

Just get the entries for the KEGG non-homologous end joining pathway:

msigdf.human %>% 
  filter(geneset=="KEGG_NON_HOMOLOGOUS_END_JOINING")

## # A tibble: 14 x 3
##    collection geneset                         entrez
##    <chr>      <chr>                            <int>
##  1 c2         KEGG_NON_HOMOLOGOUS_END_JOINING   7518
##  2 c2         KEGG_NON_HOMOLOGOUS_END_JOINING   4361
##  3 c2         KEGG_NON_HOMOLOGOUS_END_JOINING  27343
##  4 c2         KEGG_NON_HOMOLOGOUS_END_JOINING  27434
##  5 c2         KEGG_NON_HOMOLOGOUS_END_JOINING 731751
##  6 c2         KEGG_NON_HOMOLOGOUS_END_JOINING  79840
##  7 c2         KEGG_NON_HOMOLOGOUS_END_JOINING   3981
##  8 c2         KEGG_NON_HOMOLOGOUS_END_JOINING   2237
##  9 c2         KEGG_NON_HOMOLOGOUS_END_JOINING   1791
## 10 c2         KEGG_NON_HOMOLOGOUS_END_JOINING   7520
## 11 c2         KEGG_NON_HOMOLOGOUS_END_JOINING  10111
## 12 c2         KEGG_NON_HOMOLOGOUS_END_JOINING   2547
## 13 c2         KEGG_NON_HOMOLOGOUS_END_JOINING   5591
## 14 c2         KEGG_NON_HOMOLOGOUS_END_JOINING  64421

Some software, e.g., GAGE might require gene sets to be a named list of Entrez IDs, where the name of each element in the list is the name of the pathway. This is how the data was originally structured, and we can return to it with plyr::dlply(). Here, let’s use only the hallmark sets, and after we dlply the data into this named list format, get just the first few pathways, and in each of those, just display the first few entrez IDs.

msigdf.human %>% 
  filter(collection=="hallmark") %>% 
  select(geneset, entrez) %>% 
  group_by(geneset) %>% 
  summarize(entrez=list(entrez)) %>% 
  deframe() %>% 
  head() %>% 
  map(head)

## $HALLMARK_ADIPOGENESIS
## [1] 2167 9370 5468 3991 8694 4023
## 
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] 5788 3593 7040 3592  916  915
## 
## $HALLMARK_ANDROGEN_RESPONSE
## [1]   354  3817  2181  8554 10645  4824
## 
## $HALLMARK_ANGIOGENESIS
## [1]  1462 10631 11167  4043  6781  4023
## 
## $HALLMARK_APICAL_JUNCTION
## [1]     87   1366     89 149461   1739   7082
## 
## $HALLMARK_APICAL_SURFACE
## [1]  2683 51458  4118 27076  5314 50617

3 Further exploration

For demonstration purposes, create a single object containing both human and mouse data:

msigdf <- bind_rows(
  msigdf.human %>% mutate(org="human"),
  msigdf.mouse %>% mutate(org="mouse")
)

head(msigdf)

## # A tibble: 6 x 4
##   collection geneset entrez org  
##   <chr>      <chr>    <int> <chr>
## 1 c1         chr5q23   5759 human
## 2 c1         chr5q23  94033 human
## 3 c1         chr5q23  51334 human
## 4 c1         chr5q23 153163 human
## 5 c1         chr5q23 133615 human
## 6 c1         chr5q23 402229 human

tail(msigdf)

## # A tibble: 6 x 4
##   collection geneset                      entrez org  
##   <chr>      <chr>                         <int> <chr>
## 1 hallmark   HALLMARK_PANCREAS_BETA_CELLS  56458 mouse
## 2 hallmark   HALLMARK_PANCREAS_BETA_CELLS  56529 mouse
## 3 hallmark   HALLMARK_PANCREAS_BETA_CELLS  66286 mouse
## 4 hallmark   HALLMARK_PANCREAS_BETA_CELLS  69019 mouse
## 5 hallmark   HALLMARK_PANCREAS_BETA_CELLS  77766 mouse
## 6 hallmark   HALLMARK_PANCREAS_BETA_CELLS  80976 mouse

The number of gene sets in each collection is the same for each organism:

msigdf %>%
  group_by(org, collection) %>%
  summarize(ngenesets=n_distinct(geneset)) %>%
  spread(org, ngenesets)

## # A tibble: 8 x 3
##   collection human mouse
##   <chr>      <int> <int>
## 1 c1           326   326
## 2 c2          4729  4729
## 3 c3           836   836
## 4 c4           858   858
## 5 c5          6166  6166
## 6 c6           189   189
## 7 c7          4872  4872
## 8 hallmark      50    50

But the number of mouse genes in each collection is much greater, due to the one-to-many ortholog mapping.

msigdf %>%
  count(org, collection) %>%
  spread(org, n)

## # A tibble: 8 x 3
##   collection  human   mouse
##   <chr>       <int>   <int>
## 1 c1          30288   29951
## 2 c2         441018  726058
## 3 c3         198275  325849
## 4 c4          91309  160394
## 5 c5         703598 1013231
## 6 c6          31319   56280
## 7 c7         948254 1588236
## 8 hallmark     7324   12067

Look at the first few gene sets just in the 50-geneset hallmark collection. In each gene set, the number of mouse genes is greater than the number of human genes.

msigdf %>%
  count(org, collection, geneset) %>%
  filter(collection=="hallmark") %>%
  spread(org, n)

## # A tibble: 50 x 4
##    collection geneset                          human mouse
##    <chr>      <chr>                            <int> <int>
##  1 hallmark   HALLMARK_ADIPOGENESIS              200   303
##  2 hallmark   HALLMARK_ALLOGRAFT_REJECTION       200   333
##  3 hallmark   HALLMARK_ANDROGEN_RESPONSE         101   186
##  4 hallmark   HALLMARK_ANGIOGENESIS               36    65
##  5 hallmark   HALLMARK_APICAL_JUNCTION           200   363
##  6 hallmark   HALLMARK_APICAL_SURFACE             44    80
##  7 hallmark   HALLMARK_APOPTOSIS                 161   224
##  8 hallmark   HALLMARK_BILE_ACID_METABOLISM      112   176
##  9 hallmark   HALLMARK_CHOLESTEROL_HOMEOSTASIS    74   110
## 10 hallmark   HALLMARK_COAGULATION               138   225
## # ... with 40 more rows

Get the URL for the hallmark set with the fewest number of genes (Notch signaling). Optionally, %>% this to browseURL to open it up in your browser.

msigdf.human %>%
  filter(collection=="hallmark") %>%
  count(geneset) %>%
  arrange((n)) %>%
  head(1) %>%
  inner_join(msigdf.urls, by="geneset") %>%
  pull(url)

## [1] "http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_NOTCH_SIGNALING"

Just look at the number of genes in each KEGG pathway (sorted descending by the number of genes in that pathway):

msigdf.human %>%
  filter(collection=="c2" & grepl("^KEGG_", geneset)) %>%
  count(geneset) %>% 
  arrange(desc(n))

## # A tibble: 186 x 2
##    geneset                                          n
##    <chr>                                        <int>
##  1 KEGG_OLFACTORY_TRANSDUCTION                    389
##  2 KEGG_PATHWAYS_IN_CANCER                        328
##  3 KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION   272
##  4 KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION    267
##  5 KEGG_MAPK_SIGNALING_PATHWAY                    267
##  6 KEGG_REGULATION_OF_ACTIN_CYTOSKELETON          216
##  7 KEGG_FOCAL_ADHESION                            201
##  8 KEGG_CHEMOKINE_SIGNALING_PATHWAY               190
##  9 KEGG_HUNTINGTONS_DISEASE                       185
## 10 KEGG_ENDOCYTOSIS                               183
## # ... with 176 more rows

Session info

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2  forcats_0.3.0   stringr_1.3.1   dplyr_0.7.7    
##  [5] purrr_0.2.5     readr_1.1.1     tidyr_0.8.2     tibble_1.4.2   
##  [9] ggplot2_3.1.0   tidyverse_1.2.1 knitr_1.20      BiocStyle_2.8.2
## [13] msigdf_5.2     
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5 xfun_0.3         haven_1.1.2      lattice_0.20-35 
##  [5] colorspace_1.3-2 htmltools_0.3.6  yaml_2.2.0       utf8_1.1.4      
##  [9] rlang_0.3.0.1    pillar_1.3.0     glue_1.3.0       withr_2.1.2     
## [13] readxl_1.1.0     modelr_0.1.2     bindr_0.1.1      plyr_1.8.4      
## [17] cellranger_1.1.0 munsell_0.5.0    commonmark_1.6   gtable_0.2.0    
## [21] rvest_0.3.2      devtools_1.13.6  memoise_1.1.0    evaluate_0.11   
## [25] fansi_0.4.0      broom_0.5.0      Rcpp_0.12.19     backports_1.1.2 
## [29] scales_1.0.0     desc_1.2.0       jsonlite_1.5     hms_0.4.2       
## [33] digest_0.6.18    stringi_1.2.4    bookdown_0.7     rprojroot_1.3-2 
## [37] grid_3.5.1       cli_1.0.1        tools_3.5.1      magrittr_1.5    
## [41] lazyeval_0.2.1   crayon_1.3.4     pkgconfig_2.0.2  xml2_1.2.0      
## [45] lubridate_1.7.4  assertthat_0.2.0 rmarkdown_1.10   roxygen2_6.1.0  
## [49] httr_1.3.1       rstudioapi_0.8   R6_2.3.0         nlme_3.1-137    
## [53] compiler_3.5.1