This data package contains the Molecular Signature Database (MSigDB) for both human and predicted mouse orthologs in separate data frames (tibbles). Each data frame (msigdf.human
and msigdf.mouse
) contain three columns: the collection (Hallmark, or c1-c7), the gene set, and Entrez IDs for genes in that set. The msigdf.urls
tibble contains links to descriptions on the Broad Institute’s website of each gene set. Source code available on GitHub.
msigdf 5.2
Original data from the Broad Institute’s Molecular Signature Database (MSigDB)1 http://www.broad.mit.edu/gsea/msigdb/index.jsp, redistributed as separate R data files containing named lists of gene sets, available from WEHI.2 http://bioinf.wehi.edu.au/software/MSigDB/ The following description applies to the R-formatted data:
The gene sets contained in the MSigDB are from a wide variety of sources, and relate to a variety of species, mostly human. Our work at the WEHI predominately uses mouse models of human disease. To facilitate use of the MSigDB in our work, we have created a pure mouse version of the MSigDB by mapping all sets to mouse orthologs. A pure human version is also provided.
Prodecure:
1. The current MSigDB v5.2 xml file was downloaded. 2. Human Entrez Gene IDs were mapped to Mouse Entrez Gene IDs, using the HGNC Comparison of Orthology Predictions (HCOP) (downloaded 11 Octtober 2016). 3. Each collection was converted to a list in R, and written to a RData file using save()
.
See the script in data-raw/
to see how the data frames (tibbles) were created.
There are three data frames (tibbles) this package. The msigdf.human
data frame has columns for each MSigDB collection (c1-7 and hallmark), each gene set, and Entrez ID, where each row is a single Entrez gene ID. The msigdf.mouse
data frame has the same structure for mouse orthologs. The msigdf.urls
data frame links the name of the gene set to the URL on the Broad’s website.
The data sets in this package have several million rows. The package imports the tibble package so they’re displayed nicely.
library(tidyverse)
library(msigdf)
Take a look:
msigdf.human %>% head()
## # A tibble: 6 x 3
## collection geneset entrez
## <chr> <chr> <int>
## 1 c1 chr5q23 5759
## 2 c1 chr5q23 94033
## 3 c1 chr5q23 51334
## 4 c1 chr5q23 153163
## 5 c1 chr5q23 133615
## 6 c1 chr5q23 402229
msigdf.mouse %>% head()
## # A tibble: 6 x 3
## collection geneset entrez
## <chr> <chr> <int>
## 1 c1 chr5q23 100042150
## 2 c1 chr5q23 103236
## 3 c1 chr5q23 106869
## 4 c1 chr5q23 107022
## 5 c1 chr5q23 109700
## 6 c1 chr5q23 11548
msigdf.urls %>% as.data.frame() %>% head()
## collection geneset
## 1 c1 chr5q23
## 2 c1 chr16q24
## 3 c1 chr8q24
## 4 c1 chr13q11
## 5 c1 chr7p21
## 6 c1 chr10q23
## url
## 1 http://software.broadinstitute.org/gsea/msigdb/cards/chr5q23
## 2 http://software.broadinstitute.org/gsea/msigdb/cards/chr16q24
## 3 http://software.broadinstitute.org/gsea/msigdb/cards/chr8q24
## 4 http://software.broadinstitute.org/gsea/msigdb/cards/chr13q11
## 5 http://software.broadinstitute.org/gsea/msigdb/cards/chr7p21
## 6 http://software.broadinstitute.org/gsea/msigdb/cards/chr10q23
Just get the entries for the KEGG non-homologous end joining pathway:
msigdf.human %>%
filter(geneset=="KEGG_NON_HOMOLOGOUS_END_JOINING")
## # A tibble: 14 x 3
## collection geneset entrez
## <chr> <chr> <int>
## 1 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 7518
## 2 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 4361
## 3 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 27343
## 4 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 27434
## 5 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 731751
## 6 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 79840
## 7 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 3981
## 8 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 2237
## 9 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 1791
## 10 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 7520
## 11 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 10111
## 12 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 2547
## 13 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 5591
## 14 c2 KEGG_NON_HOMOLOGOUS_END_JOINING 64421
Some software, e.g., GAGE might require gene sets to be a named list of Entrez IDs, where the name of each element in the list is the name of the pathway. This is how the data was originally structured, and we can return to it with plyr::dlply()
. Here, let’s use only the hallmark sets, and after we dlply
the data into this named list format, get just the first few pathways, and in each of those, just display the first few entrez IDs.
msigdf.human %>%
filter(collection=="hallmark") %>%
select(geneset, entrez) %>%
group_by(geneset) %>%
summarize(entrez=list(entrez)) %>%
deframe() %>%
head() %>%
map(head)
## $HALLMARK_ADIPOGENESIS
## [1] 2167 9370 5468 3991 8694 4023
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] 5788 3593 7040 3592 916 915
##
## $HALLMARK_ANDROGEN_RESPONSE
## [1] 354 3817 2181 8554 10645 4824
##
## $HALLMARK_ANGIOGENESIS
## [1] 1462 10631 11167 4043 6781 4023
##
## $HALLMARK_APICAL_JUNCTION
## [1] 87 1366 89 149461 1739 7082
##
## $HALLMARK_APICAL_SURFACE
## [1] 2683 51458 4118 27076 5314 50617
For demonstration purposes, create a single object containing both human and mouse data:
msigdf <- bind_rows(
msigdf.human %>% mutate(org="human"),
msigdf.mouse %>% mutate(org="mouse")
)
head(msigdf)
## # A tibble: 6 x 4
## collection geneset entrez org
## <chr> <chr> <int> <chr>
## 1 c1 chr5q23 5759 human
## 2 c1 chr5q23 94033 human
## 3 c1 chr5q23 51334 human
## 4 c1 chr5q23 153163 human
## 5 c1 chr5q23 133615 human
## 6 c1 chr5q23 402229 human
tail(msigdf)
## # A tibble: 6 x 4
## collection geneset entrez org
## <chr> <chr> <int> <chr>
## 1 hallmark HALLMARK_PANCREAS_BETA_CELLS 56458 mouse
## 2 hallmark HALLMARK_PANCREAS_BETA_CELLS 56529 mouse
## 3 hallmark HALLMARK_PANCREAS_BETA_CELLS 66286 mouse
## 4 hallmark HALLMARK_PANCREAS_BETA_CELLS 69019 mouse
## 5 hallmark HALLMARK_PANCREAS_BETA_CELLS 77766 mouse
## 6 hallmark HALLMARK_PANCREAS_BETA_CELLS 80976 mouse
The number of gene sets in each collection is the same for each organism:
msigdf %>%
group_by(org, collection) %>%
summarize(ngenesets=n_distinct(geneset)) %>%
spread(org, ngenesets)
## # A tibble: 8 x 3
## collection human mouse
## <chr> <int> <int>
## 1 c1 326 326
## 2 c2 4729 4729
## 3 c3 836 836
## 4 c4 858 858
## 5 c5 6166 6166
## 6 c6 189 189
## 7 c7 4872 4872
## 8 hallmark 50 50
But the number of mouse genes in each collection is much greater, due to the one-to-many ortholog mapping.
msigdf %>%
count(org, collection) %>%
spread(org, n)
## # A tibble: 8 x 3
## collection human mouse
## <chr> <int> <int>
## 1 c1 30288 29951
## 2 c2 441018 726058
## 3 c3 198275 325849
## 4 c4 91309 160394
## 5 c5 703598 1013231
## 6 c6 31319 56280
## 7 c7 948254 1588236
## 8 hallmark 7324 12067
Look at the first few gene sets just in the 50-geneset hallmark collection. In each gene set, the number of mouse genes is greater than the number of human genes.
msigdf %>%
count(org, collection, geneset) %>%
filter(collection=="hallmark") %>%
spread(org, n)
## # A tibble: 50 x 4
## collection geneset human mouse
## <chr> <chr> <int> <int>
## 1 hallmark HALLMARK_ADIPOGENESIS 200 303
## 2 hallmark HALLMARK_ALLOGRAFT_REJECTION 200 333
## 3 hallmark HALLMARK_ANDROGEN_RESPONSE 101 186
## 4 hallmark HALLMARK_ANGIOGENESIS 36 65
## 5 hallmark HALLMARK_APICAL_JUNCTION 200 363
## 6 hallmark HALLMARK_APICAL_SURFACE 44 80
## 7 hallmark HALLMARK_APOPTOSIS 161 224
## 8 hallmark HALLMARK_BILE_ACID_METABOLISM 112 176
## 9 hallmark HALLMARK_CHOLESTEROL_HOMEOSTASIS 74 110
## 10 hallmark HALLMARK_COAGULATION 138 225
## # ... with 40 more rows
Get the URL for the hallmark set with the fewest number of genes (Notch signaling). Optionally, %>%
this to browseURL
to open it up in your browser.
msigdf.human %>%
filter(collection=="hallmark") %>%
count(geneset) %>%
arrange((n)) %>%
head(1) %>%
inner_join(msigdf.urls, by="geneset") %>%
pull(url)
## [1] "http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_NOTCH_SIGNALING"
Just look at the number of genes in each KEGG pathway (sorted descending by the number of genes in that pathway):
msigdf.human %>%
filter(collection=="c2" & grepl("^KEGG_", geneset)) %>%
count(geneset) %>%
arrange(desc(n))
## # A tibble: 186 x 2
## geneset n
## <chr> <int>
## 1 KEGG_OLFACTORY_TRANSDUCTION 389
## 2 KEGG_PATHWAYS_IN_CANCER 328
## 3 KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION 272
## 4 KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION 267
## 5 KEGG_MAPK_SIGNALING_PATHWAY 267
## 6 KEGG_REGULATION_OF_ACTIN_CYTOSKELETON 216
## 7 KEGG_FOCAL_ADHESION 201
## 8 KEGG_CHEMOKINE_SIGNALING_PATHWAY 190
## 9 KEGG_HUNTINGTONS_DISEASE 185
## 10 KEGG_ENDOCYTOSIS 183
## # ... with 176 more rows
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.7
## [5] purrr_0.2.5 readr_1.1.1 tidyr_0.8.2 tibble_1.4.2
## [9] ggplot2_3.1.0 tidyverse_1.2.1 knitr_1.20 BiocStyle_2.8.2
## [13] msigdf_5.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.3 haven_1.1.2 lattice_0.20-35
## [5] colorspace_1.3-2 htmltools_0.3.6 yaml_2.2.0 utf8_1.1.4
## [9] rlang_0.3.0.1 pillar_1.3.0 glue_1.3.0 withr_2.1.2
## [13] readxl_1.1.0 modelr_0.1.2 bindr_0.1.1 plyr_1.8.4
## [17] cellranger_1.1.0 munsell_0.5.0 commonmark_1.6 gtable_0.2.0
## [21] rvest_0.3.2 devtools_1.13.6 memoise_1.1.0 evaluate_0.11
## [25] fansi_0.4.0 broom_0.5.0 Rcpp_0.12.19 backports_1.1.2
## [29] scales_1.0.0 desc_1.2.0 jsonlite_1.5 hms_0.4.2
## [33] digest_0.6.18 stringi_1.2.4 bookdown_0.7 rprojroot_1.3-2
## [37] grid_3.5.1 cli_1.0.1 tools_3.5.1 magrittr_1.5
## [41] lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.2 xml2_1.2.0
## [45] lubridate_1.7.4 assertthat_0.2.0 rmarkdown_1.10 roxygen2_6.1.0
## [49] httr_1.3.1 rstudioapi_0.8 R6_2.3.0 nlme_3.1-137
## [53] compiler_3.5.1