Here I’ll show you how to take results from DESeq and quickly assess for enrichment of pathways using the fgsea package for fast preranked gene set enrichment analysis (GSEA).

FGSEA on DESeq results (human)

First, let’s start with a human dataset.

Load some results from DESeq2

Load some results from DESeq2 (generated using results(..., tidy=TRUE)). See the appendix section at the end for how this data was generated.

library(tidyverse)
res <- read_csv("data/deseq-results-tidy-human-airway.csv")
res
## # A tibble: 64,102 x 7
##    row         baseMean log2FoldChange   lfcSE    stat    pvalue      padj
##    <chr>          <dbl>          <dbl>   <dbl>   <dbl>     <dbl>     <dbl>
##  1 ENSG000000…  709.            0.381   0.101    3.79    1.52e-4   1.28e-3
##  2 ENSG000000…    0            NA      NA       NA      NA        NA      
##  3 ENSG000000…  520.           -0.207   0.112   -1.84    6.53e-2   1.97e-1
##  4 ENSG000000…  237.           -0.0379  0.143   -0.264   7.92e-1   9.11e-1
##  5 ENSG000000…   57.9           0.0882  0.287    0.307   7.59e-1   8.95e-1
##  6 ENSG000000…    0.318         1.38    3.50     0.394   6.94e-1  NA      
##  7 ENSG000000… 5817.           -0.426   0.0883  -4.83    1.38e-6   1.82e-5
##  8 ENSG000000… 1282.            0.241   0.0887   2.72    6.58e-3   3.28e-2
##  9 ENSG000000…  610.            0.0476  0.167    0.286   7.75e-1   9.03e-1
## 10 ENSG000000…  369.            0.500   0.121    4.14    3.48e-5   3.42e-4
## # ... with 64,092 more rows

Map Ensembl gene IDs to symbol. First create a mapping table.

library(org.Hs.eg.db)
ens2symbol <- AnnotationDbi::select(org.Hs.eg.db,
                                    key=res$row, 
                                    columns="SYMBOL",
                                    keytype="ENSEMBL")
ens2symbol <- as_tibble(ens2symbol)
ens2symbol
## # A tibble: 64,381 x 2
##    ENSEMBL         SYMBOL  
##    <chr>           <chr>   
##  1 ENSG00000000003 TSPAN6  
##  2 ENSG00000000005 TNMD    
##  3 ENSG00000000419 DPM1    
##  4 ENSG00000000457 SCYL3   
##  5 ENSG00000000460 C1orf112
##  6 ENSG00000000938 FGR     
##  7 ENSG00000000971 CFH     
##  8 ENSG00000001036 FUCA2   
##  9 ENSG00000001084 GCLC    
## 10 ENSG00000001167 NFYA    
## # ... with 64,371 more rows

Now join them.

res <- inner_join(res, ens2symbol, by=c("row"="ENSEMBL"))
res
## # A tibble: 64,381 x 8
##    row    baseMean log2FoldChange   lfcSE    stat   pvalue     padj SYMBOL
##    <chr>     <dbl>          <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <chr> 
##  1 ENSG0…  709.            0.381   0.101    3.79   1.52e-4  1.28e-3 TSPAN6
##  2 ENSG0…    0            NA      NA       NA     NA       NA       TNMD  
##  3 ENSG0…  520.           -0.207   0.112   -1.84   6.53e-2  1.97e-1 DPM1  
##  4 ENSG0…  237.           -0.0379  0.143   -0.264  7.92e-1  9.11e-1 SCYL3 
##  5 ENSG0…   57.9           0.0882  0.287    0.307  7.59e-1  8.95e-1 C1orf…
##  6 ENSG0…    0.318         1.38    3.50     0.394  6.94e-1 NA       FGR   
##  7 ENSG0… 5817.           -0.426   0.0883  -4.83   1.38e-6  1.82e-5 CFH   
##  8 ENSG0… 1282.            0.241   0.0887   2.72   6.58e-3  3.28e-2 FUCA2 
##  9 ENSG0…  610.            0.0476  0.167    0.286  7.75e-1  9.03e-1 GCLC  
## 10 ENSG0…  369.            0.500   0.121    4.14   3.48e-5  3.42e-4 NFYA  
## # ... with 64,371 more rows

Further, all you’ll care about later on is the gene symbol and the test statistic. Get just those, and remove the NAs. Finally, if you have multiple test statistics for the same symbol, you’ll want to deal with that in some way. Here I’m just averaging them.

res2 <- res %>% 
  dplyr::select(SYMBOL, stat) %>% 
  na.omit() %>% 
  distinct() %>% 
  group_by(SYMBOL) %>% 
  summarize(stat=mean(stat))
res2
## # A tibble: 19,662 x 2
##    SYMBOL       stat
##    <chr>       <dbl>
##  1 A1BG      0.680  
##  2 A1BG-AS1 -1.79   
##  3 A1CF     -0.126  
##  4 A2M      -1.26   
##  5 A2M-AS1   0.875  
##  6 A2ML1     0.673  
##  7 A2MP1     0.973  
##  8 A3GALT2  -0.0491 
##  9 A4GALT   -4.14   
## 10 A4GNT     0.00777
## # ... with 19,652 more rows

Using the fgsea package

We’re going to use the fgsea package for fast preranked gene set enrichment analysis (GSEA). See the preprint for more details. From the abstract:

Gene set enrichment analysis is a widely used tool for analyzing gene expression data. However, current implementations are slow due to a large number of required samples for the analysis to have a good statistical power. In this paper we present a novel algorithm, that efficiently reuses one sample multiple times and thus speeds up the analysis. We show that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values. This, in turn, allows applying standard FDR correction procedures, which are more accurate than the ones currently used.

library(fgsea)

The fgsea() function requires a list of gene sets to check, and a named vector of gene-level statistics, where the names should be the same as the gene names in the pathways list. First, let’s create our named vector of test statistics. See ?tibble::deframe for help here - deframe() converts two-column data frames to a named vector or list, using the first column as name and the second column as value.

ranks <- deframe(res2)
head(ranks, 20)
##         A1BG     A1BG-AS1         A1CF          A2M      A2M-AS1 
##  0.679946437 -1.793291412 -0.126192849 -1.259539478  0.875346116 
##        A2ML1        A2MP1      A3GALT2       A4GALT        A4GNT 
##  0.672710814  0.973319532 -0.049131427 -4.144839902  0.007772497 
##         AAAS         AACS        AADAC      AADACL2      AADACL4 
##  0.163986128  1.416071728 -0.306541616  0.489315094 -1.876311694 
##      AADACP1        AADAT        AAED1        AAGAB         AAK1 
##  0.161119304  3.079128034 -7.492301790  1.554279946  1.141522348

Let’s use the Hallmark gene set from MSigDB. Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying overlaps between gene sets in other MSigDB collections and retaining genes that display coordinate expression. The gmtPathways() function will take a GMT file you downloaded from MSigDB and turn it into a list. Each element in the list is a character vector of genes in the pathway.

# Load the pathways into a named list
pathways.hallmark <- gmtPathways("data/msigdb/h.all.v6.2.symbols.gmt")

# Look at them all if you want (uncomment)
# pathways.hallmark

# Show the first few pathways, and within those, show only the first few genes. 
pathways.hallmark %>% 
  head() %>% 
  lapply(head)
## $HALLMARK_TNFA_SIGNALING_VIA_NFKB
## [1] "JUNB"    "CXCL2"   "ATF3"    "NFKBIA"  "TNFAIP3" "PTGS2"  
## 
## $HALLMARK_HYPOXIA
## [1] "PGK1"  "PDK1"  "GBE1"  "PFKL"  "ALDOA" "ENO2" 
## 
## $HALLMARK_CHOLESTEROL_HOMEOSTASIS
## [1] "FDPS"    "CYP51A1" "IDI1"    "FDFT1"   "DHCR7"   "SQLE"   
## 
## $HALLMARK_MITOTIC_SPINDLE
## [1] "ARHGEF2" "CLASP1"  "KIF11"   "KIF23"   "ALS2"    "ARF6"   
## 
## $HALLMARK_WNT_BETA_CATENIN_SIGNALING
## [1] "MYC"    "CTNNB1" "JAG2"   "NOTCH1" "DLL1"   "AXIN2" 
## 
## $HALLMARK_TGF_BETA_SIGNALING
## [1] "TGFBR1" "SMAD7"  "TGFB1"  "SMURF2" "SMURF1" "BMPR2"

Now, run the fgsea algorithm with 1000 permutations:

fgseaRes <- fgsea(pathways=pathways.hallmark, stats=ranks, nperm=1000)

Tidy the results:

fgseaResTidy <- fgseaRes %>%
  as_tibble() %>%
  arrange(desc(NES))

# Show in a nice table:
fgseaResTidy %>% 
  dplyr::select(-leadingEdge, -ES, -nMoreExtreme) %>% 
  arrange(padj) %>% 
  DT::datatable()

Plot the normalized enrichment scores. Color the bar indicating whether or not the pathway was significant:

ggplot(fgseaResTidy, aes(reorder(pathway, NES), NES)) +
  geom_col(aes(fill=padj<0.05)) +
  coord_flip() +
  labs(x="Pathway", y="Normalized Enrichment Score",
       title="Hallmark pathways NES from GSEA") + 
  theme_minimal()

The Interferon Alpha Response gene set was significantly upregulated, consistent with the findings of the original study.

What genes are in each of these pathways? First, get a tibble with all the pathways and the genes in them. Continue to join that back to the original data to pull out genes in the pathways. Optionally, filter the list to include only those that are significant, etc.

pathways.hallmark %>% 
  enframe("pathway", "SYMBOL") %>% 
  unnest() %>% 
  inner_join(res, by="SYMBOL")
## # A tibble: 7,689 x 9
##    pathway   SYMBOL row   baseMean log2FoldChange  lfcSE    stat    pvalue
##    <chr>     <chr>  <chr>    <dbl>          <dbl>  <dbl>   <dbl>     <dbl>
##  1 HALLMARK… JUNB   ENSG…   682.           0.0640  0.144   0.445  6.56e- 1
##  2 HALLMARK… CXCL2  ENSG…     1.71         1.39    1.55    0.899  3.69e- 1
##  3 HALLMARK… ATF3   ENSG…   374.          -1.92    0.159 -12.1    1.56e-33
##  4 HALLMARK… NFKBIA ENSG…   562.          -0.799   0.177  -4.51   6.37e- 6
##  5 HALLMARK… TNFAI… ENSG…   356.          -0.562   0.149  -3.79   1.53e- 4
##  6 HALLMARK… PTGS2  ENSG…    71.6          1.37    0.365   3.76   1.70e- 4
##  7 HALLMARK… CXCL1  ENSG…    37.3         -0.841   0.438  -1.92   5.47e- 2
##  8 HALLMARK… IER3   ENSG…   136.           1.30    0.217   5.99   2.08e- 9
##  9 HALLMARK… IER3   ENSG…     0           NA      NA      NA     NA       
## 10 HALLMARK… IER3   ENSG…     0           NA      NA      NA     NA       
## # ... with 7,679 more rows, and 1 more variable: padj <dbl>

Let’s try a different set of pathways. Let’s look at KEGG pathways:

fgsea(pathways=gmtPathways("data/msigdb/c2.cp.kegg.v6.2.symbols.gmt"), ranks, nperm=1000) %>% 
  as_tibble() %>% 
  arrange(padj)
## # A tibble: 186 x 8
##    pathway         pval   padj     ES   NES nMoreExtreme  size leadingEdge
##    <chr>          <dbl>  <dbl>  <dbl> <dbl>        <dbl> <int> <list>     
##  1 KEGG_STARCH… 0.00176 0.0427 -0.654 -1.77            0    33 <chr [20]> 
##  2 KEGG_GLYCOS… 0.00183 0.0427 -0.751 -1.87            0    21 <chr [10]> 
##  3 KEGG_FOCAL_… 0.00141 0.0427 -0.458 -1.59            0   191 <chr [67]> 
##  4 KEGG_NEUROT… 0.00152 0.0427 -0.506 -1.66            0   120 <chr [45]> 
##  5 KEGG_REGULA… 0.00141 0.0427 -0.483 -1.68            0   192 <chr [67]> 
##  6 KEGG_INSULI… 0.00151 0.0427 -0.543 -1.80            0   126 <chr [47]> 
##  7 KEGG_ADIPOC… 0.00159 0.0427 -0.646 -1.94            0    62 <chr [26]> 
##  8 KEGG_ACUTE_… 0.00162 0.0427 -0.582 -1.70            0    52 <chr [24]> 
##  9 KEGG_GLYCER… 0.00316 0.0456 -0.551 -1.67            1    65 <chr [31]> 
## 10 KEGG_AMINOA… 0.00245 0.0456  0.617  1.86            0    41 <chr [11]> 
## # ... with 176 more rows

Or miR targets:

fgsea(pathways=gmtPathways("data/msigdb/c3.mir.v6.2.symbols.gmt"), ranks, nperm=1000) %>% 
  as_tibble() %>% 
  arrange(padj)
## # A tibble: 221 x 8
##    pathway         pval   padj     ES   NES nMoreExtreme  size leadingEdge
##    <chr>          <dbl>  <dbl>  <dbl> <dbl>        <dbl> <int> <list>     
##  1 TGAATGT_MIR… 0.00124 0.0346 -0.433 -1.63            0   446 <chr [120]>
##  2 AATGTGA_MIR… 0.00128 0.0346 -0.405 -1.51            0   370 <chr [93]> 
##  3 CAGTGTT_MIR… 0.00136 0.0346 -0.419 -1.52            0   286 <chr [95]> 
##  4 GTGCCTT_MIR… 0.00115 0.0346 -0.399 -1.54            0   664 <chr [213]>
##  5 CTATGCA_MIR… 0.00141 0.0346 -0.486 -1.71            0   196 <chr [54]> 
##  6 TTGCCAA_MIR… 0.00133 0.0346 -0.413 -1.51            0   291 <chr [89]> 
##  7 TTTGTAG_MIR… 0.00132 0.0346 -0.411 -1.50            0   306 <chr [82]> 
##  8 TGCCTTA_MIR… 0.00121 0.0346 -0.388 -1.47            0   519 <chr [171]>
##  9 GTGCCAA_MIR… 0.00135 0.0346 -0.445 -1.61            0   274 <chr [83]> 
## 10 TGTTTAC_MIR… 0.00242 0.0478 -0.361 -1.37            1   532 <chr [144]>
## # ... with 211 more rows

Or GO annotations:

fgsea(pathways=gmtPathways("data/msigdb/c5.all.v6.2.symbols.gmt"), ranks, nperm=1000) %>% 
  as_tibble() %>% 
  arrange(padj)
## # A tibble: 5,917 x 8
##    pathway         pval   padj     ES   NES nMoreExtreme  size leadingEdge
##    <chr>          <dbl>  <dbl>  <dbl> <dbl>        <dbl> <int> <list>     
##  1 GO_CELLULAR… 0.00108 0.0870 -0.376 -1.49            0  1200 <chr [330]>
##  2 GO_CHEMICAL… 0.00115 0.0870 -0.347 -1.35            0   742 <chr [140]>
##  3 GO_RESPONSE… 0.00128 0.0870 -0.418 -1.54            0   355 <chr [96]> 
##  4 GO_FORMATIO… 0.00154 0.0870 -0.515 -1.67            0    98 <chr [21]> 
##  5 GO_ACTIN_FI… 0.00124 0.0870 -0.434 -1.62            0   409 <chr [125]>
##  6 GO_SINGLE_O… 0.00109 0.0870 -0.332 -1.31            0  1169 <chr [253]>
##  7 GO_SMALL_MO… 0.00106 0.0870 -0.339 -1.35            0  1541 <chr [342]>
##  8 GO_ALPHA_AM… 0.00153 0.0870 -0.551 -1.75            0    83 <chr [24]> 
##  9 GO_ORGANOPH… 0.00114 0.0870 -0.360 -1.40            0   767 <chr [229]>
## 10 GO_ESTABLIS… 0.00106 0.0870 -0.328 -1.31            0  1515 <chr [514]>
## # ... with 5,907 more rows

Non-human organisms

If you have a non-human organism, first create a table to map the fly/mouse/worm/etc gene IDs to human symbol:

library(biomaRt)
mart <- useDataset("mmusculus_gene_ensembl", mart=useMart("ensembl"))
bm <- getBM(attributes=c("ensembl_gene_id", "hsapiens_homolog_associated_gene_name"), mart=mart) %>%
  distinct() %>%
  as_tibble() %>%
  na_if("") %>% 
  na.omit()
bm
## # A tibble: 26,256 x 2
##    ensembl_gene_id    hsapiens_homolog_associated_gene_name
##    <chr>              <chr>                                
##  1 ENSMUSG00000064370 MT-CYB                               
##  2 ENSMUSG00000064368 MT-ND6                               
##  3 ENSMUSG00000064367 MT-ND5                               
##  4 ENSMUSG00000064363 MT-ND4                               
##  5 ENSMUSG00000065947 MT-ND4L                              
##  6 ENSMUSG00000064360 MT-ND3                               
##  7 ENSMUSG00000064358 MT-CO3                               
##  8 ENSMUSG00000064357 MT-ATP6                              
##  9 ENSMUSG00000064356 MT-ATP8                              
## 10 ENSMUSG00000064354 MT-CO2                               
## # ... with 26,246 more rows

Then simply join your non-human results to the table above, and continue as you did earlier now using the human symbol and the associated test statistic.

Appendix

This code was used to generate the human airway results:

library(DESeq2)
library(airway)
ddsSE <- DESeqDataSet(airway, design = ~ cell + dex)
ddsSE <- DESeq(ddsSE)
res <- results(ddsSE, tidy = TRUE)
readr::write_csv(res, path="data/deseq-results-tidy-human-airway.csv")