Here I’ll show you how to take results from DESeq and quickly assess for enrichment of pathways using the fgsea package for fast preranked gene set enrichment analysis (GSEA).
First, let’s start with a human dataset.
Load some results from DESeq2 (generated using results(..., tidy=TRUE)
). See the appendix section at the end for how this data was generated.
library(tidyverse)
res <- read_csv("data/deseq-results-tidy-human-airway.csv")
res
## # A tibble: 64,102 x 7
## row baseMean log2FoldChange lfcSE stat pvalue padj
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 709. 0.381 0.101 3.79 1.52e-4 1.28e-3
## 2 ENSG000000… 0 NA NA NA NA NA
## 3 ENSG000000… 520. -0.207 0.112 -1.84 6.53e-2 1.97e-1
## 4 ENSG000000… 237. -0.0379 0.143 -0.264 7.92e-1 9.11e-1
## 5 ENSG000000… 57.9 0.0882 0.287 0.307 7.59e-1 8.95e-1
## 6 ENSG000000… 0.318 1.38 3.50 0.394 6.94e-1 NA
## 7 ENSG000000… 5817. -0.426 0.0883 -4.83 1.38e-6 1.82e-5
## 8 ENSG000000… 1282. 0.241 0.0887 2.72 6.58e-3 3.28e-2
## 9 ENSG000000… 610. 0.0476 0.167 0.286 7.75e-1 9.03e-1
## 10 ENSG000000… 369. 0.500 0.121 4.14 3.48e-5 3.42e-4
## # ... with 64,092 more rows
Map Ensembl gene IDs to symbol. First create a mapping table.
library(org.Hs.eg.db)
ens2symbol <- AnnotationDbi::select(org.Hs.eg.db,
key=res$row,
columns="SYMBOL",
keytype="ENSEMBL")
ens2symbol <- as_tibble(ens2symbol)
ens2symbol
## # A tibble: 64,381 x 2
## ENSEMBL SYMBOL
## <chr> <chr>
## 1 ENSG00000000003 TSPAN6
## 2 ENSG00000000005 TNMD
## 3 ENSG00000000419 DPM1
## 4 ENSG00000000457 SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938 FGR
## 7 ENSG00000000971 CFH
## 8 ENSG00000001036 FUCA2
## 9 ENSG00000001084 GCLC
## 10 ENSG00000001167 NFYA
## # ... with 64,371 more rows
Now join them.
res <- inner_join(res, ens2symbol, by=c("row"="ENSEMBL"))
res
## # A tibble: 64,381 x 8
## row baseMean log2FoldChange lfcSE stat pvalue padj SYMBOL
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 ENSG0… 709. 0.381 0.101 3.79 1.52e-4 1.28e-3 TSPAN6
## 2 ENSG0… 0 NA NA NA NA NA TNMD
## 3 ENSG0… 520. -0.207 0.112 -1.84 6.53e-2 1.97e-1 DPM1
## 4 ENSG0… 237. -0.0379 0.143 -0.264 7.92e-1 9.11e-1 SCYL3
## 5 ENSG0… 57.9 0.0882 0.287 0.307 7.59e-1 8.95e-1 C1orf…
## 6 ENSG0… 0.318 1.38 3.50 0.394 6.94e-1 NA FGR
## 7 ENSG0… 5817. -0.426 0.0883 -4.83 1.38e-6 1.82e-5 CFH
## 8 ENSG0… 1282. 0.241 0.0887 2.72 6.58e-3 3.28e-2 FUCA2
## 9 ENSG0… 610. 0.0476 0.167 0.286 7.75e-1 9.03e-1 GCLC
## 10 ENSG0… 369. 0.500 0.121 4.14 3.48e-5 3.42e-4 NFYA
## # ... with 64,371 more rows
Further, all you’ll care about later on is the gene symbol and the test statistic. Get just those, and remove the NA
s. Finally, if you have multiple test statistics for the same symbol, you’ll want to deal with that in some way. Here I’m just averaging them.
res2 <- res %>%
dplyr::select(SYMBOL, stat) %>%
na.omit() %>%
distinct() %>%
group_by(SYMBOL) %>%
summarize(stat=mean(stat))
res2
## # A tibble: 19,662 x 2
## SYMBOL stat
## <chr> <dbl>
## 1 A1BG 0.680
## 2 A1BG-AS1 -1.79
## 3 A1CF -0.126
## 4 A2M -1.26
## 5 A2M-AS1 0.875
## 6 A2ML1 0.673
## 7 A2MP1 0.973
## 8 A3GALT2 -0.0491
## 9 A4GALT -4.14
## 10 A4GNT 0.00777
## # ... with 19,652 more rows
We’re going to use the fgsea package for fast preranked gene set enrichment analysis (GSEA). See the preprint for more details. From the abstract:
Gene set enrichment analysis is a widely used tool for analyzing gene expression data. However, current implementations are slow due to a large number of required samples for the analysis to have a good statistical power. In this paper we present a novel algorithm, that efficiently reuses one sample multiple times and thus speeds up the analysis. We show that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values. This, in turn, allows applying standard FDR correction procedures, which are more accurate than the ones currently used.
library(fgsea)
The fgsea()
function requires a list of gene sets to check, and a named vector of gene-level statistics, where the names should be the same as the gene names in the pathways list. First, let’s create our named vector of test statistics. See ?tibble::deframe
for help here - deframe()
converts two-column data frames to a named vector or list, using the first column as name and the second column as value.
ranks <- deframe(res2)
head(ranks, 20)
## A1BG A1BG-AS1 A1CF A2M A2M-AS1
## 0.679946437 -1.793291412 -0.126192849 -1.259539478 0.875346116
## A2ML1 A2MP1 A3GALT2 A4GALT A4GNT
## 0.672710814 0.973319532 -0.049131427 -4.144839902 0.007772497
## AAAS AACS AADAC AADACL2 AADACL4
## 0.163986128 1.416071728 -0.306541616 0.489315094 -1.876311694
## AADACP1 AADAT AAED1 AAGAB AAK1
## 0.161119304 3.079128034 -7.492301790 1.554279946 1.141522348
Let’s use the Hallmark gene set from MSigDB. Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying overlaps between gene sets in other MSigDB collections and retaining genes that display coordinate expression. The gmtPathways()
function will take a GMT file you downloaded from MSigDB and turn it into a list. Each element in the list is a character vector of genes in the pathway.
# Load the pathways into a named list
pathways.hallmark <- gmtPathways("data/msigdb/h.all.v6.2.symbols.gmt")
# Look at them all if you want (uncomment)
# pathways.hallmark
# Show the first few pathways, and within those, show only the first few genes.
pathways.hallmark %>%
head() %>%
lapply(head)
## $HALLMARK_TNFA_SIGNALING_VIA_NFKB
## [1] "JUNB" "CXCL2" "ATF3" "NFKBIA" "TNFAIP3" "PTGS2"
##
## $HALLMARK_HYPOXIA
## [1] "PGK1" "PDK1" "GBE1" "PFKL" "ALDOA" "ENO2"
##
## $HALLMARK_CHOLESTEROL_HOMEOSTASIS
## [1] "FDPS" "CYP51A1" "IDI1" "FDFT1" "DHCR7" "SQLE"
##
## $HALLMARK_MITOTIC_SPINDLE
## [1] "ARHGEF2" "CLASP1" "KIF11" "KIF23" "ALS2" "ARF6"
##
## $HALLMARK_WNT_BETA_CATENIN_SIGNALING
## [1] "MYC" "CTNNB1" "JAG2" "NOTCH1" "DLL1" "AXIN2"
##
## $HALLMARK_TGF_BETA_SIGNALING
## [1] "TGFBR1" "SMAD7" "TGFB1" "SMURF2" "SMURF1" "BMPR2"
Now, run the fgsea algorithm with 1000 permutations:
fgseaRes <- fgsea(pathways=pathways.hallmark, stats=ranks, nperm=1000)
Tidy the results:
fgseaResTidy <- fgseaRes %>%
as_tibble() %>%
arrange(desc(NES))
# Show in a nice table:
fgseaResTidy %>%
dplyr::select(-leadingEdge, -ES, -nMoreExtreme) %>%
arrange(padj) %>%
DT::datatable()
Plot the normalized enrichment scores. Color the bar indicating whether or not the pathway was significant:
ggplot(fgseaResTidy, aes(reorder(pathway, NES), NES)) +
geom_col(aes(fill=padj<0.05)) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="Hallmark pathways NES from GSEA") +
theme_minimal()
The Interferon Alpha Response gene set was significantly upregulated, consistent with the findings of the original study.
What genes are in each of these pathways? First, get a tibble with all the pathways and the genes in them. Continue to join that back to the original data to pull out genes in the pathways. Optionally, filter the list to include only those that are significant, etc.
pathways.hallmark %>%
enframe("pathway", "SYMBOL") %>%
unnest() %>%
inner_join(res, by="SYMBOL")
## # A tibble: 7,689 x 9
## pathway SYMBOL row baseMean log2FoldChange lfcSE stat pvalue
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HALLMARK… JUNB ENSG… 682. 0.0640 0.144 0.445 6.56e- 1
## 2 HALLMARK… CXCL2 ENSG… 1.71 1.39 1.55 0.899 3.69e- 1
## 3 HALLMARK… ATF3 ENSG… 374. -1.92 0.159 -12.1 1.56e-33
## 4 HALLMARK… NFKBIA ENSG… 562. -0.799 0.177 -4.51 6.37e- 6
## 5 HALLMARK… TNFAI… ENSG… 356. -0.562 0.149 -3.79 1.53e- 4
## 6 HALLMARK… PTGS2 ENSG… 71.6 1.37 0.365 3.76 1.70e- 4
## 7 HALLMARK… CXCL1 ENSG… 37.3 -0.841 0.438 -1.92 5.47e- 2
## 8 HALLMARK… IER3 ENSG… 136. 1.30 0.217 5.99 2.08e- 9
## 9 HALLMARK… IER3 ENSG… 0 NA NA NA NA
## 10 HALLMARK… IER3 ENSG… 0 NA NA NA NA
## # ... with 7,679 more rows, and 1 more variable: padj <dbl>
Let’s try a different set of pathways. Let’s look at KEGG pathways:
fgsea(pathways=gmtPathways("data/msigdb/c2.cp.kegg.v6.2.symbols.gmt"), ranks, nperm=1000) %>%
as_tibble() %>%
arrange(padj)
## # A tibble: 186 x 8
## pathway pval padj ES NES nMoreExtreme size leadingEdge
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <list>
## 1 KEGG_STARCH… 0.00176 0.0427 -0.654 -1.77 0 33 <chr [20]>
## 2 KEGG_GLYCOS… 0.00183 0.0427 -0.751 -1.87 0 21 <chr [10]>
## 3 KEGG_FOCAL_… 0.00141 0.0427 -0.458 -1.59 0 191 <chr [67]>
## 4 KEGG_NEUROT… 0.00152 0.0427 -0.506 -1.66 0 120 <chr [45]>
## 5 KEGG_REGULA… 0.00141 0.0427 -0.483 -1.68 0 192 <chr [67]>
## 6 KEGG_INSULI… 0.00151 0.0427 -0.543 -1.80 0 126 <chr [47]>
## 7 KEGG_ADIPOC… 0.00159 0.0427 -0.646 -1.94 0 62 <chr [26]>
## 8 KEGG_ACUTE_… 0.00162 0.0427 -0.582 -1.70 0 52 <chr [24]>
## 9 KEGG_GLYCER… 0.00316 0.0456 -0.551 -1.67 1 65 <chr [31]>
## 10 KEGG_AMINOA… 0.00245 0.0456 0.617 1.86 0 41 <chr [11]>
## # ... with 176 more rows
Or miR targets:
fgsea(pathways=gmtPathways("data/msigdb/c3.mir.v6.2.symbols.gmt"), ranks, nperm=1000) %>%
as_tibble() %>%
arrange(padj)
## # A tibble: 221 x 8
## pathway pval padj ES NES nMoreExtreme size leadingEdge
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <list>
## 1 TGAATGT_MIR… 0.00124 0.0346 -0.433 -1.63 0 446 <chr [120]>
## 2 AATGTGA_MIR… 0.00128 0.0346 -0.405 -1.51 0 370 <chr [93]>
## 3 CAGTGTT_MIR… 0.00136 0.0346 -0.419 -1.52 0 286 <chr [95]>
## 4 GTGCCTT_MIR… 0.00115 0.0346 -0.399 -1.54 0 664 <chr [213]>
## 5 CTATGCA_MIR… 0.00141 0.0346 -0.486 -1.71 0 196 <chr [54]>
## 6 TTGCCAA_MIR… 0.00133 0.0346 -0.413 -1.51 0 291 <chr [89]>
## 7 TTTGTAG_MIR… 0.00132 0.0346 -0.411 -1.50 0 306 <chr [82]>
## 8 TGCCTTA_MIR… 0.00121 0.0346 -0.388 -1.47 0 519 <chr [171]>
## 9 GTGCCAA_MIR… 0.00135 0.0346 -0.445 -1.61 0 274 <chr [83]>
## 10 TGTTTAC_MIR… 0.00242 0.0478 -0.361 -1.37 1 532 <chr [144]>
## # ... with 211 more rows
Or GO annotations:
fgsea(pathways=gmtPathways("data/msigdb/c5.all.v6.2.symbols.gmt"), ranks, nperm=1000) %>%
as_tibble() %>%
arrange(padj)
## # A tibble: 5,917 x 8
## pathway pval padj ES NES nMoreExtreme size leadingEdge
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <list>
## 1 GO_CELLULAR… 0.00108 0.0870 -0.376 -1.49 0 1200 <chr [330]>
## 2 GO_CHEMICAL… 0.00115 0.0870 -0.347 -1.35 0 742 <chr [140]>
## 3 GO_RESPONSE… 0.00128 0.0870 -0.418 -1.54 0 355 <chr [96]>
## 4 GO_FORMATIO… 0.00154 0.0870 -0.515 -1.67 0 98 <chr [21]>
## 5 GO_ACTIN_FI… 0.00124 0.0870 -0.434 -1.62 0 409 <chr [125]>
## 6 GO_SINGLE_O… 0.00109 0.0870 -0.332 -1.31 0 1169 <chr [253]>
## 7 GO_SMALL_MO… 0.00106 0.0870 -0.339 -1.35 0 1541 <chr [342]>
## 8 GO_ALPHA_AM… 0.00153 0.0870 -0.551 -1.75 0 83 <chr [24]>
## 9 GO_ORGANOPH… 0.00114 0.0870 -0.360 -1.40 0 767 <chr [229]>
## 10 GO_ESTABLIS… 0.00106 0.0870 -0.328 -1.31 0 1515 <chr [514]>
## # ... with 5,907 more rows
If you have a non-human organism, first create a table to map the fly/mouse/worm/etc gene IDs to human symbol:
library(biomaRt)
mart <- useDataset("mmusculus_gene_ensembl", mart=useMart("ensembl"))
bm <- getBM(attributes=c("ensembl_gene_id", "hsapiens_homolog_associated_gene_name"), mart=mart) %>%
distinct() %>%
as_tibble() %>%
na_if("") %>%
na.omit()
bm
## # A tibble: 26,256 x 2
## ensembl_gene_id hsapiens_homolog_associated_gene_name
## <chr> <chr>
## 1 ENSMUSG00000064370 MT-CYB
## 2 ENSMUSG00000064368 MT-ND6
## 3 ENSMUSG00000064367 MT-ND5
## 4 ENSMUSG00000064363 MT-ND4
## 5 ENSMUSG00000065947 MT-ND4L
## 6 ENSMUSG00000064360 MT-ND3
## 7 ENSMUSG00000064358 MT-CO3
## 8 ENSMUSG00000064357 MT-ATP6
## 9 ENSMUSG00000064356 MT-ATP8
## 10 ENSMUSG00000064354 MT-CO2
## # ... with 26,246 more rows
Then simply join your non-human results to the table above, and continue as you did earlier now using the human symbol and the associated test statistic.
This code was used to generate the human airway results:
library(DESeq2)
library(airway)
ddsSE <- DESeqDataSet(airway, design = ~ cell + dex)
ddsSE <- DESeq(ddsSE)
res <- results(ddsSE, tidy = TRUE)
readr::write_csv(res, path="data/deseq-results-tidy-human-airway.csv")