name: bio-pathway-go-enrichment description: Runs Gene Ontology over-representation analysis (ORA) on a gene LIST with clusterProfiler enrichGO, the one-sided hypergeometric/Fisher 2x2 test phyper(k-1, M, N-M, n, lower.tail=FALSE). Covers why the BACKGROUND universe (not the gene list) is the null and decides significance, why omitting universe= is a bug, why enrichGO defaults to ont='MF' not 'BP', why pvalueCutoff filters p.adjust not raw p, why ORA discards effect magnitude and inherits GO-DAG true-path redundancy (simplify, topGO), why RNA-seq gene-length bias inflates long-gene terms (GOseq Wallenius), plus GeneRatio/BgRatio, bitr ID mapping, minGSSize/maxGSSize, groupGO. Use when a pre-selected gene list (DE hits, co-expression module, screen, GWAS-mapped) needs GO annotation. For a ranked no-cutoff analysis see gsea; for null theory and the method-selection spine see enrichment-foundations; for other databases see kegg-pathways, reactome-pathways, wikipathways; DE source is differential-expression/de-results; plots in enrichment-visualization. tool_type: r primary_tool: clusterProfiler
Version Compatibility
Reference examples tested with: clusterProfiler 4.18.4+, org.Hs.eg.db 3.22+ (goseq 1.54+ for the length-bias snippet).
Before using code patterns, verify installed versions match. If versions differ:
- R:
packageVersion('<pkg>')then?function_nameto verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
GO annotation lives in the local org.*.eg.db OrgDb and GO.db, both pinned to the Bioconductor release, so a GO ORA is reproducible given the package versions - record packageVersion('org.Hs.eg.db') and packageVersion('GO.db') with results. enrichGO moved no core arguments recently, but several plot helpers migrated to enrichplot in clusterProfiler 4.x (those live in enrichment-visualization).
GO Over-Representation Analysis
"Which biological processes are enriched in my gene list?" -> Test each GO term for over-representation of the query genes against a defined background with the one-sided hypergeometric test - because the BACKGROUND universe, not the gene list, is what decides which terms look significant.
- R:
enrichGO(gene, universe, OrgDb, keyType='ENTREZID', ont='BP')
Scope: hypergeometric ORA of a gene LIST against GO terms, with background-universe selection, ID conversion, GO-DAG redundancy reduction, RNA-seq length-bias correction, and the generic enricher test for custom gene sets. A ranked-list / no-cutoff analysis -> gsea. The ORA-vs-FCS-vs-topology fork and null theory -> enrichment-foundations. KEGG/Reactome/WikiPathways gene sets -> kegg-pathways, reactome-pathways, wikipathways. The DE list source -> differential-expression/de-results. Plots -> enrichment-visualization.
The Single Most Important Modern Insight -- ORA Is a Competitive 2x2 Hypergeometric Test Whose Null IS the Chosen Universe
ORA does not answer "which pathways are in my gene list". It is a competitive gene-sampling test (Goeman & Buhlmann 2007 Bioinformatics 23:980): of the genes flagged (the foreground), are more annotated to term T than expected when drawing the same number at random from the universe? The p-value is the upper tail of the hypergeometric, computed verbatim by DOSE/clusterProfiler as phyper(k-1, M, N-M, n, lower.tail=FALSE) = P(X>=k), the one-sided Fisher exact test on the 2x2 table. Here N = universe genes carrying any GO annotation, M = universe genes in T, n = foreground genes annotated, k = the overlap (the Count column). The report columns are GeneRatio = k/n and BgRatio = M/N - both denominators restricted to ANNOTATED genes - and fold enrichment = GeneRatio/BgRatio.
Three consequences drive every misuse:
- The universe is the null, not a setting. Change N or M and every p-value changes. Omitting
universe=defaults N to ALL annotated genes (~18k for human BP); if the assay only measured ~12k genes, terms for tissue-restricted and lowly-expressed genes go spuriously significant. Omittinguniverse=is a bug, not a default - set it to the genes that COULD have entered the foreground (the tested-gene set), map foreground and universe identically, and report N. The whole-genome background is defensible only when every gene truly could have been detected (Wijesooriya 2022; Timmons 2015). - ORA throws away magnitude and inherits the GO DAG. A gene is in or out at one threshold (no effect size), so a 2000-gene term at 1.2x fold can beat a 12-gene term at 4x on p-value alone - always read fold enrichment alongside p.adjust. The true-path rule propagates each annotation to all ancestors, so one real signal lights up a whole lineage ("cell cycle", "cell cycle process", "mitotic cell cycle" together); those tests are positively correlated, BH still valid but the term list over-reports. Resolve with
simplify()(semantic collapse, per ontology) or topGO elim/weight (decorrelation in the test). - The deliverable is never "the enriched pathways." It is a correctly-backgrounded, effect-sized, redundancy-resolved short list of HYPOTHESES - and it cannot be used to validate the DE list that produced it (circular: the terms are a deterministic function of the same genes; Timmons 2015).
ORA vs GSEA (the central fork)
ORA needs a pre-selected LIST plus a BACKGROUND and binarizes significant/not; GSEA needs a RANKED vector of ALL genes and no cutoff. Pick by whether a ranking exists and whether the cutoff would be arbitrary. The full three-generations taxonomy (ORA vs FCS vs topology) and competitive-vs-self-contained null theory live in enrichment-foundations - this skill owns the ORA/GO slice.
| Scenario | Method | Why |
|---|---|---|
| All genes carry a DE statistic, cutoff would be arbitrary | GSEA (gseGO) -> gsea | uses the full ranking; no threshold |
| Pre-selected list (co-expression module, GWAS-mapped, screen hits, markers) | ORA (enrichGO) | no ranking available; ORA is appropriate |
| Very small list (< ~15-20 genes) | low ORA power; report fold enrichment + counts, consider GSEA | hypergeometric power collapses on tiny lists |
| RNA-seq DE list with length/selection bias | GOseq (Wallenius) | length-corrected ORA; standard ORA inflates long-gene terms |
Tool Taxonomy
| Source / method | Citation | Mechanism / role | When |
|---|---|---|---|
| enrichGO (clusterProfiler) | Yu 2012 OMICS 16:284; Wu 2021 Innovation 2:100141 | one-sided hypergeometric per GO term; local OrgDb | the default ORA workhorse for a gene list |
| GO DAG (BP/MF/CC) | Ashburner 2000 Nat Genet 25:25 | three DAGs; true-path propagation to ancestors | the annotation structure being tested |
| simplify (GOSemSim) | Wang 2007 Bioinformatics 23:1274 | semantic-similarity de-redundancy, per ontology | collapse redundant ancestor lineages, keep calibrated p/FDR |
| topGO elim/weight/weight01 | Alexa 2006 Bioinformatics 22:1600 | decorrelates the GO graph inside the test | specificity-resolved short list (treat scores as ranking, not FDR) |
| GOseq | Young 2010 Genome Biol 11:R14 | Wallenius noncentral hypergeometric weighted by a length PWF | RNA-seq DE with gene-length/selection bias |
| enricher (clusterProfiler) | Yu 2012 OMICS 16:284 | same hypergeometric engine on a custom TERM2GENE | any gene set (MSigDB, in-house) not in a DB function |
| gseGO / GSEA | (route -> gsea) | rank-based running-sum, permutation null | a ranking exists; no arbitrary cutoff |
Run the GO ORA
Goal: Find GO terms over-represented in a gene list relative to the genes that could have been selected.
Approach: Build the foreground and the universe with the SAME ID mapping, set ont explicitly (the source default is 'MF'), pass universe= (omitting it is a bug), and read fold enrichment alongside p.adjust.
library(clusterProfiler)
library(org.Hs.eg.db)
ego <- enrichGO(gene = gene_list, # foreground ENTREZ IDs
universe = universe_ids, # tested-gene set, mapped identically -- NOT the genome
OrgDb = org.Hs.eg.db,
keyType = 'ENTREZID',
ont = 'BP', # SET explicitly: source default is 'MF', not 'BP'
pAdjustMethod = 'BH',
pvalueCutoff = 0.05, # filters p.adjust (despite the name), not raw pvalue
qvalueCutoff = 0.2,
minGSSize = 10,
maxGSSize = 500,
readable = TRUE) # map ENTREZ -> SYMBOL in the output
The returned enrichResult has columns ID, Description, GeneRatio, BgRatio, pvalue, p.adjust, qvalue, geneID, Count (plus ONTOLOGY when ont='ALL'). pvalueCutoff filters the ADJUSTED p, so an empty table usually means the cutoff or the universe, not biology - inspect everything with pvalueCutoff=1, qvalueCutoff=1.
Build the Foreground and Universe from DE Results
Goal: Turn a DE table into the foreground gene vector and the matched background universe.
Approach: Filter the DE table to the hits for the foreground; take the genes that were actually TESTED for the universe (DESeq2: rows with non-NA pvalue survive independent filtering); map both with the same bitr call. The DE mechanics and the $padj/$adj.P.Val column choice live in differential-expression/de-results.
de <- read.csv('de_results.csv')
sig_genes <- de$gene_id[de$padj < 0.05 & abs(de$log2FoldChange) > 1] # foreground = hits
all_tested <- de$gene_id[!is.na(de$pvalue)] # universe = tested genes, NOT all rows, NOT the genome
fg_map <- bitr(sig_genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
bg_map <- bitr(all_tested, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
gene_list <- unique(fg_map$ENTREZID) # deduplicate one-to-many maps before counting
universe_ids <- unique(bg_map$ENTREZID)
bitr one-to-many maps produce duplicate rows that inflate Count; deduplicate. If more than ~15% of genes fail to convert the result is unreliable - report the conversion rate. Mixed up- and down-regulated genes cancel in one list: run ORA separately per direction when direction matters.
Reduce GO-DAG Redundancy with simplify
Goal: Collapse the redundant ancestor lineage so one biological signal is one entry, not a dozen.
Approach: simplify() removes terms whose semantic similarity to a kept term exceeds the cutoff. It operates on ONE ontology (GOSemSim defines similarity within a single DAG), so run BP/MF/CC separately and simplify each - it does NOT de-redundify an ont='ALL' object.
ego_bp <- enrichGO(gene_list, universe = universe_ids, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID', ont = 'BP', readable = TRUE)
ego_bp <- simplify(ego_bp, cutoff = 0.7, by = 'p.adjust', select_fun = min, measure = 'Wang')
measure='Wang' (the default) is graph-topology-based and stable across annotation releases; IC-based measures ('Resnik', 'Lin', 'Jiang', 'Rel') shift with the annotation corpus. topGO elim/weight01 is the alternative that decorrelates inside the test, returning a specificity-resolved list directly - but its conditioned p-values are best treated as a ranking, not calibrated FDR (Alexa 2006).
Correct RNA-seq Length Bias with GOseq
Goal: Stop long, highly-expressed genes from looking enriched for a purely technical reason.
Approach: DE-detection power scales with read count, which scales with transcript length and expression, so the foreground is enriched for long genes - and RPKM/TMM normalization does NOT fix it (it corrects abundance, not detection power). GOseq fits a probability weighting function (PWF) over the bias variable and tests with the Wallenius noncentral hypergeometric (Young 2010). The input is a NAMED 0/1 vector over ALL tested genes; goseq returns UNADJUSTED p-values, so apply BH afterward.
library(goseq)
all_genes <- de$gene_id[!is.na(de$pvalue)]
de_genes <- as.integer(all_genes %in% sig_genes) # named binary vector over the tested set
names(de_genes) <- all_genes
pwf <- nullp(de_genes, 'hg38', 'ensGene') # fits the length PWF; inspect the fit plot
go <- goseq(pwf, 'hg38', 'ensGene', method = 'Wallenius') # default; 'Hypergeometric' ignores bias (= standard ORA)
go$padj <- p.adjust(go$over_represented_pvalue, method = 'BH') # goseq does NOT BH-correct internally
GSEA on a length-neutral ranking statistic (the moderated t / Wald z) is largely immune to this bias - one more reason to consider gsea for RNA-seq.
All Three Ontologies and a Descriptive Breakdown
ont='ALL' runs BP/MF/CC separately and rbinds them with an ONTOLOGY column (pool=FALSE default; pool=TRUE treats the three as one set). groupGO is NOT a test - it classifies genes at a fixed DAG level for a GO-slim overview (counts, no p-values); never read its counts as significance.
ego_all <- enrichGO(gene_list, universe = universe_ids, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID', ont = 'ALL', readable = TRUE)
ggo <- groupGO(gene_list, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID', ont = 'BP', level = 3, readable = TRUE)
Custom Gene Sets with enricher
For gene sets not covered by a DB function (MSigDB collections, in-house sets), enricher runs the SAME hypergeometric engine against a two-column TERM2GENE table; pass the same explicit universe.
ego_custom <- enricher(gene_list, TERM2GENE = t2g, universe = universe_ids,
pvalueCutoff = 0.05, pAdjustMethod = 'BH', minGSSize = 10, maxGSSize = 500, qvalueCutoff = 0.2)
Other Organisms
Swap the OrgDb: org.Mm.eg.db (mouse), org.Dr.eg.db (zebrafish), org.Sc.sgd.db (yeast, keyType='ORF'). Check usable key types with keytypes(OrgDb).
Per-Method Failure Modes
Whole-genome or default universe
Trigger: omitting universe=, or passing the genome when the assay measured fewer genes. Mechanism: N defaults to all annotated genes, inflating the denominator with genes that never could have been selected. Symptom: a confident table where tissue-restricted / lowly-expressed-gene terms dominate. Fix: set universe= to the tested-gene set, map foreground and universe identically, report N.
p read without fold enrichment (term-size trap)
Trigger: ranking results by p.adjust alone. Mechanism: a 2000-gene term has enormous power at tiny fold enrichment; p scales with term size. Symptom: vague broad terms ("cellular process") top the list, specific terms buried. Fix: read fold enrichment = (k/n)/(M/N) alongside p.adjust; trim extremes with minGSSize=10, maxGSSize=500.
Redundant ancestor lineage counted as findings
Trigger: reporting "cell cycle", "cell cycle process", "mitotic cell cycle" as separate discoveries. Mechanism: true-path propagation lights up a whole lineage from one signal; the tests are positively correlated. Symptom: the top 20 is one biological theme repeated. Fix: simplify() per ontology, or topGO weight01; never count lineage members as independent hits.
RNA-seq length / selection bias
Trigger: standard ORA on an RNA-seq DE list without length correction. Mechanism: detection power scales with count ~ length/expression; TMM/RPKM fixes abundance, not power. Symptom: long-gene categories (ECM, adhesion) enriched, short-gene (ribosomal) depleted - and it survives FDR. Fix: GOseq with a length PWF + method='Wallenius', then BH; or GSEA on a bias-neutral statistic.
Wrong ID type or silent gene loss
Trigger: passing ENSEMBL/SYMBOL with a mismatched keyType, or not checking the bitr conversion rate. Mechanism: unmapped IDs are dropped, shrinking the foreground; one-to-many maps inflate Count. Symptom: "no gene can be mapped", or a suspiciously small/large Count. Fix: match keyType to one of keytypes(OrgDb), deduplicate after bitr, report conversion rate (flag >15% loss).
pvalueCutoff misread as raw-p filter
Trigger: concluding "no significant terms" when strong raw p exists. Mechanism: pvalueCutoff filters p.adjust, not pvalue. Symptom: an empty table despite plausible signal. Fix: inspect with pvalueCutoff=1, qvalueCutoff=1, then judge on p.adjust.
simplify on ont='ALL'
Trigger: calling simplify() on an ont='ALL' object. Mechanism: semantic similarity is defined within ONE ontology, not across BP/MF/CC. Symptom: redundancy not removed, or an error. Fix: run BP/MF/CC separately and simplify each.
Quantitative Thresholds
| Threshold | Source | Rationale |
|---|---|---|
pvalueCutoff = 0.05 |
clusterProfiler default | filters on p.adjust (NOT raw pvalue); standard FDR gate |
qvalueCutoff = 0.2 |
clusterProfiler default | secondary q-value gate; loosen to 1 to inspect all terms |
pAdjustMethod = 'BH' |
Benjamini-Hochberg | controls FDR; valid under the positive dependence of true-path-correlated terms (Bonferroni is needlessly strict here) |
minGSSize = 10 |
enrichGO default | drop tiny sets that overfit and are noisy |
maxGSSize = 500 |
enrichGO default | drop huge general sets that always "enrich" with trivial fold |
simplify(cutoff = 0.7) |
GOSemSim/Wang | semantic-similarity redundancy cutoff; lower keeps more terms, higher is more aggressive |
| fold enrichment > 2 | heuristic | (k/n)/(M/N); a rough "strong" flag, never a substitute for p.adjust |
| ID-conversion loss > 15% | heuristic | above this the foreground is too eroded to trust; report the rate |
Common Errors
| Error / symptom | Cause | Solution |
|---|---|---|
--> No gene can be mapped |
wrong keyType / OrgDb, or IDs not in the OrgDb | match keyType to keytypes(OrgDb); bitr to ENTREZID first |
| Empty result table | pvalueCutoff filters p.adjust; or universe too large; or IDs lost |
set cutoffs to 1 to inspect; fix the universe; check conversion rate |
| Vague broad terms dominate | ranking by p alone (term-size trap) | read fold enrichment; trim with minGSSize/maxGSSize |
| Many redundant ancestor terms | GO-DAG true-path propagation | simplify() per ontology, or topGO weight01 |
| simplify does nothing / errors on ALL | similarity is per-ontology | run BP/MF/CC separately |
| Description column shows IDs not names | not readable | readable=TRUE or setReadable(ego, OrgDb, 'ENTREZID') |
| Tested MF when expecting BP | enrichGO default ont='MF' |
set ont explicitly every call |
References
- Ashburner M, Ball CA, Blake JA, et al. 2000. Gene Ontology: tool for the unification of biology. Nat Genet 25:25-29.
- Yu G, Wang LG, Han Y, He QY. 2012. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS 16:284-287.
- Wu T, Hu E, Xu S, et al. 2021. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. The Innovation 2(3):100141.
- Goeman JJ, Buhlmann P. 2007. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23:980-987.
- Alexa A, Rahnenfuhrer J, Lengauer T. 2006. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22(13):1600-1607.
- Young MD, Wakefield MJ, Smyth GK, Oshlack A. 2010. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol 11(2):R14.
- Wang JZ, Du Z, Payattakool R, et al. 2007. A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10):1274-1281.
- Timmons JA, Szkop KJ, Gallagher IJ. 2015. Multiple sources of bias confound functional enrichment analysis of global -omics data. Genome Biol 16:186.
- Wijesooriya K, Jadaan SA, Perera KL, et al. 2022. Urgent need for consistent standards in functional enrichment analysis. PLoS Comput Biol 18(3):e1009935.
Related Skills
- enrichment-foundations - The ORA-vs-FCS-vs-topology fork, null theory, and method-selection spine
- gsea - Ranked-list GSEA alternative when a full ranking exists and a cutoff is arbitrary
- kegg-pathways - KEGG pathway and module enrichment
- reactome-pathways - Reactome curated-pathway enrichment
- wikipathways - WikiPathways community-pathway enrichment
- enrichment-visualization - Dot/bar/cnet/emap/tree plots of enrichment results
- differential-expression/de-results - Source of the gene list and the tested-gene universe
- database-access/entrez-fetch - Fetch gene annotations / ID maps from NCBI
- workflows/expression-to-pathways - End-to-end DE-to-enrichment pipeline