go-kegg-enrichment - SKILL.md Agent Skill

name: go-kegg-enrichment

description: "Performs GO (Gene Ontology) and KEGG pathway enrichment analysis on\

\ gene lists.\nTrigger when: \n- User provides a list of genes (symbols or IDs)\

\ and asks for enrichment analysis\n- User mentions "GO enrichment", "KEGG enrichment"\

, "pathway analysis"\n- User wants to understand biological functions of gene\

\ sets\n- User provides differentially expressed genes (DEGs) and asks for interpretation\n\

Input: gene list (file or inline), organism (human/mouse/rat), background gene\

\ set (optional)\n- Output: enriched terms, statistics, visualizations (barplot,\

\ dotplot, enrichment map)"

version: 1.0.0

category: Bioinfo

tags: []

author: AIPOCH

license: MIT

status: Draft

risk_level: Medium

skill_type: Tool/Script

owner: AIPOCH

reviewer: ''

last_updated: '2026-02-06'

GO/KEGG Enrichment Analysis

Automated pipeline for Gene Ontology and KEGG pathway enrichment analysis with result interpretation and visualization.

Features

GO Enrichment: Biological Process (BP), Molecular Function (MF), Cellular Component (CC)
KEGG Pathway: Pathway enrichment with organism-specific mapping
Multiple ID Support: Gene symbols, Entrez IDs, Ensembl IDs, RefSeq
Statistical Methods: Hypergeometric test, Fisher's exact test, GSEA support
Visualizations: Bar plots, dot plots, enrichment maps, cnet plots
Result Interpretation: Automatic biological significance summary

Supported Organisms

|-------------|-----------------|-----------|---------------|

| Rat | Rattus norvegicus | rno | org.Rn.eg.db |

| Fly | Drosophila melanogaster | dme | org.Dm.eg.db |

Usage

Basic Usage


# Run enrichment analysis with gene list

python scripts/main.py --genes gene_list.txt --organism human --output results/

Parameters

|-----------|-------------|---------|----------|

| --genes | Path to gene list file (one gene per line) | - | Yes |

| --pvalue-cutoff | P-value cutoff for significance | 0.05 | No |

| --qvalue-cutoff | Adjusted p-value (q-value) cutoff | 0.2 | No |

| --analysis | Analysis type (go/kegg/all) | all | No |

| --format | Output format (csv/tsv/excel/all) | all | No |

Advanced Usage


# GO enrichment only with specific ontology

python scripts/main.py \

    --genes deg_upregulated.txt \

    --organism mouse \

    --analysis go \

    --go-ontologies BP,MF \

    --pvalue-cutoff 0.01 \

    --output go_results/



# KEGG enrichment with custom background

python scripts/main.py \

    --genes treatment_genes.txt \

    --background all_expressed_genes.txt \

    --organism human \

    --analysis kegg \

    --qvalue-cutoff 0.05 \

    --output kegg_results/

Input Format

Gene List File


TP53

BRCA1

EGFR

MYC

KRAS

PTEN

With Expression Values (for GSEA)


gene,log2FoldChange

TP53,2.5

BRCA1,-1.8

EGFR,3.2

Output Files


output/

├── go_enrichment/

│   ├── GO_BP_results.csv       # Biological Process results

│   ├── GO_MF_results.csv       # Molecular Function results

│   ├── GO_CC_results.csv       # Cellular Component results

│   ├── GO_BP_barplot.pdf       # Visualization

│   ├── GO_MF_dotplot.pdf

│   └── GO_summary.txt          # Interpretation summary

├── kegg_enrichment/

│   ├── KEGG_results.csv        # Pathway results

│   ├── KEGG_barplot.pdf

│   ├── KEGG_dotplot.pdf

│   └── KEGG_pathview/          # Pathway diagrams

└── combined_report.html        # Interactive report

Result Interpretation

The tool automatically generates biological interpretation including:

Top Enriched Terms: Significant GO terms/pathways ranked by enrichment ratio
Functional Themes: Clustered biological themes from enriched terms
Key Genes: Core genes driving enrichment in significant terms
Network Relationships: Gene-term relationship visualization
Clinical Relevance: Disease associations (for human genes)

Technical Difficulty: HIGH

⚠️ AI自主验收状态: 需人工检查

This skill requires:

R/Bioconductor environment with clusterProfiler
Multiple annotation databases (org.*.eg.db)
KEGG REST API access
Complex visualization dependencies

Dependencies

Required R Packages


install.packages(c("BiocManager", "ggplot2", "dplyr", "readr"))

BiocManager::install(c(

    "clusterProfiler", 

    "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db",

    "enrichplot", "pathview", "DOSE"

))

Python Dependencies


pip install pandas numpy matplotlib seaborn rpy2

Example Workflow

Prepare Input: Create gene list from DEG analysis
Run Analysis: Execute main.py with appropriate parameters
Review Results: Check generated CSV files and visualizations
Interpret: Read auto-generated summary for biological insights

References

See references/ for:

clusterProfiler documentation
KEGG API guide
Statistical methods explanation
Visualization examples

Limitations

Requires internet connection for KEGG database queries
Large gene lists (>5000) may require increased memory
Some pathways may not be available for all organisms
KEGG API has rate limits (max 3 requests/second)

Risk Assessment

| Risk Indicator | Assessment | Level |

|----------------|------------|-------|

| Code Execution | Python/R scripts executed locally | Medium |

| Network Access | No external API calls | Low |

| File System Access | Read input files, write output files | Medium |

| Instruction Tampering | Standard prompt guidelines | Low |

| Data Exposure | Output files saved to workspace | Low |

Security Checklist

No hardcoded credentials or API keys
No unauthorized file system access (../)
Output does not expose sensitive information
Prompt injection protections in place
Input file paths validated (no ../ traversal)
Output directory restricted to workspace
Script execution in sandboxed environment
Error messages sanitized (no stack traces exposed)
Dependencies audited

Prerequisites


# Python dependencies

pip install -r requirements.txt

Evaluation Criteria

Success Metrics

Successfully executes main functionality
Output meets quality standards
Handles edge cases gracefully
Performance is acceptable

Test Cases

Basic Functionality: Standard input → Expected output
Edge Case: Invalid input → Graceful error handling
Performance: Large dataset → Acceptable processing time

Lifecycle Status

Current Stage: Draft
Next Review Date: 2026-03-06
Known Issues: None
Planned Improvements:
- Performance optimization
- Additional feature support