bio-metagenomics-functional-profiling - SKILL.md Agent Skill

name: bio-metagenomics-functional-profiling description: Profiles the functional potential of shotgun metagenomes with HUMAnN 3's tiered search (MetaPhlAn prescreen, Bowtie2 pangenome, translated DIAMOND vs UniRef), giving gene-family (RPK) and MetaCyc pathway abundances stratified by species. Covers why a metagenome measures potential not activity, why dropping UNMAPPED/UNINTEGRATED biases everything, why stratification is an estimate, coverage-vs-abundance and MinPath/gap-fill, UniRef90-vs-50 and biome database bias, and the assembly/eggNOG/dbCAN/antiSMASH alternatives. Use when obtaining pathway or gene-family abundances, regrouping to KO/EC/GO, normalizing functional tables, or choosing read-based vs assembly-based functional profiling. For AMR genes see amr-detection; for host-gene enrichment see pathway-analysis. tool_type: cli primary_tool: HUMAnN

Version Compatibility

Reference examples tested with: HUMAnN 3.6+, MetaPhlAn 4.1+, DIAMOND 2.1+, pandas 2.2+, scipy 1.12+.

Before using code patterns, verify installed versions match. If versions differ:

CLI: humann --version then humann --help to confirm flags and defaults
Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Output is driven by the reference databases: the ChocoPhlAn nucleotide pangenome, the UniRef90/50 protein database, and the MetaPhlAn database version HUMAnN calls. Record all three. Match the MetaPhlAn database version to the HUMAnN version when supplying --taxonomic-profile (MetaPhlAn 3 and 4 databases differ), and pin UniRef90 vs UniRef50, which changes both sensitivity and the UNMAPPED fraction.

Functional Profiling

"What can my community do?" -> Quantify gene families and pathways with a tiered search that only translates the reads the fast steps could not place - measuring functional POTENTIAL the community encodes, never what it is expressing.

CLI: humann --input reads.fastq.gz --output out/ --taxonomic-profile sample_metaphlan.tsv --threads 8

Scope: read-based community function (HUMAnN) and the assembly/specialized-database alternatives. Read classification -> kraken-classification, metaphlan-profiling. AMR gene quantification -> amr-detection. Host-gene over-representation (GO/KEGG/GSEA) -> the pathway-analysis category. Assembly/ORF mechanics -> genome-assembly/metagenome-assembly. Host depletion and trimming -> contamination-controls, read-qc.

The Single Most Important Modern Insight -- A Metagenome Measures Potential, Not Activity

A gene family or a "complete pathway" in HUMAnN output is a CAPABILITY the community encodes - never a rate, a flux, or proof of expression. The DNA says the cell could ferment pyruvate; only RNA (metatranscriptome), protein, or metabolite data says it is. RNA functional profiles decouple from gene carriage (Franzosa 2014 PNAS 111:E2329), so narrating a pathabundance table as "the disease microbiome upregulates X" is the cardinal sin - it carries the gene more abundantly, nothing more. If the question is about activity, pair with metatranscriptomics: run the same HUMAnN on RNA and divide RNA-CPM by matched DNA-CPM per feature to get expression per gene copy. Memorable form: HUMAnN reports what the community CAN do, never what it IS doing. And every cell of the table is a model-dependent artifact of a reference database plus a tiered search at chosen thresholds - a hypothesis conditioned on the reference, not a measurement.

The Tiered Search Is the Algorithm

HUMAnN does not brute-force-translate every read. Three tiers each filter the input to the next, so the slow translated step only sees what the fast steps could not place:

Taxonomic prescreen (MetaPhlAn). Detect species, then build a sample-specific ChocoPhlAn pangenome from only species above --prescreen-threshold (default 0.01%). Reuse via --taxonomic-profile to skip re-running MetaPhlAn.
Nucleotide pangenome search (Bowtie2). Reads mapping to that pangenome get a UniRef90 family WITH a high-confidence species label - this is where confident stratification comes from.
Translated protein search (DIAMOND). Reads that failed tier 2 are 6-frame translated and aligned to full UniRef90/50; they get a family but the species is INFERRED or |unclassified - lower-confidence stratification. Reads matching nothing become UNMAPPED.

This model is why --bypass-* flags and --prescreen-threshold change results, and why a stratified contribution from the translated tier is an estimate, not a measurement.

Tool Taxonomy

Tool	Citation	Role	When
HUMAnN 3	Beghini 2021 eLife 10:e65088	tiered read-based gene-family + MetaCyc pathway abundance	quantitative community function across samples
eggNOG-mapper v2	Cantalapiedra 2021 Mol Biol Evol 38:5825	orthology annotation (KO/GO/EC/CAZy/COG) of predicted ORFs	assembly route; flat functional catalogue
DIAMOND	Buchfink 2021 Nat Methods 18:366	fast sensitive protein search (blastx)	custom read-vs-protein-DB profiling; backend of many tools
dbCAN3	Zheng 2023 Nucleic Acids Res 51:W115	CAZyme family/subfamily + substrate	carbohydrate-active enzymes (UniRef under-resolves these)
antiSMASH 7	Blin 2023 Nucleic Acids Res 51:W46	biosynthetic gene cluster detection	secondary-metabolite BGCs; CONTIGS only
MinPath	Ye & Doak 2009 PLoS Comput Biol 5:e1000465	parsimony pathway calling inside HUMAnN	suppresses naive any-gene-implies-pathway over-calling

Decision Tree by Scenario

Scenario	Recommended	Why
Quantitative community function across samples	HUMAnN 3 (read-based)	counts every read; comprehensive; DB-bounded
Gut/host-associated, fine resolution	HUMAnN + UniRef90	well-covered biome; specific families
Soil/marine/novel, big UNMAPPED	HUMAnN + UniRef50, or assembly route	UniRef90 cannot map divergent homologs
Need gene-to-organism/operon context or novel function	assembly + Prodigal + eggNOG-mapper	genomic context; accept loss of the unassembled majority
Carbohydrate-active enzymes	dbCAN3	CAZy families/substrate beat generic UniRef
Biosynthetic gene clusters	antiSMASH (-> assembly first)	clusters span kb; require contigs
Activity, not capability	metatranscriptome (HUMAnN RNA mode) / RNA-DNA ratio	DNA cannot report expression
AMR gene quantification	-> amr-detection	dedicated ARG databases are the standard
Host-gene pathway enrichment	-> pathway-analysis	community pathway abundance is not GSEA

Run HUMAnN and Build a Functional Table

# Pre-QC first: adapter/quality trim AND host-deplete (e.g. KneadData). Host reads inflate UNMAPPED
# and waste DIAMOND time. Paired-end has no native pairing - concatenate R1+R2 into one file.
cat sample_R1.fq.gz sample_R2.fq.gz > sample.fq.gz
humann --input sample.fq.gz --output out/ \
    --taxonomic-profile sample_metaphlan.tsv \   # reuse the MetaPhlAn profile; do NOT --remove-temp-output and lose it
    --threads 8                                   # defaults: prescreen 0.01, translated-id 80 (uniref90), gap-fill on, minpath on

# Normalize PER SAMPLE before cross-sample stats (RPK is depth-dependent), then join and split.
humann_renorm_table -i out/sample_genefamilies.tsv -o out/sample_cpm.tsv -u cpm   # cpm preferred for models
humann_join_tables -i out -o merged_pathabundance.tsv --file_name pathabundance
humann_regroup_table -i merged_pathabundance.tsv -g uniref90_ko -o merged_ko.tsv  # adds an UNGROUPED row - keep it
humann_split_stratified_table -i merged_pathabundance.tsv -o .                     # run stats on the UNSTRATIFIED file

Differential Abundance Without Biasing the Denominator

Goal: Test pathways between conditions without inventing abundance by discarding the unmapped fraction.

Approach: Keep UNMAPPED/UNINTEGRATED through normalization, check they do not differ by group (they often track the phenotype), then test the unstratified community totals with a compositional method (MaAsLin2/ANCOM-BC), not a bare Mann-Whitney on proportions.

import pandas as pd

df = pd.read_csv('merged_pathabundance_unstratified.tsv', sep='\t', index_col=0)
meta = pd.read_csv('metadata.tsv', sep='\t', index_col=0)
unmapped = df.loc[['UNMAPPED', 'UNINTEGRATED']]            # the denominator - inspect, do not drop
g1 = meta.index[meta['condition'] == 'healthy']
g2 = meta.index[meta['condition'] == 'disease']
# If UNMAPPED differs by group, an assigned-feature comparison is confounded - report it.
unmapped_shift = unmapped[g2].mean(axis=1) - unmapped[g1].mean(axis=1)
print('UNMAPPED/UNINTEGRATED shift (disease - healthy):')
print(unmapped_shift.round(1))
# Then hand the unstratified table (UNMAPPED retained) to MaAsLin2/ANCOM-BC, which model
# compositionality and zero-inflation - not a raw t-test/Mann-Whitney on relative abundance.

Per-Method Failure Modes

Dropping UNMAPPED/UNINTEGRATED then renormalizing

Trigger: .drop(['UNMAPPED','UNINTEGRATED']) before relab. Mechanism: rescales assigned features to sum to 1, inventing abundance and erasing the database-coverage signal. Symptom: two samples with 20% vs 60% UNMAPPED look identical; a hit appears or vanishes. Fix: keep them through normalization; report them; if comparing assigned features only, confirm UNMAPPED does not differ by group.

Stratification read as ground truth

Trigger: "species X contributes Y% of pathway Z." Mechanism: tier-2 species labels are confident, tier-3 (translated) are inferred or |unclassified. Symptom: over-confident organism-of-origin claims; ignored unclassified mass. Fix: treat contributions as estimates; flag the unclassified fraction; for confident gene-to-organism linkage use the assembly+binning route.

Gut-centric QC thresholds applied to other biomes

Trigger: flagging a soil run as failed because UNMAPPED > 50%. Mechanism: UniRef is biased toward well-studied microbes; environmental biomes have large genuine dark function. Symptom: healthy environmental runs labeled failures. Fix: in novel biomes a big UNMAPPED is the environment, not a failure - drop to UniRef50 for sensitivity or switch to the assembly route.

Coverage vs abundance confusion; gap-fill/MinPath artifacts

Trigger: reading pathcoverage as abundance or trusting a "complete" pathway. Mechanism: gap-fill (default on) scores pathways with missing reactions; MinPath parsimony prunes redundant pathways so absence is not biological absence; coverage is de-emphasized in recent docs. Symptom: fabricated completeness or missing real-but-redundant pathways. Fix: use abundance for stats; treat coverage as a soft presence prior; know gap-fill and MinPath are on by default.

Quantitative Thresholds

Threshold	Source	Rationale
`--prescreen-threshold` 0.01%	HUMAnN docs	species below this are excluded from the tier-2 pangenome
`--translated-identity-threshold` 80 (UniRef90) / 50 (UniRef50)	HUMAnN docs	the sensitivity knob; divergent homologs fail the 80% cutoff
nucleotide/translated coverage 90/50	HUMAnN docs	query 90%, subject 50% coverage filters
`--evalue` 1.0	HUMAnN docs	permissive DIAMOND e-value; tier design controls specificity
gap-fill on, MinPath on	HUMAnN docs	sensitivity/parsimony defaults that shape pathway presence
Normalize RPK -> CPM per sample before stats	HUMAnN docs	RPK is depth-dependent; CPM preferred for linear/log models

Common Errors

Error / symptom	Cause	Solution
Huge UNMAPPED on a gut sample	host reads not removed	host-deplete + trim before HUMAnN
MetaPhlAn profile gone after run	`--remove-temp-output` deleted `_humann_temp/`	omit it; keep the profile for reuse and taxonomy
`--taxonomic-profile` rejected	MetaPhlAn DB version mismatch with HUMAnN	match the MetaPhlAn database to the HUMAnN version
Pathways look over-called	MinPath off / naive any-gene mapping	keep MinPath on (default)
Stratified DA is all zeros/noise	bare t-test on zero-inflated strata	run stats on the unstratified table with MaAsLin2/ANCOM-BC
CAZymes under-annotated by UniRef	generic database under-resolves CAZy	use dbCAN3 on predicted ORFs

References

Beghini F, McIver LJ, Blanco-Miguez A, et al. 2021. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10:e65088.
Franzosa EA, Morgan XC, Segata N, et al. 2014. Relating the metatranscriptome and metagenome of the human gut. PNAS 111:E2329-E2338.
Ye Y, Doak TG. 2009. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput Biol 5:e1000465.
Buchfink B, Reuter K, Drost HG. 2021. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18:366-368.
Cantalapiedra CP, Hernandez-Plaza A, Letunic I, Bork P, Huerta-Cepas J. 2021. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol 38:5825-5829.
Hyatt D, Chen GL, LoCascio PF, et al. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119.
Zheng J, Hu B, Zhang X, et al. 2023. dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Res 51:W115-W121.
Blin K, Shaw S, Augustijn HE, et al. 2023. antiSMASH 7.0: new and improved predictions. Nucleic Acids Res 51:W46-W50.

Related Skills

metaphlan-profiling - The taxonomic prescreen HUMAnN reuses via --taxonomic-profile
kraken-classification - Alternative taxonomic input
abundance-estimation - Compositional normalization shared with functional tables
amr-detection - Dedicated ARG quantification (HUMAnN can surface AMR families but is not standard)
metagenome-visualization - Plot and test functional tables
contamination-controls - Host depletion before HUMAnN
genome-assembly/metagenome-assembly - Assembly route for contextualized/novel function
pathway-analysis/kegg-pathways - Organism-centric pathway interpretation (not community abundance)