bio-metagenomics-metaphlan

star 912

Profiles shotgun metagenomes to species/SGB relative abundance with MetaPhlAn 4's clade-specific marker genes (bowtie2 short reads, minimap2 long reads). Covers why a MetaPhlAn percentage is a cell fraction (genome-size-normalized taxonomic abundance) and must never be merged with Kraken/Bracken read fractions, kSGB vs uSGB units for quantifying database-absent taxa, the unknown-fraction rescaling and its version-default flip, --index pinning as a batch variable, and when mOTUs3 or sourmash gather beat marker profiling. Use when profiling who-is-there with high precision, needing HMP-comparable species abundances, quantifying novel taxa, or deciding marker-gene vs k-mer profiling. For k-mer classification see kraken-classification; for strains see strain-tracking; for 16S amplicon see the microbiome category.

GPTomics By GPTomics schedule Updated 6/9/2026

name: bio-metagenomics-metaphlan description: Profiles shotgun metagenomes to species/SGB relative abundance with MetaPhlAn 4's clade-specific marker genes (bowtie2 short reads, minimap2 long reads). Covers why a MetaPhlAn percentage is a cell fraction (genome-size-normalized taxonomic abundance) and must never be merged with Kraken/Bracken read fractions, kSGB vs uSGB units for quantifying database-absent taxa, the unknown-fraction rescaling and its version-default flip, --index pinning as a batch variable, and when mOTUs3 or sourmash gather beat marker profiling. Use when profiling who-is-there with high precision, needing HMP-comparable species abundances, quantifying novel taxa, or deciding marker-gene vs k-mer profiling. For k-mer classification see kraken-classification; for strains see strain-tracking; for 16S amplicon see the microbiome category. tool_type: cli primary_tool: MetaPhlAn

Version Compatibility

Reference examples tested with: MetaPhlAn 4.1+, Bowtie2 2.5.3+, minimap2 2.26+, pandas 2.2+.

Before using code patterns, verify installed versions match. If versions differ:

  • CLI: metaphlan --version then metaphlan --help to confirm flag names and defaults
  • Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

The marker DATABASE version is the experimental variable. Results track the index (e.g. mpa_vJun23_CHOCOPhlAnSGB_202403 vs the live vJan25 build); MetaPhlAn 3 and MetaPhlAn 4 databases are not interchangeable. Pin --index and report it like a reagent lot. Two flags were renamed in 4.2: --bowtie2out -> --mapout and --bowtie2db -> --db_dir (the --input_type value bowtie2out likewise becomes mapout); unknown-fraction estimation flipped from opt-in (--unclassified_estimation) to on-by-default (--skip_unclassified_estimation to disable). Confirm against metaphlan --help.

MetaPhlAn Profiling

"Who is in my metagenome, by cell fraction?" -> Detect which clades' private marker genes are present, average their per-marker coverage, and normalize to a genome-size-aware relative abundance - so the percentage is a fraction of cells, not of reads.

  • CLI: metaphlan reads_1.fq.gz,reads_2.fq.gz --input_type fastq --index mpa_vJun23_CHOCOPhlAnSGB_202403 -o profile.txt --mapout sample.bz2

Scope: marker-gene species/SGB profiling and its alternatives (mOTUs3, sourmash gather). K-mer read classification -> kraken-classification. Strain-resolved SNV haplotypes -> strain-tracking. Functional profiling -> functional-profiling. Compositional stats and plotting -> metagenome-visualization. 16S amplicon -> the microbiome category.

The Single Most Important Modern Insight -- A MetaPhlAn Percentage Is a Cell Fraction, Not a Read Fraction

A MetaPhlAn percentage estimates what fraction of the CELLS in the community belong to a clade - a genome-size-normalized taxonomic abundance. A Kraken/Bracken percentage estimates what fraction of the READS came from a clade - a sequence abundance. There is no sample-independent conversion between them, because sequence abundance under-estimates small-genome microbes and over-estimates large-genome ones by a factor that depends on the whole community's genome-size distribution (Sun 2021 Nat Methods 18:618). Therefore:

  • Never merge MetaPhlAn percentages with Kraken/Bracken percentages into one table, correlate them, or benchmark one against the other. Disagreement between them is expected even when both are correct.
  • Marker profiling is not "classify every read." It detects which clades' PRIVATE markers are present (default presence gate: reads cover roughly 20% of a clade's markers) and averages their per-marker coverage. Most reads are never assigned - by design, not failure.

Mnemonic: markers measure WHO is there (cells); k-mers measure HOW MUCH DNA is there (reads).

SGBs: the Unit Is Species-Level, and uSGBs Quantify the Unnamed

MetaPhlAn 4's atomic taxon is the SGB (species-level genome bin, a ~95% ANI cluster), not an NCBI species. A kSGB contains a cultured reference genome and gets a Latin name; a uSGB is defined only from MAGs (>=5 required) and is reported with a placeholder ID and no name. Quantifying uSGBs - taxa with no reference genome - is MetaPhlAn 4's headline advance over MetaPhlAn 3 and explains ~20% more gut reads, >40% more in under-characterized environments (Blanco-Miguez 2023 Nat Biotechnol 41:1633). Consequences: an unnamed t__SGB... row is a real quantified taxon - do not drop it; one named species can split into several SGBs; MetaPhlAn 3 species profiles and MetaPhlAn 4 SGB profiles are not row-compatible (use sgb_to_gtdb_profile.py for GTDB names). The t__ tier is the SGB, NOT a strain - strain resolution is StrainPhlAn (-> strain-tracking).

Tool Taxonomy

Tool Citation Mechanism / role When
MetaPhlAn 4 Blanco-Miguez 2023 Nat Biotechnol 41:1633 ~189 clade-specific markers/SGB; robust coverage average high-precision species/SGB %, HMP-comparable, characterized communities
mOTUs3 Ruscheweyh 2022 Microbiome 10:212 10 universal single-copy marker genes higher recall of novel/divergent taxa; transparent marker-hit confidence
sourmash gather Pierce 2019 F1000Res 8:1006 FracMinHash containment, minimum metagenome cover genome-resolved hits vs all of GTDB + an honest unknown fraction
Kraken2 + Bracken Wood 2019 Genome Biol 20:257 k-mer LCA + Bayesian reestimation -> kraken-classification; max recall, willing to filter false positives

Decision Tree by Scenario

Scenario Recommended Why
Human gut species %, low false positives, HMP-comparable MetaPhlAn 4 curated SGB markers; high precision; huge corpus
Quantify novel / database-absent taxa MetaPhlAn 4 uSGBs OR mOTUs3 ext-mOTUs reference-independent units
Maximize recall in under-characterized environments mOTUs3 or sourmash gather universal markers / containment vs everything
Genome-resolved + explicit unknown fraction sourmash gather minimum metagenome cover reports what is unexplained
Max recall of every read, speed -> kraken-classification k-mer LCA; filter the false-positive tail
Need cell fraction, not read fraction MetaPhlAn / mOTUs k-mer tools report read fraction
Strain-level resolution -> strain-tracking per-SNV haplotypes, not species profiling
Composition stats next -> metagenome-visualization (CLR/ANCOM-BC) output is closed; naive stats on percentages are invalid

Basic Profiling

# Paired-end reads are passed as ONE comma-separated argument (MetaPhlAn treats them as two
# single-end files - it does not use insert/pairing info). Pin the index for reproducibility.
metaphlan reads_R1.fastq.gz,reads_R2.fastq.gz \
    --input_type fastq \
    --index mpa_vJun23_CHOCOPhlAnSGB_202403 \   # pin it; DB version is a batch variable
    --nproc 8 \
    --mapout sample.map.bz2 \                    # cache the read->marker mapping (pre-4.2: --bowtie2out)
    --output_file profile.txt

Re-Profile from the Mapping Cache (the real operational lever)

Goal: Try different analysis types, levels, or estimator settings without realigning.

Approach: Save the mapping once with --mapout, then re-run from it with --input_type mapout (pre-4.2: bowtie2out). Realignment is the expensive step; everything downstream is free.

metaphlan sample.map.bz2 --input_type mapout \
    --tax_lev s \           # k,p,c,o,f,g,s,t (t = SGB tier)
    --stat_q 0.2 \          # quantile-truncated robust mean of per-marker coverages: drop top/bottom 20%, average the middle 60%
    --output_file profile_species.txt

--stat_q down-weights markers in HGT/mobile and conserved cross-clade regions; the default 0.2 is a sensible robust mean. Changing it changes the reported abundances - report it if it is changed. Long reads (4.1+) route to minimap2 with --long_reads.

The Unknown Fraction Rescales Everything

Relative abundance sums to 100% only over DETECTED clades. With unknown estimation OFF (pre-4.2 default), known taxa absorb 100% and the database-absent community is invisible - overstating every known taxon. With it ON (4.2 default), an UNCLASSIFIED row appears and every known abundance shrinks proportionally. In soil/marine/rumen the unknown fraction can be the largest "taxon" in the sample.

# 4.2 default includes the UNCLASSIFIED row. To force it on pre-4.2: --unclassified_estimation
# For SAM input, pass --nreads <total> or the unknown fraction is wrong.
metaphlan reads.fastq.gz --input_type fastq -o profile.txt   # 4.2: UNCLASSIFIED row present by default

Pre-4.2-default and 4.2-default outputs are not comparable abundances - mixing them is a hidden batch effect.

Merge and Convert

# All inputs MUST come from the SAME database index or rows mismatch silently.
merge_metaphlan_tables.py profiles/*_profile.txt > merged_abundance.txt
sgb_to_gtdb_profile.py -i merged_abundance.txt -o merged_gtdb.txt   # recover GTDB names for SGBs

Per-Method Failure Modes

MetaPhlAn percentages merged with Kraken percentages

Trigger: putting MetaPhlAn and Bracken abundances in one matrix or correlating them. Mechanism: cell fraction vs read fraction - different quantities (Sun 2021). Symptom: "tools disagree," spurious scatter, broken ML/differential-abundance features. Fix: keep them separate; if harmonizing, convert via genome length (Bracken counts / genome length, renormalize) and accept it is approximate.

Unknown-fraction default mismatch across samples

Trigger: profiles built with different MetaPhlAn versions or --unclassified_estimation settings. Mechanism: the UNCLASSIFIED row rescales all known abundances. Symptom: a batch effect aligned to processing date, not biology. Fix: pin one version and one unknown-estimation setting across the whole study; for environmental samples always include the unknown fraction.

Treating a low mapping rate as a QC failure

Trigger: alarm at <1% of reads mapping. **Mechanism:** only clade-specific markers are targeted; low mapping is expected. **Symptom:** unnecessary re-runs. **Fix:** low mapping is normal; a large unknown fraction means database-absent community (consider mOTUs3/sourmash), and a very low rate plus low microbial yield suggests host contamination -> contamination-controls.

Recall ceiling in under-characterized environments

Trigger: profiling soil/marine and reporting only named taxa. Mechanism: a marker tool is structurally blind to clades whose markers are not in the database (high precision, low recall; CAMI2 Meyer 2022). Symptom: most of the community missing; lowering thresholds does not recover it. Fix: use mOTUs3 (universal markers) or sourmash gather (containment vs all of GTDB), or accept Kraken false positives and filter - do not just lower MetaPhlAn thresholds and call it sensitivity.

Index mismatch on merge

Trigger: merging profiles built on different --index builds. Mechanism: SGB IDs and marker sets differ between releases. Symptom: rows silently fail to align; abundances look implausible. Fix: rebuild all samples on one pinned index before merging.

Quantitative Thresholds

Threshold Source Rationale
Presence gate ~20% of an SGB's markers Blanco-Miguez 2023 Nat Biotechnol 41:1633 enough markers covered to call a clade present (precision mechanism)
--stat_q 0.2 default MetaPhlAn docs truncated mean drops top/bottom 20% of marker coverages; robust to HGT/conserved outliers
uSGB requires >=5 MAGs Blanco-Miguez 2023 Nat Biotechnol 41:1633 false-positive control for unnamed taxa
Pin --index MetaPhlAn docs DB version changes profiles for identical reads; report like a reagent lot
--min_cu_len 2000 MetaPhlAn docs minimum cumulative marker length to report a clade (low-evidence filter)

Common Errors

Error / symptom Cause Solution
"No database found" DB not installed metaphlan --install (optionally --index <ver> --db_dir DIR)
Output all zeros wrong --input_type or empty/host-only input match --input_type to the file; check microbial yield
--bowtie2out not recognized running MetaPhlAn 4.2+ use --mapout / --input_type mapout (4.2 rename)
Rows mismatch after merge profiles from different indices rebuild on one pinned --index
SAM input unknown fraction wrong --nreads not supplied pass total read count with --nreads
Viral calls look unreliable --add_viruses calls are low-confidence treat vSGB calls cautiously (CAMI2)

References

  • Blanco-Miguez A, Beghini F, Cumbo F, et al. 2023. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat Biotechnol 41:1633-1644.
  • Beghini F, McIver LJ, Blanco-Miguez A, et al. 2021. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 10:e65088.
  • Sun Z, Huang S, Zhang M, et al. 2021. Challenges in benchmarking metagenomic profilers. Nat Methods 18:618-626.
  • Meyer F, Fritz A, Deng ZL, et al. 2022. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat Methods 19:429-440.
  • Ruscheweyh HJ, Milanese A, Paoli L, et al. 2022. Cultivation-independent genomes greatly expand taxonomic-profiling capabilities of mOTUs across various environments. Microbiome 10:212.
  • Sunagawa S, Mende DR, Zeller G, et al. 2013. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods 10:1196-1199.
  • Pierce NT, Irber L, Reiter T, Brooks P, Brown CT. 2019. Large-scale sequence comparisons with sourmash. F1000Res 8:1006.

Related Skills

  • kraken-classification - K-mer read classification; reports read fraction, not cell fraction
  • abundance-estimation - Compositional handling and cross-tool abundance comparison
  • strain-tracking - StrainPhlAn strain resolution below the SGB level
  • functional-profiling - HUMAnN reuses a MetaPhlAn profile for its taxonomic prescreen
  • metagenome-visualization - Compositional stats and plotting of profiles
  • genome-assembly/metagenome-assembly - Recover the MAGs that define uSGBs; this category is read-based
  • workflows/metagenomics-pipeline - End-to-end shotgun profiling
Install via CLI
npx skills add https://github.com/GPTomics/bioSkills --skill bio-metagenomics-metaphlan
Repository Details
star Stars 912
call_split Forks 160
navigation Branch main
article Path SKILL.md
More from Creator