bio-metabolomics-metabolite-annotation

star 876

Turns untargeted LC-MS/MS features (m/z, RT, MS/MS) into confidence-stratified metabolite annotations using spectral-library matching (matchms), in-silico tools (SIRIUS/CSI:FingerID, MetFrag) and molecular networking, and assigns a defensible MSI/Schymanski confidence level to each. Use when naming detected features, scoring MS/MS against a reference library, running SIRIUS, or deciding what confidence level an evidence set actually supports. For upstream feature extraction see metabolomics/xcms-preprocessing and metabolomics/msdial-preprocessing; for downstream enrichment that must respect these levels see metabolomics/pathway-mapping; for lipid-specific structural annotation see metabolomics/lipidomics.

GPTomics By GPTomics schedule Updated 6/6/2026

name: bio-metabolomics-metabolite-annotation description: Turns untargeted LC-MS/MS features (m/z, RT, MS/MS) into confidence-stratified metabolite annotations using spectral-library matching (matchms), in-silico tools (SIRIUS/CSI:FingerID, MetFrag) and molecular networking, and assigns a defensible MSI/Schymanski confidence level to each. Use when naming detected features, scoring MS/MS against a reference library, running SIRIUS, or deciding what confidence level an evidence set actually supports. For upstream feature extraction see metabolomics/xcms-preprocessing and metabolomics/msdial-preprocessing; for downstream enrichment that must respect these levels see metabolomics/pathway-mapping; for lipid-specific structural annotation see metabolomics/lipidomics. tool_type: mixed primary_tool: matchms

Version Compatibility

Reference examples tested with: matchms 0.33+, SIRIUS 6.x, MetFrag 2.5+

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures
  • CLI: <tool> --version then <tool> --help to confirm flags

Spectral matching needs precursor m/z on every MS/MS spectrum (add_precursor_mz filter) or ModifiedCosine silently returns zeros. Level 1 needs an authentic standard run in the same lab under the same method; no software output can substitute for it.

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Metabolite Annotation

"Annotate my metabolomics features with compound identities" -> Map each feature's m/z and MS/MS to candidate structures, then attach an explicit confidence level to every name.

  • Python: matchms.calculate_scores() for library matching (matchms)
  • CLI: sirius ... formulas fingerprints structures canopus for in-silico formula/structure/class (SIRIUS)

The Single Most Important Insight -- An Annotation Is a Hypothesis Carrying a Confidence Level, Not an Identification

A metabolite name without a stated MSI/Schymanski level is scientifically incomplete. The inference chain m/z -> formula -> structure -> isomer-resolved identity is three separate lossy steps, each needing its own orthogonal evidence axis. A database hit supplies a name, not evidence: with no MS/MS or RT to back it, it is Schymanski Level 4 (formula) at best, often Level 5 (a feature of interest). A high cosine score ranks candidates; it never proves one. Only an in-house authentic standard, same method, with MS, MS/MS, and RT all matching reaches Level 1 ("identification") -- everything else is an honest hypothesis. The field's recurring sin is laundering Level 2/3 hypotheses into Level-1 prose; the canonical worked example is phenylacetylglutamine being reported as phenylacetylglycine in nearly half of NMR studies (Theodoridis 2023). Assign the lowest level the evidence honestly supports and report which database/version was searched.

Confidence-Level Taxonomy (MSI and Schymanski)

Schymanski MSI Name Evidence required
Level 1 1 Confirmed structure In-house authentic standard, same method: MS + MS/MS + RT all match. The only "identification".
Level 2a 2 Probable structure (library) MS/MS matches a reference library spectrum; no in-house standard.
Level 2b 2 Probable structure (diagnostic) Diagnostic fragments / RT / ionization consistent with exactly one structure; no reference spectrum.
Level 3 3 Tentative candidate(s) Evidence narrows to a structure class or candidate set but isomers remain unresolved.
Level 4 -- Unequivocal formula MS1 accurate mass + isotope pattern + adduct logic assign one formula; no structure.
Level 5 4 Exact mass A feature of interest; nothing assigned.

Promote one level per orthogonal evidence axis that survives scrutiny; cap at Level 2 unless an in-house standard exists. CSI:FingerID and library matching recover constitution only -- no stereochemistry, so enantiomer/regiochemistry claims cannot come from MS/MS.

Tool Roles

Tool Core idea Output Best for
matchms (CosineGreedy / ModifiedCosine / spectral entropy) Score query MS/MS against library spectra Ranked library hits + matched-peak count Level 2a when a library spectrum exists
SIRIUS + ZODIAC Fragmentation trees + isotope pattern, dataset-wide formula re-ranking Ranked molecular formula Formula (Level 4); the reliable part of SIRIUS
CSI:FingerID + COSMIC Predict fingerprint, search structure DB, calibrated confidence Ranked structures + FDR-controllable score Level 2b/3 structure when COSMIC FDR is set
CANOPUS Predict compound class directly from MS2 ClassyFire + NPClassifier class Level 3 class for unknowns; often the most honest output
MetFrag Bond-disconnection scoring of candidate list Explainable fragment-supported ranks Transparent, scriptable, custom DBs, RT term
FBMN (GNPS2) + MS2Query Modified-cosine network / ML analogue search Edges = "related to" Analogue propagation (Level 3 scaffold hypothesis)

Decision Tree: Evidence Available -> Tool -> Achievable Level

Situation Do Achievable level
In-house authentic standard, same method, MS+MS/MS+RT match Confirm against standard Level 1
MS/MS available, library spectrum likely exists matchms library match (entropy or modified cosine) Level 2a
MS/MS available, no library spectrum SIRIUS formulas + CSI:FingerID + CANOPUS, or MetFrag Level 2b/3 (formula Level 4)
Need class only / compound absent from all DBs CANOPUS (class); MSNovelist (de novo SMILES) Level 3
Find analogues / propagate across a network FBMN on GNPS2 + MS2Query Level 3 (scaffold hypothesis)
Only MS1 m/z + isotopes + clean adduct Formula assignment (SIRIUS / seven golden rules) Level 4
Bare m/z, no orthogonal evidence Report as a feature Level 5
Biology hinges on a specific isomer / stereocenter Demand a standard or orthogonal method (NMR, chiral assay) MS alone insufficient

Match MS/MS Against a Spectral Library

Goal: Rank library candidates for each query spectrum and attach the matched-peak count, not just the score.

Approach: Harmonize metadata, normalize intensities, add precursor m/z, score with ModifiedCosine (analogue-aware) or spectral entropy (identity), then keep only hits above both a score and a matched-peak floor.

from matchms import calculate_scores
from matchms.filtering import default_filters, normalize_intensities, add_precursor_mz
try:
    from matchms.similarity import ModifiedCosineGreedy as ModifiedCosine  # matchms 0.33+
except ImportError:
    from matchms.similarity import ModifiedCosine          # matchms <= 0.32

def prepare(spectrum):
    spectrum = default_filters(spectrum)
    spectrum = add_precursor_mz(spectrum)  # required for ModifiedCosine or scores are zero
    return normalize_intensities(spectrum)

queries = [prepare(s) for s in queries_raw]
references = [prepare(s) for s in references_raw]

scores = calculate_scores(references, queries, ModifiedCosine(tolerance=0.005))

# CosineGreedy/ModifiedCosine return a structured array; the field names are
# class-prefixed and version-dependent (e.g. 'ModifiedCosineGreedy_score' in 0.33),
# so derive them from the dtype rather than hard-coding.
for query in queries:
    pairs = scores.scores_by_query(query)
    score_field, match_field = pairs[0][1].dtype.names
    ref, hit = max(pairs, key=lambda pair: pair[1][score_field])
    if hit[score_field] >= 0.7 and hit[match_field] >= 6:  # score floor + peak-count floor (GNPS defaults)
        print(ref.get('compound_name'), hit[score_field], hit[match_field])  # Level 2a candidate

Run SIRIUS for Formula, Structure, and Class

Goal: Annotate features that have no library spectrum, reporting formula and class with more trust than top-1 structure.

Approach: Run the SIRIUS subcommand chain on one project space; trust ZODIAC-refined formula over CSI:FingerID structure, and only report a structure as confident when a COSMIC FDR threshold is set.

# SIRIUS 6 is a multi-command pipeline on one line. A free academic account/license
# is required (since v5); log in once, then the project space persists across runs.
# Credential flags vary by version; run `sirius login --help` to confirm (commonly `-u <email>`).
sirius login -u "$SIRIUS_USER"

sirius --input features.mgf --project ./sirius_project \
    formulas --profile orbitrap \
    fingerprints \
    structures --database bio \
    canopus \
    write-summaries --output ./sirius_summary
# Verify exact subcommand spelling with `sirius <command> --help`: formulas/fingerprints/
# structures/canopus changed plural/singular and options between v5 and v6.
# --database (on structures) is a scientific choice: 'bio' raises plausibility but cannot
# return a novel metabolite; 'pubchem' maximizes recall but floods implausible isomers.

Assemble an Evidence-to-Level Call

Goal: Collapse a feature's evidence set into a single defensible confidence level.

Approach: Start at Level 5 and promote per surviving orthogonal axis; an authentic standard is the only path to Level 1.

def assign_level(evidence):
    if evidence.get('authentic_standard_same_method'):
        return 1
    if evidence.get('library_match') and evidence['library_match']['score'] >= 0.7 and evidence['library_match']['matches'] >= 6:
        return '2a'  # reference library spectrum, no in-house standard
    if evidence.get('diagnostic_fragments') and evidence.get('single_structure_consistent'):
        return '2b'
    if evidence.get('candidate_set') or evidence.get('canopus_class') or evidence.get('network_propagated'):
        return 3  # isomers unresolved, class only, or "related to" an annotated node
    if evidence.get('unambiguous_formula'):
        return 4  # MS1 + isotopes + adduct logic, no structure
    return 5

Per-Method Failure Modes

Cosine score is not identity

  • Trigger: Reporting a name because a single high cosine/modified-cosine score came back.
  • Mechanism: Cosine rewards shared fragment peaks, and fragments are substructures many distinct molecules share; a high score on few peaks aligns with thousands of unrelated compounds.
  • Symptom: Confident name that an isomer or scaffold-sharing compound would have produced identically.
  • Fix: Require a matched-peak floor (>=6) alongside the score (>=0.7); prefer spectral entropy for identity; report Level 2a, not Level 1.

The isomer wall

  • Trigger: Claiming a specific positional/stereo/regio isomer from MS/MS.
  • Mechanism: Constitutional isomers frequently fragment identically; enantiomers have near-identical CID spectra; CSI:FingerID is constitution-only.
  • Symptom: A specific structure reported where multiple isomers fit the data equally.
  • Fix: Report Level 3 unless RT or CCS breaks the tie (CCS needs ~0.5-0.6% separation); for biology hinging on the isomer, use NMR or a co-eluting standard.

In-source fragments and adduct cascades corrupt the input

  • Trigger: Annotating and counting features before collapsing ion families.
  • Mechanism: In-source fragmentation creates phantom MS1 features; assuming the wrong adduct shifts the neutral mass and corrupts every downstream candidate, producing a confident, internally consistent, wrong answer.
  • Symptom: Over-counted "compounds", the same molecule named several ways, invented biology.
  • Fix: Group ion families (CAMERA / Ion Identity Molecular Networking / khipu) before annotation; never quote feature counts as compound counts.

Database-mapping inflation poisons pathway analysis

  • Trigger: Feeding all candidate IDs of an ambiguous feature into enrichment.
  • Mechanism: One ambiguous m/z maps to many compound IDs across different pathways, so a single uncertain feature lights up several pathways (phantom enrichment).
  • Symptom: Inflated pathway significance traceable to Level-3 features voting as if they were several confirmed compounds.
  • Fix: Carry annotation uncertainty (candidate sets, levels) into enrichment; prefer mass-level or probabilistic methods that do not multiply ambiguous IDs (see metabolomics/pathway-mapping; mummichog deliberately avoids prior ID).

Quantitative Thresholds

Threshold Source Rationale
Cosine/modified-cosine >= 0.7 AND >= 6 matched peaks GNPS defaults (Wang 2016) Suppresses promiscuous low-complexity spectra that hairball the network.
Spectral entropy >= 0.75 -> FDR < 10% Li 2021 (natural-products benchmark) Dataset-dependent, NOT a universal constant; entropy beats dot product for identity.
MS1 mass error <= 5 ppm (HRMS) HRMS convention Tighter than the 10 ppm older default; pairs with isotope-pattern filter.
Isotope-pattern ~2% abundance accuracy Kind & Fiehn 2006 Removes >95% of false formula candidates even at 3 ppm -- orthogonal info, not better mass accuracy, fixes formula.
COSMIC 0.94 / 0.64 / 0.34 ~ 5 / 10 / 20% FDR Hoffmann 2022 Calibrated confidence on CSI:FingerID structures; raw top-1 with no COSMIC is Level 3.
Predicted CCS within ~3-5% of measured AllCCS / IMS benchmarks (Zhou 2020) Use CCS as a falsifier (rejects candidates), not as positive proof of identity.

Common Errors

Error / symptom Cause Solution
ModifiedCosine scores all zero Missing precursor m/z on spectra Apply add_precursor_mz filter to both references and queries first.
AttributeError: 'Scores' has no attribute 'scores' Indexing scores.scores[...] (old tutorials) Use scores.scores_by_query(query) or scores.to_array(name=...).
ValueError: no field of name <X>_score Field names are class-prefixed and version-dependent Read pair[1].dtype.names for the score/matches field names rather than hard-coding.
ImportError: cannot import name 'ModifiedCosine' Renamed to ModifiedCosineGreedy in matchms 0.33 Try the new name with an ImportError fallback to the old.
sirius formula not found v5 used singular subcommands; v6 uses formulas Run sirius --help; verify plural/singular per installed version.
SIRIUS exits at login Account/license required since v5 sirius login once with a free academic account before the chain.
Pathway enrichment lights up everywhere Ambiguous features mapped to many DB IDs Collapse ion families and carry levels into enrichment (metabolomics/pathway-mapping).

References

  • Sumner LW, et al. 2007. Proposed minimum reporting standards for chemical analysis (CAWG MSI). Metabolomics 3:211-221.
  • Schymanski EL, Jeon J, Gulde R, Fenner K, Ruff M, Singer HP, Hollender J. 2014. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ Sci Technol 48:2097-2098.
  • Li Y, Kind T, Folz J, Vaniya A, Mehta SS, Fiehn O. 2021. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat Methods 18:1524-1531.
  • Dührkop K, Fleischauer M, Ludwig M, Aksenov AA, Melnik AV, Meusel M, Dorrestein PC, Rousu J, Böcker S. 2019. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods 16:299-302.
  • Dührkop K, Shen H, Meusel M, Rousu J, Böcker S. 2015. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. PNAS 112:12580-12585.
  • Dührkop K, et al. 2021. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra (CANOPUS). Nat Biotechnol 39:462-471.
  • Hoffmann MA, et al. 2022. High-confidence structural annotation of metabolites absent from spectral libraries (COSMIC). Nat Biotechnol 40:411-421.
  • Ruttkies C, Schymanski EL, Wolf S, Hollender J, Neumann S. 2016. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J Cheminform 8:3.
  • Wang M, Carver JJ, Phelan VV, et al. 2016. Sharing and community curation of mass spectrometry data with GNPS. Nat Biotechnol 34:828-837.
  • Nothias LF, Petras D, Schmid R, et al. 2020. Feature-based molecular networking in the GNPS analysis environment. Nat Methods 17:905-908.
  • Kind T, Fiehn O. 2006. Metabolomic database annotations via query of elemental compositions: mass accuracy is insufficient even at less than 1 ppm. BMC Bioinformatics 7:234.
  • Zhou Z, et al. 2020. Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics (AllCCS). Nat Commun 11:4334.
  • Theodoridis G, Gika H, Raftery D, Goodacre R, Plumb RS, Wilson ID. 2023. Ensuring fact-based metabolite identification in LC-MS-based metabolomics. Anal Chem 95:3909-3916.
  • Huber F, Verhoeven S, Meijer C, et al. 2020. matchms - processing and similarity evaluation of mass spectrometry data. J Open Source Softw 5:2411.

Related Skills

  • metabolomics/xcms-preprocessing - Upstream feature extraction (m/z, RT, intensity table)
  • metabolomics/msdial-preprocessing - Alternative feature extraction and deconvolution
  • metabolomics/pathway-mapping - Downstream enrichment that must respect these confidence levels
  • metabolomics/lipidomics - Lipid-specific annotation and structural resolution
  • proteomics/spectral-libraries - Related spectral-matching concepts (closed-world peptide search)
Install via CLI
npx skills add https://github.com/GPTomics/bioSkills --skill bio-metabolomics-metabolite-annotation
Repository Details
star Stars 876
call_split Forks 156
navigation Branch main
article Path SKILL.md
More from Creator