name: chem-spectrum-matcher description: Match an experimental spectrum (1H NMR, 13C NMR, IR) against predicted or database reference spectra for candidate ranking and structure confirmation. Supports local catalog lookup, public database fallback, and pluggable similarity metrics. category: [chemistry, drug-discovery]
Spectrum Matcher
Goal
To retrieve or generate reference spectra for a set of candidate molecules and rank them by similarity to an experimental query spectrum. The skill abstracts a common three-component pattern:
- Prediction — generate reference spectra from structure (SMILES) via empirical predictors or QM.
- Reference DB — cache computed spectra locally; fall back to public databases for known compounds.
- Similarity metric — score each candidate against the query and rank.
This pattern applies to any spectral modality: 1H NMR, 13C NMR, IR, mass spectrometry, UV-Vis, Raman. The concrete implementation here covers 1H NMR and IR, with the NMR path fully implemented and IR sketched for extension.
When to Use This Skill
- Confirm a proposed structure against an experimental spectrum.
- Screen a shortlist of candidates and rank by spectral similarity.
- Avoid re-running expensive predictions by retrieving cached spectra from the local catalog.
- Fetch experimental reference spectra from public databases (NMRShiftDB2, NIST WebBook) before committing to a prediction run.
When NOT to Use This Skill
- Unknown structure elucidation from scratch — this skill requires a candidate list. For open-ended structure identification, use
general-query-literature-databasefirst. - Mixture deconvolution — use
chem-nmr-analysis(Wasserstein deconvolution) for quantifying component ratios. - 13C, 19F, 31P NMR prediction —
chem-nmr-predict(SPINUS) covers 1H only. Extension needed. - Mass spectrometry — not yet implemented. Scaffold is in place; add a predictor and similarity metric.
Architecture
SMILES ──► [Predictor] ──► predicted spectrum (.xy / .jdx)
│
▼
[Local Catalog] ◄── register_spectrum.py
│
[Public DB fallback] ──────────┤ (NMRShiftDB2, NIST WebBook)
│
▼
Experimental query ──► [match_spectrum.py] ──► ranked candidates
Modality–Predictor–Metric table
| Modality | Predictor skill | Public DB | Similarity metric |
|---|---|---|---|
| 1H NMR | chem-nmr-predict (SPINUS + nmrsim) |
NMRShiftDB2 | L2 / Wasserstein |
| IR | chem-db-spectra (NIST) or ORCA DFT |
NIST WebBook | Cosine |
| 13C NMR | (not yet implemented) | NMRShiftDB2 | L2 |
| Mass spec | (not yet implemented) | NIST WebBook | Dot product |
Workflow
Step 1 — Prepare Candidate Reference Spectra
Option A: Retrieve from catalog or public DB (fast, no prediction)
# Env: nmr-agent
python .agents/skills/chem-spectrum-matcher/scripts/match_spectrum.py \
--query experimental_spectrum.xy \
--smiles "CCO" \
--names "ethanol" \
--modality nmr_1h \
--catalog_dir research/spectrum_catalog/ \
--output_dir <research_dir>/spectrum_match/ \
--fallback_public_db
Option B: Predict first, then match
Run the appropriate predictor for the modality, then register outputs into the catalog, then match.
1H NMR:
# Env: nmr-agent
python .agents/skills/chem-nmr-predict/scripts/predict_nmr.py \
--smiles "CCO" \
--names "ethanol" \
--field_mhz 400 \
--output_dir <research_dir>/nmr_predictions/
# Env: nmr-agent
python .agents/skills/chem-spectrum-matcher/scripts/register_spectrum.py \
--source_dir <research_dir>/nmr_predictions/ \
--modality nmr_1h \
--catalog_dir research/spectrum_catalog/
IR (from NIST WebBook):
# Env: base-agent
python .agents/skills/chem-db-spectra/scripts/query_spectra.py \
C10H18O <research_dir>/ir_references/ --type IR
# Env: base-agent
python .agents/skills/chem-spectrum-matcher/scripts/register_spectrum.py \
--source_dir <research_dir>/ir_references/ \
--modality ir \
--catalog_dir research/spectrum_catalog/
IR (QM-backed, high accuracy):
# Env: orca-agent
# Run ORCA frequency calculation → extract IR spectrum → register
# See conda-envs/orca-agent/ for ORCA setup.
# After ORCA run, convert output with src/utils/dft/orca_utils.py
# then call register_spectrum.py --modality ir
Step 2 — Match and Rank
match_spectrum.py retrieves reference spectra (catalog → public DB fallback) and computes similarity scores between the query and each candidate.
# Env: nmr-agent
python .agents/skills/chem-spectrum-matcher/scripts/match_spectrum.py \
--query experimental_spectrum.xy \
--smiles "CCO" \
--names "ethanol" \
--modality nmr_1h \
--metric l2 \
--catalog_dir research/spectrum_catalog/ \
--output_dir <research_dir>/spectrum_match/ \
--plot
Arguments:
--query: experimental spectrum file (two-column, ppm/wavenumber vs intensity;.xy,.csv,.jdx).--smiles: candidate SMILES strings.--names: human-readable labels matching SMILES order.--modality:nmr_1h,nmr_13c,ir. Controls which catalog partition and public DB to query.--metric: similarity metric —l2(default),cosine,wasserstein. Choose based on modality (see table above).--catalog_dir: local spectrum catalog directory.--fallback_public_db: query NMRShiftDB2 or NIST WebBook for any candidate not in catalog.--field_mhz: spectrometer field (NMR only, default 400). Must match experimental spectrum.--plot: emit overlay plot (match_plot.png) with query and top-3 candidates.--output_dir: directory for outputs.
Outputs:
match_results.json— ranked candidates with similarity scores, source (catalog/public_db/predicted), and spectrum paths.match_plot.png— overlay of query vs ranked references (if--plot).input_configs.yaml— all parameters for reproducibility.
Step 3 — Interpret Results
Read match_results.json. Candidates are ranked by descending similarity score (1.0 = perfect match, 0.0 = no overlap).
If/Then rules:
| Score (L2 / cosine) | Interpretation | Agent action |
|---|---|---|
| > 0.90 | Strong match | Report top candidate with confidence. |
| 0.70–0.90 | Plausible match | Report with caveat; check overlay plot for unmatched peaks. |
| < 0.70 | Poor match | Likely wrong candidate or missing structure. Expand candidate list or re-examine experimental spectrum. |
After reading scores, the agent must:
- Inspect
match_plot.png— verify visual agreement, check for systematic shifts. - Cross-check with signal table (NMR) — compare predicted multiplicity/coupling to experimental assignments.
- If top candidate has score < 0.70 — consider running QM-backed prediction (ORCA IR, or higher-level NMR) rather than empirical.
Failure Modes
| Failure | Symptom | Agent action |
|---|---|---|
| Catalog miss, no public DB hit | match_results.json candidate marked missed |
Run appropriate predictor then register_spectrum.py. |
| Ppm/wavenumber axis mismatch | Similarity scores all near 0 | Query and reference use different x-axis. Check --field_mhz or unit convention. |
| SMILES canonicalization fails | RDKit error | SMILES invalid. Verify with RDKit before retry. |
| NMRShiftDB2 / NIST timeout | HTTP error during public DB query | Retry once; if persistent, disable --fallback_public_db and predict locally. |
| ORCA IR prediction unavailable | ORCA_BINARY_PATH not set |
Set env var per conda-envs/orca-agent/README.md or fall back to NIST WebBook IR. |
Relationship to Other Skills
drug-db-pubchem → resolve compound name to SMILES
chem-nmr-predict → 1H NMR prediction (SPINUS + nmrsim)
chem-db-spectra → experimental IR/MS from NIST WebBook
chem-nmr-analysis → mixture deconvolution (Wasserstein)
chem-spectrum-matcher → this skill: catalog + retrieval + similarity ranking
Environment
Primary (NMR matching):
mamba activate nmr-agent
Required packages: numpy, scipy, rdkit, requests, matplotlib.
IR prediction via QM (optional):
mamba activate orca-agent
Requires ORCA_BINARY_PATH environment variable. See conda-envs/orca-agent/README.md.
Constraints
- Catalog format:
catalog.jsonkeyed by(canonical_smiles, modality). Do not edit manually. - Spectrum format:
.xyfiles are two-column tab-separated (x-axis descending, intensity)..jdxfiles are parsed via thejcamppackage. - Modality isolation: NMR and IR spectra are stored in separate catalog partitions; cross-modality lookup is not supported.
- Field strength: NMR catalog entries store the field (MHz) used during prediction. Retrieval warns if query field differs.
- Stereochemistry: Diastereomers stored separately by canonical SMILES. Enantiomers produce identical achiral NMR/IR spectra but are stored separately for traceability.
References
- Steinbeck, C. et al., "NMRShiftDB — constructing a free chemical information system with open-source components", J. Chem. Inf. Comput. Sci., 2003. DOI
- Linstrom, P.J. and Mallard, W.G., Eds., NIST Chemistry WebBook, NIST Standard Reference Database 69. URL
- Banfi, D. & Patiny, L., "www.nmrdb.org: Resurrecting and processing NMR spectra on-line", Chimia, 2008.
- Landrum, G. et al., RDKit: Open-Source Cheminformatics. URL
Author: Magdalena Lederbauer Contact: GitHub @mlederbauer