chemometrics-ms-metabolomics

name: chemometrics-ms-metabolomics description: >- Expert guidance for processing and analyzing mass spectrometry (MS) based metabolomics data. Covers the complete workflow from sample normalization through data processing, multivariate analysis, and metabolite identification. Use when processing LC-MS or GC-MS metabolomics raw data, performing peak detection, alignment, and normalization, applying multivariate analysis (PCA, PLS-DA, OPLS-DA), identifying metabolites from MS/MS spectra, or designing clinical metabolomics studies. license: MIT author: Alban Ott based-on: Boccard & Rudaz 2018 - Extracting Knowledge from MS Clinical Metabolomic Data

MS Metabolomics Data Processing

Complete workflow from raw MS data to metabolite identification. Load only the reference file relevant to your current task.

Workflow Decision Tree

START: What stage are you at?

├─ PRE-ACQUISITION: Sample normalization needed?
│  ├─ Cell cultures → Normalize by cell count or protein content
│  ├─ Urine → Dilute to constant osmolality or creatinine
│  ├─ Blood/Plasma → Use fixed volume (no normalization needed)
│  └─ Tissue → Normalize by wet weight or protein content

├─ DATA ACQUISITION: QC strategy?
│  ├─ Insert pooled QC samples every 5-10 injections
│  ├─ Randomize sample order within batches
│  └─ Include blanks and internal standards

├─ SIGNAL PROCESSING: Raw data to feature table?
│  ├─ Peak detection → XCMS, MZmine, MS-DIAL
│  ├─ Alignment → RT warping + m/z matching
│  └─ Gap filling → Re-integration at missing features

├─ NORMALIZATION: Which method?
│  ├─ Simple → TSN, MSTUS
│  ├─ Reference-based → PQN
│  └─ QC-based → LOESS, QC-RSC, QC-SVRC

├─ SCALING: How to transform data?
│  ├─ Equal importance → Unit variance (UV) scaling
│  ├─ Reduce impact of large peaks → Pareto scaling
│  └─ Reduce heteroscedasticity → Log transformation

├─ MULTIVARIATE ANALYSIS: Which method?
│  ├─ Exploratory → PCA
│  ├─ Discrimination → PLS-DA, OPLS-DA
│  └─ Biomarker selection → S-plot, VIP scores

└─ IDENTIFICATION: What level?
   ├─ Level 1 → Authentic standard (same RT, MS, MS/MS)
   ├─ Level 2 → Library match (MS/MS spectral match)
   ├─ Level 3 → Putative class (characteristic fragments)
   └─ Level 4 → Unknown (unidentified)

When to Use What

Signal Processing (pre-acquisition + raw data to feature table)

Pre-acquisition normalization: Cell count, creatinine, osmolality, fixed volume -- depends on sample type. Peak detection: pyopenms FeatureFindingMetabo pipeline. Mass accuracy 5-20 ppm, S/N > 3-10. RT alignment: LOESS warping or Obiwarp DTW. Tolerance 0.1-0.5 min. Gap filling: Targeted re-integration at expected m/z + RT for missing features. Details: references/signal-processing.md

Post-Acquisition Normalization

TSN: Total signal normalization. Simple but affected by dominant peaks. MSTUS: Median ratio to geometric mean reference. More robust than TSN. PQN: Probabilistic quotient normalization. Best for biological variation. QC-LOESS: Drift correction using pooled QC samples with LOESS smoothing. Missing values: Filter >30% missing, then min/2 or kNN imputation. Batch effects: QC-based correction, randomization, ComBat. Details: references/normalization.md

Data Scaling

Method	Formula	Effect	Best For
Centering	x - mean	Centers data	All methods
UV (Autoscaling)	(x - mean) / std	Equal importance	When all features matter equally
Pareto	(x - mean) / sqrt(std)	Reduce dominant peaks	Balanced importance
Log transform	log(x + 1)	Reduce heteroscedasticity	Skewed distributions
Range scaling	(x - min) / (max - min)	0-1 range	Neural networks

Multivariate Analysis

PCA: sklearn.decomposition.PCA. Hotelling's T2 for outlier detection. Unsupervised exploration. PLS-DA: PLSRegression + LabelEncoder. 7-fold CV + permutation (n>=100). Q2>0.5 required. VIP scores: Variable Importance in Projection. VIP > 1 as initial biomarker filter. S-plot: Covariance vs. correlation plot. High |p(cov)| + high |p(corr)| = reliable biomarkers. Validation: Permutation testing (p < 0.05), R2Y-Q2 gap < 0.3, FDR correction for biomarker selection. Details: references/multivariate-analysis.md

Cross-reference: ../chemometrics-shared/references/validation-strategies.md

Metabolite Identification

MSI levels: Level 1 (authentic standard) through Level 4 (unknown). Spectral matching: Cosine similarity on normalized MS/MS peaks, mz_tolerance=0.02, min 3 matched peaks. Databases: HMDB, METLIN, MassBank, LipidMaps, KEGG, MoNA. Details: references/metabolite-identification.md

Software Tools

Task	Open Source	Commercial
Peak detection	XCMS, MZmine, MS-DIAL	Compound Discoverer
Alignment	XCMS, OpenMS	Progenesis QI
Statistics	MetaboAnalyst, scikit-learn	SIMCA, Progenesis
Identification	MS-FINDER, SIRIUS	mzCloud, Lipid Search
Pathway analysis	MetaboAnalyst, KEGG	IPA, MetaCore

References

Boccard, J., & Rudaz, S. (2018). Extracting Knowledge from MS Clinical Metabolomic Data: Processing and Analysis Strategies. Chimia, 72(3), 160-167. doi:10.2533/chimia.2018.160
Dunn, W. B., et al. (2011). Procedures for large-scale metabolic profiling. Nature Protocols, 6(7), 1060-1083.
Sumner, L. W., et al. (2007). Proposed minimum reporting standards for chemical analysis. Metabolomics, 3(3), 211-221.
Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119-128.