name: ms-format-conversion description: "Convert mass spectrometry vendor files to open formats using msConvert and pyopenms. Use when: user needs to convert RAW files to mzML, convert WIFF/Agilent .d/Waters .raw to open format, batch convert instrument files, or troubleshoot format issues. Triggers: convert RAW file, mzML conversion, mzXML, vendor format, msConvert, ProteoWizard, Thermo RAW, AB SCIEX WIFF, Agilent .d, Waters .raw, Bruker .d, centroid mode, profile mode, batch conversion, format conversion." tool_type: cli primary_tool: msconvert metadata: skill-author: Fei Wen
Metabolomics Data Format Conversion
Overview
Mass spectrometry instruments produce data in proprietary vendor formats that must be converted to open standards (mzML, mzXML) for downstream analysis. This skill covers msConvert (ProteoWizard) for vendor-to-open conversion, pyopenms for mzML/mzXML reading and interconversion, batch processing scripts, and common pitfalls around centroiding, encoding precision, and vendor-specific quirks.
When to Use This Skill
Use this skill when converting raw instrument files to mzML or mzXML, interconverting between open formats, performing batch conversion across many files, or troubleshooting format-related issues in metabolomics pipelines.
Vendor Format Reference
| Vendor | Format | Extension | Platform | Notes |
|---|---|---|---|---|
| Thermo Fisher | RAW | .raw |
Windows | Most common in proteomics/metabolomics |
| AB SCIEX | WIFF | .wiff, .wiff2 |
Windows | .wiff requires .wiff.scan sidecar; .wiff2 requires .wiff2.scan |
| Agilent | MassHunter | .d (directory) |
Windows | Directory containing AcqData/ |
| Waters | MassLynx | .raw (directory) |
Windows | Directory, not a single file (unlike Thermo .raw) |
| Bruker | BAF/TDF | .d (directory) |
Windows/Linux | Contains ser/fid files; timsTOF uses .tdf/.tdf_bin inside .d |
Installation
msConvert (ProteoWizard)
msConvert is the standard tool for vendor format conversion. Vendor format reading requires Windows or Windows libraries via Wine/Docker.
Windows (native):
Download from https://proteowizard.sourceforge.io/download.html and install. After installation:
# Verify installation (Windows cmd or PowerShell)
msconvert --help
Linux/macOS (Docker):
# Pull the ProteoWizard Docker image with Wine for vendor support
docker pull chambm/pwiz-skyline-i-agree-to-the-vendor-licenses:latest
# Verify
docker run --rm chambm/pwiz-skyline-i-agree-to-the-vendor-licenses:latest wine msconvert --help
pyopenms (for open format operations)
uv pip install pyopenms
Core Capabilities
1. Basic Vendor-to-mzML Conversion with msConvert
# Convert a single Thermo RAW file to mzML
msconvert sample.raw --mzML --outdir /data/converted/
# Convert with centroiding (recommended for most workflows)
msconvert sample.raw --mzML --filter "peakPicking vendor msLevel=1-2" --outdir /data/converted/
# Convert to mzXML instead
msconvert sample.raw --mzXML --filter "peakPicking vendor msLevel=1-2" --outdir /data/converted/
2. Centroid vs Profile Mode
Profile mode stores the full peak shape; centroid mode stores only peak centroids (m/z, intensity pairs). Most analysis tools expect centroid data.
# Vendor-native centroiding (highest quality, requires vendor libraries)
msconvert sample.raw --mzML \
--filter "peakPicking vendor msLevel=1-2"
# Algorithm-based centroiding (works without vendor libraries)
msconvert sample.mzML --mzML \
--filter "peakPicking cwt snr=1.0 peakSpace=0.1 msLevel=1-2"
# Check if a file is centroid or profile using pyopenms
import pyopenms as ms
exp = ms.MSExperiment()
ms.MzMLFile().load("sample.mzML", exp)
for i, spec in enumerate(exp):
if i >= 3:
break
data_type = "centroid" if spec.getType() == 1 else "profile"
print(f"Spectrum {i}: MS{spec.getMSLevel()}, {data_type}, {len(spec.get_peaks()[0])} peaks")
3. Encoding Precision (32-bit vs 64-bit)
64-bit encoding preserves full precision but doubles file size. 32-bit is sufficient for most metabolomics workflows.
# Force 32-bit encoding (smaller files)
msconvert sample.raw --mzML \
--filter "peakPicking vendor msLevel=1-2" \
--32
# Force 64-bit encoding (maximum precision)
msconvert sample.raw --mzML \
--filter "peakPicking vendor msLevel=1-2" \
--64
# Mixed: 64-bit m/z, 32-bit intensity (good balance)
msconvert sample.raw --mzML \
--filter "peakPicking vendor msLevel=1-2" \
--mz64 --inten32
4. Filtering During Conversion
msConvert filters reduce file size by selecting only needed data.
# Keep only MS1 spectra
msconvert sample.raw --mzML \
--filter "msLevel 1"
# Keep only a retention time window (seconds)
msconvert sample.raw --mzML \
--filter "scanTime [60,600]"
# Keep only a specific m/z range
msconvert sample.raw --mzML \
--filter "mzWindow [100,1000]"
# Combine multiple filters (applied in order)
msconvert sample.raw --mzML \
--filter "peakPicking vendor msLevel=1-2" \
--filter "msLevel 1-2" \
--filter "scanTime [60,1200]" \
--filter "mzWindow [50,1500]"
5. Batch Conversion Scripts
Bash (Linux/macOS with Docker):
#!/usr/bin/env bash
# batch_convert.sh - Convert all RAW files in a directory
INPUT_DIR="$1"
OUTPUT_DIR="$2"
mkdir -p "$OUTPUT_DIR"
for raw_file in "$INPUT_DIR"/*.raw; do
[ -f "$raw_file" ] || continue
echo "Converting: $(basename "$raw_file")"
docker run --rm \
-v "$INPUT_DIR":/input \
-v "$OUTPUT_DIR":/output \
chambm/pwiz-skyline-i-agree-to-the-vendor-licenses:latest \
wine msconvert "/input/$(basename "$raw_file")" \
--mzML \
--filter "peakPicking vendor msLevel=1-2" \
--outdir /output
done
echo "Conversion complete."
Python (cross-platform):
import subprocess
import os
from pathlib import Path
def batch_convert(
input_dir: str,
output_dir: str,
output_format: str = "mzML",
centroid: bool = True,
) -> list:
"""Batch convert vendor files using msconvert."""
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
extensions = {".raw", ".wiff", ".d"}
input_files = [
f for f in input_path.iterdir()
if f.suffix.lower() in extensions or (f.is_dir() and f.suffix.lower() in extensions)
]
converted = []
for raw_file in input_files:
cmd = [
"msconvert", str(raw_file),
f"--{output_format}",
"--outdir", str(output_path),
]
if centroid:
cmd.extend(["--filter", "peakPicking vendor msLevel=1-2"])
try:
subprocess.run(cmd, check=True, capture_output=True, text=True)
converted.append(raw_file.name)
print(f"Converted: {raw_file.name}")
except subprocess.CalledProcessError as e:
print(f"Failed: {raw_file.name} - {e.stderr}")
return converted
# Convert all RAW files in a directory
results = batch_convert("/data/raw", "/data/mzml", output_format="mzML", centroid=True)
print(f"Successfully converted {len(results)} files")
6. mzML/mzXML Interconversion with pyopenms
import pyopenms as ms
def mzml_to_mzxml(input_path: str, output_path: str) -> None:
"""Convert mzML to mzXML using pyopenms."""
exp = ms.MSExperiment()
ms.MzMLFile().load(input_path, exp)
ms.MzXMLFile().store(output_path, exp)
print(f"Converted {input_path} -> {output_path} ({exp.getNrSpectra()} spectra)")
def mzxml_to_mzml(input_path: str, output_path: str) -> None:
"""Convert mzXML to mzML using pyopenms."""
exp = ms.MSExperiment()
ms.MzXMLFile().load(input_path, exp)
ms.MzMLFile().store(output_path, exp)
print(f"Converted {input_path} -> {output_path} ({exp.getNrSpectra()} spectra)")
# Convert mzML to mzXML
mzml_to_mzxml("sample.mzML", "sample.mzXML")
# Convert mzXML to mzML
mzxml_to_mzml("legacy_data.mzXML", "legacy_data.mzML")
7. Validate Converted Files
import pyopenms as ms
def validate_mzml(filepath: str) -> dict:
"""Validate an mzML file and report basic statistics."""
exp = ms.MSExperiment()
ms.MzMLFile().load(filepath, exp)
ms1_count = 0
ms2_count = 0
centroid_count = 0
profile_count = 0
for spec in exp:
level = spec.getMSLevel()
if level == 1:
ms1_count += 1
elif level == 2:
ms2_count += 1
if spec.getType() == 1:
centroid_count += 1
else:
profile_count += 1
rt_range = (
exp.getSpectrum(0).getRT(),
exp.getSpectrum(exp.getNrSpectra() - 1).getRT(),
)
return {
"file": filepath,
"total_spectra": exp.getNrSpectra(),
"ms1_spectra": ms1_count,
"ms2_spectra": ms2_count,
"centroid_spectra": centroid_count,
"profile_spectra": profile_count,
"chromatograms": exp.getNrChromatograms(),
"rt_range_seconds": rt_range,
}
stats = validate_mzml("converted_sample.mzML")
for key, value in stats.items():
print(f"{key}: {value}")
Common Workflows
Workflow 1: Convert a Batch of Thermo RAW Files for XCMS
Convert with centroiding:
msconvert /data/raw/*.raw --mzML \ --filter "peakPicking vendor msLevel=1-2" \ --outdir /data/mzml/Validate output:
from pathlib import Path for mzml in Path("/data/mzml").glob("*.mzML"): stats = validate_mzml(str(mzml)) print(f"{mzml.name}: {stats['total_spectra']} spectra, " f"{stats['centroid_spectra']} centroid")
Workflow 2: Convert Legacy mzXML Archive to mzML
Find all mzXML files:
from pathlib import Path mzxml_files = list(Path("/data/legacy").glob("**/*.mzXML")) print(f"Found {len(mzxml_files)} mzXML files")Batch convert:
output_dir = Path("/data/converted") output_dir.mkdir(parents=True, exist_ok=True) for mzxml in mzxml_files: out_path = output_dir / mzxml.with_suffix(".mzML").name mzxml_to_mzml(str(mzxml), str(out_path))
Best Practices
Always centroid during conversion: Apply vendor-native peak picking (
peakPicking vendor) during the initial conversion step. Re-centroiding downstream is lower quality.Use vendor libraries when possible: Vendor-native centroiding (via msConvert on Windows or Docker) produces superior results compared to algorithm-based alternatives.
Prefer mzML over mzXML: mzML is the current HUPO-PSI standard. mzXML is a legacy format that lacks support for newer features (ion mobility, chromatogram metadata).
Keep original vendor files: Always retain the original vendor files. Conversion is lossy for some metadata (tune parameters, calibration info).
Check centroid/profile status after conversion: Use the validation function to confirm spectra are in the expected mode before running analysis pipelines.
Match encoding to downstream tools: Use 64-bit m/z encoding if your pipeline performs high-precision mass calculations. Use 32-bit intensity to save space without meaningful precision loss.
WIFF files need sidecar: AB SCIEX
.wifffiles require the accompanying.wiff.scanfile in the same directory. Conversion will fail without it.Agilent and Waters use directories: These vendor formats are directories (
.d/,.raw/), not single files. Point msConvert at the directory path.
Resources
- ProteoWizard / msConvert: https://proteowizard.sourceforge.io/
- ProteoWizard Docker image: https://hub.docker.com/r/chambm/pwiz-skyline-i-agree-to-the-vendor-licenses
- mzML specification: https://www.psidev.info/mzML
- pyopenms documentation: https://pyopenms.readthedocs.io/
- HUPO-PSI standards: https://www.psidev.info/