gcms-processing - SKILL.md Agent Skill

name: gcms-processing description: "Process GC-MS data for metabolomics and volatile compound analysis including peak detection, deconvolution, and NIST library matching. Use when: user has GC-MS data, needs retention index calculation, wants to identify volatiles, match EI spectra against NIST, or process derivatized metabolites. Triggers: GC-MS, gas chromatography, volatile analysis, NIST library, retention index, Kovats index, EI spectrum, electron ionization, deconvolution, AMDIS, TMS derivatives, volatile profiling, headspace analysis, SPME." tool_type: python primary_tool: matchms

GC-MS Data Processing

Overview

GC-MS metabolomics requires specialized workflows distinct from LC-MS: electron ionization (EI) fragmentation produces reproducible spectra suitable for library matching, but introduces challenges like derivatization artifacts, column bleed interference, and co-eluting peaks requiring deconvolution. This skill covers the full pipeline from raw data through metabolite identification.

Version Compatibility

Reference examples tested with: matchms 0.18-0.26, pyopenms 3.x, xcms 4.x

The matchms API for Scores objects and CosineGreedy.pair() return types changed across versions. If code throws TypeError or KeyError, introspect the return value (e.g., print(type(result), result)) and adapt the unpacking accordingly.

Installation

Python Tools

uv pip install matchms pyopenms numpy pandas scipy

R Tools (XCMS for GC-MS)

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("xcms", "MSnbase"))

Verify Installation

import matchms
import pyopenms as ms
print(f"matchms: {matchms.__version__}")
print(f"pyopenms: {ms.__version__}")

Core Capabilities

1. Loading GC-MS Data

Read raw GC-MS files (mzML, mzXML, CDF) and inspect TIC vs EIC.

import pyopenms as ms
import numpy as np

exp = ms.MSExperiment()
ms.MzMLFile().load("gcms_sample.mzML", exp)

# Build total ion chromatogram (TIC)
retention_times = []
tic_intensities = []
for spectrum in exp:
    if spectrum.getMSLevel() == 1:
        retention_times.append(spectrum.getRT())
        mz, intensity = spectrum.get_peaks()
        tic_intensities.append(np.sum(intensity))

print(f"Spectra: {len(retention_times)}")
print(f"RT range: {min(retention_times):.1f} - {max(retention_times):.1f} s")

Extract an extracted ion chromatogram (EIC) for a target m/z:

def extract_eic(experiment, target_mz, tolerance=0.5):
    rts = []
    intensities = []
    for spectrum in experiment:
        if spectrum.getMSLevel() == 1:
            rts.append(spectrum.getRT())
            mz_array, int_array = spectrum.get_peaks()
            mask = np.abs(mz_array - target_mz) <= tolerance
            intensities.append(np.sum(int_array[mask]) if np.any(mask) else 0.0)
    return np.array(rts), np.array(intensities)

# EIC for m/z 73 (common TMS fragment)
rt_arr, eic_arr = extract_eic(exp, target_mz=73.0, tolerance=0.5)

2. Peak Detection with XCMS (R, GC-MS Optimized)

CentWave parameters tuned for GC-MS: narrower peak widths, wider ppm tolerance for unit-mass-resolution instruments.

library(xcms)
library(MSnbase)

raw_files <- list.files("gcms_data", pattern = "\\.mzML$", full.names = TRUE)
raw_data <- readMSData(raw_files, mode = "onDisk")

# CentWave parameters optimized for GC-MS
# - peakwidth: GC peaks are typically 2-10 seconds (narrower than LC)
# - ppm: unit-mass GC-MS needs wider tolerance (~30-40 ppm);
#         for HR-GC-MS (e.g., GC-Orbitrap), use 5-10 ppm instead
# - noise: adjust to instrument baseline
cwp <- CentWaveParam(
    peakwidth = c(2, 10),
    ppm = 40,
    snthresh = 5,
    prefilter = c(3, 500),
    mzdiff = 0.5,
    noise = 500,
    integrate = 2
)

xdata <- findChromPeaks(raw_data, param = cwp)
cat("Peaks detected:", nrow(chromPeaks(xdata)), "\n")

3. Retention Index Calculation

Retention indices normalize retention times against an alkane ladder, enabling cross-instrument comparison.

Kovats Retention Index (isothermal GC, uses logarithmic interpolation):

import numpy as np

def kovats_ri(rt_compound, alkane_rts, alkane_carbons):
    """Calculate Kovats retention index for isothermal GC.

    Uses the original Kovats logarithmic formula:
    RI = 100 * (n + (log(tR(x)) - log(tR(n))) / (log(tR(n+1)) - log(tR(n))))

    Args:
        rt_compound: retention time of the target compound (seconds)
        alkane_rts: list of alkane retention times (seconds), sorted ascending
        alkane_carbons: list of carbon numbers matching alkane_rts
    """
    alkane_rts = np.array(alkane_rts, dtype=np.float64)
    alkane_carbons = np.array(alkane_carbons)

    idx = np.searchsorted(alkane_rts, rt_compound) - 1
    idx = max(0, min(idx, len(alkane_rts) - 2))

    n_z = alkane_carbons[idx]
    rt_z = alkane_rts[idx]
    rt_z1 = alkane_rts[idx + 1]

    ri = 100 * (n_z + (np.log(rt_compound) - np.log(rt_z)) / (np.log(rt_z1) - np.log(rt_z)))
    return ri

Linear Retention Index (LTPRI) (temperature-programmed GC, most common in metabolomics):

def linear_ri(rt_compound, alkane_rts, alkane_carbons):
    """Calculate linear (temperature-programmed) retention index.

    For temperature-programmed GC, retention times increase linearly with
    carbon number, so the simpler linear interpolation formula is used:
    RI = 100 * (n + (tR(x) - tR(n)) / (tR(n+1) - tR(n)))

    Args:
        rt_compound: retention time of the target compound (seconds)
        alkane_rts: list of alkane retention times (seconds), sorted ascending
        alkane_carbons: list of carbon numbers matching alkane_rts
    """
    alkane_rts = np.array(alkane_rts, dtype=np.float64)
    alkane_carbons = np.array(alkane_carbons)

    idx = np.searchsorted(alkane_rts, rt_compound) - 1
    idx = max(0, min(idx, len(alkane_rts) - 2))

    n_z = alkane_carbons[idx]
    rt_z = alkane_rts[idx]
    rt_z1 = alkane_rts[idx + 1]

    ri = 100 * (n_z + (rt_compound - rt_z) / (rt_z1 - rt_z))
    return ri

# Alkane ladder C8-C30
alkane_rts = [180.2, 240.5, 310.1, 385.7, 462.3, 540.8, 621.4, 703.9,
              788.1, 874.2, 962.0, 1051.5, 1142.8, 1235.6, 1330.1,
              1426.0, 1523.4, 1622.1, 1722.3, 1823.8, 1926.5, 2030.7, 2136.2]
alkane_carbons = list(range(8, 31))

ri = linear_ri(rt_compound=450.0, alkane_rts=alkane_rts, alkane_carbons=alkane_carbons)
print(f"Linear RI: {ri:.0f}")

4. Spectral Deconvolution and Library Matching with matchms

matchms supports GC-MS spectral comparison against reference libraries such as NIST and MassBank.

from matchms.importing import load_from_msp
from matchms.similarity import CosineGreedy
from matchms import calculate_scores
from matchms.filtering import (
    default_filters,
    normalize_intensities,
    select_by_mz,
    select_by_relative_intensity,
)

def process_spectrum(spectrum):
    spectrum = default_filters(spectrum)
    spectrum = normalize_intensities(spectrum)
    spectrum = select_by_mz(spectrum, mz_from=35, mz_to=500)
    spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
    return spectrum

# Load reference library (NIST-format MSP export)
references = list(load_from_msp("nist_ei_library.msp"))
references = [process_spectrum(s) for s in references if s is not None]

# Load query spectra (deconvolved peaks exported as MSP)
queries = list(load_from_msp("sample_deconvolved.msp"))
queries = [process_spectrum(s) for s in queries if s is not None]

# Calculate cosine similarity scores
similarity = CosineGreedy(tolerance=0.5)
scores = calculate_scores(references, queries, similarity)

# Extract best matches per query
for i, query in enumerate(queries):
    best = scores.scores_by_query(query, "CosineGreedy_score", sort=True)
    # Each entry is (reference, score_object); access style depends on matchms version
    top_matches = []
    for ref, score_obj in best[:3]:
        sc = float(score_obj["score"] if isinstance(score_obj, dict) else score_obj[0])
        nm = int(score_obj["matches"] if isinstance(score_obj, dict) else score_obj[1])
        if sc > 0.7 and nm >= 6:
            top_matches.append((ref, sc, nm))
    if top_matches:
        ref, score, n_matches = top_matches[0]
        name = ref.get("compound_name", "Unknown")
        print(f"Query RT={query.get('retention_time')}: {name} "
              f"(score={score:.3f}, matched_peaks={n_matches})")

5. NIST Library Matching Workflow

For production-grade NIST matching, combine RI filtering with spectral similarity:

def nist_match_with_ri(query_spectrum, query_ri, references, ri_tolerance=20):
    """Filter candidates by RI window, then rank by spectral similarity."""
    similarity = CosineGreedy(tolerance=0.5)

    candidates = []
    for ref in references:
        ref_ri = ref.get("retention_index")
        if ref_ri is not None and abs(float(ref_ri) - query_ri) <= ri_tolerance:
            score_pair = similarity.pair(ref, query_spectrum)
            if score_pair is not None:
                # matchms >= 0.18 returns a dict-like object; older versions return a tuple
                score_val = float(score_pair["score"] if isinstance(score_pair, dict) else score_pair[0])
                n_matches = int(score_pair["matches"] if isinstance(score_pair, dict) else score_pair[1])
                if score_val > 0.6 and n_matches >= 5:
                    candidates.append({
                        "name": ref.get("compound_name", "Unknown"),
                        "score": score_val,
                        "matched_peaks": n_matches,
                        "ri_diff": abs(float(ref_ri) - query_ri),
                    })

    return sorted(candidates, key=lambda x: x["score"], reverse=True)

6. Peak Detection and Integration with PyOpenMS

import pyopenms as ms

exp = ms.MSExperiment()
ms.MzMLFile().load("gcms_sample.mzML", exp)

# Smooth spectra
gaussian = ms.GaussFilter()
params = gaussian.getParameters()
params.setValue("gaussian_width", 0.15)
gaussian.setParameters(params)
gaussian.filterExperiment(exp)

# Peak picking (centroiding)
picker = ms.PeakPickerHiRes()
pick_params = picker.getParameters()
pick_params.setValue("signal_to_noise", 3.0)
picker.setParameters(pick_params)

centroided = ms.MSExperiment()
picker.pickExperiment(exp, centroided)
print(f"Centroided spectra: {centroided.getNrSpectra()}")

Common GC-MS Issues and Solutions

TIC vs EIC Selection

TIC (Total Ion Chromatogram): use for overall sample overview and retention time alignment. Prone to masking low-abundance compounds.
EIC (Extracted Ion Chromatogram): use for targeted quantification. Select characteristic fragment ions (e.g., m/z 73, 147 for TMS derivatives; m/z 117 for amino acid TMS).

Derivatization Artifacts

TMS (trimethylsilyl) derivatization is the most common approach for GC-MS metabolomics. Common reagents include MSTFA (N-Methyl-N-(trimethylsilyl)trifluoroacetamide) and BSTFA (N,O-Bis(trimethylsilyl)trifluoroacetamide), often with 1% TMCS as catalyst. TMS derivatization produces multiple derivatives per metabolite:

# Common TMS artifacts to flag and exclude
TMS_ARTIFACTS = {
    "hexamethylcyclotrisiloxane": 207,   # D3 siloxane, column bleed
    "octamethylcyclotetrasiloxane": 281,  # D4 siloxane, column bleed
    "trimethylsilanol": 75,               # TMS reagent byproduct
}

def flag_artifacts(peak_list, artifact_mz_values, tolerance=0.5):
    """Flag peaks matching known derivatization artifacts."""
    flagged = []
    for peak in peak_list:
        base_peak_mz = peak.get("base_peak_mz", 0)
        is_artifact = any(
            abs(base_peak_mz - mz) <= tolerance
            for mz in artifact_mz_values
        )
        flagged.append({**peak, "is_artifact": is_artifact})
    return flagged

Column Bleed Detection

Column bleed produces characteristic siloxane fragments that increase with temperature:

COLUMN_BLEED_MZ = [207.0, 281.0, 355.0, 429.0]

def detect_column_bleed(spectrum_mz, spectrum_intensity, threshold=0.05):
    """Check if column bleed fragments dominate the spectrum."""
    total_intensity = np.sum(spectrum_intensity)
    if total_intensity == 0:
        return False
    bleed_intensity = sum(
        spectrum_intensity[np.argmin(np.abs(spectrum_mz - mz))]
        for mz in COLUMN_BLEED_MZ
        if np.min(np.abs(spectrum_mz - mz)) < 0.5
    )
    return (bleed_intensity / total_intensity) > threshold

Workflow: Complete GC-MS Metabolomics Pipeline

# 1. Load raw data
exp = ms.MSExperiment()
ms.MzMLFile().load("gcms_sample.mzML", exp)

# 2. Smooth and centroid
gaussian = ms.GaussFilter()
g_params = gaussian.getParameters()
g_params.setValue("gaussian_width", 0.15)
gaussian.setParameters(g_params)
gaussian.filterExperiment(exp)

picker = ms.PeakPickerHiRes()
centroided = ms.MSExperiment()
picker.pickExperiment(exp, centroided)

# 3. Export spectra for library matching
ms.MzMLFile().store("gcms_centroided.mzML", centroided)

# 4. Calculate RIs for detected peaks using alkane ladder
# 5. Match against NIST/MassBank using matchms with RI filtering
# 6. Export annotated feature table

Best Practices

Always run alkane standards in the same batch for reliable RI calculation
Set mass range to 35-500 m/z for EI spectra (below 35 is noise)
Require minimum 6 matched peaks and cosine score > 0.7 for confident library matches
Check for multiple TMS derivatives of the same metabolite (mono-, di-, tri-TMS)
Use EIC for quantification, TIC for qualitative screening
Blank subtraction is critical: run solvent blanks and derivatization blanks
Monitor column bleed at high-temperature segments of the temperature program
RI tolerance of 10-20 units for intra-lab matching, 30-50 for cross-lab comparison

Resources

matchms documentation: https://matchms.readthedocs.io
OpenMS documentation: https://www.openms.org
NIST Chemistry WebBook: https://webbook.nist.gov
Golm Metabolome Database (GMD): http://gmd.mpimp-golm.mpg.de
MassBank spectral database: https://massbank.eu