name: mace-dataset-curation description: Use this skill for turning VASP result trees into extxyz training datasets that follow the validated reference-script conventions for REF labels, optional head/config_type tags, and fixed split artifacts before MACE training.
mace-dataset-curation
Overview
Use this skill to convert collected VASP runs into a reusable extxyz dataset directory while staying close to the validated reference_scripts/mace_training_example export contract.
Quick Start
- Point
build_dataset_from_runsat one result root. - Choose
frame_modedeliberately:finalorall_ionic_steps. - If the downstream training uses a multi-head foundation model, set
head_labelexplicitly, typicallyomat_pbe. - Keep the split fractions explicit and reproducible, and leave
split_unit="source_run"unless you intentionally want frame-level leakage. - Leave
require_converged=falseunless you are intentionally building a guessed-converged subset. - Carry forward the emitted summary JSON and split file paths.
Allowed tools
build_dataset_from_runs
Workflow
1. Choose the frame policy first
- The validated reference workflow starts from ionic-step data rather than only final frames, so
all_ionic_stepsis the default starting point when you want to match that path. finalis only for deliberately reduced datasets where relaxed endpoints alone are the training target.
2. Keep split artifacts stable
- Treat
dataset.extxyz,train.extxyz,valid.extxyz, andtest.extxyzas the canonical handoff set. - Use the summary JSON as the ledger for skipped runs, frame counts, stress encoding assumptions, and any
head_label/config_typetags.
3. Keep the reference labeling contract explicit
- The validated reference export path writes
REF_energy,REF_forces,REF_stress,config_type, and optionallyhead. - If the downstream finetune uses
mace-mh-1withomat_pbe, sethead_label="omat_pbe"instead of assuming the head will be inferred later.
4. Stop before training
- This skill ends at a curated dataset directory.
- Hand the resulting dataset to
mace-finetuning-and-benchmarkoractive-learning-relabel-loopinstead of mixing curation and training into one opaque step. - Use this skill once the working artifact is a dataset directory rather than a structure-screening batch.
Method-critical defaults
- The reference-aligned default is to keep all ionic steps and store
step_electronic_converged_guessfor later filtering; userequire_converged=trueonly when that hard subset is the actual dataset target. - Keep
alignment_check=trueunless you are deliberately debugging malformed XML; XML/ASE step-order mismatches should cause the run to be skipped, not silently truncated into the dataset. - Report whether the dataset contains final frames only or full ionic trajectories.
- Surface the
head_labelwhen the dataset is intended for multi-head foundation-model finetuning. - Keep
split_unit="source_run"for trajectory-style data unless you intentionally want frame-level mixing across train/valid/test. - Do not silently reshuffle split fractions between model comparisons.
- Keep the handoff artifact explicit: this skill should start from collected result trees and end with a reproducible dataset directory plus split files.
Output Contract
Return:
- dataset directory
- split file paths
- dataset summary JSON
- any skipped-run ledger
References
- Reference flow: vasp_to_mace_finetune.md
- Export conventions: export_ase_db_to_mace_xyz.py