synthetic-data - SKILL.md Agent Skill

name: synthetic-data description: Synthetic data & fine-tune dataset curation capability pack. Gives AI agents the judgment rules for pretraining/SFT data quality filtering, document-level deduplication, synthetic instruction generation, preference-pair curation, and benchmark contamination detection. Research-grounded rules from Self-Instruct, Evol-Instruct/WizardLM, LSHBloom, distilabel, Axolotl/Unsloth, DPO/RRHF/GRPO, and the ConTAM/CoDeC contamination literature. Use for any synthetic dataset build, fine-tune data prep, dedup pipeline, preference dataset, or contamination audit task. keywords: ["合成数据", "synthetic data", "数据集", "dataset", "微调数据", "fine-tune data", "去重", "deduplication", "数据清洗", "data filtering", "Self-Instruct", "Evol-Instruct", "偏好数据", "preference data", "DPO", "污染检测", "contamination", "蒸馏", "distillation", "指令数据", "instruction tuning"] type: reference-based

CONSUMES: User dataset-curation task + raw corpus / seed tasks / fine-tune data description + target model + optional existing pipeline configs PRODUCES: Applied data-curation judgment rules + filtering thresholds + dedup architecture choice + synthetic-generation pipeline + preference-pair config + chat-template alignment + contamination audit results

Synthetic Data & Fine-Tune Dataset Curation Capability Pack

Version: 0.1.0 Compatibility: Claude Code (Phase 1); Codex / Cursor / Gemini in Phase 3 License: Apache 2.0

What This Pack Does

AI agents build fine-tune datasets by dumping a model's raw generations into a JSONL and training on it. They skip deduplication, so the model memorizes repeated documents. They prompt a single model to generate instructions with no diversity controls and no ROUGE-L filtering, so the set collapses into near-duplicates. They evaluate the fine-tuned model on public benchmarks that leaked into the data and celebrate inflated scores. They mis-map chat templates and train on pad tokens.

This pack embeds the judgment rules that data-curation engineers apply automatically — rules from the Self-Instruct and Evol-Instruct/WizardLM papers, LSHBloom-scale dedup architectures, distilabel pipeline practice, Axolotl/Unsloth fine-tuning docs, and the ConTAM/CoDeC contamination-detection literature.

Pack = data-curation judgment. Your workflow system = process constraints. No overlap.

Cross-Cutting Rule: Decontaminate Before You Trust the Score

Before reporting ANY benchmark number for a model trained on web-scraped or synthetic data, run contamination detection against the eval set. Public benchmarks leak into training corpora at high rates — one study flagged up to ~90% of examples in datasets like SQuADv2 and DROP as contaminated (a single-source, definition-specific figure, not a universal rate). Contamination is not binary; it is a spectrum that silently inflates leaderboard scores. The GSM1k study showed accuracy drops of up to 13% on uncontaminated math problems, and Claude Opus 4.5 dropped 35 percentage points (80.9% → 45.9%) from SWE-bench Verified to the contamination-resistant SWE-bench Pro.

This rule applies to: synthetic generation (your generated set may regurgitate benchmark items), preference curation, and every "our fine-tune scores X%" claim. It is surfaced here because a clean number on a contaminated benchmark is the most expensive silent failure in the whole pipeline.

Step 0: Context Detection

When the user mentions dataset-curation work, detect the context and load the right reference:

User Signal	Reference to Load
"data quality", "filter corpus", "perplexity filter", "Ask-LLM", "quality score", "数据清洗", "质量过滤"	`references/quality-filtering-rules.md`
"deduplicate", "dedup", "MinHash", "LSH", "near-duplicate", "去重"	`references/deduplication-rules.md`
"synthetic data", "Self-Instruct", "Evol-Instruct", "WizardLM", "generate instructions", "distilabel", "合成数据", "指令生成"	`references/synthetic-generation-rules.md`
"preference data", "DPO", "RRHF", "GRPO", "chat template", "ShareGPT", "Axolotl", "Unsloth", "偏好数据", "微调格式"	`references/preference-alignment-rules.md`
"contamination", "benchmark leak", "data leakage", "ConTAM", "CoDeC", "污染检测", "基准泄漏"	`references/contamination-detection-rules.md`
"full dataset pipeline", "build a fine-tune dataset end to end", "curate everything"	Load all references sequentially

Step 1: Apply Rules

After loading the relevant reference file(s):

Read the reference completely — do not skim
Apply each rule as a judgment check against the user's corpus, pipeline config, or generation setup
For each violated rule: state the violation clearly, then give the specific fix (with the threshold/command from the reference)
Enforce the Decontaminate-Before-You-Trust-the-Score cross-cutting rule before reporting any benchmark number
Check stage annotations — they tell you where in the pipeline a rule applies:
- pretraining: corpus-scale filtering/dedup (millions–billions of docs)
- post-training: SFT / preference / instruction-tuning data (thousands–millions of examples)
- eval: contamination audit before scoring

Output format per finding:

[P0] Rule DEDUP3 (deduplication): No near-duplicate pass — only exact SHA-256 dedup is configured.
→ Add a MinHashLSH (or LSHBloom) near-duplicate pass; exact-match misses copyedited/reformatted duplicates.

[P1] Rule GEN1 (synthetic-generation): Self-Instruct loop has no ROUGE-L overlap filter.
→ Reject any generated instruction whose ROUGE-L with an existing pool instruction exceeds 0.7.

Step 2: Output

Produce a structured curation report:

## Dataset Curation Review: [area reviewed]

### P0 — Blocking (must fix before training / shipping the dataset)
- [finding + specific fix]

### P1 — Required (fix before trusting the dataset)
- [finding + specific fix]

### P2 — Advisory (improves dataset quality)
- [finding + specific fix]

### Pipeline Stage Audit
[table: filter → dedup → generate → align → decontaminate, with status per stage]

### Tool Recommendation
[distilabel / Axolotl / Unsloth / Milvus-BINARY_VECTOR / fastText based on user context]

Anti-Skip Table

Excuse	Counter
"Exact dedup is enough"	Exact SHA-256 only catches verbatim copies. Copyediting, reformatting, and versioning produce near-duplicates that inflate memorization — you need MinHashLSH or LSHBloom.
"We'll just prompt the model to generate a lot of instructions"	Without ROUGE-L > 0.7 rejection and the Self-Instruct 6-human/2-machine sampling mix, your set collapses into near-duplicates. Flat Self-Instruct also underperforms Evol-Instruct evolution.
"Perplexity filtering picks the good data"	Perplexity has in-distribution bias: it discards niche/long-tail documents (high PPL) and keeps repetitive boilerplate (low PPL). Ask-LLM has near-zero ranking correlation with perplexity for a reason.
"Our fine-tune scores 80% on the benchmark"	Did you decontaminate first? One study flagged up to ~90% of SQuADv2/DROP as contaminated (single-source figure). GSM1k dropped 13%, and SWE-bench Verified→Pro is a 35pp gap on a harder contamination-resistant suite.
"DPO is all we need"	DPO is structurally pairwise and lacks exploration. For >2 candidates use RRHF ranking loss; for verifiable math/logic use GRPO. And map your chat template, or you train on pad tokens.

Tool Quick Reference

Tool	Install / Invocation	Primary Use
distilabel	`pip install distilabel`	DAG-based synthetic generation pipelines (load → generate → ultrafeedback → to_argilla)
Axolotl	`pip install axolotl`	SFT/preference fine-tuning; `roles_to_train`, `train_on_eos`, `eot_tokens` config
Unsloth	`pip install unsloth`	Fast fine-tuning; `standardize_sharegpt`, `map_eos_token`
fastText	`pip install fasttext`	Language identification filtering in the heuristic gate
Milvus / Zilliz	`pip install pymilvus`	`MINHASH_LSH` index over a `BINARY_VECTOR` MinHash field (avoids float32 precision loss)