synthetic-data-generation - SKILL.md Agent Skill

name: "synthetic-data-generation" description: 'Generate synthetic data for fine-tuning, eval-set bootstrapping, RAG corpus augmentation, and rare-case coverage. Covers Self-Instruct, Evol-Instruct, persona-based generation, distillation from larger models, dataset curation (filtering, dedup, decontamination), and provenance / dataset cards.' metadata: author: "AgentX" version: "1.0.0" created: "2026-04-30" updated: "2026-04-30" compatibility: frameworks: ["distilabel", "argilla", "self-instruct", "evol-instruct", "openai", "anthropic"] languages: ["python"]

Synthetic Data Generation

Purpose: Build training and eval datasets quickly without violating user privacy, while controlling for quality, diversity, and contamination.

When to Use This Skill

Bootstrapping eval sets when you have no real labeled data
Augmenting fine-tuning data for under-represented intents / slices
Generating adversarial test cases (combine with ai-safety-and-red-teaming)
Creating QA pairs over a private RAG corpus
Privacy-safe substitutes for production data in dev/test

When NOT to Use

If real, well-labeled data exists at sufficient scale -- use it
For final benchmark numbers in regulated decisions -- use human-labeled holdouts

Common Techniques

Technique	Use For
Self-Instruct	Generate instruction/response pairs from a small seed
Evol-Instruct	Iteratively rewrite prompts to be deeper, broader, harder
Persona-based	Generate from N personas to enforce demographic / role diversity
Distillation	Strong-model answers used to train smaller model (check provider terms)
Back-translation	Multilingual coverage
Counterfactual augmentation	Flip protected attributes to test fairness
Programmatic / templated	Slot-fill structured data with controlled noise
RAG-grounded QA	Generate Q & A from each chunk; ideal for eval sets

Pipeline

1. Define target distribution and slices (intents, difficulty, demographics)
2. Choose generator model (often a strong reasoning model)
3. Seed with examples or schema
4. Generate at over-sample rate (2-5x target)
5. Quality filter: schema validation, toxicity, dedupe
6. Diversity filter: embedding clustering, slice balancing
7. Decontamination: remove overlap with eval/test sets
8. Human review on a sample (5-10%)
9. Dataset card + version + provenance

Quality Filters

Filter	Tool
Schema valid	JSON Schema, Pydantic
Length sane	Token count bounds
Language correct	fastText / lid.176
Toxicity / unsafe	LlamaGuard, Azure Content Safety
PII present	Presidio
Dedupe (near)	MinHash / SimHash, embedding cosine
Factuality (RAG)	Citation present and supports claim
Difficulty balanced	LLM judge classifies easy/medium/hard

Apply filters in order from cheap to expensive.

Decontamination

Critical before using synthetic data near eval sets.

N-gram overlap (13-gram is a common threshold) against held-out eval
Embedding-based near-duplicate detection across train and test
Source-traceable seeds: never seed from the eval set itself

Document decontamination in the dataset card. A model that has seen its eval is not measuring what you think it is.

Diversity and Bias

Stratify by intent, language, register (formal/informal), persona, slice
Measure diversity: distinct n-grams, embedding-cluster spread, persona coverage
Watch for mode collapse: generator produces narrow patterns; mitigate with temperature, persona conditioning, or rejection sampling
Audit for amplified bias (compare slice distributions to target)

Legal and Ethical

Check provider Terms of Service before training models on outputs (some providers restrict use to train competing models)
Honor source licenses (do not generate near-copies of copyrighted text)
Do not generate synthetic personally-identifying data of real individuals
Document the generator model and prompt; downstream consumers must know provenance

Dataset Card (Required Output)

Purpose and intended use
Generation method (model, prompt template, seeds)
Filters applied with retention rates
Slice distribution (counts per category)
Decontamination evidence
Known limitations and biases
Version, license, contact

Skills to Load Alongside

Need	Skill
Eval set construction	`ai-evaluation`
Fine-tuning that consumes the data	`model-fine-tuning`
RAG-grounded QA generation	`rag-pipelines`
Bias / safety filtering	`ai-safety-and-red-teaming`
Drift over generations	`data-drift-strategy`

References

Self-Instruct (Wang et al.), Evol-Instruct (WizardLM)
Distilabel and Argilla pipelines
HuggingFace datasets cards format