case-studies - SKILL.md Agent Skill

name: case-studies description: End-to-end case studies showing how to implement the full training pipeline for different skill types. Covers three complete worked examples — tool-calling training, essay-style training, and agentic search (RAG agent) training — demonstrating dataset design, synthetic generation, validation, fine-tuning, evaluation, and iteration. Use when onboarding to the project, understanding how all components fit together, explaining the pipeline to others, or planning a new training capability. This skill is about UNDERSTANDING the system holistically — reference the other skills for specific CLI commands. allowed-tools: Read, Bash, Write, Grep, Glob

Case Studies: Implementing the Training Pipeline

Three end-to-end worked examples showing how to take a capability from concept to trained model.

Why Case Studies?

The other skills teach you how to use individual tools:

synthetic-data-generation — how to run SynthChat
fine-tuning — how to run trainers
evaluation — how to run evals
upload-deployment — how to ship models

This skill shows you how they all connect — the decisions, the iteration, and the order of operations that turn an idea into a trained capability.

The Three Case Studies

Case Study	What It Teaches	Reference
Tool Calling	Structured output training — teaching a model to call APIs with correct syntax, context objects, and parameters	`reference/tool-calling-pipeline.md`
Essay Style	Creative output training — teaching a model to transform messy brainstorms into structured outlines with voice and personality	`reference/essay-style-pipeline.md`
Agentic Search	RAG agent training — teaching a model to search a corpus, select relevant documents, and answer grounded in sources	`reference/agentic-search-pipeline.md`

The Universal Pipeline

All three case studies follow the same high-level pipeline, but diverge in dataset design and validation:

┌──────────────────────────────────────────────────────────┐
│  1. DEFINE THE CAPABILITY                                 │
│     What should the model do? What does good look like?   │
│     → Rubrics, schemas, specifications                    │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  2. CREATE TRAINING DATA                                  │
│     How do we generate enough high-quality examples?      │
│     → SynthChat scenarios, handcrafted seeds, self-play   │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  3. VALIDATE & IMPROVE                                    │
│     How do we ensure quality before training?             │
│     → Schema validation, rubric scoring, manual review    │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  4. TRAIN                                                 │
│     SFT first (learn the format), then KTO (learn         │
│     preferences), optionally GRPO (optimize rewards)      │
│     → Trainers with YAML config                           │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  5. EVALUATE                                              │
│     Does the model do what we trained it to do?           │
│     → Evaluator with YAML scenarios                       │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  6. ITERATE                                               │
│     What failed? Generate more data targeting weaknesses. │
│     → Failure analysis → targeted generation → retrain    │
└──────────────────────────────────────────────────────────┘

Key Design Principles

1. Schema-First, Not Example-First

Define what "correct" looks like before writing any training data. For tools, this means JSON schemas. For essays, this means rubrics. The schema is the source of truth — everything validates against it.

2. SFT Teaches Format, KTO Teaches Judgment

SFT (Supervised Fine-Tuning) teaches the model WHAT to do — tool call syntax, output structure, response format. KTO (Kahneman-Tversky Optimization) teaches the model WHICH responses are better — preferring clarification over reckless action, preferring concise outlines over bloated ones. Never try to teach both at once.

3. Paired Contrastive Examples

For KTO, every good example needs a realistic bad counterpart using the SAME user request. The bad example should be a plausible mistake, not garbage — wrong tool selected, missing context fields, overly verbose outline. This is what teaches the model judgment.

4. Validate Before You Train

Training on bad data is worse than not training at all. Every dataset goes through structural validation (schema checks) and quality validation (rubric scoring) before it touches a trainer.

5. Iterate on Failures

After evaluation, the failure analysis tells you exactly what to generate next. If the model keeps producing empty memory fields, make more examples that demonstrate rich session memory. If outlines are too long, add negative examples of bloated outlines.

Progressive Reference

Reference	When to Load
Tool Calling Pipeline	Understanding the full tools training journey — from schema to trained model
Essay Style Pipeline	Understanding the full essay training journey — from brainstorm to outline model
Agentic Search Pipeline	Understanding the full RAG agent training journey — from corpus to grounded-answer model
Pipeline Comparison	Side-by-side comparison of how the pipelines differ at each stage

Cross-References to Other Skills

At each stage of the pipeline, you'll use tools documented in the other skills:

Pipeline Stage	Skill to Reference
Generate data	`synethetic-data-generation`
Validate data	`synethetic-data-generation` (rubrics, validate command)
SFT / KTO / GRPO training	`fine-tuning`
Evaluate model	`evaluation`
Upload & deploy	`upload-deployment`

Tips

Read the tool-calling case study first — it's the simpler, more mechanical pipeline
The essay case study shows how to adapt the pipeline for creative/subjective outputs
The agentic search case study shows how to train multi-step reasoning where tools are means to an end
All three pipelines use the same trainers, evaluator, and upload tools — only the data differs
When planning a new capability, map it to whichever case study is closer, then adapt