dspy-advanced-workflow

name: dspy-advanced-workflow description: Drive a complete DSPy 3.2.x project end-to-end — spec → program → metric → baseline → GEPA optimize → export → deploy. Orchestrates the other four DSPy skills (dspy-fundamentals, dspy-evaluation-harness, dspy-gepa-optimizer, dspy-rlm-module) in the correct order. Use this for any non-trivial DSPy build from scratch. when_to_use: User wants to build, optimize, and ship a new DSPy pipeline; says "full workflow" / "end to end" / "from scratch"; or needs the standard loop applied to a greenfield task.

DSPy Advanced Workflow (2026)

This skill runs the seven-step loop that turns a natural-language task description into an optimized, saved, deployable DSPy program. Every step delegates to a specific skill — invoke them in order.

The seven steps

1. Spec

Rephrase the user's task in one sentence. Identify inputs, outputs, the quality axis that matters, and any constraints (latency, cost, tool access, context size). Pick predictor shape:

Task shape	Predictor
Single-step structured I/O	`dspy.Predict` / `dspy.ChainOfThought`
Tool use / multi-step	`dspy.ReAct`
Code execution	`dspy.ProgramOfThought`
Long context / codebase	`dspy.RLM` → `dspy-rlm-module`

2. Program

Write the typed dspy.Signature + dspy.Module subclass per dspy-fundamentals. No hard-coded prompts. Keep predictors named so GEPA can target them.

3. Data

Build trainset and separate valset as dspy.Example(...).with_inputs(...). For GEPA, maximize trainset size and keep validation just large enough to represent downstream behavior; held-out testset is reported on at the end only. See dspy-evaluation-harness.

4. Rich metric

Write rich_metric(gold, pred, trace=None, pred_name=None, pred_trace=None) returning dspy.Prediction(score=0..1, feedback="natural-language critique"). The feedback is load-bearing — it's what GEPA's reflection LM learns from. A dict with the same fields crashes dspy.Evaluate; only dspy.Prediction aggregates correctly. See dspy-evaluation-harness.

5. Baseline

evaluator = dspy.Evaluate(devset=valset, metric=rich_metric,
                          num_threads=8, display_progress=True,
                          provide_traceback=True,
                          save_as_json="runs/baseline.json")
baseline = evaluator(program)
print("Baseline:", baseline.score)

6. GEPA optimize

reflection_lm = dspy.LM("openai/gpt-5", temperature=1.0, max_tokens=32000)
optimizer = dspy.GEPA(
    metric=rich_metric,
    auto="medium",
    reflection_lm=reflection_lm,
    candidate_selection_strategy="pareto",
    track_stats=True,
    track_best_outputs=True,
    log_dir="./gepa_logs",
    num_threads=8,
    seed=0,
)
optimized = optimizer.compile(student=program, trainset=trainset, valset=valset)
print("Optimized:", evaluator(optimized).score)

Run auto="light" first as a sanity check; move to auto="medium"/"heavy" for the final run. See dspy-gepa-optimizer.

If you need a deliberate multi-stage compile loop, DSPy 3.2.x also exposes dspy.BetterTogether(metric=..., bootstrap=..., gepa=...) for chaining named optimizers after you have a clean baseline GEPA setup.

7. Export & deploy

optimized.save("artifacts/program.json", save_program=False)     # state, portable
# or for full deployment artifact:
optimized.save("artifacts/program_dir/", save_program=True)

Deploy:

Load with dspy.load("artifacts/program_dir/") or reconstruct + .load("program.json").
Wrap in FastAPI/CLI.
Enable track_usage=True for cost/latency observability.
Log with MLflow (mlflow.dspy.autolog()) or W&B in CI.
Keep an offline regression test that runs the evaluator against the saved program and fails CI below a threshold.

Full orchestration template

"""DSPy end-to-end pipeline — spec → optimize → deploy."""

import dspy
from pathlib import Path

# ----- 1–2. Spec & program (dspy-fundamentals) -----
class MyTask(dspy.Signature):
    """<one-line instruction from the spec>."""
    input_field: str = dspy.InputField()
    output_field: str = dspy.OutputField()

class MyProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.step = dspy.ChainOfThought(MyTask)
    def forward(self, **kw):
        return self.step(**kw)

# ----- 3. Data (dspy-evaluation-harness) -----
trainset = [...]   # list[dspy.Example(...).with_inputs(...)]
valset   = [...]

# ----- 4. Rich metric (dspy-evaluation-harness) -----
def rich_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    score = ...          # compute 0..1
    feedback = ...       # detailed critique
    return dspy.Prediction(score=score, feedback=feedback)  # NOT a dict

# ----- 5. Baseline -----
dspy.configure(lm=dspy.LM("openai/gpt-4o"), track_usage=True)
evaluator = dspy.Evaluate(devset=valset, metric=rich_metric, num_threads=8,
                          display_progress=True, provide_traceback=True,
                          save_as_json="runs/baseline.json")
program = MyProgram()
print("Baseline:", evaluator(program).score)

# ----- 6. GEPA optimize (dspy-gepa-optimizer) -----
optimizer = dspy.GEPA(
    metric=rich_metric,
    auto="medium",
    reflection_lm=dspy.LM("openai/gpt-5", temperature=1.0, max_tokens=32000),
    candidate_selection_strategy="pareto",
    track_stats=True, track_best_outputs=True,
    log_dir="./gepa_logs", num_threads=8, seed=0,
)
optimized = optimizer.compile(student=program, trainset=trainset, valset=valset)
print("Optimized:", evaluator(optimized).score)

# ----- 7. Export (dspy-fundamentals) -----
Path("artifacts").mkdir(exist_ok=True)
optimized.save("artifacts/program.json", save_program=False)

Guardrails

Never skip step 3 (rich metric). GEPA without feedback ≈ random search.
Always baseline before optimizing — no baseline, no claim.
Save both pre- and post-optimization metrics to JSON for auditability.
If held-out test score drops post-optimization, your valset is too narrow. Expand valset and re-run.
Freeze optimized program with module._compiled = True before multi-stage re-compilation.