name: agentsop-prompt-compilation version: 0.1.0 phase: D tier: core frequency: medium status: opinionated layer: enhance-overlay decision_layer_only: true defers_implementation_to: ["dspy", "dspy-sop", "metric-design"] description: The compile-readiness gate for prompt auto-optimization. Decide whether you have earned the right to run an optimizer (DSPy MIPROv2 / GEPA / BootstrapFewShot) before spending compute. Two preconditions only — a real metric, and enough examples for the optimizer you picked. Garbage metric in, garbage prompt out. Pick the optimizer by data scale; GEPA inverts the scale assumption (~10 examples + textual feedback).
prompt-compilation — The Compile-Readiness Gate
"It's unproductive to launch optimization runs using a poorly designed program or a bad metric." — DSPy core team [dspy.ai/learn/optimization/overview/]
"Compile when you can measure. The optimizer maximizes your metric — garbage metric in, garbage prompt out." — this skill's operating principle (synthesized from the line above + DSPy Case C)
This is an enhancement-overlay decision skill. It answers exactly one question the broad [[dspy]] library
skill buries under API surface: have you earned the right to run an optimizer yet, and which one? It produces a
go / no-go gate plus an optimizer pick. It defers every implementation detail — Signature syntax, module choice,
compile() calls, save/deploy — to [[dspy]] and the full workflow in [[agentsop-dspy]]. It defers metric construction
to [[agentsop-metric-design]]; this skill only checks the metric exists and is validated, then uses it as the gate.
The trap it removes: people reach for MIPROv2(auto="heavy") because the API is right there, before they have a
metric worth maximizing or enough data to avoid memorization. Compilation is a hyperparameter search costing
hundreds-to-thousands of LM calls ($2–$40+, minutes-to-hours) [dspy.ai/faqs/]. Spending that on an un-validated
metric or 8 examples is pure waste.
1. 何时激活 (When to Activate)
Activate when all three of these are plausibly true (the gate then confirms them):
- A hand-tuned prompt has plateaued. The team has manually iterated few-shot examples / wording past the point
of obvious returns. Symptom from
[[agentsop-dspy]]§1: "the team manually tunes few-shot examples; a metric exists but isn't being used to drive prompt design." - A metric exists (or can be built). There is a
metric(example, pred) -> bool|float— or one can be written and human-validated. Without this, do not activate; the optimizer has nothing to maximize. - Labeled examples exist. There is a dev set. The count determines which optimizer is even legal (§3, §4.2).
Concrete triggers in intent or codebase:
| Trigger | Signal |
|---|---|
| Spend intent | "auto-tune this prompt", "should I run MIPRO?", "is it worth compiling?", "GEPA vs MIPROv2", "optimize prompts for our metric" |
| API reach | MIPROv2(, BootstrapFewShot(, dspy.GEPA(, teleprompter, optimizer.compile( about to be called |
| Symptom | hand-tuned prompt stuck; few-shot examples curated by hand; metric written but only used for reporting, not optimization |
Do NOT activate when:
- The task is one-shot or the Signature/I-O contract is still churning daily — compile only after it stabilizes [dspy.ai/learn/optimization/overview/]; otherwise you pay compile cost for prompts you'll throw away.
- No metric is possible and none will be built — then this is verbose prompting, not compilation. Route to
[[agentsop-metric-design]]first; if the user refuses any success criterion, the gate stays closed. - Compliance requires verbatim human-authored prompts — optimized prompts are machine-generated artifacts.
- You only need parse safety (a typed Signature), not quality optimization — that is
[[agentsop-dspy]]Stage 1 and thesignature-designoverlay. Promoting prose to a typed Signature and compiling it are two different gates.
2. 核心心智模型 (Core Mental Model)
"Compile when you can measure. The optimizer maximizes your metric — garbage metric in, garbage prompt out."
An optimizer (MIPROv2, GEPA, BootstrapFewShot) is a black-box search over prompt instructions + few-shot demos that
maximizes metric(pred, example). It has no taste. It will faithfully chase whatever the metric rewards, biases
and all. From [[agentsop-dspy]] Case C: "DSPy will optimize toward whatever the metric rewards. A bad metric becomes a
bad program at scale." Two corollaries make this a gate, not a step:
The metric is the precondition, not a tunable. Before any compute is spent, the metric must (a) exist and (b) agree with human judgment on ≥20 spot-checks. An un-validated metric means the expensive search optimizes the metric's blind spot. This is non-negotiable [dspy.ai/learn/evaluation/metrics/; Case C]. Construction is
[[agentsop-metric-design]]'s job; this skill only checks the receipt.Data scale is a hard floor, not a preference. Below the optimizer's example floor you are not training, you are memorizing. The DSPy 20/80 train/val split exists because "prompt-based optimizers often overfit to small training sets" [dspy.ai/learn/optimization/overview/]. The floor differs per optimizer (§4.2).
The two-gate picture
┌──────────────────────────────────────────────┐
Gate 1 │ METRIC: exists? AND human-validated ≥20? │ ── No ─► STOP. Build/validate metric ([[agentsop-metric-design]]).
(measure) └──────────────────────────────────────────────┘
│ Yes
▼
┌──────────────────────────────────────────────┐
Gate 2 │ EXAMPLES: ≥ floor for the optimizer I want? │ ── No ─► Pick a lower-floor optimizer, collect data,
(data) └──────────────────────────────────────────────┘ or STOP (use LabeledFewShot as a floor).
│ Yes
▼
COMPILE (cheap probe first: auto="light")
The cost reality (why the gate is worth having)
Compilation runs hundreds-to-thousands of LM calls. Reference run: ~3.2k API calls, $2–$3, 6–20 min for
auto="light"; auto="heavy" on 1000+ examples can hit tens of dollars and hours [dspy.ai/faqs/]. Cost scales with
num_trials × |trainset| × |program LM calls|. The gate's entire job is to stop you spending that on a metric or
dataset that cannot pay it back.
3. SOP (Standard Operating Procedure)
0. Confirm activation (§1): plateaued hand-tuning + metric + examples.
1. GATE 1 — metric exists AND validated ≥20 human spot-checks? [hard]
2. GATE 2 — example count ≥ floor for the chosen optimizer? [hard]
3. PICK — choose optimizer by data scale + feedback signal. (§4.2)
4. BUDGET — estimate cost; cap it; choose a cheap optimizer LM.
5. PROBE — run auto="light" (or smallest config) as a signal.
6. DECIDE — gain ≥ threshold → escalate; else go back, don't escalate.
Gate 1 — Metric (the "can you measure" gate)
- Is there a
metric(example, pred, trace=None) -> bool|float? If not → STOP, route to[[agentsop-metric-design]]. - Has it been validated against a human on ≥20 spot-checks (≥30 if open-ended), with ≥80% agreement? If not →
STOP and validate first. "Never compile against a metric you haven't human-validated on ≥20 spot-checks"
[
[[agentsop-dspy]]Case C; dspy.ai/learn/evaluation/metrics/]. - Is the metric a single holistic LLM-judge? Then it inherits length / self-preference / position bias and the
optimizer will chase those biases — decompose it (
[[agentsop-metric-design]]) before compiling [arxiv.org/pdf/2506.02592].
Exit: a validated metric callable + a calibration receipt. Otherwise the gate is closed; do not proceed.
Gate 2 — Examples (the "enough data" gate)
Count labeled examples. The DSPy-documented thresholds: ≥30 = minimum useful, ~300 = recommended, 200+ required for MIPROv2 to avoid overfitting [dspy.ai/learn/optimization/overview/]. The floor is per optimizer:
< 10→ onlyLabeledFewShot(k=8)(no search, weakest) — usually means STOP, collect data.~10+with textual feedback →GEPAis legal and sample-efficient (this inverts the usual "more data" rule).~30–50→BootstrapFewShot/BootstrapFewShotWithRandomSearch.200+→MIPROv2.
Exit: the example count clears the floor of the optimizer you intend to run. If not, either drop to a lower-floor optimizer or stop.
Step 3 — Pick optimizer by data scale + signal
Use the table in §4.2. The single most important branch: do you have textual error feedback (test diffs, schema
violations, judge rationales like "answer was verbose")? If yes, GEPA needs only ~10 examples and converges faster
because it reflects on the text of the feedback, not just a scalar [dspy.ai/tutorials/gepa_ai_program/;
arxiv.org/abs/2507.19457]. If no, fall back to the data-scale ladder.
Step 4 — Budget
Estimate before launching (§4.4). Cap spend. Use a cheap optimizer LM (e.g. gpt-4o-mini) even when the task LM
is expensive — community-reported parity at a fraction of cost [github.com/stanfordnlp/dspy/issues/1596]. Set
dspy.configure(track_usage=True) to log actual spend.
Step 5 — Probe cheap first
Run auto="light" (MIPROv2) or the smallest config first. Docs: "start with moderate values, observe behavior, and
scale up only if you see clear gains" [github.com/stanfordnlp/dspy/issues/1596]. Never start at auto="heavy".
Step 6 — Decide on the probe result
lightgives <2% lift → do not escalate. The bottleneck is the program/metric, not the optimizer. Loop back to[[agentsop-dspy]]Stage 1 (signature ambiguous? wrong decomposition?) [dspy.ai/learn/optimization/overview/].lightgives 2–10% lift → escalate tomedium; go toheavyonly if data ≥300 and you have a held-out test set distinct from val.- Always confirm the final gain on a held-out test set, not the val set used during optimization.
4. 操作模型 (Operations)
4.1 — Operation registry (Trigger / Action / Output / Evidence)
OP-1 — ReadinessGate (entry, hard gate)
- Trigger: Anyone about to call an optimizer's
compile(). - Action: Run Gate 1 (metric exists + validated ≥20) then Gate 2 (examples ≥ optimizer floor). Any failure → STOP.
- Output: Decision token
compile-ready|not-ready+ the failing gate. - Evidence: [dspy.ai/learn/optimization/overview/];
[[agentsop-dspy]]Case C, §3 three-stage gate.
OP-2 — MetricExistsCheck
- Trigger: Gate 1.
- Action: Confirm a
metric(example, pred, trace=None) -> bool|floatexists. If absent → route to[[agentsop-metric-design]], gate stays closed. - Output: Metric callable or a STOP.
- Evidence: Anti-pattern "compiling without a metric" [dspy.ai/learn/optimization/overview/].
OP-3 — MetricValidatedCheck
- Trigger: Metric exists (OP-2).
- Action: Require ≥20 (open-ended ≥30) human spot-checks at ≥80% agreement, plus a calibration receipt. If a single holistic LLM-judge, require decomposition first.
- Output:
validatedflag + receipt reference. - Evidence:
[[agentsop-dspy]]Case C step 4; [dspy.ai/learn/evaluation/metrics/]; [arxiv.org/pdf/2506.02592].
OP-4 — ExampleFloorCheck
- Trigger: Gate 2.
- Action: Count labeled examples; compare to the floor of the intended optimizer (LabeledFewShot any; GEPA ~10+feedback; Bootstrap ~30–50; MIPROv2 200+).
- Output:
cleared|below-floor+ the legal optimizer set. - Evidence: [dspy.ai/learn/optimization/overview/] (30 min / 300 rec / 200+ MIPROv2).
OP-5 — OptimizerByDataScale
- Trigger: Both gates passed.
- Action: Pick from the §4.2 table by example count and feedback signal. Prefer GEPA when textual feedback exists.
- Output: Optimizer choice + config.
- Evidence: [dspy.ai/learn/optimization/optimizers/]; [dspy.ai/api/optimizers/GEPA/overview/].
OP-6 — CostBudgetGuard
- Trigger: Before launching compile.
- Action: Estimate
num_trials × |trainset| × calls; cap spend; set a cheap optimizer LM; enabletrack_usage=True. - Output: A budget ceiling + chosen optimizer LM.
- Evidence: [dspy.ai/faqs/]; [github.com/stanfordnlp/dspy/issues/1596].
OP-7 — CheapProbeFirst
- Trigger: Compile-ready, budget set.
- Action: Run
auto="light"(or smallest config) first. Never start atheavy. - Output: A cheap lift signal.
- Evidence: [github.com/stanfordnlp/dspy/issues/1596];
[[agentsop-dspy]]Case A.
OP-8 — EscalateOrReturn
- Trigger: Probe finished.
- Action: <2% lift → return to program/metric design (do NOT escalate). 2–10% →
medium.heavyonly if data ≥300 + held-out test set. Confirm gain on held-out test. - Output: Escalate / iterate-back decision.
- Evidence:
[[agentsop-dspy]]Case A; [dspy.ai/learn/optimization/overview/].
4.2 — Optimizer-by-data-scale table
| Examples | Feedback signal | Optimizer | Why / floor |
|---|---|---|---|
<10 |
any | LabeledFewShot(k=8) |
No search; weakest. Usually means STOP and collect data. [dspy.ai/cheatsheet/] |
~10+ |
textual (diffs, schema violations, judge rationales) | dspy.GEPA(metric=m_with_feedback) |
Inverts the data-scale rule — reflection on text feedback is sample-efficient. [arxiv.org/abs/2507.19457] |
~30–50 |
scalar | BootstrapFewShot / …WithRandomSearch |
Self-bootstrapped demos; minimum useful regime. [dspy.ai/learn/optimization/optimizers/] |
200+ |
scalar | MIPROv2(metric=m, auto="light") then escalate |
Joint instruction + demo Bayesian search; 200 is the documented floor. [dspy.ai/api/optimizers/MIPROv2/] |
| any (post-MIPRO, ship smaller model) | — | chain BootstrapFinetune(student=small, teacher=optimized) |
Distills prompts into weights. [dspy.ai/api/optimizers/BootstrapFinetune/] |
4.3 — Readiness checklist (copy/paste)
GATE 1 — MEASURE
[ ] metric(example, pred, trace=None) -> bool|float exists
[ ] validated vs human: >=20 spot-checks (>=30 open-ended), >=80% agreement
[ ] receipt saved {n_spot_checks, agreement, judge_model, task_model, date}
[ ] NOT a lone holistic LLM-judge (else decompose via [[agentsop-metric-design]] first)
GATE 2 — DATA
[ ] labeled examples counted: N = ____
[ ] N clears the floor of the optimizer chosen below
[ ] (prompt-based optimizers) 20/80 train/val split planned; GEPA uses standard split
PICK + BUDGET
[ ] optimizer chosen by §4.2 (GEPA if textual feedback)
[ ] cost ceiling set; cheap optimizer LM chosen; track_usage=True
[ ] plan: probe auto="light" first, escalate only on >=2% lift, confirm on held-out test
4.4 — Cost budget heuristics
| Config | Rough cost | Use when |
|---|---|---|
LabeledFewShot / BootstrapFewShot |
cents–~$1 | Floor; ≤50 examples |
MIPROv2(auto="light") |
~$2–3, 6–20 min, ~3.2k calls | First probe, always |
MIPROv2(auto="medium") |
single–low-tens of $ | Light showed ≥2% lift |
MIPROv2(auto="heavy") |
tens of $, hours | Only if data ≥300 + held-out test + budget |
GEPA (~10+ examples) |
low, sample-efficient | Textual feedback available |
Source: [dspy.ai/faqs/]; [github.com/stanfordnlp/dspy/issues/1596]; [arxiv.org/abs/2507.19457].
5. 困境决策案例 (Dilemma Cases)
Dilemma 1 — Optimizer cost vs gain: is compiling even worth it?
困境: A 3-stage RAG pipeline already hits 72% on dev after manual prompt tuning. MIPROv2(auto="heavy") would
cost ~$40 and 4 hours. Worth it? (Adapted from [[agentsop-dspy]] Case A.)
约束:
- 250 labeled examples (above the MIPROv2 200-floor) [dspy.ai/learn/optimization/optimizers/].
- Prompts already hand-iterated — diminishing returns suspected.
- Pipeline LM = GPT-4o; per-call cost compounds at trial scale.
决策步骤:
- Gate 1 first: were the prompts ever scored against the metric, or just eyeballed? If only eyeballed, even
auto="light"(~$2) often yields large lifts over hand-tuned baselines (paper reports 25%/65% over standard few-shot) [arxiv.org/abs/2310.03714] — the metric was never actually driving design. - Probe
light(~$2), neverheavy(OP-7). It is a cheap signal for whether more compute helps. - <2% lift → do NOT escalate to
heavy. Return to program/metric: signature ambiguous? is 3-stage the right decomposition? (OP-8) [github.com/stanfordnlp/dspy/issues/1596]. - 2–10% lift →
medium;heavyonly if data ≥300 + a held-out test set distinct from val. - Use gpt-4o-mini as the optimizer LM even though the task LM is gpt-4o (OP-6) — community parity.
结果: The $40 heavy run is almost never the right first move. The $2 probe tells you whether to spend more or
to go fix the program. Often the answer is "fix the program/metric first."
可提取的操作: OP-1, OP-6 CostBudgetGuard, OP-7 CheapProbeFirst, OP-8 EscalateOrReturn. Lesson: never
open at heavy. Probe light with a cheap optimizer LM, and treat <2% as a signal to fix the program, not to add
compute.
Dilemma 2 — Only 12 examples: GEPA inverts the data-scale assumption
困境: A code-fix agent has just 12 labeled examples, but each failing run produces rich textual feedback: the failing test diff, a schema-violation message, a linter error. The data-scale ladder says 12 < 30 → "STOP, collect data, you can't optimize." Is that right?
约束:
- 12 examples is below the BootstrapFewShot (~30) and MIPROv2 (200) floors.
- Collecting 200+ labeled examples is weeks of effort.
- But the failures emit machine-readable text feedback, not just a pass/fail scalar.
决策步骤:
- Gate 1: a metric exists — pass/fail on the test plus a textual
feedbackstring. Validate it (the test is ground truth; spot-check the feedback strings are accurate). Pass. - Gate 2 — do NOT apply the scalar-data ladder. The presence of textual feedback changes which optimizer is legal. GEPA needs only ~10 examples because it reflects on the content of the feedback, not a scalar gradient [dspy.ai/tutorials/gepa_ai_program/; arxiv.org/abs/2507.19457]. 12 ≥ ~10 → GEPA is legal.
- Pick GEPA (OP-5): return
dspy.Prediction(score=..., feedback="failing assert: expected X got Y")from the metric; this is GEPA's superpower [dspy.ai/api/optimizers/GEPA/overview/]. - Probe cheap (OP-7), confirm lift on the held-out examples (OP-8).
结果: A dataset that is far too small for MIPROv2/Bootstrap is sufficient for GEPA. The "you need 200 examples" intuition is specific to scalar-feedback optimizers; textual feedback buys an order-of-magnitude in sample efficiency. GEPA reported beating MIPROv2 by 10–13% on benchmarks while being more sample-efficient [arxiv.org/abs/2507.19457].
可提取的操作: OP-4 ExampleFloorCheck, OP-5 OptimizerByDataScale. Lesson: the example floor is per-optimizer.
If you can express why an output failed as text, GEPA inverts the data-scale assumption — ~10 examples suffice.
Do not reflexively gate out small datasets that carry rich feedback.
6. 反模式与边界 (Anti-Patterns & Boundaries)
Anti-patterns
| # | Anti-pattern | Why it's wrong | Fix |
|---|---|---|---|
| AP-1 | Compiling without a metric | The optimizer has nothing to maximize; DSPy degenerates to verbose prompting | Refuse the compile; build a metric (OP-2, [[agentsop-metric-design]]) |
| AP-2 | Compiling against an un-validated single LLM-judge | The search faithfully chases the judge's length / self-preference / position bias [arxiv.org/pdf/2506.02592] | Validate ≥20 spot-checks; decompose the judge (OP-3) |
| AP-3 | Compiling on <10 examples (scalar feedback) | Below the floor you memorize, not train; "prompt-based optimizers overfit small sets" [dspy.ai/learn/optimization/overview/] | Collect data, or use LabeledFewShot as a floor — or GEPA if textual feedback (OP-4) |
| AP-4 | Starting at auto="heavy" |
Tens of $ / hours before you know if compute even helps | Probe auto="light" first (OP-7) |
| AP-5 | Escalating after a <2% probe lift | The bottleneck is the program/metric; more compute won't fix it | Return to [[agentsop-dspy]] Stage 1 (OP-8) |
| AP-6 | Using the expensive task LM as the optimizer LM | Multiplies trial-scale cost for no documented gain | Cheap optimizer LM (gpt-4o-mini) (OP-6) |
| AP-7 | Applying the scalar data-floor to a feedback-rich task | Wrongly gates out GEPA-legal small datasets | Check for textual feedback before counting examples (OP-5) |
| AP-8 | Reporting the gain on the val set used in optimization | Overfit signal; not a real held-out improvement | Confirm on a held-out test set distinct from val (OP-8) |
| AP-9 | Compiling while the Signature/I-O contract is still churning | You pay compile cost for prompts you'll throw away | Stabilize the contract first (signature-design / [[agentsop-dspy]] §1) |
Boundaries (when this skill cannot help)
- No willingness to define any success criterion — no metric can exist; the gate stays closed. Problem
definition first (
[[agentsop-metric-design]], thenscientific-critical-thinking). - Parse-safety, not quality — if you only need a typed Signature so code can consume the output, that is the
promote gate (
signature-design/[[agentsop-dspy]]Stage 1), a different decision from compile. - HOW to write/run the optimizer — class syntax,
compile()args, save/deploy: that is[[dspy]]and[[agentsop-dspy]], not this overlay. - Constructing the metric — decomposition, bias probes, calibration receipts: that is
[[agentsop-metric-design]]. This skill only checks the receipt exists. - Non-DSPy auto-optimization — e.g. OpenAI fine-tuning is a weights path, not prompt compilation; the readiness logic (metric + data floor) still applies but the operators differ (§7).
7. 跨框架对照 (Cross-Framework Mapping)
The readiness gate (metric + data floor) is framework-agnostic; the operators differ.
| Concept | DSPy optimizers | Manual few-shot tuning | OpenAI fine-tuning |
|---|---|---|---|
| What is optimized | prompt instructions + demos (and optionally weights via BootstrapFinetune) |
the prompt string, by hand | model weights |
| Metric required? | Yes — metric(ex, pred) -> bool|float drives the search |
implicit / eyeballed (the failure mode) | a held-out eval set + loss; eval suite recommended |
| Example floor | ~10 (GEPA+feedback) / ~30 (Bootstrap) / 200+ (MIPROv2) | none, but no guarantees | OpenAI guidance: ~50–100+ examples minimum, more is better |
| Cost shape | `num_trials × | trainset | × calls` ($2–$40+) [dspy.ai/faqs/] |
| Readiness gate (this skill) | both gates apply directly | Gate 1 is exactly what's missing — you're "tuning" with no validated metric | both gates apply; "metric validated" = your eval set is trustworthy |
| When to prefer | metric + ≥10 examples + want a portable, recompilable artifact | one-shot, unstable contract, or audit-mandated verbatim prompts | task is stable, latency/cost matters at high volume, prompt optimization plateaued |
Decision summary: If you have a validated metric and examples clearing a floor, compile (DSPy). If you
have a metric but it's never been validated, you are doing manual tuning dressed up — close Gate 1 first. If
prompt optimization has plateaued and you have hundreds of stable examples and volume justifies it, consider
fine-tuning (or BootstrapFinetune to distill an already-compiled program into a smaller model).
Combination patterns:
- prompt-compilation +
[[agentsop-metric-design]]: this skill's Gate 1 is satisfied by a[[agentsop-metric-design]]calibration receipt. No receipt → gate closed. - prompt-compilation +
[[agentsop-dspy]]: this overlay is the sharpened version of[[agentsop-dspy]]§3 Stage 2→3 transition (the "are you allowed to compile yet?" boundary). After the gate opens, hand off to[[agentsop-dspy]]for the full compile/save/deploy workflow. - prompt-compilation +
[[dspy]]:[[dspy]]provides the optimizer APIs; this skill decides whether and which.
References
references/R1-source-evidence.md— every claim traced to the localdspy-sopskill and the upstream DSPy docs it cites; overlap check vs[[dspy]]and[[agentsop-metric-design]].intermediate/operation_candidates.json— the 8 operations + tables in machine-readable form.
Citations: [dspy.ai/learn/optimization/overview/], [dspy.ai/learn/optimization/optimizers/],
[dspy.ai/api/optimizers/MIPROv2/], [dspy.ai/api/optimizers/GEPA/overview/], [dspy.ai/api/optimizers/BootstrapFinetune/],
[dspy.ai/learn/evaluation/metrics/], [dspy.ai/faqs/], [dspy.ai/cheatsheet/], [dspy.ai/tutorials/gepa_ai_program/],
[arxiv.org/abs/2310.03714], [arxiv.org/abs/2507.19457], [arxiv.org/pdf/2506.02592],
[github.com/stanfordnlp/dspy/issues/1596]. Primary source:
/Users/5imp1ex/Desktop/Skill-Workplace/output/dspy-sop-skill/SKILL.md.