name: gepa-fast-agent description: Build, review, or refactor GEPA optimization loops that use fast-agent. Use when Codex needs to design GEPA evaluators, Actionable Side Information, Pareto/frontier metrics, FastAgentReflectionLM, FastAgentBatchEvaluator, BatchRunner, EvalRun/CandidateRun artifact layouts, Trackio monitoring, or refactor OpenClaw/Birch-style prompt, policy, skill, batch, or artifact-generation optimizers.
GEPA Fast-Agent
Use this skill to turn a candidate-evaluation problem into a small, auditable GEPA loop backed by fast-agent primitives.
Core Shape
Think in this order:
- Candidate: the mutable text GEPA can change, usually a prompt, policy, AgentCard variable, skill file, recipe, or code/config fragment.
- Evidence: rows, artifacts, logs, screenshots, checker reports, telemetry, usage, traces, or command output produced by evaluating the candidate.
- Score plus ASI: a numeric score and Actionable Side Information explaining why the candidate passed or failed.
GEPA consumes:
score, side_info = evaluator(candidate)
or, in dataset mode, per-example evaluation. Keep GEPA plumbing thin; put reusable execution and artifact capture in fast-agent primitives.
Non-Negotiable Frontier Rule
GEPA treats scores as maximize-only.
- The main evaluator
scoremust be "higher is better". - Every value in
side_info["scores"]must also be "higher is better". - Do not put raw loss, raw latency, raw cost, failure count, error count, timeout seconds, token count, or policy length into
side_info["scores"]unless transformed so higher is better. - Keep raw lower-is-better diagnostics outside
side_info["scores"], for example underdetails,raw_metrics,summary, orfailures. - Name transformed metrics clearly:
latency_score,cost_score,valid_json_rate,no_timeout,policy_length_compliance,failure_free_rate. - Before running GEPA, audit all frontier keys and write down the transform for each one.
Examples:
side_info = {
"scores": {
"accuracy": accuracy,
"latency_score": 1.0 / (1.0 + latency_seconds),
"cost_score": 1.0 / (1.0 + dollars),
"failure_free_rate": 1.0 - failure_count / max(total, 1),
},
"raw_metrics": {
"latency_seconds": latency_seconds,
"dollars": dollars,
"failure_count": failure_count,
},
}
return accuracy, side_info
If unsure, check the installed GEPA package source or upstream GEPA docs before
deciding. Search for higher is better, SideInfo, scores, Pareto, and
frontier candidate selection.
Choose The Fast-Agent Primitive
Use BatchRunner for row-oriented tasks:
from fast_agent.batch import BatchRunner
runner = BatchRunner(env_dir=".fast-agent", backend="process")
result = await runner.run(
agent_card=".fast-agent/agent-cards/classifier.md",
agent="classifier",
input="eval/input.jsonl",
template_source="eval/task-template.md",
json_schema="eval/output.schema.json",
variables={"policy": candidate["policy"]},
output_path=candidate_run.path / "results.jsonl",
summary_path=candidate_run.path / "batch-summary.json",
telemetry_path=candidate_run.path / "telemetry.jsonl",
overwrite=True,
)
Use FastAgentBatchEvaluator when the GEPA evaluator is mostly row/batch execution plus a scorer:
from fast_agent.integrations.gepa import FastAgentBatchEvaluator
evaluator = FastAgentBatchEvaluator(
env_dir=".fast-agent",
agent_card=".fast-agent/agent-cards/classifier.md",
agent="classifier",
candidate_variables={"policy": "policy"},
input="eval/input.jsonl",
template_source="eval/task-template.md",
schema="eval/output.schema.json",
scorer=score_candidate,
run_dir="runs/gepa",
backend="process",
)
Use EvalRun / CandidateRun for artifact tasks:
from fast_agent.eval import EvalRun
run = EvalRun("runs/gepa")
candidate_run = run.candidate()
candidate_run.materialize_candidate(candidate)
command = candidate_run.run_command(
["uv", "run", "python", "scripts/run_eval.py", "--candidate", str(candidate_run.path)],
timeout_seconds=900,
log_prefix="eval",
)
candidate_run.write_score(score_value, side_info, metadata={"ok": command.ok})
return score_value, side_info
Use FastAgentReflectionLM when GEPA reflection should use fast-agent model aliases/config and leave an audit trail:
from fast_agent.integrations.gepa import FastAgentReflectionLM
reflection_lm = FastAgentReflectionLM(
env_dir=".fast-agent",
model="responses.gpt-5.5?reasoning=high",
audit_dir=run_dir / "reflection",
)
The reflection adapter should write prompt, request, response, timing, stdout/stderr for process runs, errors, and usage when --results contains fast-agent usage channels.
Trackio Monitoring
For long GEPA runs, add Trackio logging in the evaluator script so candidate scores and diagnostics are visible while the optimizer is running:
import trackio
trackio.init(project="gepa-openclaw", config={"split": "validation"})
# In a FastAgentBatchEvaluator scorer, candidate_run is available.
score, side_info = score_candidate(result, candidate, candidate_run)
trackio.log(
{
"score": score,
**side_info.get("scores", {}),
"candidate_index": candidate_run.index or 0,
}
)
Trackio is for monitoring and retrieval, not GEPA frontier semantics. It is
fine to log raw latency, token count, cost, failure count, or artifact size to
Trackio, but do not copy those raw lower-is-better values into
side_info["scores"]; transform them first or keep them under raw_metrics,
summary, or details.
When running detached or remote jobs, use a Trackio Space via space_id and
poll with CLI JSON output such as trackio list alerts --project <name> --json
or trackio get metric --project <name> --run <run> --metric score --json.
Use trackio.alert() for stalled runs, repeated invalid outputs, failing
smoke checks, or score regressions that should interrupt an autonomous loop.
Demo Card Pack
When the GEPA demo pack is available in the configured card-pack registry, use it as a quick smoke test for the evaluator shape it ships:
fast-agent go --pack gepa-demo
uv run .fast-agent/scripts/gepa-run.py --evaluate-only
Keep pack examples aligned with this skill:
- batch demo: AgentCard
variables,FastAgentBatchEvaluator,FastAgentReflectionLM; - artifact demo, when present:
EvalRun,CandidateRun, checker reports, andwrite_score; - all
side_info["scores"]values higher-is-better; - Trackio logging optional and separate from GEPA frontier semantics.
AgentCard Variables
For mutable prompts/policies, prefer declared AgentCard variables over rendering a new card per candidate:
---
type: agent
name: openclaw_classifier
model: "$system.default"
variables:
policy: ""
---
{{file:eval/openclaw/allowed-topics.md}}
{{policy}}
Populate variables per run:
variables={"policy": candidate["policy"]}
For CLI smoke tests:
fast-agent batch run \
--agent-card .fast-agent/agent-cards/classifier.md \
--agent classifier \
--input eval/input.jsonl \
--output runs/candidate-0001/results.jsonl \
--template eval/task-template.md \
--json-schema eval/output.schema.json \
--var-file policy=seed/policy.md \
--overwrite \
--no-final-summary
Scorer Pattern
Keep scoring user-owned and explicit:
- Read normalized evidence:
BatchRunResult.rows,CandidateRunreports, checker JSON, telemetry, logs, screenshots. - Compute one main maximize score.
- Build ASI with concrete failure examples and repair guidance.
- Put only maximize metrics in
side_info["scores"]. - Write raw diagnostics outside
scores. - Persist
score.jsonthroughCandidateRun.write_score()or the evaluator wrapper.
Good ASI includes:
- exact row/artifact IDs;
- expected vs actual;
- failure categories;
- stderr/stdout tails;
- checker findings;
- screenshots or
gepa.Imagewhere useful; - concise actionable feedback;
- raw metrics and transformed frontier metrics.
Avoid ASI that only says "bad" or only returns aggregate numbers; GEPA needs the why.
Backend Choice
- Use
backend="harness"for lower overhead in library/API contexts. - Use
backend="process"for optimizer loops that need isolation, natural stdout/stderr audit, timeouts, and CLI parity. - Both backends should return the same structured result/artifact contract.
OpenClaw Pattern
Use for row/batch classification:
- static AgentCard with declared
policy; FastAgentBatchEvaluatorwithcandidate_variables={"policy": "policy"};- scorer reads
results.jsonl, summary, telemetry; - ASI reports confusion, false positives/negatives, invalid JSON, representative failures, and boundary guidance;
- transform policy length, latency, failure count, or cost into maximize scores before adding to
side_info["scores"].
Birch Pattern
Use for artifact-generation evals:
- candidate contains
SKILL.mdand recipe files; EvalRunallocates candidate dirs;CandidateRunmaterializes isolated skill/resource trees;CandidateRun.run_command()executes generation/checker scripts with timeout;- reports include artifacts, screenshots, checker JSON/Markdown, stdout/stderr, optional VLM findings;
- scorer owns all weights and penalties, but all frontier keys remain higher-is-better.
Final Checklist
Before running a long GEPA job:
- Candidate mutation is limited to intended text fields.
- Candidate dirs contain
candidate.json, variables or materialized files, evidence artifacts, logs, andscore.json. - Reflection calls are audited.
- Evaluator can run in
--evaluate-onlyor smoke mode. - All
side_info["scores"]keys are higher-is-better. - Raw lower-is-better values are stored outside
scores. - The main score is higher-is-better.
- The seed candidate and data split are frozen.
- Lint/typecheck/tests or smoke commands for changed repos have passed.