terminal-agent-improvement-loop - SKILL.md Agent Skill

name: terminal-agent-improvement-loop description: Run a benchmark-driven improvement loop for Con's terminal agent. Use when iterating on pane awareness, SSH/tmux behavior, coding-cli flows, benchmark scoring, or progress tracking across many runs.

Use this skill when improving Con as a terminal-native agent, not just fixing a one-off bug.

Primary references:

Choose the smallest operator profile that matches the problem.
Run the benchmark on an idle tab:
- python3 benchmarks/terminal-agent/run.py --profile operator-local-codex-devloop --suite operator
- for repeated clean iterations, prefer python3 benchmarks/terminal-agent/iterate.py ...
Score the resulting run with the matching rubric:
- python3 benchmarks/terminal-agent/score.py --profile ... --record ... --score ...
- or ask the built-in agent to judge the raw record and transcript first: python3 benchmarks/terminal-agent/judge_llm.py --profile ... --record ... --socket /tmp/con.sock
- then turn that judge artifact into a normal scorecard: python3 benchmarks/terminal-agent/score.py --profile ... --record ... --judge-file ...
Record one short summary, a few lessons, and a few next-focus bullets in the score record.
Append the scorecard to the tracked improvement log:
- python3 benchmarks/terminal-agent/log_iteration.py --scorecard ... --change "..."
Make one focused product change.
Re-run the same operator profile.
Generate a report when you need to inspect trend:
- python3 benchmarks/terminal-agent/report.py

Use strict suites to protect the floor and operator suites to judge real workflows.
Prefer operator profiles that start a fresh conversation and have bounded step timeouts.
Prefer typed control-plane improvements over prompt-only fixes.
Do not overfit to a single benchmark phrase or one host layout.
Keep unknowns honest when the backend cannot prove more.
When benchmark infra changes, say whether the product improved or the measurement improved.
Keep iteration notes concise and comparable across runs.
Keep docs/impl/terminal-agent-improvement-log.md useful to a human reader; it should explain what changed, not just repeat the numeric score.
If you use the LLM judge, feed it the raw record and transcript, not only the generated report. The report is a summary, not primary evidence.