scenario-agent-run-optimizer - SKILL.md Agent Skill

name: scenario-agent-run-optimizer description: Analyze and optimize a scenario-specific AI agent from runtime logs, traces, eval outputs, system prompts, tool schemas, or failure reports. Use when the user asks Codex to inspect another agent system's behavior, diagnose agent failures, evaluate runs, compare before/after agent traces, find root causes, improve system prompts or tool policies from evidence, or design an optimization loop for non-code or domain-specific agents.

Scenario Agent Run Optimizer

Diagnose another agent system from evidence, then propose concrete optimizations. This skill is for scenario-specific agents, not only coding agents.

Core Promise

Turn these inputs:

runtime logs, traces, JSONL events, OpenTelemetry spans, LangChain/LangGraph traces, OpenAI/Anthropic call logs, CrewAI logs, screenshots, eval outputs, failure reports
current system prompt, tool definitions, routing rules, retrieval policy, memory policy, guardrails, escalation rules
expected task success criteria and known failures

into:

an evidence-backed run health report
root-cause hypotheses tied to trace evidence
specific optimization proposals for prompt, tools, retrieval, routing, state, evals, logging, and human handoff
an A/B or replay plan to test whether the fix improves the agent

Do not invent logs, metrics, trace fields, or outcomes. If evidence is missing, state the smallest artifact needed next.

Quick Workflow

Inventory inputs.
- Identify agent type, scenario, runtime, log format, trace files, prompt files, eval files, and success criteria.
- Separate facts from assumptions.
- If paths are provided, read only relevant files. Avoid secrets and private credentials.
Detect analysis path.
- If agent-xray is available and trace format is compatible, use it for structural triage.
- If not available, perform a manual evidence-based analysis using references/diagnostic-rubric.md.
- For raw JSONL/JSON/CSV logs, use deterministic parsing where practical.
Normalize each run.
- Task id, user goal, final outcome, steps, tools, retrieval calls, model calls, errors, retries, latency, cost/tokens if available, final answer, evaluator result, human corrections.
Score execution structure.
- completion signal, loop resistance, tool choice, error handling, retrieval quality, output contract stability, latency/cost, escalation behavior, safety/privacy behavior.
- Make scoring scale explicit. Distinguish structural quality from answer correctness.
Diagnose root causes.
- Use evidence: prompt snippet, tool call, retrieval miss, state transition, evaluator result, or error line.
- Classify likely causes: prompt ambiguity, missing success criteria, tool schema mismatch, retrieval weakness, router overlap, state leak, memory staleness, insufficient logging, evaluator drift, unsafe action policy, missing fallback.
Propose optimizations.
- Rank fixes by expected impact and risk.
- For prompt fixes, produce a patchable system-prompt section, not vague advice.
- For tool fixes, propose tool schema/description/precondition/postcondition changes.
- For retrieval fixes, propose query, chunking, reranking, citation, and evidence-gap policies.
- For orchestration fixes, propose routing, retry budgets, loop caps, state validation, and human handoff gates.
Design verification.
- Create an A/B, replay, task-bank, or golden-run plan.
- Define metrics and pass/fail gates before recommending adoption.
- If possible, produce a small eval table from the available runs.

Agent-Xray Integration

Agent-Xray source was downloaded for reference at:

D:\工作流优化\codex-research-workflow-html\tmp\agent-xray-source-20260602\Agent-Xray

Read references/agent-xray-notes.md when using Agent-Xray concepts or commands.

Use Agent-Xray when:

the user provides trace directories or JSONL logs
the run format is OpenAI, Anthropic, LangChain, CrewAI, OpenTelemetry, or generic JSONL
structural grading, root-cause classification, decision-surface reconstruction, or before/after comparison would help

Suggested commands when agent-xray is installed:

agent-xray format-detect <trace-file>
agent-xray triage <trace-dir>
agent-xray grade <trace-dir>
agent-xray root-cause <trace-dir>
agent-xray inspect <task-id> <trace-dir>
agent-xray compare <before-traces> <after-traces>
agent-xray flywheel <trace-dir>

If agent-xray is not installed, do not stop. Use the manual rubric and tell the user that Agent-Xray would add automated structural grading if installed.

Output Shape

Use this structure unless the user requests a different format:

Summary
- Agent/system inspected:
- Evidence used:
- Main diagnosis:
- Highest-impact fix:

Run Health
| Dimension | Status | Evidence | Risk |

Root Causes
| Rank | Cause | Evidence | Confidence | Fix |

Optimization Plan
1. Prompt/system-instruction changes
2. Tool/schema changes
3. Retrieval/knowledge changes
4. Routing/state/orchestration changes
5. Eval/logging changes

Verification Plan
| Test | Input/run | Metric | Pass gate |

Open Evidence Gaps
- ...

For a prompt edit, include a clearly copyable section:

System Prompt Patch
<section name>
...

For logs with insufficient detail, output a minimal logging schema the target agent should emit next.

Evidence Rules

Cite local file paths and line numbers when available.
Quote only short excerpts. Prefer paraphrase and line references.
Do not expose secrets, API keys, tokens, passwords, cookies, or private user data.
If a log may contain sensitive data, summarize patterns and redact values.
Treat model reasoning traces as sensitive unless the user explicitly asks to inspect them and they are local artifacts.

References

Load only what is needed:

references/diagnostic-rubric.md: manual scoring, root-cause taxonomy, optimization mapping.
references/agent-xray-notes.md: Agent-Xray concepts, commands, and when to use them.
references/logging-schema.md: recommended trace schema for future runs and eval loops.