ops-doctor - SKILL.md Agent Skill

name: ops-doctor description: Collect session incidents (mistakes, broken workflows, user corrections) and send an honest report to Codex for fixes. Ground truth only — no spin, no blame-shifting. user_invocable: true

/ops:doctor — Honest Incident Reporter

When something goes wrong — you made a mistake, the user corrected you, a workflow broke, you got locked out, you blamed tooling when it was your fault — this skill collects the ground truth and sends it to Codex for diagnosis and fixes.

You are not the judge. Codex is. Your job is to report facts accurately. Codex decides what to fix.

When to Use

User says you made a mistake
User says "fix yourself", "tell codex", "what went wrong"
You got blocked/locked out by the permission gate
A workflow failed because you skipped steps
You blamed tooling when the real problem was your behavior
Any time the user points out something isn't working

Process

Step 0: Ingest Automatic Incident Logs First

Before writing anything manually, check for ground-truth incidents captured by hooks in:

~/.claude/state/ops-doctor/incidents-<cwdHash>.jsonl
~/.claude/state/ops-doctor/pending-<cwdHash>.json

For the current repo/workspace cwd:

If the JSONL file exists, read all incidents from it first.
Treat those incidents as primary evidence — do not omit or rewrite them.
If the pending flag exists, note that the threshold was reached automatically.
Then add any extra manual incidents for things the hooks could not observe (for example: wasted tokens, user corrections, wrong reasoning, or skipped workflows that never triggered a hook).

After a successful mailbox send to Codex:

archive the JSONL file to incidents-<cwdHash>-sent-<timestamp>.jsonl
remove the matching pending-<cwdHash>.json flag if it exists

Step 1: Gather Additional Manual Incidents

Collect every important incident from the current session that is not already captured in the automatic log. For each one, document:

What happened — the observable fact (error message, blocked tool, wrong output)
What you did — your exact actions that led to it (be specific: which commands, which shortcuts, which workflow steps you skipped)
What you should have done — the correct workflow
What the user said — their exact correction or complaint
Root cause — why you did the wrong thing (skipped a step, took a shortcut, blamed tooling, didn't read the workflow, etc.)

Step 2: Classify Each Incident

Category	Meaning
`supervisor_mistake`	You (Claude) did something wrong — skipped a workflow step, took a shortcut, blamed tooling
`workflow_gap`	The workflow/skill is missing a step or has ambiguous instructions
`gate_issue`	The permission gate blocked something it shouldn't have, or didn't block something it should have
`tooling_bug`	An actual bug in OPS tools, GSD tools, hooks, or scripts
`config_issue`	A configuration problem (session state, Redis, file permissions)

Default to supervisor_mistake unless you have concrete evidence otherwise. If you're unsure, it's probably your fault.

Step 3: Build the Report

Write a JSON envelope with ALL incidents — automatic first, then manual additions. Do not cherry-pick. Do not minimize.

{
  "task_type": "investigate",
  "objective": "<1-line: what went wrong this session>",
  "why": "The Supervisor made mistakes and/or found issues. This is an honest incident report for diagnosis and fixes.",
  "deliverable": "recommendation",
  "evidence": "<automatic JSONL incidents + full manual timeline with ground truth>",
  "acceptance_criteria": [
    {"id": "AC-1", "requirement": "Analyze each incident and classify as supervisor_mistake vs actual tooling issue", "verification_method": "Written analysis per incident"},
    {"id": "AC-2", "requirement": "For tooling issues: implement fixes if warranted", "verification_method": "Code changes or explicit skip with rationale"},
    {"id": "AC-3", "requirement": "For supervisor mistakes: propose operating rules to prevent recurrence", "verification_method": "Concrete rules for the Supervisor role definition"}
  ],
  "scope_in": ["<files involved in the incidents>"],
  "scope_out": [],
  "authority_flags": {"can_create_files": false, "can_delete_files": false, "can_modify_deps": false, "can_edit_dirty_files": true}
}

Step 4: Send to Codex

Dispatch via mailbox:

cat > /tmp/mailbox-task.json <<'JSON'
<the JSON envelope>
JSON
TASK_ID=$(node ~/.claude/scripts/mailbox send codex "$(cat /tmp/mailbox-task.json)" | jq -r '.id')
echo "$TASK_ID"

Step 5: Archive Automatic Logs After Successful Send

If mailbox send succeeds:

Rename incidents-<cwdHash>.jsonl to incidents-<cwdHash>-sent-<timestamp>.jsonl
Remove pending-<cwdHash>.json if present
Leave archived incident files untouched for history

Step 6: Report to User

Tell the user:

How many incidents were reported
Brief summary of each (1 line)
The mailbox task ID
That Codex will judge independently what needs fixing

Rules

Ground truth only. Report exactly what happened. No euphemisms, no passive voice that hides who did what.
- Bad: "The session encountered an authentication issue"
- Good: "I didn't run /ops:run and tried to hack a stale session instead"
Don't minimize. If you wasted 200K tokens on a mistake, say so. If you blamed tooling when it was your fault, say so.
Don't prescribe fixes. Describe the problem. Let Codex decide the solution. You already proved your judgment is unreliable by making the mistake in the first place.
Include the user's words. When the user corrected you, quote them. Their perspective is more reliable than yours about what went wrong.
Default to your fault. Unless you have concrete evidence that tooling is broken (error in code, missing file, wrong logic), assume the problem is your behavior.