name: ccb-comm-reply-recover description: Diagnose and recover CCB communication and reply delivery stalls. Use when a user reports a missing CCB_REPLY, stuck ask, agent stuck busy/delivering, queued work behind an active job, cancelled/incomplete reply, empty artifact, callback not continuing, duplicate retry after success, or a CCB mailbox/communication backend that appears stuck.
CCB Comm Reply Recover
Overview
Use this skill for user-visible "I did not receive the reply" incidents. It combines message lineage repair with mailbox and provider-pane evidence, then hands off to runtime recovery only when the chain evidence proves the provider process or pane must be replaced.
Mutations must go through CCB control-plane commands. Do not directly edit mailbox, lifecycle, lease, runtime, provider-session, artifact, or tmux authority files.
Core Workflow
- Identify the target from the user's evidence:
job_id,message_id,attempt_id,reply_id, inbound event id, or artifact path when provided.- agent name when the user only says a target is stuck.
ccb queue --detail allwhen neither id nor agent is clear.
- Trace lineage first:
ccb trace <id>- record message, attempt, reply, event, callback, and job states.
- read the full artifact file before acting when a request or reply is artifact-backed.
- Inspect mailbox head-of-line state:
ccb queue --detail <agent>ccb pend --inbox --detail <agent>- if the requested job is queued behind an active event, trace the active job before touching the queued job.
- Cross-check runtime evidence when a job is
runningordeliveringlonger than expected:ccb psccb ping <agent>ccb doctor logs <agent>when provider/pane failure is plausible.- read-only tmux pane capture only from the socket and pane id reported by
ccb ps, and only as evidence.
- Classify the incident, choose the least disruptive repair, then re-run trace and queue checks.
Incident Classes
normal_running: the active job is visible in the provider pane and making progress. Report that it is running; do not restart or duplicate-submit.head_of_line_blocked: an active job isrunning/delivering, later jobs are queued behind it, and pane/log evidence shows stale, dead, mismatched, or non-progressing provider state.queued_behind_active: the user's job is not lost; it is queued behind an active event. Repair the active event only if evidence proves it is blocked.provider_pane_stale: CCB reports the runtime healthy, but pane text/logs show an old prompt, a dead/update prompt, or a request that does not match the active lineage.empty_cancel_artifact_expected: trace shows terminalcancelledwithcancel_info, and the empty artifact came from an intentional cancel.empty_bad_artifact: a completed, failed, or required artifact-backed reply is absent, zero bytes, truncated, or unreadable without a valid cancel reason.duplicate_retry_after_success: a later retry or resubmission of the same work is still queued/running after another attempt already completed and the user received the needed reply.callback_or_ack_stalled: a reply exists and is acceptable, but callback or inbox progress did not advance.
Repair Rules
- If trace shows the blocking job is still in flight, run
ccb cancel <job_id>first when the user supplied maintenance intent. If cancel fails or reports a blocker, stop. - Prefer
ccb repair retry <job_id|attempt_id>when the same work should run again and the original lineage remains valid. - Prefer
ccb repair resubmit <message_id>when the old execution lineage is stale, context-corrupted, semantically wrong, or no longer suitable. - Use
ccb repair ack <agent_name> [inbound_event_id]only when the reply is already accepted and progress state is wrong. - Cancel
duplicate_retry_after_successjobs rather than letting an agent run the same review or repair twice. - Hand off to
ccb-self-recoverforccb restart <agent>only after chain repair clears or cancels active work and the target remains stale, dead, or unusable. Restart is not the first repair for a communication stall.
Verification
After every repair:
ccb trace <old_job_or_message>proves the old path is terminal, completed, or intentionally cancelled.ccb queue --detail <agent>andccb pend --inbox --detail <agent>show no unexpected active head-of-line blockage.- the desired job is completed, or a fresh valid job is queued/running with no duplicate path.
- artifact-backed replies are read from the full artifact file when needed.
- report intentional cancelled empty artifacts as expected maintenance output, not as missing user replies.
Example Pattern
In an archi incident, the user reported that no architecture review reply
arrived. Trace showed one old active job stuck in running/delivering, while
later asks were queued behind it. Provider logs showed a Codex update and pane
death, but ccb ps still showed a bound pane. The correct repair was:
- cancel the stale active job;
- verify the next queued job entered the provider pane and was progressing;
- avoid restart while valid work was running;
- accept the completed new reply;
- cancel the duplicate old retry that remained after success;
- verify the target mailbox returned to idle.
Red Lines
- Do not restart, clear, reload, or kill before tracing the message lineage.
- Do not submit a second concurrent path for the same work while the original path is still active unless the user explicitly retargets or duplicates the task.
- Do not treat observer snapshots as terminal authority; use
ccb trace. - Do not infer from artifact preview text when the full artifact is required.
- Do not mutate tmux directly or write CCB runtime authority files.
- Do not read provider secrets, credentials, API keys, or unrelated private provider state.