name: evaluate-issue-pr
description: Independently evaluate a PR's implementation against its approved plan. Run from inside the feature worktree. Can make fixes. Auto-merges on green; pass --manual-merge to opt out. Usage: /pipeline:evaluate-issue-pr [--manual-merge]
disable-model-invocation: false
allowed-tools: Read, Write, Edit, Bash, Glob, Grep, Skill, mcp__playwright_*
Boot
Source pipeline.config so PIPELINE_* variables are available:
source "$(pwd)/pipeline.config" 2>/dev/null || source ./pipeline.config
# Self-resolve CLAUDE_PLUGIN_ROOT in case the env var is unset in the Bash subshell.
# Anchor via the plugin cache glob (var-independent — no chicken-and-egg dependence on
# CLAUDE_PLUGIN_ROOT to FIND the resolver). _cpr_dir is the dir prefix; literal source line.
_cpr_dir="${CLAUDE_PLUGIN_ROOT:+${CLAUDE_PLUGIN_ROOT}/}"
_cpr_dir="${_cpr_dir:-$(ls -d ${HOME}/.claude/plugins/cache/claude-pipeline-local/pipeline/*/ 2>/dev/null | sort -V | tail -1)}"
_cpr_dir="${_cpr_dir:-$(ls -d ${HOME}/.claude/plugins/cache/claude-pipeline/pipeline/*/ 2>/dev/null | sort -V | tail -1)}"
source "${_cpr_dir}scripts/_resolve-plugin-root.sh" 2>/dev/null || true
Invocation mode
Three dispatch shapes; every step below is identical, only CWD + visual-proof setup differs:
- Inline
Agent(...)dispatch (PATH A, docs-only; PATH B, standard). Worktree absolute path + issue number in the prompt. You are NOT in the worktree CWD —cd <worktree-absolute-path>before any step. (Per #748 PATH B PR-eval dispatches inline here, alongside PATH A — the inline B execute Agent and inline B PR-eval Agent are SEPARATE inline contexts for evaluator independence.) - PATH C (multi-task) PR-eval — inline
Agent(...)by default (#749/#891/#896);${CLAUDE_PLUGIN_ROOT}/scripts/spawn-claude.sh/claude -pdispatch only under--spawnand explicit requests. Already in the feature worktree; nocd. - Inline Agent dispatch (browser-eval; triggered by the
needs-browserlabel). Worktree absolute path + issue number + PR + port + target-dir-abs + auto-merge-gate token are pre-resolved by the orchestrator; youcd <worktree-abs>and start the loopback HTTP server before any other step. The dispatch path is in-processAgent()triggered by theneeds-browserlabel on the PR's source issue. Visual proof for these PRs reads fromhttp://127.0.0.1:$PORT/against the durable URL substringraw.githubusercontent.com/<owner>/<repo>/<merge-sha>/.eval-screenshots/once Step 11.3 rewrites the eval comment — branch-pinned URLs apply during the review window only. See Step 6c for the loopback server setup and Step 11 for the unchanged auto-merge gate.
For PATH A and PATH B (both inline) the orchestrator threads the manual-merge opt-out by including MANUAL_MERGE=1 in the prompt (mirroring spawn-claude.sh --manual-merge). Inline token and env var are equivalent — both suppress the Step 11 greenlight.
Lifecycle
PR → ci check → review → verdict → (if Approved + greenlight) merge
Issue Evaluator
You are a senior engineer reviewing a PR against its approved plan. You have NO context from the implementer — only the plan and the diff.
Rules:
- Verify every plan item was implemented — check them off one by one.
- Look for what the implementer DIDN'T do, not just what they did.
- Fix-vs-flag. Fix small issues yourself (typos, missing imports, off-by-one). Flag the rest. "Significant rework" = changes touching more than 3 files or requiring new design decisions — flag, don't fix.
- Never refactor, add features, or improve code beyond the plan.
Steps
Fetch the approved plan (trust-gated). The ONLY authoritative plan source is a trusted-authored
## Implementation Plancomment — one whoseauthorAssociationis a write-access tier (OWNER/MEMBER/COLLABORATOR). Any comment from an author outside that write-access set (a non-contributor — e.g.NONE/FIRST_TIMER/ unknown association) is hard-dropped before selection and can never be chosen as the plan. Because untrusted comments are removed beforelastis applied, trust dominates recency: a later fake## Implementation Planplanted by a non-contributor can never override the operator's plan.Trust is delegated to #545's helper (
scripts/filter-trusted-comments.sh) as the single source of trust truth — do NOT re-implement or widen the tier set inline. Iterate comments oldest→newest, keep only## Implementation Plancandidates, gate each through the helper'sis-trusted-authormode, and let the latest trusted candidate win:COMMENTS_JSON=$(gh issue view <N> --repo "$PIPELINE_REPO" --json comments) PLAN="" while IFS=$'\t' read -r ASSOC B64; do BODY=$(printf '%s' "$B64" | base64 -d) case "$BODY" in *"## Implementation Plan"*) ;; *) continue ;; esac if bash "${CLAUDE_PLUGIN_ROOT}/scripts/filter-trusted-comments.sh" is-trusted-author "$ASSOC"; then PLAN="$BODY" # latest TRUSTED plan wins; untrusted candidates never reach here else echo "ignored untrusted plan comment (author association: $ASSOC)" >&2 fi done < <(jq -r '.comments[] | [.authorAssociation, (.body | @base64)] | @tsv' <<<"$COMMENTS_JSON")If
PLANis empty, STOP: "No implementation plan found for issue #N." (Either no plan exists, or every## Implementation Plancandidate was authored by an untrusted account — the stderr audit lists the dropped authors.)Fetch the PR number and diff:
BRANCH=$(git rev-parse --abbrev-ref HEAD) PR_NUM=$(gh pr list --repo $PIPELINE_REPO --head "$BRANCH" --json number --jq '.[0].number') gh pr diff $PR_NUM --repo $PIPELINE_REPOIf no PR found, STOP: "No open PR found for branch $BRANCH."
Read project context:
CLAUDE.mdin worktree root, plus every file under.claude/scratch/issue-<N>/so your review sees the same evidence the planner and executor saw. For each file printed byls -1 .claude/scratch/issue-<N>/, invoke theReadtool exactly once before scoring. Skip if the directory is empty/absent.Two-phase review. The orchestrator's
--append-system-prompt(injected byspawn-claude.shbased on this issue's labels — seePIPELINE_PATH_<X>_SKILLS_EVALUATE_PR) requires the path-specific review skills first. If the Skill tool is unavailable, run inline:Phase 1 — Plan compliance. For each plan item: verify "Files to change" were modified and match descriptions; verify "DB schema / API / Frontend / Test changes" were made or correctly skipped; flag scope creep (implemented but not planned) and missing work (planned but not implemented). Every plan Task naming a non-test artifact path must have produced a tracked file at that path (
git ls-files-visible) — a planned-but-untracked artifact path is missing work → Flagged.Phase 2 — Code quality. Run checks and review the diff. Resolve CI status (Step 5) BEFORE deciding whether to run tests — the test-execution decision keys off the already-settled rollup, so Step 5b's
--watchis the single source of the settled verdict and Phase 2 never issues a second--watch/--wait.Typecheck always runs (cheap; not covered by the CI-suite-trust rationale):
$PIPELINE_TYPECHECK_CMD 2>&1 | head -50Test execution is CI-aware (green-CI short-circuit, issue #957). Read the settled
statusCheckRollupverdict once — reuse the same all-SUCCESS jq predicate asscripts/auto-merge-gate.sh(length > 0 and all(.conclusion == "SUCCESS")); do NOT re-run the watch/wait (Step 5b already settled the checks):ROLLUP_GREEN=$(gh pr view $PR_NUM --repo $PIPELINE_REPO --json statusCheckRollup \ --jq '.statusCheckRollup | length > 0 and all(.conclusion == "SUCCESS")') if [ "$ROLLUP_GREEN" = "true" ] && [ "${PIPELINE_CI_CHECK_ENABLED-true}" = "true" ]; then echo "CI: green rollup — trusting CI suite verdict; skipping full local PIPELINE_TEST_CMD re-run (issue #957)" # SKIP the full-suite invocation. The green CI rollup ran the same suite; re-running it # locally is pure duplication. Typecheck (above), diff-vs-plan review, acceptance checks, # and adversarial code-quality review all still run — only the blanket suite re-run is removed. else # Fallback — CI rollup not green, empty/no-CI, or PIPELINE_CI_CHECK_ENABLED disabled: # run tests locally, SCOPED TO TOUCHED TESTS rather than the whole suite. Derive the # touched test files from the PR diff and run only those; if none were touched, fall back # to the full $PIPELINE_TEST_CMD (never skip testing when CI is untrusted). TOUCHED_TESTS=$(gh pr diff $PR_NUM --repo $PIPELINE_REPO --name-only | grep -E '(^|/)tests?/' || true) if [ -n "$TOUCHED_TESTS" ]; then while IFS= read -r t; do [ -f "$t" ] && timeout 300 bash "$t" </dev/null; done <<< "$TOUCHED_TESTS" else timeout 600 bash -c "$PIPELINE_TEST_CMD" </dev/null 2>&1 | tail -30 fi fiDedup guard (hard constraint). The full-suite
$PIPELINE_TEST_CMDis invoked at most once per eval and never viarun_in_background— no overlapping/duplicate full-suite sweeps (the harness auto-backgrounding that caused the #955/#956 duplicate sweeps). It always runs synchronously in the foreground, mirroring the Step 5b--watchno-run_in_backgroundrule.Trust boundary (assumption). The short-circuit trusts that the green CI suite is the SAME suite as
$PIPELINE_TEST_CMD. This holds for the dogfood repo (CI runs the identicaltests/test*.shsweep). Consumers with divergent CI should understand this trust boundary.Look for: leftover debug code / console.logs / TODOs; missing error handling at system boundaries; security issues (injection, XSS, unsanitized input); type-safety issues
tscmissed; test coverage for every implemented feature.
Check CI workflow status. A red PR must never receive Approved.
5a. Detect CI presence. Skip if no checks configured:
CHECK_COUNT=$(gh pr view $PR_NUM --repo $PIPELINE_REPO --json statusCheckRollup --jq '.statusCheckRollup | length')If
CHECK_COUNTis 0, log"CI: none configured — skipping status check"and proceed to Step 6.5b. Wait for in-progress checks to settle. Single bounded foreground (blocking, in-turn) wait via
Bash— do NOT userun_in_background, and do NOT wrap in awhile ... sleep ... greploop. A subagent cannot durably block on a backgrounded monitor: a backgroundedBashreturns immediately, ending the subagent's turn before it reaches Step 5c → the verdict (Step 9) → the auto-merge gate (Step 11) (issue #684). Thetimeout 600 ... --watch --fail-fastbelow IS the hard iteration cap: it runs to completion in this turn, returns nonzero on the first failing check, and blocks until CI settles or the 600s budget elapses (10-min one-shot, 30-second poll):timeout 600 gh pr checks $PR_NUM --repo $PIPELINE_REPO --watch --fail-fast --interval 305c. Inspect final check state.
gh pr view $PR_NUM --repo $PIPELINE_REPO --json statusCheckRollup \ --jq '.statusCheckRollup[] | select(.conclusion == "FAILURE" or .conclusion == "CANCELLED") | {name: .name, conclusion: .conclusion, url: .detailsUrl}'For each failed check, fetch the first error line. Parse RUN_ID from
detailsUrl; if parsing yields empty/non-numeric, use thegh run listfallback:RUN_ID=$(echo "$DETAILS_URL" | sed 's|.*/runs/\([0-9]*\)/.*|\1|') if ! [[ "$RUN_ID" =~ ^[0-9]+$ ]]; then BRANCH=$(git rev-parse --abbrev-ref HEAD) RUN_ID=$(gh run list --repo $PIPELINE_REPO --branch "$BRANCH" --status failure --limit 1 --json databaseId --jq '.[0].databaseId') fi gh run view "$RUN_ID" --repo $PIPELINE_REPO --log-failed 2>&1 | head -205d. Decide verdict interaction:
- Any FAILURE/CANCELLED → you MUST NOT post Approved.
- Fixable within budget (≤3 files, no new design decisions): fix in-worktree, commit,
git push, re-run 5b with a fresh 10-minute timeout. - Exceeds budget, persists after one fix attempt, or wait timed out: post "Flagged" with failing job names + first error line in the
**CI status:**row (see Step 9).
Hook-enforced. The
enforce-ci-waitStop hook (hooks/enforce-ci-wait.py) reads.claude/logs/tool-use.logand blocks Stop unless thegh pr view→gh pr checks --watch→gh pr viewsequence is recorded; an Approved verdict on a red rollup is also blocked. Prose remains source of truth for HOW; the hook only verifies it happened.
Visual validation (if UI changes exist in the diff). Two tiers: a baseline screenshot/console pass, then a verdict layer for
needs-browserissues.6a. Baseline — Playwright MCP probe + screenshot plumbing. Check Playwright MCP via
cat .mcp.json 2>/dev/null. If available and on Linux: navigate to affected views, screenshot to<worktree>/.claude/scratch/*.png, check console for JS errors, verify UI matches the plan. Otherwise note: "Visual validation skipped — Playwright MCP not available."Attach screenshots to the eval comment. For each PNG, invoke the attach helper, then verify the file actually reached the remote before embedding its link in Step 9's
**Screenshots:**row. The helper commits the PNG to<worktree>/.eval-screenshots/, pushes to the PR branch, and returns a branch-pinnedraw.githubusercontent.com/<owner>/<repo>/<branch>/.eval-screenshots/<name>.pngURL. These URLs initially resolve via the branch-pinned form during the PR review window; after auto-merge fires, Step 11.3 rewrites them to the merge-SHA-pinned form (raw.githubusercontent.com/<owner>/<repo>/<merge-sha>/.eval-screenshots/...), which is durable for the life of the commit (issue #506, superseding the Option A ephemeral behaviour of tracker #383). Operators who prefer the legacy ephemeral behaviour may setPIPELINE_SCREENSHOT_REWRITE_ENABLED=false.Private repos (issue #551). On a private repo the attach helper emits a
github.com/<owner>/<repo>/blob/<branch>/.eval-screenshots/...URL instead, and Step 6 wraps it as a clickable[name](url)link (not![]()). Reason: GitHub's camo image proxy fetches![]()image URLs anonymously, and a private repo'sraw.githubusercontent.comcontent 404s anonymously — so the inline embed renders broken. The blob link routes through GitHub's authenticated file viewer, which renders the PNG for repo members. True inline rendering on a private repo is only possible via GitHub'suser-attachmentsCDN (browser drag-drop upload, which needs a browser session + CSRF token and is NOT reachable viagh/PAT) — blob links are the CLI-feasible answer; drag-drop is the manual inline alternative. Visibility is detected fail-soft (gh repo view "$PIPELINE_REPO" --json isPrivate); anyghabsence/error/non-truevalue falls back to the public raw +![]()behaviour.Failure-loud verification. A returned URL is not proof the blob landed on origin —
git pushcan fail silently inside the sandbox. Before writing anyrow, confirm the branch exists on the remote (git ls-remote --exit-code origin "refs/heads/$BRANCH") AND the specific file is present at that branch tip (gh api repos/$PIPELINE_REPO/contents/.eval-screenshots/$name?ref=$BRANCH). On failure, emit a⚠️ screenshot attach failedrow instead of a broken-link image so the human reviewer gets a self-debugging trail.SCREENSHOT_LINES=() BRANCH="$(gh pr view "$PR_NUM" --repo "$PIPELINE_REPO" --json headRefName --jq .headRefName)" PRIVATE="$(gh repo view "$PIPELINE_REPO" --json isPrivate --jq .isPrivate 2>/dev/null || true)" for png in .claude/scratch/*.png; do [ -f "$png" ] || continue name="$(basename "$png")" url="$(bash "${CLAUDE_PLUGIN_ROOT}/mock-web-eval/scripts/eval-screenshot-attach.sh" "$PR_NUM" "$(realpath "$png")" 2>/dev/null || true)" if [ -n "$url" ] \ && git ls-remote --exit-code origin "refs/heads/$BRANCH" >/dev/null 2>&1 \ && gh api "repos/$PIPELINE_REPO/contents/.eval-screenshots/$name?ref=$BRANCH" --jq .sha >/dev/null 2>&1; then if [ "$PRIVATE" = "true" ]; then # Private repo: camo proxy can't fetch raw.githubusercontent.com anonymously (404), # so use a clickable blob link — GitHub's authenticated viewer renders the PNG for members. SCREENSHOT_LINES+=("- [${name%.*}](${url})") else SCREENSHOT_LINES+=("- ") fi else SCREENSHOT_LINES+=("- ⚠️ screenshot attach failed — see .eval-screenshots/${name} in the worktree") fi done6b. Visual proof verdict (needs-browser issues only). If the issue carries the needs-browser label, invoke
Skill(skill: "pipeline:visual-proof-from-plan")in this evaluator session and parse its JSON output. For every entry inunsatisfied, the verdict MUST be Flagged for user review — record the claim and the failing artifact path/URL in the Remaining issues row. This is the load-bearing trust layer;satisfiedpredicates from the executor session do NOT carry over.6c. Inline-mode visual proof setup (issue #517, #527 — applies when invoked via the inline Agent dispatch for browser-eval, dispatch mode #3 above). Before any
browser_navigate/browser_evaluatecall, bootstrap the loopback server via the single-responsibility helperscripts/visual-proof-server-start.sh(composes the port broker + startspython3 -m http.server --directory <target> --bind 127.0.0.1+ readiness probe). The helper allocates the port itself, so this path no longer depends on the orchestrator pre-resolving$PORT:SERVER_LINE=$(bash "${CLAUDE_PLUGIN_ROOT}/scripts/visual-proof-server-start.sh" \ "${SLATE_INDEX:-0}" "$TARGET_DIR" 2>&1) \ || { echo "$SERVER_LINE"; exit 1; } # block-server-start: ... on stderr PORT=$(printf '%s\n' "$SERVER_LINE" | sed -n 's/^SERVER: .*port=\([0-9]*\) .*/\1/p') SERVER_PID=$(printf '%s\n' "$SERVER_LINE" | sed -n 's/^SERVER: pid=\([0-9]*\) .*/\1/p') trap 'kill "$SERVER_PID" 2>/dev/null' EXIT$TARGET_DIRis the absolute path under the worktree served by the server (fromPIPELINE_VISUAL_PROOF_TARGET_DIR). For the single-issue orchestrator-driven path (fullsend on a 1-issue inline slate) the orchestrator does NOT pre-resolve$PORT— it never reachesrun-queue.sh launch_agent()— so the helper allocating the port closes the #519 gap.$SLATE_INDEXdefaults to 0 for a single-issue dispatch.All subsequent
browser_navigate/browser_evaluatecalls targethttp://127.0.0.1:$PORT/<path>. The EXIT trap kills the server on normal exit and most signals; SIGKILL leaks are reaped byscripts/reap-stale-visual-proof-servers.sh(tracks by--directory, unchanged) invoked from/pipeline:statusStep 0 housekeeping.--bind 127.0.0.1is load-bearing — never bind to0.0.0.0(avoids external exposure during concurrent fullsend runs). The queue dispatch-inline path (run-queue.sh launch_agent()) emits only the EVENT line and does NOT start a server, so routing the single start through the helper introduces no double-bootstrap.Per-tool wall-clock budget. Wrap each
browser_evaluateandbrowser_navigatecall in a 60s wall-clock budget. On timeout, post Flagged with a timeout note and exit non-zero — this explicitly prevents theuntil-grep DONE_MARKERwedge pattern from issue #511 from migrating into the inline path. The 60s budget applies to inline-mode dispatch (mode #3) unconditionally.Selector pitfall (as of 2026-05-26). When clicking elements, prefer the
ref=identifier returned bybrowser_snapshotover CSS selectors with embedded quotes (e.g.#echo-form button[type="submit"]). The Playwright MCP server rejects the latter on the literal string (escaped-quote selectors fail to parse); theref=from the snapshot works instantly. Surfaced in the #525 / PR #526 dogfood. This is upstream Playwright MCP behavior, not a pipeline bug — flagged "as of 2026-05-26" so a future audit can re-verify it is still needed.If fixable issues found (≤3 files, no new design decisions): fix in-worktree, then
git commit -m "fix: evaluation fixes for #<N> — <summary>",git push, and re-run tsc + tests to confirm fixes don't break anything.Check for branch divergence before merge decision:
git fetch origin $PIPELINE_BASE_BRANCH git log --oneline HEAD..origin/$PIPELINE_BASE_BRANCH | head -5If
PIPELINE_BASE_BRANCHhas advanced,git rebase origin/$PIPELINE_BASE_BRANCH. If conflicts are complex (semantic, not whitespace), flag for user review.Post evaluation comment on the PR via
gh pr comment $PR_NUM --repo $PIPELINE_REPO --body "<evaluation>"using this format:TERSENESS: This is the highest-cost artifact in the pipeline — pr-eval output is ≈20% of total spend and output tokens are uncacheable. Emit decisions, not narration. Reference the plan and PR by
#N/#PR— do NOT paste the plan body or re-quote the diff. Each row below is a verdict, not an essay: cap**Code quality:**and**Remaining issues:**at ≈3 bullets each, one line per bullet; drop a row's prose entirely when its value is "No issues found" / "None". Plan-compliance checkboxes are the evidence — do not restate the plan item in prose after checking it.## Evaluation **Verdict:** Approved / Flagged for user review **Plan compliance:** - [x] <plan item> — implemented correctly - [ ] <plan item> — missing or incorrect: <detail> **Code quality:** <findings or "No issues found"> **CI status:** All checks passed / No CI configured / FAILED: <job names> — <first error line> / Timed out **Screenshots:** (one row per entry in `$SCREENSHOT_LINES` from Step 6 — already formatted as either an image row or a `⚠️` failure-loud row; `None` if empty) -  <!-- public repo: inline embed --> - [screenshot 1](https://github.com/owner/repo/blob/<branch>/.eval-screenshots/<filename>.png) <!-- private repo: clickable blob link (#551) --> **Visual proof:** satisfied=N/M; unsatisfied=[<claim>...] (or `N/A — needs-browser not applied`) **Fixes applied:** `<commit hash>` — <description> (or "None") **Remaining issues:** (if flagged) <what needs human attention and why>Report verdict: Approved → "PR #X approved — ready for merge." / Flagged → "PR #X flagged for review:
". The evaluator auto-merges on the Step 11 greenlight matrix unless --manual-mergewas passed or the issue carries themanual-mergelabel. Otherwise the orchestrator's step 8 handles merge.Auto-merge gate.
Authoritative owner of the greenlight check.
On
needs-browserissues, gate (1) requires zerounsatisfiedentries in the Visual proof row.Greenlight matrix — all 4 must hold (otherwise the PR is left for manual merge with a
block-*reason):- Latest
## Evaluationcomment contains**Verdict:** Approved. - Every entry in the PR's
statusCheckRolluphasconclusion == SUCCESS(or the rollup is empty for repos with no CI configured). mergeable == MERGEABLE.mergeStateStatus == CLEAN(not BLOCKED/BEHIND/DIRTY/UNSTABLE).
Dual-defense doctrine (issue #295). Base-branch enforcement is defense-in-depth across four layers: (i) the eval-time
baseRefName == $PIPELINE_BASE_BRANCHassertion insideauto-merge-gate.sh(Step 11.2 —block-base-mismatch); (ii) a TOCTOU re-read immediately beforegh pr mergein Step 11.3; (iii) the skill-level quoted--base "$PIPELINE_BASE_BRANCH"inexecute-issue-planStep 9b; (iv) theenforce-base-branch.pyPreToolUse hook overgh pr create/gh pr edit --base. The hook alone is insufficient — it has bypassed in production (#295) when consumer.claude/settings.jsonshadowed the plugin matcher or stalespawn-claude.shemitted an unnamespaced slash command (seedev/audits/295-root-cause.md). The eval-time gate is the load-bearing zero-data-loss layer.Flag parsing.
--manual-mergemay appear anywhere in argv — before or after the issue number; the parser is loop-based, not positional. Also honored via env:MANUAL_MERGE=1(exported byspawn-claude.shwhen the spawn carried--manual-merge) is equivalent. If either signal is set, skip Step 11 entirely and return Approved-but-not-merged.Source the helper and run the gate.
source "${CLAUDE_PLUGIN_ROOT}/scripts/auto-merge-gate.sh" REASON=$(auto_merge_should_fire "$ISSUE" "$PR_NUM")Checks in order:
MANUAL_MERGEenv,manual-mergeissue label, the 4 greenlight conditions above, andbaseRefName == $PIPELINE_BASE_BRANCH. Prints exactly one token:green,block-flag,block-label,block-verdict,block-base-mismatch,block-ci,block-mergeable, orblock-mergestate.
2b. Split-role gate (#881 —
PIPELINE_PATH_B_SPLIT_ROLE, defaulttrueper #1057, opt-OUT via=false). The split-role precondition applies ONLY to PRs that were actually dispatched as split-role. Before running the gate, resolve TWO guards — the issue's PATH letter and (for PATH B) the resolved dispatch shape — so the gate distinguishes "this PR was never a split-role dispatch (nothing to protect → pass/skip)" from "this split-role PR is missing its mandatory red anchor (real violation → block)" (#1076).PATH guard (skip for non-
B). Resolve the issue's PATH letter from its label:docs-only=A,multi-task=C,quick-fix=D, none of those=B. PATH != B → SKIP this gate entirely (an A/C/D PR — docs-only, multi-task, orquick-fix— is NEVER a split-role dispatch and never carries a[split-role-red]anchor; treat as pass / not-applicable and proceed to greenlight without requiringSPLIT_ROLE=pass). This is the deterministic interpretation that fixes the PATH-D permanentno-red-shablock (#1076): a PATH Dquick-fixPR is unconditionally skipped here, NOT blocked.Resolver-shape guard (PATH B only). For PATH B, the
PIPELINE_PATH_B_SPLIT_ROLEdefault-on (true/unset/empty per #1057) only MEANS split-role if the dispatch actually resolved that way. Resolve the dispatch shape and read itsSPLIT_ROLE=token:SPEC=$(PIPELINE_REPO="$PIPELINE_REPO" bash "${CLAUDE_PLUGIN_ROOT}/scripts/resolve-execute-dispatch.sh" "$ISSUE" B) SPLIT_SHAPE=$(printf '%s\n' "$SPEC" | grep '^SPLIT_ROLE=' | cut -d= -f2)SPLIT_ROLE=false(single-role dispatch — e.g. a W2 carve-out orPIPELINE_PATH_B_SPLIT_ROLE=false) → SKIP the gate (nothing to protect; no[split-role-red]anchor is expected, so a missing anchor is NOT a violation). Within PATH B, this resolver-shape guard subsumes the knob polarity (#1057): skip ONLY when the resolved shape is explicitlySPLIT_ROLE=false; withPIPELINE_PATH_B_SPLIT_ROLEunset/empty the resolver defaults the shape toSPLIT_ROLE=true(#1057) and the gate RUNS.SPLIT_ROLE=true(genuine split-role dispatch) → RUN the gate exactly as below; a missing red anchor or a tampered locked test MUST still hard-block.
When the gate RUNS (PATH B +
SPLIT_ROLE=true): runscripts/split-role-gate.shand requireSPLIT_ROLE=passas an ADDITIONAL greenlight precondition (mirrors theauto-merge-gate.shblock-token pattern: the script emits exactly one line and ALWAYS exits 0 — the verdict rides the token):export PIPELINE_BASE_BRANCH # ensure the gate subprocess inherits the var (#1066) # CI-trust signal (#1078, precedent #957): thread the green-rollup verdict # ALREADY resolved in Step 11.2 ($ROLLUP_GREEN — `length > 0 and all(.conclusion # == "SUCCESS")`) into the gate so its SECONDARY suite-green check is SKIPPED on # trust instead of re-running the ~9–11min $PIPELINE_TEST_CMD sweep a SECOND time. # NO `timeout` wrapper: the gate ALWAYS exits 0 and emits exactly one token, and # removing the redundant sweep removes the only slow step — a timeout-wrapper # expiry mid-sweep was the sole source of the #1065 empty-token race; do not # reintroduce one. The PRIMARY locked-test invariant still runs first in the gate. # # Shared-test exemption (#1089, Direction 3): parse the approved plan's optional # `**Shared tests (split-role):**` section into a newline-separated path list and # thread it into the gate as PIPELINE_SPLIT_ROLE_SHARED_TESTS. Two supported forms: # header-inline: `**Shared tests (split-role):** tests/test-foo.sh` (#1107) # following-bullet: `**Shared tests (split-role):**\n- tests/test-foo.sh` # Parser reads until the next `**…:**` header. These are the EXACT repo-relative # test file paths the plan sanctioned for green-role modification. Trust anchor: # $PLAN is already trust-gated (OWNER/MEMBER/COLLABORATOR) from Step 1. Absent # section → empty list → gate default-deny unchanged (fail-closed). Set on the gate # invocation ONLY — never read from pipeline.config (per-issue scoping). SHARED_TESTS_RAW=$(printf '%s\n' "$PLAN" \ | awk '/^\*\*Shared tests \(split-role\):\*\*/{found=1; rest=substr($0, index($0,":**")+3); gsub(/^[ `]+|`[ ]*$/,"",rest); sub(/[ \t]+[—-][ \t].*$/,"",rest); sub(/[ \t]+#.*$/,"",rest); gsub(/[ `]+$/,"",rest); if(rest!="") print rest; next} found && /^\*\*[^*].*:\*\*/{found=0} found && /^[- ]/{gsub(/^[- `]+|`[ ]*$/,"",$0); sub(/[ \t]+[—-][ \t].*$/,"",$0); sub(/[ \t]+#.*$/,"",$0); gsub(/[ `]+$/,"",$0); if($0!="") print $0}') SPLIT_ROLE_LINE=$(PIPELINE_REPO="$PIPELINE_REPO" \ PIPELINE_CI_ROLLUP_GREEN="$ROLLUP_GREEN" \ PIPELINE_SPLIT_ROLE_SHARED_TESTS="$SHARED_TESTS_RAW" \ bash "${CLAUDE_PLUGIN_ROOT}/scripts/split-role-gate.sh" "$ISSUE") # → SPLIT_ROLE=<pass|block> ISSUE=<N> REASON=<token>Parse the
SPLIT_ROLE=token.passis necessary for greenlight and the parse keys onpass(ANY reason) — so bothREASON=additive-okandREASON=additive-ok-ci-green(#1078: emitted whenROLLUP_GREEN=trueis threaded in, the gate trusts CI and SKIPS its secondary suite-green re-run) are accepted as a greenlight with NO parser change. Anyblocktoken (no-red-sha,locked-test-modified,locked-test-deleted,suite-red,unresolvable-base) leaves the PR for manual merge (same shape as theauto-merge-gate.shblock-*tokens). Theunresolvable-basetoken means the base ref could not be resolved in the gate subprocess — distinct fromno-red-sha(author never committed a red anchor) (#1066). The gate runs ONLY for a genuine split-role PR; for that case the block semantics are unchanged — a PR dispatchedSPLIT_ROLE=truethat carries no[split-role-red]anchor STILL hard-blocksno-red-sha(the real protection #1076 preserves). The ONLY #1076 change is the PATH + resolver guard deciding WHETHER the gate runs. This gate adds a precondition ONLY — pr-eval itself STAYS Opus in all configurations (W3), never gated by any new knob.On
green: (when split-role is enabled, Step 11.2b'sSPLIT_ROLE=passis an additional precondition for this branch)- TOCTOU re-check (issue #295). Immediately before the merge, re-read
baseRefName. A malicious or buggy actor could retarget the PR between Step 11.2's gate and the merge call.
IfBASE_RECHECK=$(gh pr view "$PR_NUM" --repo "$PIPELINE_REPO" --json baseRefName --jq .baseRefName 2>/dev/null) if [ -z "$BASE_RECHECK" ] || [ "$BASE_RECHECK" != "$PIPELINE_BASE_BRANCH" ]; then REASON="block-base-mismatch" # Fall through to Step 11.4: post the block comment and skip merge. fiREASONis nowblock-base-mismatch, jump to Step 11.4 — do not invokegh pr merge. - Merge synchronously (NOT
--auto), then capture the merge-commit SHA (empty$SHAfrom rare API lag → omit from close comment; the merge is authoritative):gh pr merge "$PR_NUM" --repo "$PIPELINE_REPO" --merge --delete-branch SHA=$(gh pr view "$PR_NUM" --repo "$PIPELINE_REPO" --json mergeCommit --jq .mergeCommit.oid) - Rewrite screenshot URLs to the merge SHA (issue #506, extended #551). The eval comment embeds branch-pinned screenshot URLs that 404 once
--delete-branchremoves the feature branch. On public repos these areraw.githubusercontent.com/<owner>/<repo>/<branch>/.eval-screenshots/...; on private repos they aregithub.com/<owner>/<repo>/blob/<branch>/.eval-screenshots/...(the blob-link form from Step 6). The rewriter branch-scope-pins BOTH host forms to the durable merge-SHA equivalent (.../<merge-sha>/.eval-screenshots/...) in one pass. Now that the authoritative merge SHA is captured, rewrite them. Must run AFTER the SHA capture (the SHA it pins to) and BEFORE the footer-append (so the rewriter targets the screenshot comment, not the footer). Fail-soft — never block the merge that already completed:if [ -n "$SHA" ] && [ "${PIPELINE_SCREENSHOT_REWRITE_ENABLED:-true}" = "true" ]; then bash "${CLAUDE_PLUGIN_ROOT}/scripts/rewrite-eval-screenshot-urls.sh" "$PR_NUM" "$SHA" \ || echo "WARN: post-merge URL rewrite failed for PR #${PR_NUM}" fi - Append the auto-merged footer (exact literal prefix — Step 8 of
run/SKILL.mdgreps it):TS=$(date -u +%Y-%m-%dT%H:%M:%SZ) FOOTER="Auto-merged: eval Approved + CI SUCCESS + MERGEABLE/CLEAN at ${TS}" gh pr comment "$PR_NUM" --repo "$PIPELINE_REPO" --body "$FOOTER" - Flip labels and close the issue (omit
(${SHA})if$SHAis empty). Swallow the benign "already closed" non-error (issue #813).gh pr mergeauto-closes the linked issue viaclosingIssuesReferencesa beat before this explicitgh issue closeruns, so the explicit close routinely fails with an "already closed" message. That is cosmetic — the final state (merged + closed) is already correct — so the guard treats analready closedstderr as success and only re-raises a genuine close failure (e.g. a transient API error). The label flip and close comment still run for the case where the PR body carried noCloses #Nlink: The label flip is delegated to the sharedfinalize-issue-labels.shhelper (issue #866): it addsmergedand strips the full pipeline lifecycle/path/priority set (not justpr-open), keeping all three merge-completion sites (this path,finish-manual-merge.sh,cleanup-worktree.sh) in lockstep. The close-comment / "already closed" guard (#813) is unchanged. Step 11.3 now passes--repo "$PIPELINE_REPO"explicitly (this subshell may not export it) and surfaces aWARNon finalize failure rather than silently no-op'ing on stale labels (issue #888).bash "${CLAUDE_PLUGIN_ROOT:-.}/scripts/finalize-issue-labels.sh" "$ISSUE" --repo "$PIPELINE_REPO" \ || echo "WARN: finalize-issue-labels exited non-zero for issue #${ISSUE}; labels may be stale (merge already completed)." CLOSE_SUFFIX=$([ -n "$SHA" ] && echo " (${SHA})" || echo "") CLOSE_ERR=$(gh issue close "$ISSUE" --repo "$PIPELINE_REPO" --comment "Merged via #${PR_NUM}${CLOSE_SUFFIX}. ${FOOTER}" 2>&1) if [ $? -ne 0 ]; then if printf '%s' "$CLOSE_ERR" | grep -qi 'already closed'; then echo "Issue #${ISSUE} already closed by gh pr merge (closingIssuesReferences) — benign (issue #813)." else echo "$CLOSE_ERR" >&2 exit 1 fi fi - Screenshots: no cleanup needed — the
.eval-screenshots/commit collapses into the merge-commit and the feature branch is deleted by--delete-branch. Step 11.3 (above) has already rewritten the eval comment's branch-pinned URLs — both the publicraw.githubusercontent.com/<owner>/<repo>/<branch>/.eval-screenshots/...form and the privategithub.com/<owner>/<repo>/blob/<branch>/.eval-screenshots/...blob link (issue #551) — to the merge-SHA-pinned form (.../<merge-sha>/.eval-screenshots/...), so the embedded screenshots stay durable for the life of the commit even after the feature branch is deleted (issue #506). This supersedes the Option A ephemeral-artifact tradeoff originally accepted in tracker #383, where branch-pinned URLs returned 404 post-merge and reviewers had to capture evidence during the review window. Operators who deliberately want the legacy ephemeral behaviour (e.g. external or legal-hold screenshot capture) setPIPELINE_SCREENSHOT_REWRITE_ENABLED=false, which skips the Step 11.3 rewrite and restores the tracker-#383 post-merge-404 semantics.
- TOCTOU re-check (issue #295). Immediately before the merge, re-read
On any
block-*reason: post a single comment explaining why auto-merge was skipped, then return Approved-but-not-merged. Do not flip labels or close the issue.gh pr comment "$PR_NUM" --repo "$PIPELINE_REPO" \ --body "Auto-merge skipped: ${REASON}. Run \`gh pr merge\` manually."block-base-mismatchextension. WhenREASON == block-base-mismatch(from Step 11.2's gate or Step 11.3's TOCTOU re-check), the comment body MUST also include a retarget suggestion:gh pr comment "$PR_NUM" --repo "$PIPELINE_REPO" \ --body "Auto-merge skipped: block-base-mismatch — PR baseRefName diverges from \$PIPELINE_BASE_BRANCH ($PIPELINE_BASE_BRANCH).
- Latest
Run `$CLAUDE_PLUGIN_ROOT/scripts/retarget-pr.sh $PR_NUM $PIPELINE_BASE_BRANCH` to retarget (or `gh pr edit $PR_NUM --base $PIPELINE_BASE_BRANCH` if retarget-pr.sh is unavailable)." ```
**Auto-apply the `manual-merge` label (issue #489).** After posting the `Auto-merge skipped:` comment — for ANY `block-*` reason — add the `manual-merge` label to the issue so the wedge becomes terminal-detectable by the run-queue runner on its next poll:
```bash
gh issue edit "$ISSUE" --repo "$PIPELINE_REPO" --add-label "manual-merge" 2>/dev/null || true
```
The label flip is what lets the runner (`scripts/run-queue.sh` `evaluator_finished_terminal()`) free the queue slot immediately instead of waiting for the per-agent 90-min timeout. Fails OPEN on `gh` error — the worst case is the pre-#489 behaviour (queue waits for the timeout). The label is permanent post-merge (`cleanup-worktree.sh` leaves it as a historical "this PR did not auto-merge" signal).
Release-please PRs are out of scope for this gate — they flow through `PIPELINE_RELEASE_PR_AUTO_MERGE` in Step 7b of `run/SKILL.md`.
Canonical Agent prompt template (assembled by orchestrator)
The inline Agent dispatch (mode #3 above) is launched by the orchestrator with the following prompt body. Fields are pre-resolved by the orchestrator — the subagent does NOT re-derive them:
You are dispatched to run /pipeline:evaluate-issue-pr <N> inline.
Context (pre-resolved by orchestrator — do not re-derive):
- Worktree: <abs-path>
- PR: <PR-num>
- Target dir: <PIPELINE_VISUAL_PROOF_TARGET_DIR resolved abs-path>
- Port: <P> (advisory; the helper re-allocates via the broker, --bind 127.0.0.1)
- Auto-merge: <gated|allowed> (per manual-merge label check)
Setup: cd <worktree>;
SERVER_LINE=$(bash "${CLAUDE_PLUGIN_ROOT}/scripts/visual-proof-server-start.sh" "${SLATE_INDEX:-0}" "$TARGET_DIR" 2>&1) || { echo "$SERVER_LINE"; exit 1; };
PORT=$(printf '%s\n' "$SERVER_LINE" | sed -n 's/^SERVER: .*port=\([0-9]*\) .*/\1/p');
SERVER_PID=$(printf '%s\n' "$SERVER_LINE" | sed -n 's/^SERVER: pid=\([0-9]*\) .*/\1/p');
trap 'kill "$SERVER_PID" 2>/dev/null' EXIT
Then: follow skills/evaluate-issue-pr/SKILL.md verbatim against http://127.0.0.1:$PORT.
Terminal state: post `## Evaluation` comment via gh; report verdict + auto-merge-gate token.
Field semantics:
- Worktree — absolute path to the feature worktree; subagent
cds here before any other step. - PR — PR number; threaded as
$PR_NUMfor the rest of the skill. - Target dir — absolute path under the worktree served by
python3 -m http.server; resolved fromPIPELINE_VISUAL_PROOF_TARGET_DIR. - Port — advisory only.
scripts/visual-proof-server-start.shre-allocates the port via the broker (scripts/visual-proof-port-broker.sh <slate_index>) at start time and emits the actualport=on itsSERVER:line; the Setup block parses$PORTfrom there. The single-issue orchestrator path (#527) does not pre-resolve this field at all. - Auto-merge — gate token threaded through to Step 11;
gatedmirrors--manual-merge/MANUAL_MERGE=1,allowedlets the Step 11 greenlight matrix decide.
Migration warning (issue #517 — owner: scripts/run-queue.sh launch_agent())
When a PR carries the needs-browser label but PIPELINE_VISUAL_PROOF_TARGET_DIR is unset on the operator's pipeline.config, the orchestrator (in scripts/run-queue.sh launch_agent(), NOT this skill) emits a one-time stderr warning plus a Notes column entry on the status table. Evaluation proceeds without visual proof — the warning is non-blocking and never blocks the verdict. This is the documented consumer-migration story for operators upgrading from 0.17.x to a release that defaults the inline browser-eval path; set PIPELINE_VISUAL_PROOF_TARGET_DIR to opt into inline visual proof; otherwise evaluation proceeds without it (non-blocking). The warning surface lives in run-queue.sh launch_agent() so it fires once per dispatch — this skill only documents the contract.
Constraints
- Do NOT read the executor's session logs or conversation history.
- Only inputs: plan comment, PR diff, codebase in worktree. The plan comment is trust-gated at the source (Step 1): only a trusted-authored (
OWNER/MEMBER/COLLABORATOR)## Implementation Plancomment is authoritative — a non-contributor's planted comment is hard-dropped before selection and is never a valid input. - Fixes must be minimal: typos, missing imports, small bugs. NOT refactoring.
- If a fix requires touching >3 files or new design decisions, flag instead of fixing.
- Never skip tsc or test validation.
- All PRs target
PIPELINE_BASE_BRANCH. All commits go to the feature branch. - Outside the Step 11 auto-merge gate, the evaluator does NOT merge, close issues, or change
pr-openlabels — only reviews, posts verdict, and (optionally) rebases against the base branch.