banker-preflip-validation - SKILL.md Agent Skill

name: banker-preflip-validation description: > Read-only pre-flag-flip gate for the banker Q&A workflow shipped dormant in PR #178. Walks the operator through the documented pre-flip checklist before flipping BANKER_QA_OUTPUT=true in production. Tier 1 (offline, ~30s): gold-fixture parse-back via `node scripts/run-bankerqa-isolated.mjs --dry` (no API, no cost). Tier 2 (static, ~1m): `bash scripts/g4-readiness.sh` with Check B asserting all 5 banker-qa alert rules point at real, registry-wired metrics. Tier 3 (live, billable): ONE full non-Cardinal banker session under WRAPPED_SUBAGENTS=true + BANKER_QA_OUTPUT=true — assert dispatch emits mcpsubagentsrun_banker_*, the 9 banker artifacts land, the alert metrics get live series on /metrics, and frontend banker-mode renders. Emits a GO/NO-GO verdict. Never flips the flag itself; references docs/runbooks/g4-rollback-playbook.md for rollback. Triggers: "banker preflip", "validate banker mode", "pre-flag-flip gate", "preflip banker", "is banker mode ready to flip", "/banker-preflip-validation".

Banker Pre-Flip Validation

A thin operator-guidance gate for the BANKER_QA_OUTPUT=true production flip.

PR #178 shipped the banker Q&A workflow dormant (flag default off). The documented pre-flip gate is: one full non-Cardinal banker session passes AND the G4 alerts are real before flipping. The underlying scripts already exist — this skill wraps them into a single falsifiable checklist and a GO/NO-GO verdict.

This skill is read-only. It runs validation scripts and probes /metrics. It does NOT edit flags.env, redeploy, or flip the flag. The operator flips the flag manually (via /deploy + g4-operator-enable-disable.md § A) only after all tiers pass.

Configuration

Project Directory

/Users/ej/Super-Legal/super-legal-mcp-refactored — run all script commands from here (both scripts resolve REPO_ROOT relative to themselves, so cwd inside the repo is fine).

Base URL Resolution (Tier 3 only)

If --url <url> given, use that.
Else cat scripts/.staging-ip → http://<ip>:3001.
Else http://localhost:3001.

Invocation

/banker-preflip-validation — run Tier 1 + Tier 2 (offline + static), print Tier 3 plan.
/banker-preflip-validation --tier 1 — gold-fixture parse-back only.
/banker-preflip-validation --tier 2 — G4 readiness + alert-metric reality check only.
/banker-preflip-validation --tier 3 --url http://<ip>:3001 — drive + verify the live session.

Tier 3 is billable (1–2 Opus 4.8 agent calls + one full banker session) and requires a staging deployment with WRAPPED_SUBAGENTS=true already on. Run it deliberately, not on a loop.

Tier 1 — Gold-Fixture Parse-Back (offline, ~30s, no API)

Proves the validation path end-to-end against the Cardinal gold banker-question-answers.md without spending a token. This is the cheapest falsifiable signal that the parser/validator contract is intact.

node scripts/run-bankerqa-isolated.mjs --dry

Check	Pass criteria
Gold fixture parses clean	Script logs `PASS — gold fixture is parser-clean` and exits 0
Validator stats non-empty	`stats` block shows the expected Q-block count (from `specialist-coverage-state.json`)

FAIL (exit 1) here means the validator itself regressed (not a model issue) — investigate src/utils/knowledgeGraph/bankerQaValidator.js before anything else. Do not proceed.

Tier 2 — G4 Readiness + Alert-Metric Reality (static, ~1m)

Runs the full G4 operational-readiness gate and, critically, asserts that the banker-qa alerts reference real metrics — closing the false-green where an alert points at a metric that exists nowhere (Prometheus then shows "no data" forever and never fires).

bash scripts/g4-readiness.sh --static-only
# or, with a live staging target so Check B also probes /metrics:
BASE_URL=http://<ip>:3001 bash scripts/g4-readiness.sh

Check	Pass criteria
Overall G4 verdict	Script exits 0 (no FAIL lines). SKIPs that require a staging shell are acceptable at this tier.
Check B — 5 alerts defined	All 5 alert rules present in `prometheus/alerts-banker-qa.yml`: `BankerQAWriterFailure`, `BankerIntakeAnalystFailure`, `BankerQACoverageFail`, `BankerCertifierReject`, `BankerKGPhase1bLatency`
Check B — alert metrics wired in registry	Every metric the 5 alerts reference is declared in `src/utils/sdkMetrics.js` (script prints `metric wired in registry: <m>` per metric). The 4 base metrics are `claude_gate_check_results_total`, `claude_qa_dimension_score`, `claude_kg_phase_duration_ms`, `claude_kg_circuit_breaker_state`. Any `metric NOT in registry (dead alert)` line = NO-GO.
Check B — alert metrics live (if BASE_URL set)	Each metric either emits on `/metrics` or is reported as not-yet-emitted (a SKIP, acceptable pre-session — it will populate in Tier 3).
Rollback playbook present	`docs/runbooks/g4-rollback-playbook.md` § A/B/C all pass.

NO-GO if the script exits non-zero, or if any alert is missing, or any alert metric is absent from the registry (a "dead alert" can never fire — that defeats the entire monitoring gate).

Tier 3 — Live Non-Cardinal Banker Session (billable)

The keystone check. Drive ONE full banker session on a non-Cardinal deal (Cardinal is the gold fixture — using it would be circular; use a different deal so coverage/extraction is genuinely exercised) on staging with both flags on:

WRAPPED_SUBAGENTS=true
BANKER_QA_OUTPUT=true

Then assert all four of the following. Any miss = NO-GO.

#	Assertion	How to verify	Pass criteria
T3.1	Dispatch fires the banker agents	`hook_audit_log` rows / SSE for the session show `tool_name LIKE 'mcp__subagents__run_banker_%'`	≥1 invocation each of `mcp__subagents__run_banker_intake_analyst` and `mcp__subagents__run_banker_qa_writer` (and `..._banker_specialist_coverage_validator` if the deal triggers coverage validation)
T3.2	The 9 banker artifacts are produced	List the session dir under `reports/<session>/`	All 9 present: `banker-deal-context.json`, `banker-prohibited-assumptions.json`, `banker-intake-state.json`, `banker-questions-presented.md`, `banker-question-answers.md`, `banker-qa-state.json`, `banker-qa-metadata.json`, `banker-qa.md`, plus `specialist-coverage-state.json`. `banker-question-answers.md` must pass the parse-back validator (re-run Tier 1's validator against the new file, not the gold).
T3.3	Alert metrics now have live series	`curl -s $URL/metrics \| grep -E '^(claude_gate_check_results_total\|claude_qa_dimension_score\|claude_kg_phase_duration_ms\|claude_kg_circuit_breaker_state)'`	All 4 base metrics emit at least one series after the session (e.g. `claude_qa_dimension_score{dimension="dim_13..."}`, `claude_kg_phase_duration_ms{phase="KG-Phase1b"...}`). Confirms the alerts will actually evaluate against real data.
T3.4	Frontend banker-mode renders	Load the session in the dashboard (`test/react-frontend/`)	Banker Q&A view renders the questions/answers/confidence without console errors; the banker artifacts hydrate.

See references/preflip-checklist.md for the per-assertion query/probe commands and the exact 9-artifact + dispatch-tool reference.

GO / NO-GO Gate

GO — flip BANKER_QA_OUTPUT=true in prod only when all three tiers pass:

✅ Tier 1: --dry exits 0 (validator path proven).
✅ Tier 2: g4-readiness.sh exits 0; all 5 alerts defined; all 4 alert metrics wired in the registry (no dead alerts).
✅ Tier 3: dispatch emits mcp__subagents__run_banker_*; 9 artifacts produced + parser-clean; 4 alert metrics have live /metrics series; frontend banker-mode renders.

NO-GO on any failure. Do not flip. Common blockers:

Symptom	Verdict	Action
Tier 1 FAIL	NO-GO	Validator regression — fix `bankerQaValidator.js`, do not touch the flag
Tier 2 "dead alert" line	NO-GO	Wire the metric in `src/utils/sdkMetrics.js` (alerts that can't fire defeat the gate)
Tier 3 artifact missing	NO-GO	Writer/intake contract gap — diagnose via `session-diagnostics <session>`
Tier 3 metric has no live series	NO-GO	Metric registered but not emitted on the banker path — emission bug, not a flag-readiness state
Tier 3 frontend render error	NO-GO	Frontend banker-mode hydration bug

When the flip is later made by the operator, the enable sequence is in docs/runbooks/g4-operator-enable-disable.md § A. If anything degrades after the flip, roll back per docs/runbooks/g4-rollback-playbook.md:

§ A — soft-disable (flip flag off + redeploy; orphan banker data is safe to leave).
§ B — hard-rollback (DB + GCS WORM constraints).
§ C — orphan-data behavior post-flag-off.

Read-Only Guarantee

This skill runs run-bankerqa-isolated.mjs --dry, g4-readiness.sh, curl /metrics, and read-only DB/SSE inspection. It does not edit flags.env, redeploy, run DML, or flip BANKER_QA_OUTPUT. Tier 3 drives one real billable banker session on staging — that is the only side effect (a scratch staging session + 1–2 Opus 4.8 calls), and it touches staging only.

Pre-flight

which node    # required for Tier 1
which bash    # required for Tier 2
which curl    # required for Tier 3 /metrics probe
which jq      # G4 baselines branch check + /metrics parsing