prompt-eval-gate

name: prompt-eval-gate description: The gate every prompt change must pass before shipping. No change to E5_EXTRACTOR_PROMPT or the SHARED_* prompt constants ships without an offline replay (e5_offline_replay.py), a test_webhook.py regression, and an LLM-judge pass scoring extraction fidelity and reply quality SEPARATELY from intent. Use before editing any prompt, classifier, or extractor. Triggers - change the prompt, edit E5, tune classifier, SHARED_ constant, before shipping a prompt.

This is master-plan item I5 (prompt eval harness, planned) — and the harness already partly exists (scripts/e5_offline_replay.py). Source of truth: docs/plans/2026-05-09-recovery-hardening-plan.md. Pairs with hebrew-eval-corpus + first-try-scoreboard.

The gate — all three must pass before a prompt change ships

1. Offline replay

Run:

python scripts/e5_offline_replay.py --prompt-file scripts/prompts/<candidate>.txt --prompt-version <tag> --with-history

4-bucket corpus (shadowed + clobbered + no-shadow + canonical). temperature=0 makes it reproducible; same window + same prompt-file = same predictions. ~$0.10/full run, soft $5/day ceiling. Report per-intent FN/FP deltas vs the current baseline.

2. Integration regression

Run:

python tests/test_webhook.py

47 end-to-end cases against the prod Edge Function. Accept the ~3 known LLM-non-determinism flakes; investigate anything new.

3. LLM-judge pass

Judge extraction fidelity (did it get the entities — especially ISO send_at for reminders) AND reply quality (voice/grounding) SEPARATELY from intent classification.

Intent precision ≠ extraction precision — a correct intent with a wrong time still FAILS.

Activation bars (Gate G1, per-intent not global)

≤0.3% false-negative rate
≤1% false-positive rate
≥50 firings per intent
ZERO canonical wrong-intent cases (the תזכירי לי מה ...? shape pinned in tests/e5_corpus_pinning_test.ts)

The 0.3% bar holds the absolute rate flat as volume grows (E5 covers ~91% of Solo messages). These bars are per-intent, not global — passing on add_reminder does not grant a free pass on add_shopping.

Prompt-body landmines

These will 5xx or mis-eval in production; the bundler and module-load gate PASS some of them silently.

Backticks inside a prompt template literal break the bundler. Use plain text, straight quotes, or angle brackets. v4 replaced all backticks with double-quotes for this reason.

${...} inside a top-level template literal is eagerly evaluated at module load. Intended literal ${count} text crashes 100% of requests. Escape as \${...} or write {count}.

The ship-version copy scripts/prompts/e5_v4_production.txt MUST stay in lock-step with the E5_EXTRACTOR_PROMPT literal in index.ts. If you change one, change both — they are the same prompt and will diverge silently otherwise.

Offline-first rule

v4 was iterated 6× offline — achieving −25pp FN on add_reminder and 0 canonical wrong-intent — BEFORE any prod-shadow deploy. Never tune a prompt directly in prod. The offline replay harness exists precisely so prompt iteration is cheap ($0.10/run) and reproducible (temperature=0). Use it.

Summary

A prompt change with no offline replay + no integration regression + no LLM-judge pass is NOT shippable, regardless of how clean it reads in review.