name: prompt-eval-gate description: The gate every prompt change must pass before shipping. No change to E5_EXTRACTOR_PROMPT or the SHARED_* prompt constants ships without an offline replay (e5_offline_replay.py), a test_webhook.py regression, and an LLM-judge pass scoring extraction fidelity and reply quality SEPARATELY from intent. Use before editing any prompt, classifier, or extractor. Triggers - change the prompt, edit E5, tune classifier, SHARED_ constant, before shipping a prompt.
prompt-eval-gate
This is master-plan item I5 (prompt eval harness, planned) — and the harness already partly exists (scripts/e5_offline_replay.py). Source of truth: docs/plans/2026-05-09-recovery-hardening-plan.md. Pairs with hebrew-eval-corpus + first-try-scoreboard.
The gate — all three must pass before a prompt change ships
1. Offline replay
Run:
python scripts/e5_offline_replay.py --prompt-file scripts/prompts/<candidate>.txt --prompt-version <tag> --with-history
4-bucket corpus (shadowed + clobbered + no-shadow + canonical). temperature=0 makes it reproducible; same window + same prompt-file = same predictions. ~$0.10/full run, soft $5/day ceiling. Report per-intent FN/FP deltas vs the current baseline.
2. Integration regression
Run:
python tests/test_webhook.py
47 end-to-end cases against the prod Edge Function. Accept the ~3 known LLM-non-determinism flakes; investigate anything new.
3. LLM-judge pass
Judge extraction fidelity (did it get the entities — especially ISO send_at for reminders) AND reply quality (voice/grounding) SEPARATELY from intent classification.
Intent precision ≠ extraction precision — a correct intent with a wrong time still FAILS.
Activation bars (Gate G1, per-intent not global)
- ≤0.3% false-negative rate
- ≤1% false-positive rate
- ≥50 firings per intent
- ZERO canonical wrong-intent cases (the
תזכירי לי מה ...?shape pinned intests/e5_corpus_pinning_test.ts)
The 0.3% bar holds the absolute rate flat as volume grows (E5 covers ~91% of Solo messages). These bars are per-intent, not global — passing on add_reminder does not grant a free pass on add_shopping.
Prompt-body landmines
These will 5xx or mis-eval in production; the bundler and module-load gate PASS some of them silently.
Backticks inside a prompt template literal break the bundler. Use plain text, straight quotes, or angle brackets. v4 replaced all backticks with double-quotes for this reason.
${...} inside a top-level template literal is eagerly evaluated at module load. Intended literal ${count} text crashes 100% of requests. Escape as \${...} or write {count}.
The ship-version copy scripts/prompts/e5_v4_production.txt MUST stay in lock-step with the E5_EXTRACTOR_PROMPT literal in index.ts. If you change one, change both — they are the same prompt and will diverge silently otherwise.
Offline-first rule
v4 was iterated 6× offline — achieving −25pp FN on add_reminder and 0 canonical wrong-intent — BEFORE any prod-shadow deploy. Never tune a prompt directly in prod. The offline replay harness exists precisely so prompt iteration is cheap ($0.10/run) and reproducible (temperature=0). Use it.
Summary
A prompt change with no offline replay + no integration regression + no LLM-judge pass is NOT shippable, regardless of how clean it reads in review.