ai-prompt-engineering

name: ai-prompt-engineering description: Production prompt lifecycle toolkit. Gives AI agents the ability to design, test, optimize, version, and deploy prompts like a senior prompt engineer — with automated testing (promptfoo), programmatic optimization (DSPy), quality metrics (DeepEval), and CI/CD gates. Use for writing system prompts, testing prompt suites, diagnosing hallucination/drift, setting up CI/CD pipelines, or auditing existing prompts. keywords: ["prompt", "prompt engineering", "LLM", "system prompt", "hallucination", "提示词", "幻觉", "漂移", "DSPy", "promptfoo", "CI/CD", "测试"] type: reference-based

CONSUMES: User prompt engineering task + optional existing system prompts or eval dataset PRODUCES: Tested prompt suite + optimization results + CI/CD gate configuration

AI Prompt Engineering Capability Pack

Version: 1.0.0 Compatibility: Claude Code (Phase 1); Codex / Cursor / Gemini in Phase 3 License: Apache 2.0 — see LICENSE-ATTRIBUTION.md for source credits

What This Pack Does

AI agents can write prompts. What they cannot do:

Test prompts against regression suites before deploying
Diagnose why a prompt hallucinates or drifts format
Set up CI/CD pipelines that block bad prompts from shipping
Handle model updates that silently break existing prompts

This pack encodes the production prompt lifecycle — 4 phases with specific CLI tools at each phase — that most prompt engineering guides skip.

This pack teaches "how to run prompts in production", not "how to write a prompt." The Anthropic tutorial covers writing. This covers testing, versioning, drift detection, CI/CD.

Pack = prompt production knowledge. Your workflow system = process constraints. No overlap.

Contents / Navigation Index

Section	Loads	Use when
Step 0: Context Detection Router	this file	Always — entry routing
Phase 1: Write	`references/phase1-write.md`	Designing a new prompt
Phase 2: Test	`tools/selection-matrix.md`, `tools/promptfoo-starter.yaml`	Building a regression/eval suite
Phase 3: Optimize	`references/failure-catalog.md`, `tools/selection-matrix.md`	Diagnosing/optimizing a prompt
Phase 4: Ship	`references/ci-cd-templates.md`	Versioning + CI/CD + deploy
Anti-Skip Table	this file	When tempted to skip a phase
Anti-Slop Rules	this file	Every phase
Claude target rules	`references/claude.md`	Target model is Claude (model IDs, adaptive thinking, structured outputs)
Validation	`tools/prompt-lint.sh`	Deterministic pre-ship check (assertions/model-pin/aggressive-language)

References: claude.md (Claude API rules + model-pinning), phase1-write.md (Phase 1 detail), few-shot-design.md, output-format.md, failure-catalog.md (FM-1..FM-6), ci-cd-templates.md. Tools: selection-matrix.md (promptfoo/DSPy/GEPA/DeepEval), promptfoo-starter.yaml, prompt-lint.sh.

Step 0: Context Detection Router

When the user mentions prompt work, detect context and load the right references:

User Signal	Entry Mode	Load Reference
"write prompt", "design prompt", "system prompt for"	`/write` → Phase 1	`references/phase1-write.md` (+ `references/claude.md` if Claude target)
"few-shot", "examples", "shots", "demonstrations"	`/write` → Phase 1	`references/phase1-write.md` + `references/few-shot-design.md`
"output format", "JSON schema", "structured output", "format lock"	`/write` → Phase 1	`references/phase1-write.md` + `references/output-format.md`
"hallucinating", "drifting", "broken prompt", "inconsistent"	`/audit` → Phase 3	`references/failure-catalog.md`
"optimize", "improve this prompt", "scoring low"	`/audit` → Phase 3	`references/failure-catalog.md`
"CI/CD", "pipeline", "deploy prompt", "version", "ship"	Phase 4 direct	`references/ci-cd-templates.md`
"test prompt", "eval", "promptfoo", "red team", "golden set"	Phase 2 direct	`tools/selection-matrix.md`
"model update", "prompt drift", "regression", "silent change"	`/audit` → Phase 3	`references/failure-catalog.md` + `references/claude.md`
"everything", "full audit", "review this prompt"	`/audit` → Phase 3	All references

Two entry modes:

/write — Design a new prompt from scratch → Phase 1 → 2 → 3 → 4
/audit — Diagnose or optimize an existing prompt → Phase 3 (with escalation gate) → 2 → 4

Interaction contract: Challenge vague requirements with specific questions before proceeding (e.g., "What's the output consumer — API code or human reader?", "What's the target model and context window size?"). Present tradeoffs; human decides. Not adversarial — but not passive. Do not refuse to proceed; clarify, then act.

Phase 1: Write — Design the System Prompt

Trigger: /write entry, or user wants a new prompt from scratch.

1.1 Scope Clarification (ask before writing)

Before writing, confirm:

Target model: Which LLM? (Claude 4.x? GPT-4o? Gemini?) → Load references/claude.md for Claude
Output consumer: API code parsing JSON? Human reading text? Pipeline feeding next agent?
Primary task: Classify/Extract/Generate/Transform/Reason?
Constraints: Token budget? Latency SLA? Content policy?

1.2–1.7 Prompt Construction → load `references/phase1-write.md`

The full Phase 1 construction detail lives in references/phase1-write.md (load it now for any /write task). It covers:

1.2 Role definition — domain+value+consumer formula (never "You are a helpful assistant")
1.3 Constraint design — ≤10 MUST/NEVER, front-load in first 30% (U-shaped attention)
1.4 Context architecture — cache_control placement, model-specific min cacheable prefix (4096 Opus / 2048 Sonnet), token audit via count_tokens (not tiktoken)
1.5 Anti-hallucination — grounding constraints + capability declaration (~23% reduction)
1.6 Injection defense — delimiter isolation + reasoning scaffold (84% → ~12% success rate)
1.7 Conditional loading — few-shot / output-format / Claude references

Key constraint surfaced here so it isn't skipped: "MUST USE"/"CRITICAL:"/"ALWAYS" over-trigger on Claude 4.6+ — use direct language. If target model is Claude, also load references/claude.md (adaptive thinking, structured outputs not prefill, model pinning, tool-triggering).

Phase 1 Output

Complete system prompt (ready to use)
Optional: few-shot examples block (if applicable)
Optional: output schema definition (if structured output required)
Token count estimate (via count_tokens, model-specific)

Phase 2: Test — Automated Prompt Evaluation

Trigger: After Phase 1, or user wants to test an existing prompt.

Exit criteria: All assertions pass. PASS rate ≥80% on golden dataset.

2.1 Create Golden Dataset

Minimum dataset (18 test cases per research findings):

10 core cases: representative of normal usage (happy path)
5 edge cases: boundary conditions, unusual inputs, empty inputs
3 adversarial cases: injection attempts, constraint violations, jailbreak probes

For each test case, define:

input: the prompt input
assert: at least 1 measurable assertion (not just "looks good")

Common assertion types:

contains: response contains specific text
not-contains: response does NOT contain specific text
javascript: custom JS assertion function
llm-rubric: natural language quality check via LLM-as-judge

2.2 Generate promptfooconfig.yaml

Write the config file:

# promptfooconfig.yaml
description: "[prompt name] evaluation suite"
prompts:
  - file://system-prompt.txt

providers:
  - anthropic:messages:claude-sonnet-4-6  # pin exact version

tests:
  # Core cases
  - description: "[case name]"
    vars:
      input: "[test input]"
    assert:
      - type: [assertion-type]
        value: "[expected value]"
  # ... 17 more cases per tools/promptfoo-starter.yaml template

→ See tools/promptfoo-starter.yaml for ready-to-use template with 18 test cases.

2.3 Execute Evaluation

# Install (first time)
npx promptfoo@latest init

# Run evaluation
npx promptfoo eval --no-cache

# View results
npx promptfoo view

2.4 Interpret Results

Key metrics from promptfoo output:

Pass rate: target ≥80% on initial run, 100% after optimization
Consistency: run same tests 3× and check variance (>5% variance = non-deterministic prompt)
Cost: total token cost per test run (useful for budget planning)
Latency: p50/p95 latency per provider

Failure analysis:

Group failures by assertion type
Check if failures cluster on edge cases vs core cases (different root causes)
Failures on adversarial cases → fix security constraints (Phase 1.6)
Failures on core cases → fundamental prompt issue (return to Phase 1 or escalate to Phase 3)

2.5 Red Teaming (optional, for security-critical prompts)

Map to the OWASP LLM Top 10 2025 via promptfoo's owasp:llm plugin preset (individual risks owasp:llm:01..owasp:llm:10):

# promptfooconfig.yaml — red-team section
redteam:
  plugins:
    - owasp:llm          # all 10, or pin individual risks below
    - owasp:llm:01       # LLM01 Prompt Injection
    - owasp:llm:07       # LLM07 System Prompt Leakage (new in 2025)
  strategies:
    - prompt-injection   # wraps payload in injection frames
    - jailbreak          # DAN-style bypass
    - crescendo          # multi-turn escalation, each message builds on the last

npx promptfoo redteam generate --purpose "customer service chatbot"
npx promptfoo redteam run

2025 risk IDs to probe for a prompt: LLM01 Prompt Injection, LLM02 Sensitive Information Disclosure, LLM05 Improper Output Handling, LLM06 Excessive Agency, LLM07 System Prompt Leakage (new in 2025), LLM09 Misinformation. Delivery strategies map to attack shape: prompt-injection (frames), jailbreak (bypass), crescendo (multi-turn escalation).

When to run: auth-adjacent prompts, multi-agent systems, user-facing production prompts.

Phase 2 Output

promptfooconfig.yaml (committed to repo)
Test results (pass/fail per case with assertion details)
Pass rate summary

Phase 3: Optimize — Diagnose and Fix

Trigger: /audit entry, or Phase 2 tests failing.

Entry criteria: Test failures exist that need diagnosis, OR user explicitly requests optimization.

3.1 Escalation Gate

Before optimizing, score the prompt on 6 dimensions (1–10, ≤4 = needs fix):

Dimension	Score	Assessment
Task clarity	__	Is the primary task unambiguous?
Context sufficiency	__	Does the prompt have enough context to succeed?
Format precision	__	Is the output format fully specified?
Scope boundary	__	Are the MUST/NEVER limits clear?
Reasoning support	__	Does the prompt support the model's reasoning process?
Agentic safety	__	Are injection and scope-creep defenses in place?

Escalation rule: If ≥2 dimensions score ≤2 → escalate to Phase 1 full redesign. Optimizing a broken foundation wastes effort. Diagnosis confirmed by the failure taxonomy:

46% of prompt failures are environment/infrastructure faults (not the prompt)
25% are configuration faults (wrong model, wrong temperature)
Only ~29% are actual prompt wording issues

→ Before blaming the prompt, check: Is the failure in infra/config? Run references/failure-catalog.md diagnosis first.

3.2 Failure Mode Diagnosis

→ Load references/failure-catalog.md and match symptoms to failure modes:

Symptom	Likely Failure Mode	Reference Section
Ignoring format constraints	FM-1: Format Drift	§FM-1
Hallucinated facts in RAG	FM-2: RAG Hallucination	§FM-2
Broke after model update	FM-3: Silent Regression	§FM-3
Losing context mid-conversation	FM-4: Context Overflow	§FM-4
Executing injected instructions	FM-5: Prompt Injection	§FM-5
Unstable despite prompt fixes	FM-6: Fix-Prompt Fallacy	§FM-6

3.3 Context and Format Diagnosis

Load context audit steps from Phase 1:

Run token audit (Phase 1.4): is the prompt exceeding budget?
Check U-shaped placement: are critical constraints in the first 30%?
Check cache architecture: is the stable prefix truly stable?

Load references/output-format.md:

Run compliance verification: is the format precisely defined?
Check schema completeness: are all edge cases covered (empty arrays, null values)?

3.4 Fix with Tradeoff Presentation

For each diagnosed issue, present ≥2 options before applying:

Issue: [description]

Option A: [fix name]
  → Effect: [what changes]
  → Risk: [what might break]
  → Cost: [token/complexity impact]

Option B: [fix name]
  → Effect: [what changes]
  → Risk: [what might break]
  → Cost: [token/complexity impact]

Recommendation: [option] because [reason]

3.5 Pre-Delivery Check (6-point)

Before delivering the optimized prompt:

Tool syntax: All CLI commands in the prompt are syntactically valid
Critical weight: Most important constraint is in the first 30%
Signal strength: Constraints use direct language (not hedged)
Fabrication audit: No facts stated without source or grounding
Token efficiency: No redundant instructions (duplicates cancel each other)
First-pass success: Would a capable model succeed on the first try?

3.6 Programmatic Optimization — DSPy MIPROv2 or GEPA

When to use programmatic optimization (consult tools/selection-matrix.md optimizer sub-matrix):

import dspy

class PromptTask(dspy.Signature):
    """[Your task description here]"""
    input_text = dspy.InputField()
    output = dspy.OutputField()

lm = dspy.LM("anthropic/claude-opus-4-8")
dspy.configure(lm=lm)

MIPROv2 — Bayesian search over instructions+demos, best for pure scalar-metric search:

from dspy.teleprompt import MIPROv2
optimizer = MIPROv2(metric=your_metric_fn, auto="medium")
optimized = optimizer.compile(PromptTask(), trainset=train_examples)
optimized.save("optimized_program.json")

GEPA (Genetic-Pareto reflective evolution, ICLR 2026 oral, arXiv:2507.19457) — reads execution traces, diagnoses failures in natural language, maintains a Pareto frontier of candidates. Prefer it over MIPROv2 when you have rich textual feedback (test traces / error messages) and a tiny trainset (works with as few as 3 examples):

optimizer = dspy.GEPA(
    metric=your_metric_fn,
    max_metric_calls=150,
    reflection_lm="openai/gpt-4.1",  # REQUIRED — the separate, usually stronger reflection model
)
optimized = optimizer.compile(PromptTask(), trainset=train_examples)

Grounded numbers: GEPA beats MIPROv2 by >10pp (+10pp on AIME-2025 with GPT-4.1-mini: 46.6%→56.6%), beats GRPO by ~20% with ~35× fewer rollouts (100–500 evals vs 5,000–25,000+ for GRPO), and discovered an agent architecture lifting ARC-AGI 32%→89%. reflection_lm is a required parameter — GEPA degrades to MIPROv2-like behavior without a capable reflection model.

Decision rule: tiny trainset + textual error feedback → GEPA; pure scalar metric, no trace narrative → MIPROv2; need only demonstrations → BootstrapFewShot (see tools/selection-matrix.md).

3.7 Regression Verification

After applying fixes, re-run Phase 2 test suite to confirm:

All previously-passing tests still pass
Fixed tests now pass
No new failures introduced

npx promptfoo eval --no-cache

Regression verification is distinct from Phase 2's initial test — its purpose is to confirm the optimization didn't break what was working.

Phase 3 Output

Optimized prompt with before/after comparison
Failure mode diagnosis report
Tradeoff decisions log
Regression test results

Phase 4: Ship — Version, CI/CD, Deploy

Trigger: Phase 2 tests passing, ready to deploy.

4.1 Prompt Versioning (git-native)

Directory structure:

prompts/
├── customer-support/
│   ├── v1.0.0/
│   │   ├── system-prompt.txt
│   │   ├── promptfooconfig.yaml
│   │   └── test-results-2026-01-15.json   # Required: link to test run
│   ├── v1.1.0/
│   │   └── ...
│   └── CHANGELOG.md

Semantic versioning rules:

MAJOR: behavior change (different output format, different task scope)
MINOR: quality improvement (same behavior, better performance)
PATCH: typo fix, whitespace, metadata only

CHANGELOG entry format (required per version):

## v1.1.0 (2026-01-15)
**Why**: Format drift detected in 3% of production responses (FM-1)
**Fix**: Added explicit JSON schema constraint + format lock
**Test results**: 18/18 assertions pass (promptfoo eval 2026-01-15-results.json)
**Breaking change**: No

4.2 Model Version Pinning (critical)

❌ Never: claude-sonnet (alias — points to different weights over time → FM-3 silent regression) ✅ Always: claude-opus-4-8 / claude-sonnet-4-6 / claude-haiku-4-5 / claude-fable-5 (exact version; never append a date suffix to an alias)

Run tools/prompt-lint.sh promptfooconfig.yaml system-prompt.txt to mechanically catch unpinned aliases, missing assertions, and over-triggering language before shipping (exit 2 = blocked).

⚠️ Model-upgrade API breakage (see references/claude.md): moving onto Opus 4.7/4.8/Fable 5 makes thinking.budget_tokens, last-assistant-turn prefill, and temperature/top_p/top_k return HTTP 400 — a prompt that worked on 4.6 can hard-fail on upgrade. Check the migration table.

Upgrade checklist:

Pin new version in non-production environment
Run full regression suite against golden dataset
Compare outputs on 10 representative production samples
Check new model's release notes for behavior changes
Deploy with canary (5% traffic) for 24–48h before full rollout

4.3 CI/CD Pipeline Setup

→ Load references/ci-cd-templates.md for complete GitHub Actions templates.

3-tier pipeline summary:

Tier	When	Duration	Gate
Tier 1	Per-commit	<2 min	JSON format, keywords, token budget
Tier 2	Per-PR	10–20 min	LLM-as-judge regression on golden set, blocks merge
Tier 3	Weekly/pre-release	60–90 min	Vulnerability scan, jailbreak, edge-case hallucination

4.4 Prompt Drift Monitoring

Signals to monitor in production:

Refusal rate: sudden spike = model update or context change
Format compliance rate: drop = format drift (FM-1)
Semantic similarity drift: outputs diverging from historical baseline
Token cost deviation: >20% change = prompt or model behavior change

Set up alerting thresholds based on 7-day rolling baseline.

Phase 4 Output

Versioned prompt directory committed to git
CI/CD config file (GitHub Actions YAML or equivalent)
Model pinning confirmed
Monitoring thresholds set

Anti-Skip Table (why each phase is NOT optional)

Agents rationalize skipping the lifecycle. Each excuse below has a one-line counter:

Excuse to skip	Counter
"The prompt looks fine, skip Phase 2 testing"	"Looks fine" is not a measurement. 46% of failures are env/config, invisible to eyeballing — only a golden-set run surfaces them.
"It's just a small wording change, skip regression"	Wording changes are exactly what drift FM-1 catches. Re-run Phase 2; `prompt-lint.sh` is <1s.
"No time for a golden set"	18 cases (10 core / 5 edge / 3 adversarial) is the floor, not a luxury — without it you ship blind and pay in production incidents.
"Just bump the model ID, it's a patch"	Model upgrades 400 on removed params (budget_tokens/prefill/temperature on 4.7+). Run the §4.2 upgrade checklist + regression.
"Skip red-team, it's internal only"	Internal prompts still process untrusted data (LLM01). Run `owasp:llm` if the prompt sees any user/external content.
"Optimize the wording, the prompt is the problem"	Don't blame the prompt first — 46% env / 25% config / 29% wording (Phase 3.1). Diagnose before editing.
"Ship without linking a test run"	A version with no test-results link is unauditable; CHANGELOG requires the run (Phase 4.1).
"Aggressive 'MUST USE' language is safer"	It over-triggers on Claude 4.6+ (claude.md Rule 4). `prompt-lint.sh` blocks it.

Anti-Slop Rules (applied at every phase)

Rules that prevent generic output:

No "You are a helpful assistant" — every prompt must have domain + value anchoring
No manual CoT for reasoning-native models — Claude 4.6+ uses adaptive thinking + the effort parameter (NOT thinking.budget_tokens, which 400s on Opus 4.7+); keep CoT only in few-shot examples
No "MUST USE" aggressive language on Claude 4.6+ — causes over-triggering; use direct language
No testing without assertions — every test case needs ≥1 measurable assertion (prompt-lint.sh enforces)
No version without test results — every prompt version must link to a test run
No prompt fix without root cause — check failure taxonomy first (46% env, 25% config, 29% wording)
No prefill for format-locking on Claude 4.6+ — use output_config.format (structured outputs); prefill 400s