ai-engineering

name: ai-engineering description: > Reviews and guides LLM/AI application engineering: prompt design, prompt caching, multimodal inputs, RAG, agent loops and tool design, resilience (rate limits, retries, fallbacks), memory, model migration, evals, testing, prompt-injection defence, and observability. Synthesises practices from Anthropic, OpenAI, Google, OWASP LLM Top 10, and practitioners (Hamel Husain, Eugene Yan, Chip Huyen). Triggers on "review my prompt", "design a system prompt", "optimise tokens", "set up RAG", "build an agent", "handle rate limits", "migrate to a new model", "write evals", "test my prompt", "audit AI code", "/ai-engineering". disable-model-invocation: true argument-hint: '[guide|review|design] [|]' license: MIT metadata: author: mthines version: '1.0.0' workflow_type: advisory tags: - ai-engineering - llm - prompt-engineering - multimodal - rag - agents - evals - testing - guardrails - observability

AI Engineering

Prescriptive guidance for building and reviewing LLM/AI applications. Thirteen orthogonal concerns — load only the rules the current task needs.

This SKILL.md is a thin index. Detailed rules live in rules/*.md and load on demand. Curated source URLs live in references/primary-sources.md. Date-flagged changes since 2024 live in references/recent-changes.md. Literal scaffolding lives in templates/.

Mode Detection

Parse $ARGUMENTS (first token) and detect the mode:

Mode	Default	Trigger
`guide`	yes	Default. Open question ("how should I …", "what's the best way to …").
`review`		`$0 == "review"`, or a file/path is supplied as `$ARGUMENTS`.
`design`		`$0 == "design"`, or "scaffold a prompt / system prompt / eval".

State the detected mode and the area(s) in scope before continuing:

Mode: review
Areas: prompt-writing, system-prompt-design
Targets: src/agents/triage.ts (system prompt at L24-78)

Area Routing

Map the user's request to one or more rule files. Load only the rules listed for the matched area(s).

Area	Rule file	Load when
Writing user prompts	`rules/prompt-writing.md`	"improve this prompt", few-shot questions, output format, CoT, structured outputs.
Designing system prompts	`rules/system-prompt-design.md`	Persona, tool docs, ordering, refusals, agent stop conditions.
Token cost / latency	`rules/token-optimization.md`	Prompt caching, model routing, batching, streaming, max_tokens.
Multimodal	`rules/multimodal.md`	Image/audio/PDF inputs, vision-vs-OCR, voice agents, image token costs.
Retrieval-augmented generation	`rules/rag.md`	Chunking, embeddings, hybrid search, reranking, query rewriting.
Agents & tool use	`rules/agents-and-tools.md`	Tool schemas, agent loops, parallel tool calls, error recovery, workflow vs agent.
Resilience	`rules/resilience.md`	Rate limits (429), retries with jitter, circuit breakers, fallback chains, timeouts, idempotency.
Memory & long-running state	`rules/memory-and-state.md`	Conversation summarisation, structured memory, vector memory, memory tools, compaction.
Model migration & versioning	`rules/model-migration.md`	Pin snapshots vs aliases, A/B a new model, deprecations, cross-provider migration, rollback.
Evaluation	`rules/evals.md`	Golden sets, LLM-as-judge, regression CI, error analysis.
Testing (engineering)	`rules/testing.md`	Unit/integration tests, mocks, VCR cassettes, snapshot tests, CI cost discipline.
Safety & guardrails	`rules/safety-and-guardrails.md`	Prompt injection, jailbreaks, output validation, PII, scope control.
Observability & versioning	`rules/observability-and-versioning.md`	Tracing, prompts-as-code, A/B releases, rollback.

evals.md covers product-quality measurement (golden sets, LLM-as-judge). testing.md covers engineering-correctness tests (mocks, VCR, snapshots). Load both when the user asks "how do I test this?" without specifying.

Observability composition. When the area is Observability & versioning and the task involves OTEL wiring or attribute naming, invoke the dash0 OTEL skills via Skill() before answering — they hold the source of truth for spans and gen_ai.* semconv:

Skill("otel-instrumentation", ...) — SDK setup, exporters, span shape.
Skill("otel-semantic-conventions", ...) — attribute naming and spec validation.

If neither skill is in the available-skills list, fall back to the inline guidance in rules/observability-and-versioning.md and the OTEL spec.

If the user does not name an area, ask one batched clarifying question listing the thirteen options before loading rules.

Workflow

`guide` (default)

Identify the area(s) from the user's request (use the table above).
Load the matched rule file(s).
Answer using the rules. Cite the source URLs from references/primary-sources.md when a claim is non-obvious or model-version-specific.
If the rule references a date-flagged change, also load references/recent-changes.md and call out the date in the answer.

`review`

Read the target file(s) supplied in $ARGUMENTS.
Detect which areas the file touches (prompt strings, system prompts, tool schemas, retrieval pipelines, eval scripts, …).
Load the matching rule file(s).
Produce findings as a numbered list, each with:
- What — the issue, with path:line.
- Rule — the rule file + section that's violated.
- Fix — concrete change, with a code snippet when ≤ 10 lines.
End with a "Top 3 fixes" prioritised by impact.

Do not edit the file in review mode unless the user asks for fixes.

`design`

Confirm the artefact: prompt, system prompt, or eval rubric?
Ask the user — in one message — for:
- The task the prompt drives.
- The model family (Claude / GPT / Gemini / open).
- Any hard constraints (latency budget, output schema, refusal scope).
- Whether tool use is in scope.
Load the relevant rule(s) and the matching template:
- System prompt → templates/system-prompt-skeleton.md.
- Tool definition → templates/tool-description.md.
- Eval rubric → templates/eval-rubric.md.
- Golden-set seed → templates/golden-set.md.
Fill in the template. Annotate why each section exists (one inline comment per section, not a full essay).

Required Reading by Area

Load on demand — do not preload.

Area	Files
Prompt writing	`rules/prompt-writing.md`
System prompts	`rules/system-prompt-design.md`
Token cost	`rules/token-optimization.md`
Multimodal	`rules/multimodal.md`
RAG	`rules/rag.md`
Agents	`rules/agents-and-tools.md`
Resilience	`rules/resilience.md`
Memory & state	`rules/memory-and-state.md`
Model migration	`rules/model-migration.md`
Evals	`rules/evals.md`
Testing	`rules/testing.md`
Safety	`rules/safety-and-guardrails.md`
Observability	`rules/observability-and-versioning.md`
Source URLs	`references/primary-sources.md`
Date-flagged changes	`references/recent-changes.md`
System prompt template	`templates/system-prompt-skeleton.md`
Tool description template	`templates/tool-description.md`
Eval rubric template	`templates/eval-rubric.md`
Golden-set template	`templates/golden-set.md`

Core Principles

Stable-prefix everything. The single highest-leverage optimisation across cost, latency, and determinism is keeping the prompt prefix byte-identical across calls so it caches. Order: tools → system → stable context → volatile context → user input.
Match the technique to the model class. Reasoning models (o3, o4, Claude with adaptive thinking) regress under added chain-of-thought; non-reasoning models often need it. See rules/prompt-writing.md.
Workflows beat agents until they don't. A predefined control flow is cheaper, faster, and more debuggable than a model-driven loop. Build an agent only when the trajectory genuinely cannot be enumerated.
Evals before optimisation. You cannot tell whether a prompt change is an improvement without a golden set + regression run. Without traces in production, you have no eval inputs.
Defence-in-depth for prompt injection. OWASP LLM01:2025 ranks prompt injection #1. Layer input classifier + least-privilege tools + human approval on destructive actions + output validator. No single guardrail is sufficient — guardrail LLMs are themselves injectable.
Right-size the model. Route 60% Haiku-class / 30% Sonnet-class / 10% Opus-class with confidence-gated escalation. Uniform-Opus is rarely the right answer.
Prompts are code. Versioned in-repo, diffed in PR, gated by CI evals, A/B released. Anonymous prompt edits in production dashboards are a regression waiting to happen.
Pin model snapshots in production. Aliases auto-upgrade silently — the model that passed evals yesterday is not the model serving traffic today. See rules/model-migration.md.
Plan for the failure modes, not the happy path. Rate limits, provider outages, and tail latency hit every prod system. Honour Retry-After, jitter retries, fall back across models, key destructive tool calls for idempotency. See rules/resilience.md.

Anti-patterns (one-liners — full coverage in each rule)

"Be helpful and detailed" instructions — non-actionable.
Volatile content (timestamps, user IDs) before stable content — kills caching.
Adding "let's think step by step" to a reasoning model — measurable regression.
Flipping tools, tool_choice, or thinking mid-conversation — silent cache flush.
Vector-only retrieval (no BM25) — misses exact-string matches (IDs, error codes).
Passing the top-20 chunks to the LLM without reranking — wastes tokens, dilutes attention.
Single-judge LLM evals using the same model as the actor — silent self-preference inflation.
Tool descriptions written for humans — agents pick the wrong tool.
Building a multi-agent system before single-agent + workflow exhausts the design space.
Toggling extended thinking mid-turn — invalidates the entire cached prefix.
Sending full-resolution screenshots — resize to ≤ 1568 px first.
Retrying on 429 without honouring Retry-After — wastes quota and triggers thundering herds.
Pure exponential backoff with no jitter — every client retries at the same instant.
Side-effectful tool calls without idempotency keys — retries double-charge.
Model aliases (claude-sonnet-4-7) in production — auto-upgrade silently.
Migrating models without a full golden-set run — regressions ship undetected.
Truncating long conversations instead of summarising — drops context the model needs.
Cross-user memory leakage — let the model fill in user_id and you have a privacy bug.

Definition of Done

Mode and areas stated up front.
Only the matched rule file(s) loaded.
Each finding cites path:line and the rule it violates.
Source URLs cited when claims are model-version-specific or recent.
In design mode, the produced artefact uses the matching template.
In review mode, findings are prioritised; no edits without consent.