rag-retrieval - SKILL.md Agent Skill

name: rag-retrieval description: RAG & retrieval engineering capability pack. Gives AI agents the judgment rules a senior retrieval engineer applies automatically — chunking strategy selection, embedding model choice, vector database routing, hybrid search with Reciprocal Rank Fusion, two-stage cross-encoder reranking, GraphRAG, and reference-based + LLM-as-judge RAG evaluation. Research-grounded rules with specific numbers from chunking benchmarks, embedding/reranker/vector-DB comparisons, and Ragas-style evaluation. Use for any RAG pipeline design, retrieval quality debugging, chunking/embedding/vector-DB selection, hybrid search fusion, reranker selection, or RAG eval task. keywords: ["RAG", "检索增强", "retrieval", "检索", "chunking", "分块", "embedding", "嵌入", "向量数据库", "vector database", "reranker", "重排序", "hybrid search", "混合检索", "RRF", "BM25", "pgvector", "GraphRAG", "faithfulness", "向量检索"] type: reference-based

CONSUMES: User RAG/retrieval task + corpus description (size, format, language, domain) + optional existing pipeline config (chunker, embedder, vector DB, reranker, eval suite) PRODUCES: Applied retrieval judgment rules + chunking strategy decision + embedding model selection + vector DB routing + hybrid-search/RRF config + reranker selection + RAG eval suite with target thresholds

RAG & Retrieval Engineering Capability Pack

Version: 0.1.0 Compatibility: Claude Code (Phase 1); Codex / Cursor / Gemini in Phase 3 License: Apache 2.0

What This Pack Does

AI agents build RAG pipelines by copying a tutorial: fixed-size chunks, text-embedding-3-small, Chroma, top-k similarity, and a prompt. They reach for semantic chunking because it sounds advanced (it benchmarked < 55% vs recursive-512's 69%). They fuse BM25 and vector scores by adding them directly (mathematically invalid — BM25 is unbounded). They rerank the top-200 (paying latency for ~10% of the accuracy gain). They never separate retrieval evaluation from generation evaluation, and they report a single blended "RAG score" with no domain-calibrated Faithfulness gate — so they can't tell whether the retriever or the generator is the problem, or whether the answer is even grounded.

This pack embeds the judgment rules retrieval engineers apply automatically — rules grounded in 2026 chunking benchmarks, embedding/reranker/vector-DB comparisons, and Ragas-style evaluation, with the specific numbers a no-pack LLM would not produce.

Pack = retrieval judgment. Your workflow system = process constraints. No overlap.

Cross-Cutting Rule: Evaluate Retrieval and Generation Separately, and Gate Faithfulness on a Domain Threshold

A RAG system has two failure surfaces that MUST be measured independently: retrieval (did we fetch the right chunks?) and generation (did the answer stay grounded in them?). Never report a single blended "RAG score." Retrieval is measured with reference-based IR metrics (Precision@k, Recall@k, MRR, nDCG@k) or LLM-judged Context Precision/Recall. Generation is measured with Faithfulness/Groundedness and Answer Relevance. Faithfulness is the fraction of answer claims supported by the retrieved context; the lower it is, the larger the share of unsupported (potentially fabricated) claims. 1.0 is the aspirational target, but Faithfulness is a semi-deterministic LLM-judge score (run variance alone keeps a perfectly-grounded answer below 1.0), so gate on a domain threshold averaged over the eval suite, not a strict ==1.0: block (P0) below ~0.85 general / ~0.90 regulated (finance/health/legal), treat 0.85–0.99 as a P1 risk band, and pair with Answer Relevance ≥ 0.90. Source: findings.md "Rigorous Validation" + "Actionable Recommendations" [35, 36]; Ragas faithfulness docs + 2026 RAG-eval threshold guidance (0.8 general / 0.85 customer-facing / 0.9+ regulated), retrieved 2026-06-13

This rule applies to every RAG eval, every "why is my RAG wrong?" debug (always split: bad retrieval vs bad generation), and every production gate. It is surfaced here because a blended score hides whether the retriever or the generator is the problem — and that determines the entire fix.

Step 0: Context Detection

When the user mentions RAG/retrieval work, detect the context and load the right reference:

User Signal	Reference to Load
"chunking", "chunk size", "splitting", "semantic chunk", "late chunking", "page-level", "分块", "切分"	`references/chunking-rules.md`
"embedding model", "voyage", "cohere embed", "bge", "text-embedding-3", "dimensions", "Matryoshka", "嵌入模型"	`references/embedding-rules.md`
"vector database", "pgvector", "Qdrant", "Pinecone", "Milvus", "HNSW", "IVFFlat", "metadata filter", "向量数据库"	`references/vector-database-rules.md`
"hybrid search", "BM25", "RRF", "reciprocal rank fusion", "reranker", "rerank", "cross-encoder", "混合检索", "重排序"	`references/hybrid-rerank-rules.md`
"GraphRAG", "knowledge graph", "multi-hop", "entity", "relationship", "Leiden", "图谱", "多跳"	`references/graphrag-rules.md`
"RAG eval", "faithfulness", "context precision", "recall@k", "nDCG", "Ragas", "RAGAS", "DeepEval", "TruLens", "groundedness", "hallucination", "评估"	`references/rag-evaluation-rules.md`
"contextual retrieval", "prepend context", "chunk context", "context generation"	`references/chunking-rules.md` (CH8)
"full RAG pipeline", "design my RAG", "end-to-end retrieval", "build a RAG"	Load all references sequentially
User hands a concrete pipeline config (chunker/embedder/vectorDB/fusion/reranker/eval thresholds) and wants a deterministic check	Run `scripts/rag-config-lint.sh <config>` FIRST, then apply judgment rules to its findings

Validation Script (deterministic checks — do not hand-recompute)

When the user provides a machine-readable pipeline config, run the linter instead of mentally re-deriving the thresholds:

bash scripts/rag-config-lint.sh <path-to-pipeline-config>

It reads a flat key=value (or JSON) config and emits P0/P1/P2 findings with grounded numbers, exiting 1 on any P0 (raw-score fusion; semantic chunking on academic or unspecified-doc-type corpora; Faithfulness gate below the domain floor — < 0.85 general / < 0.90 regulated), **2 on P1-only** (candidate pool > 50, eval suite < 100 queries, Faithfulness gate in the 0.85–0.99 review band), 0 when clean. P2s cover dedicated vector DB < 100M vectors and missing Contextual Retrieval on legal/medical/academic corpora. The deterministic checks live in code (QUALITY-BAR A10) — the prose rules in references/ remain the source of the why and the judgment calls the script cannot make.

Step 1: Apply Rules

After loading the relevant reference file(s):

Read the reference completely — do not skim
Apply each rule as a judgment check against the user's pipeline, config, or request
For each violated rule: state the violation clearly, then give the specific fix with the grounded number/command
Enforce the Cross-Cutting Rule on every eval setup and every retrieval-quality debug — split retrieval vs generation before recommending a fix
Always cite the specific threshold/number from the reference — a recommendation without the grounded number is generic advice the user could have gotten from any LLM

Output format per finding:

[P0] Rule CH4 (chunking): Pipeline uses semantic chunking on academic docs — benchmarked < 55% accuracy vs recursive-512's 69% under equal context budget.
→ Switch to Recursive Character Splitting at 512 tokens with 10–20% overlap.

[P1] Rule HR3 (hybrid): BM25 and cosine scores are summed directly. BM25 is unbounded; it dominates the fusion.
→ Fuse by rank with RRF (k=60), not by raw score.

Step 2: Output

Produce a structured retrieval review:

## RAG / Retrieval Review: [pipeline or stage reviewed]

### P0 — Blocking (will produce wrong or hallucinated results)
- [finding + specific fix + grounded number]

### P1 — Required (fix before trusting retrieval quality)
- [finding + specific fix + grounded number]

### P2 — Advisory (improves quality/latency/cost)
- [finding + specific fix + grounded number]

### Retrieval vs Generation Split (Cross-Cutting Rule)
- Retrieval metrics + targets: [Precision@k / Recall@k / MRR / nDCG@k / Context Precision/Recall]
- Generation metrics + targets: [Faithfulness ≥ domain threshold (~0.85 general / ~0.90 regulated; 1.0 aspirational), Answer Relevance≥0.90, Groundedness≥0.95]

### Pipeline Blueprint (if full design requested)
- Chunking → Embedding → Vector DB → Hybrid + RRF → Reranker → Eval gate

Anti-Skip Table

Excuse	Counter
"Semantic chunking is more advanced, let's use it"	Under equal context budget on academic docs it scored < 55% vs recursive-512's 69% [4]. Complexity ≠ accuracy. Use recursive-512 as the baseline.
"We'll just add BM25 and vector scores together"	BM25 is unbounded; cosine is in [-1,1]. The sum is mathematically invalid and BM25 dominates [23]. Use RRF (k=60) on ranks.
"Rerank the top 200 for best accuracy"	In the source benchmark, reranking top-50 captured ~90% of the accuracy gain of top-200 [30]. Treat top-50 as the starting point and plot quality vs latency for your own reranker/hardware (the cited ~120ms P95 is benchmark-specific, not a portable constant).
"We need a dedicated vector DB at our scale"	For < 100M vectors, pgvector + pgvectorscale hit 471 QPS @ 99% recall — 11.4× Qdrant's 41 QPS under identical conditions [11]. Don't add a second datastore prematurely.
"Our RAG eval passes, ship it"	What score, on what suite? Faithfulness below your domain threshold (~0.85 general / ~0.90 regulated) flags elevated hallucination risk and should block [35]; 0.85–0.99 is a review band, not an automatic pass. And a single blended score hides whether retrieval or generation failed.
"More dimensions = better embeddings"	text-embedding-3-large truncates 3072→512 with a Wilcoxon test showing no significant quality loss [7,8] — cutting DB storage. Dimensions are a storage knob, not a quality knob here.
"Chunks are fine as-is, just embed them"	If chunks aren't self-contained ("revenue grew 3% last quarter" — which company/quarter?), apply Contextual Retrieval (CH8): prepend 50–100 token LLM context BEFORE embedding AND BM25 → cuts top-20 failure rate 35% → 49% (+BM25) → 67% (+rerank) at ~$1.02/M tokens with prompt caching.
"Faithfulness is 0.95, the answer is correct"	Faithfulness measures grounding in the retrieved context, NOT correctness. With stale/wrong context a 0.95-faithful answer is still wrong — no framework distinguishes wrong-context from right-context [RE7]. Pair with retrieval metrics + source freshness.
"Cohere Rerank is the best API reranker"	Voyage rerank-2.5 beats Cohere Rerank v3.5 by +7.94% (lite: +7.16%) with 32K context (8× Cohere) at no price increase [HR5]. Re-check SOTA before pinning a managed reranker.

Tool / Model Quick Reference

Component	Grounded Pick (from research)	Why
Chunking baseline	Recursive Character Splitting, 512 tokens, 10–20% overlap	Highest accuracy (69%) under equal budget [4]
Chunking (non-self-contained chunks)	Contextual Retrieval — prepend 50–100 token LLM context before embed + BM25	Cuts top-20 failure 35–67%; ~$1.02/M tokens cached (CH8)
Embedding (general)	Voyage 3.5 (frontier: voyage-4 MoE, Gemini 001, Qwen3-8B, Jina v5-small)	Retrieval champion, 32k context [10,12]; select by use-case not MTEB rank (EM1)
Reranker (managed API, accuracy + long ctx)	Voyage rerank-2.5 (lite if latency-bound)	+7.94% vs Cohere v3.5, 32K context (8×), no price increase (HR5)
Embedding (storage-truncatable)	OpenAI text-embedding-3-large → 512 dims	Matryoshka, no significant loss [7,8]
Embedding (self-hosted hybrid)	BGE-M3	Native dense+sparse+multi-vector, 100+ langs, $0 [16,17]
Vector DB (< 100M, relational)	pgvector + pgvectorscale	471 QPS @ 99% recall, 11.4× Qdrant [11]
Vector DB (petabyte/GPU)	Milvus	Billions+, DiskANN, GPU [20,21]
Fusion	RRF, k=60	Rank-based, no score normalization [23,25]
Reranker (low-latency)	gte-reranker-modernbert-base (149M)	Cross-encoder precision, 8× smaller than 1B [31]
Eval	RAGAS (experiment) / DeepEval (CI/CD) / TruLens (prod) + IR metrics	Split retrieval vs generation; Faithfulness ≠ correctness (RE7)