corpus-grounding

model_tier: inherit name: corpus-grounding description: "Shared corpus-grounding engine — BM25 + structured filters + decision rules over CSV corpora via a domain manifest. Use when a skill needs grounded pre-action option-space constraints." domain: engineering workspaces: - engineering packs: - engineering-base trust: level: professional install: removable: true execution: type: manual

The reusable grounding layer from ADR-061: one pure-stdlib engine (BM25 retrieval + structured pre-filters + decision-rule evaluation) that any domain plugs into via a manifest — frontend design is its first consumer (design-intelligence). A corpus grounds (pre-action, constrains the option space); it never replaces mid-action reference (RAG), post-action validation (rules), or the method itself (a framework skill).

Engine provenance: ported from nextlevelbuilder/ui-ux-pro-max-skill @ b7e3af80f6e331f6fb456667b82b12cade7c9d35 (MIT, last checked 2026-06-07) — BM25 de-duplicated, slide-only paths stripped, every frontend-hardcoded axis moved into the manifest. Full license obligations: design-intelligence/ATTRIBUTION.md.

When to use

A skill needs pre-action selection grounded in curated knowledge ("which layout pattern / threat class / index strategy applies here").
You are authoring a new domain corpus — write a manifest + CSVs, validate with ground.py validate, never fork the engine.
You need stack-scoped Do/Don't guidance (--stack react …).

Do NOT

Do NOT use the corpus for mid-task fact lookup — that is reference (references/ docs / RAG).
Do NOT use it for output validation — write a rule/linter.
Do NOT build a corpus that fits in 5 lines of an always-on rule.
Do NOT fork the engine per domain — plug in via a manifest.
Do NOT merge a manifest without owner + refresh cadence (validator refuses it anyway).

Procedure: Consult a corpus

Locate the domain manifest (<domain-skill>/data/manifest.json).
Run search (one domain / stack axis) or ground (reasoning plan).
Read confidence + evidence_gap before trusting any row.
Propose grounded options; the human confirms (Tier-1 default).

Invocation (consumer runtime)

Scripts resolve all paths skill-relative (per docs/contracts/skill-bundled-assets.md) — they work from any cwd:

python3 <skills-root>/corpus-grounding/scripts/ground.py search \
  --manifest <skills-root>/<domain-skill>/data/manifest.json \
  "fintech dashboard" [--domain style] [--stack react] \
  [--filter "Severity=HIGH"] [--max-results 3] [--json]

python3 <skills-root>/corpus-grounding/scripts/ground.py ground \
  --manifest <skills-root>/<domain-skill>/data/manifest.json \
  "luxury e-commerce" [--context '{"data_heavy": true}'] [--persist DIR]

python3 <skills-root>/corpus-grounding/scripts/ground.py validate --manifest …

<skills-root> is wherever skills are deployed (~/.claude/skills/ for Claude Code installs; src/skills/ inside this repo).

Output format

Interface contract v1 — stability promise. Per ADR-061 §2 ("Opus condition"), this interface is versioned; domains may depend on it. Breaking any item below requires a major bump (manifest_version: 2), a migration note here, and updates to every consuming skill in the same PR. Additive fields are allowed anytime.

Stable v1 surface:

Modules + public names — bm25_search.{BM25, load_csv, apply_filters, search_rows, RETRIEVERS}, decision_engine.{detect_domain, search_domain, search_stack, evaluate_rules, ground, persist_grounding}, schema_validator.{load_manifest, validate_manifest, resolve_data_path, ManifestError, MANIFEST_VERSION, TIERS}.
CLI — ground.py {search|ground|validate} with the flags shown above.
Result shape — search: {domain|stack, query, file, count, results, scores, filtered_from, confidence{label,score}, evidence_gap[]}; ground: {domain, query, category, rule, rules_evaluation{matched, unmatched}, selections{<domain>: {best, alternatives, confidence}}, confidence{label,score}, evidence_gap[]}.
Contract invariants — every output carries confidence and evidence_gap; retrievers are name-selected (bm25 default, structured, hybrid); no network, no subprocess, read-only except the opt-in --persist DIR.

Manifest contract (schema-agnostic plug-in)

Each domain ships data/manifest.json beside its CSVs — declaring its own axes; the engine never assumes the frontend schema:

{
  "manifest_version": 1,
  "domain": "frontend-design",
  "tier": "conditional-grounding",
  "data_dir": ".",
  "retriever": "bm25",
  "default_domain": "style",
  "domains": {
    "style": {
      "file": "styles.csv",
      "search_cols": ["Style Category", "Keywords"],
      "output_cols": ["Style Category", "Best For"],
      "max_results": 3,
      "filters": {}
    }
  },
  "detect": { "style": ["minimalism", "glassmorphism"] },
  "stacks": { "react": "stacks/react.csv" },
  "stack_cols": { "search_cols": ["…"], "output_cols": ["…"] },
  "reasoning": {
    "file": "ui-reasoning.csv",
    "category_domain": "product",
    "category_column": "Product Type",
    "match_column": "UI_Category",
    "rules_column": "Decision_Rules",
    "priority_column": "Style_Priority",
    "priority_domain": "style",
    "name_columns": { "style": "Style Category" },
    "plan": { "style": 3, "color": 2 },
    "rules_module": "rules.py"
  },
  "owner": "package-maintainer",
  "refresh_cadence": "quarterly",
  "upstream": { "repo": "…", "sha": "…", "last_checked": "YYYY-MM-DD" }
}

tier ∈ lookup-only | conditional-grounding | constraint-emission; reasoning is only legal above lookup-only.
owner, refresh_cadence, upstream{repo,sha,last_checked} are required — provenance discipline per ADR-061 §6; an unowned corpus is not merged.
rules_module (optional Python escape hatch where JSON rules cap out) must live beside the manifest — absolute paths and .. are refused.
Decision rules are surfaced as matched and unmatched — the full rule space stays auditable, never a hidden gate.

Procedure: Author a new domain corpus

Pass the qualification rubric (ADR-061 §5): decision-rule utility beats row count; grounding must happen before action; "fits in 5 lines → it's a rule, not a corpus".
Write data/manifest.json + CSVs in your domain skill; run ground.py validate.
Name an owner + refresh cadence in the manifest, pin the upstream SHA.
Cite this skill from your domain skill; consult via the CLI (Tier-1 consultation default — propose grounded options, human confirms).

Gotchas

An empty result is a legitimate outcome — surface the evidence gap and proceed on priors; never silently widen filters to force a hit.
The structured filters pre-filter matches case-insensitive substrings; an over-specific value silently filters everything out — check filtered_from vs count when results look thin.
detect keyword routing falls back to default_domain; product-shaped queries route best ("fintech dashboard"), generic words land on the default.
BM25 tokenizer drops tokens ≤2 chars — "UI", "a11y" style queries need longer companions.

Runtime-safety review (Step 1.6 record)

Read-only by default — the engine opens corpus CSVs under the manifest's directory only (resolve_data_path refuses absolute paths and .. escapes).
Single write surface — --persist DIR (opt-in) writes markdown under the caller-chosen DIR; nothing else writes.
No network, no subprocess, no secrets — pure stdlib; embeddings / remote retrievers are intentionally not implemented (ADR-061 §2).
rules_module executes manifest-adjacent Python — same trust domain as the skill that ships the manifest; containment enforced by resolve_data_path.

corpus-grounding