model_tier: inherit name: corpus-grounding description: "Shared corpus-grounding engine — BM25 + structured filters + decision rules over CSV corpora via a domain manifest. Use when a skill needs grounded pre-action option-space constraints." domain: engineering workspaces: - engineering packs: - engineering-base trust: level: professional install: removable: true execution: type: manual
corpus-grounding
The reusable grounding layer from ADR-061: one pure-stdlib engine (BM25 retrieval + structured pre-filters + decision-rule evaluation) that any domain plugs into via a manifest — frontend design is its first consumer (
design-intelligence). A corpus grounds (pre-action, constrains the option space); it never replaces mid-action reference (RAG), post-action validation (rules), or the method itself (a framework skill).
Engine provenance: ported from nextlevelbuilder/ui-ux-pro-max-skill
@ b7e3af80f6e331f6fb456667b82b12cade7c9d35 (MIT, last checked
2026-06-07) — BM25 de-duplicated, slide-only paths stripped, every
frontend-hardcoded axis moved into the manifest. Full license obligations:
design-intelligence/ATTRIBUTION.md.
When to use
- A skill needs pre-action selection grounded in curated knowledge ("which layout pattern / threat class / index strategy applies here").
- You are authoring a new domain corpus — write a manifest + CSVs,
validate with
ground.py validate, never fork the engine. - You need stack-scoped Do/Don't guidance (
--stack react …).
Do NOT
- Do NOT use the corpus for mid-task fact lookup — that is reference
(
references/docs / RAG). - Do NOT use it for output validation — write a rule/linter.
- Do NOT build a corpus that fits in 5 lines of an always-on rule.
- Do NOT fork the engine per domain — plug in via a manifest.
- Do NOT merge a manifest without owner + refresh cadence (validator refuses it anyway).
Procedure: Consult a corpus
- Locate the domain manifest (
<domain-skill>/data/manifest.json). - Run
search(one domain / stack axis) orground(reasoning plan). - Read
confidence+evidence_gapbefore trusting any row. - Propose grounded options; the human confirms (Tier-1 default).
Invocation (consumer runtime)
Scripts resolve all paths skill-relative (per
docs/contracts/skill-bundled-assets.md)
— they work from any cwd:
python3 <skills-root>/corpus-grounding/scripts/ground.py search \
--manifest <skills-root>/<domain-skill>/data/manifest.json \
"fintech dashboard" [--domain style] [--stack react] \
[--filter "Severity=HIGH"] [--max-results 3] [--json]
python3 <skills-root>/corpus-grounding/scripts/ground.py ground \
--manifest <skills-root>/<domain-skill>/data/manifest.json \
"luxury e-commerce" [--context '{"data_heavy": true}'] [--persist DIR]
python3 <skills-root>/corpus-grounding/scripts/ground.py validate --manifest …
<skills-root> is wherever skills are deployed (~/.claude/skills/ for
Claude Code installs; src/skills/ inside this repo).
Output format
Interface contract v1 — stability promise. Per ADR-061 §2 ("Opus
condition"), this interface is versioned; domains may depend on it. Breaking any item below requires a major bump
(manifest_version: 2), a migration note here, and updates to every
consuming skill in the same PR. Additive fields are allowed anytime.
Stable v1 surface:
- Modules + public names —
bm25_search.{BM25, load_csv, apply_filters, search_rows, RETRIEVERS},decision_engine.{detect_domain, search_domain, search_stack, evaluate_rules, ground, persist_grounding},schema_validator.{load_manifest, validate_manifest, resolve_data_path, ManifestError, MANIFEST_VERSION, TIERS}. - CLI —
ground.py {search|ground|validate}with the flags shown above. - Result shape — search:
{domain|stack, query, file, count, results, scores, filtered_from, confidence{label,score}, evidence_gap[]}; ground:{domain, query, category, rule, rules_evaluation{matched, unmatched}, selections{<domain>: {best, alternatives, confidence}}, confidence{label,score}, evidence_gap[]}. - Contract invariants — every output carries
confidenceandevidence_gap; retrievers are name-selected (bm25default,structured,hybrid); no network, no subprocess, read-only except the opt-in--persist DIR.
Manifest contract (schema-agnostic plug-in)
Each domain ships data/manifest.json beside its CSVs — declaring its
own axes; the engine never assumes the frontend schema:
{
"manifest_version": 1,
"domain": "frontend-design",
"tier": "conditional-grounding",
"data_dir": ".",
"retriever": "bm25",
"default_domain": "style",
"domains": {
"style": {
"file": "styles.csv",
"search_cols": ["Style Category", "Keywords"],
"output_cols": ["Style Category", "Best For"],
"max_results": 3,
"filters": {}
}
},
"detect": { "style": ["minimalism", "glassmorphism"] },
"stacks": { "react": "stacks/react.csv" },
"stack_cols": { "search_cols": ["…"], "output_cols": ["…"] },
"reasoning": {
"file": "ui-reasoning.csv",
"category_domain": "product",
"category_column": "Product Type",
"match_column": "UI_Category",
"rules_column": "Decision_Rules",
"priority_column": "Style_Priority",
"priority_domain": "style",
"name_columns": { "style": "Style Category" },
"plan": { "style": 3, "color": 2 },
"rules_module": "rules.py"
},
"owner": "package-maintainer",
"refresh_cadence": "quarterly",
"upstream": { "repo": "…", "sha": "…", "last_checked": "YYYY-MM-DD" }
}
tier∈lookup-only | conditional-grounding | constraint-emission;reasoningis only legal above lookup-only.owner,refresh_cadence,upstream{repo,sha,last_checked}are required — provenance discipline per ADR-061 §6; an unowned corpus is not merged.rules_module(optional Python escape hatch where JSON rules cap out) must live beside the manifest — absolute paths and..are refused.- Decision rules are surfaced as
matchedandunmatched— the full rule space stays auditable, never a hidden gate.
Procedure: Author a new domain corpus
- Pass the qualification rubric (ADR-061 §5): decision-rule utility beats row count; grounding must happen before action; "fits in 5 lines → it's a rule, not a corpus".
- Write
data/manifest.json+ CSVs in your domain skill; runground.py validate. - Name an owner + refresh cadence in the manifest, pin the upstream SHA.
- Cite this skill from your domain skill; consult via the CLI (Tier-1 consultation default — propose grounded options, human confirms).
Gotchas
- An empty result is a legitimate outcome — surface the evidence gap and proceed on priors; never silently widen filters to force a hit.
- The structured
filterspre-filter matches case-insensitive substrings; an over-specific value silently filters everything out — checkfiltered_fromvscountwhen results look thin. detectkeyword routing falls back todefault_domain; product-shaped queries route best ("fintech dashboard"), generic words land on the default.- BM25 tokenizer drops tokens ≤2 chars — "UI", "a11y" style queries need longer companions.
Runtime-safety review (Step 1.6 record)
- Read-only by default — the engine opens corpus CSVs under the
manifest's directory only (
resolve_data_pathrefuses absolute paths and..escapes). - Single write surface —
--persist DIR(opt-in) writes markdown under the caller-chosenDIR; nothing else writes. - No network, no subprocess, no secrets — pure stdlib; embeddings / remote retrievers are intentionally not implemented (ADR-061 §2).
rules_moduleexecutes manifest-adjacent Python — same trust domain as the skill that ships the manifest; containment enforced byresolve_data_path.
See also
design-intelligence— first consumer (frontend corpus + manifest).- ADR-061 — architecture, qualification rubric, fork resolutions.
docs/contracts/skill-bundled-assets.md— how bundledscripts/+data/reach consumer runtime.- Tests:
tests/test_corpus_grounding_engine.py,tests/test_skill_bundled_assets.py.