ai-feature

star 151

Use when the user adds, changes, or refactors an LLM agent under src/agents/ — or anything that goes through Foundation Model API or an MLflow-traced LLM call. Mandatory under CNS §3.5 and .cursor/12-ai-feature-lifecycle.mdc. Walks the SPEC → dataset → eval-harness → impl → re-eval sequence.

databrickslabs By databrickslabs schedule Updated 6/3/2026

name: ai-feature description: Use when the user adds, changes, or refactors an LLM agent under src/agents/ — or anything that goes through Foundation Model API or an MLflow-traced LLM call. Mandatory under CNS §3.5 and .cursor/12-ai-feature-lifecycle.mdc. Walks the SPEC → dataset → eval-harness → impl → re-eval sequence.

OntoBricks ai-feature

Triggered automatically for any change under src/agents/**, src/back/core/agents/**, or an MCP tool that wraps an agent. The companion gate is .cursor/12-ai-feature-lifecycle.mdc. This skill is the path of least resistance to passing that gate. Skip this skill, and the CI gate (G2 — .github/workflows/eval-gate.yml) will reject the PR.

Why this exists

OntoBricks ships 5 production agents (agent_owl_generator, agent_ontology_assistant, agent_auto_assignment, agent_auto_icon_assign, agent_dtwin_chat) with zero eval coverage today. A prompt regression or a tool-handler bug ships green because the existing tests only mock the LLM. This skill enforces a SPEC-first + eval-gated lifecycle so quality is observable.

Procedure

7 steps. The first 4 happen before code changes (the gate is on the artifact, not the runtime). Steps 5–7 are the implementation loop.

1. Brainstorm (≤ 10 min)

Invoke superpowers:brainstorming. Surface:

  • Purpose in one sentence.
  • Target users: who calls this agent? (LLM client via MCP? Internal UI? Another agent?)
  • In/out shape: input format, output format.
  • Success criteria: what does "right" look like in three examples?
  • Failure modes: what's the worst output? How would a user notice?

Capture decisions at the top of .planning/<slug>/PLAN.md.

2. Fill SPEC.md

Copy .claude/skills/ai-feature/SPEC.template.md to .planning/<slug>/SPEC.md. Fill every section. Do not skip the eval-dimensions table — that's what the CI gate parses.

Required fields:

  • agent_name, module_path (e.g., src/agents/agent_fact_checker/).
  • model_endpoint — the Databricks Foundation Model API endpoint name.
  • tools[] — tool name, JSON schema, intended LLM use.
  • eval_dimensions[] — name, metric, threshold, weight.
  • failure_modes[] — symptom, detection (which judge / trace tag), mitigation.

3. Build the eval dataset

Output: .planning/<slug>/eval/dataset.jsonl AND a mirror at tests/eval/datasets/<agent_name>/baseline.jsonl.

Minimum sizes:

Change type Min examples
New agent 20 (15 hand-curated + 5 synthetic)
Material change (prompt, tool surface) 10
Hotfix / regression test 3 (the failing cases) → regression.jsonl

Row shape (JSON-Lines, one example per line):

{"input": {...}, "expected": {"contains": [...], "schema": {...}, "constraints": [...]}, "tags": ["happy" | "ambiguous" | "adversarial"]}

Sources:

  • Hand-curated: from product team intuition, real user requests, existing screenshots.
  • Synthetic: use the databricks-synthetic-data-generation skill to bootstrap. Tag as tags: ["synthetic"] so reviewers know.
  • Regression: every production-trace failure that you fix gets added here. Never deleted.

4. Wire the eval harness

Output: tests/eval/run_<agent_name>.py.

Pattern:

import mlflow
from databricks_mlflow_evaluation import evaluate
# ...
result = evaluate(
    data="tests/eval/datasets/<agent_name>/baseline.jsonl",
    model=<your agent invocation callable>,
    judges=[...]  # see tests/eval/judges/
)
mlflow.log_metric("judge_score", result.aggregate_score)

Run a baseline before any code change. Record the MLflow run URI in the PR body (template .github/PULL_REQUEST_TEMPLATE.md has a slot).

5. Plan and implement

Invoke superpowers:writing-plans referencing SPEC.md. The plan lives at .planning/<slug>/PLAN.md and lists:

  • Files to add / change under src/agents/<agent_name>/ (follow src/agents/engine_base.py loop pattern).
  • Tool definitions (TOOL_DEFINITIONS) and handlers (TOOL_HANDLERS).
  • Tracing wiring: @trace_agent, @trace_llm, @trace_tool on every code path (src/agents/tracing.py).
  • Unit tests with httpx.MockTransport (see tests/fixtures/http.py).

Run superpowers:test-driven-development. Red → Green → Refactor.

6. Re-run the eval

After the implementation lands locally, re-run tests/eval/run_<agent_name>.py against the same dataset. Two outcomes:

  • Judge score ≥ baseline + delta: great, push the PR with the new MLflow run URI.
  • Judge score < threshold: iterate. Look at failed examples (the judge writes per-example pass/fail to MLflow). If the failure is the eval being wrong, edit the dataset and document why in .planning/<slug>/PLAN.md.

Borderline cases get the superpowers:requesting-code-review invocation, then a reviewer's judgment via a waiver comment in the PR.

7. Ship

  • superpowers:verification-before-completion — tests + eval green.
  • code-review (project skill).
  • changelog (project skill).
  • Conventional Commit: feat(agents): add agent_<name> or fix(agents): tune <agent> threshold.
  • PR; CI G1 + G2 both pass.

What this skill does not do

  • It doesn't deploy the agent. Use deploy once the PR merges.
  • It doesn't run the LLM in CI. CI runs only the unit tests + the eval comparison against the committed dataset; the LLM call happens locally or against a configured serving endpoint, with the result captured in MLflow.
  • It doesn't replace superpowers:brainstorming or superpowers:writing-plans. It sequences them.

Cross-references

  • .cursor/12-ai-feature-lifecycle.mdc — the rule that gates this work.
  • §3.5 of /Users/dermot.smyth/.claude/plans/ultrathink-perform-a-detailed-whimsical-token.md — methodology context.
  • src/agents/engine_base.py — the runtime pattern to follow.
  • tests/fixtures/http.pyagent_mock_transport for unit tests.
  • tests/fixtures/mlflow.pycaptured_traces for span-shape assertions.
  • databricks-mlflow-evaluation (plugin skill) — the harness.
  • databricks-synthetic-data-generation (plugin skill) — dataset cold-start.
  • agent-evaluation (plugin skill) — umbrella for the eval flow.
Install via CLI
npx skills add https://github.com/databrickslabs/ontobricks --skill ai-feature
Repository Details
star Stars 151
call_split Forks 31
navigation Branch main
article Path SKILL.md
More from Creator
databrickslabs
databrickslabs Explore all skills →