bat-story-eval - SKILL.md Agent Skill

name: bat-story-eval description: Compare MCP tool behavior between target and baseline versions using pre-built and custom stories with diff-based triage. disable-model-invocation: true argument-hint: --baseline v6.6.1 [--agents gemini] [--stories s01,s02] allowed-tools: Bash, Read, Write, Glob, Grep, Task

BAT Story Evaluation

You are the evaluator. Follow these steps IN ORDER. Do not skip steps.

Parse Arguments

From $ARGUMENTS, extract:

--baseline: REQUIRED. Git tag/branch of the released version (e.g., v6.6.1).
--agents: Agent list (default: gemini). Comma-separated.
--stories: Force specific pre-built story IDs (e.g., s01,s02). Overrides triage selection.
--all-stories: Skip triage, run ALL pre-built stories.
--keep-container: Keep HA containers alive after run for manual inspection.
--model: Model for Claude agent (e.g., haiku, sonnet).

If $ARGUMENTS is --help or missing --baseline, show usage and stop:

/bat-story-eval --baseline v6.6.1
/bat-story-eval --baseline v6.6.1 --agents gemini,claude
/bat-story-eval --baseline v6.6.1 --stories s01,s02
/bat-story-eval --baseline v6.6.1 --all-stories --agents claude --model haiku

Step 0: Triage (Diff Analysis + Custom Story Design)

0a. Compute Diff

cd /home/julien/github/ha-mcp/worktree/uat-stories
git diff <baseline>..HEAD -- src/ha_mcp/ --stat
git diff <baseline>..HEAD -- src/ha_mcp/ --name-only

Classify changed files:

Tool modules (tools/tools_*.py): specific tool implementations changed
Core code (client/, server.py, errors.py, tools/util_helpers.py): affects all tools
Utilities (utils/, resources/): may affect all tools
No src/ changes: only tests/docs/config — select 2 smoke-test stories

0b. Select Pre-built Stories

Skip if --stories or --all-stories was passed.

Read the diff from 0a
Read all story YAMLs in tests/uat/stories/catalog/s*.yaml (title, description, prompt, setup)
For each story, reason about whether the diff could affect its outcome:
- What tools/code paths would this story exercise?
- Do any of those overlap with what changed?
Rules:
- Story likely exercises changed code -> selected
- Core code changed (client/, server.py, errors.py) -> all stories selected
- No src/ changes -> 2 representative stories as smoke test
Report which stories were selected and why (one sentence per story)

0c. Design Custom Stories (at least 1)

Read the diff carefully. Your job is to catch regressions. For each changed code path NOT covered by selected pre-built stories, ask: "could this break something a user would notice?" If yes, design a custom story to test that hypothesis.

Guidelines: Always create at least 1 custom story. Each must test a distinct regression hypothesis — don't create stories that overlap. Stop when you've covered the risky gaps.

Write each as /tmp/custom_c<NN>.yaml using the standard story format:

id: c01
title: "Short description of what is being tested"
category: custom
weight: 5
description: >
  Rationale: [what changed in the diff and why this scenario tests it]

setup:
  - tool: ha_config_set_helper
    args:
      helper_type: "input_boolean"
      name: "Test Entity Name"

prompt: >
  [Natural language request a real user would make that exercises the changed code]

teardown: []

verify:
  questions:
    - "Did the agent achieve the expected outcome?"
    - "Did it use the expected tools?"

expected:
  tools_should_use:
    - ha_search_entities
  description: >
    [What a correct agent should do]

Design principles:

Focus on code paths that changed in the diff
Plausible user scenarios, not synthetic edge cases
Setup creates realistic HA state via FastMCP in-memory steps
Prompts are what a real user would type
At least 1. Each tests a distinct regression hypothesis. Stop when gaps are covered.

Step 1: Run Baseline Version

For EACH agent, run all stories against the baseline version. One container per agent, reused across all stories.

1a. Start container with first story

cd /home/julien/github/ha-mcp/worktree/uat-stories
uv run python tests/uat/stories/run_story.py \
  catalog/<first_story>.yaml \
  --agents <agent> --keep-container \
  --branch <baseline> \
  --results-file local/uat-results.jsonl

CAPTURE from stderr: HA URL (e.g., http://localhost:32771), token, session file path.

1b. Verify, then run remaining pre-built stories

After each story, verify via ha_query.py using the story's verify.questions:

uv run python tests/uat/stories/scripts/ha_query.py \
  --ha-url http://localhost:PORT --ha-token TOKEN \
  --agent <agent> \
  "Does an automation with alias 'Sunset Porch Light' exist?"

Record each answer as confirmed / denied / unclear.

Run remaining pre-built stories on the same container:

uv run python tests/uat/stories/run_story.py \
  catalog/<next_story>.yaml \
  --agents <agent> --ha-url http://localhost:PORT --ha-token TOKEN \
  --branch <baseline> \
  --results-file local/uat-results.jsonl

Verify each immediately after running.

1c. Run custom stories on same container

uv run python tests/uat/stories/run_story.py \
  /tmp/custom_c01.yaml \
  --agents <agent> --ha-url http://localhost:PORT --ha-token TOKEN \
  --branch <baseline> \
  --results-file local/uat-results.jsonl

Verify each via ha_query.py using the custom story's verify.questions.

1d. Stop container

docker stop $(docker ps -q --filter "ancestor=ghcr.io/home-assistant/home-assistant:2026.1.3") 2>/dev/null

Step 2: Run Target Version

Repeat Step 1 for the target (local code). Same stories, same order, fresh container.

The only difference: omit --branch so run_story.py uses local code.

uv run python tests/uat/stories/run_story.py \
  catalog/<first_story>.yaml \
  --agents <agent> --keep-container \
  --results-file local/uat-results.jsonl

Same container reuse for remaining stories (--ha-url). Same verification after each.

Step 3: White-Box Analysis

For each story on each version, read the session file captured during the run.

Gemini sessions (JSON):

python3 -c "
import json, sys
data = json.load(open(sys.argv[1]))
for msg in data.get('messages', []):
    for tc in msg.get('toolCalls', []):
        print(f\"  {tc['name']} ({tc.get('status', '?')})\")
" /path/to/session.json

Claude sessions (JSONL):

python3 -c "
import json, sys
for line in open(sys.argv[1]):
    entry = json.loads(line)
    if entry.get('type') == 'assistant':
        for b in entry.get('message', {}).get('content', []):
            if b.get('type') == 'tool_use':
                print(f\"  {b['name']}\")
" /path/to/session.jsonl

Compare against expected.tools_should_use:

All expected tools used? (High weight)
Tool failures with recovery? (Medium weight)
Total tool call count (Low weight, note it)

Step 4: Score & Compare

Scoring Matrix

Black-Box	White-Box	Score
Entity correct + right structure	Right tools	pass
Entity correct + right structure	Wrong tools or recovered errors	pass (with notes)
Entity correct + wrong structure	Any	partial
Entity not created	Any	fail

Metrics

Primary metrics (decide pass/fail on these):

Black-box score (entity correct, structure correct)
White-box tool selection (expected tools used)
Error recovery (failures handled gracefully)

Secondary metrics (report but don't decide on these alone):

Billable tokens — directional cost signal, flag >30% increase for investigation but don't auto-fail
Cached tokens / cache hit ratio — useful context for cost analysis, but varies based on provider-side KV-cache behavior
Tool call count / turns — varies between runs due to agent exploration
Duration — noisy (network, KV-cache misses, server load), only flag large (>2x) outliers
Tool description size delta (Step 8)

Extracting Billable Tokens

# Gemini: input includes cached, so subtract
billable = (input - cached) + output + thoughts

# Claude: input_tokens is already non-cached
billable = input + output

Trend (target vs baseline)

For each story+agent:

Both pass -> stable
Target pass, baseline fail -> improved
Target fail, baseline pass -> decreased (REGRESSION)
Custom story, first run -> new
Billable tokens >30% higher -> cost investigation (even if pass — check Step 7 for KV-cache misses before concluding regression)

Step 5: Update JSONL

Append eval results as NEW lines (never modify existing):

record["eval_score"] = "pass"  # or "partial" or "fail"
record["eval_notes"] = "Entity created, triggers verified"
record["eval_trend"] = "stable"  # or "new", "improved", "decreased"

Step 6: Report

Triage Summary

Diff: <baseline>..HEAD — N files changed in src/ha_mcp/
Selected pre-built: s01, s03, s05 (3 stories — tools_automation.py, tools_search.py changed)
Custom stories: c01, c02 (2 stories — covering error handling, fuzzy search threshold)
Skipped: s02, s04, s06-s12 (tools unchanged)

Pre-built Story Results

Also read the model and quantization fields from each JSONL record (written by run_story.py) and show them as columns. Results vary by model, and the same base model at different quants behaves very differently, so a report naming only the agent is ambiguous after the fact. (Quant is - for cloud backends that don't expose it.)

| Story | Agent  | Model            | Quant | Baseline | Target | Trend  | Baseline Tokens | Target Tokens | Delta |
|-------|--------|------------------|-------|----------|--------|--------|-----------------|---------------|-------|
| s01   | claude | claude-sonnet-4-6| -     | pass     | pass   | stable | 36,262          | 34,100        | -6%   |
| s03   | claude | claude-sonnet-4-6| -     | pass     | pass   | stable | 42,000          | 41,500        | -1%   |

Custom Story Details

For EACH custom story, output a full section:

#### c01: [Title]

**Rationale**: [What changed in the diff and why this tests it]

**Setup**:
- Created input_boolean "Sophisticated Kitchen Sensor" via FastMCP

**Test prompt**: "[The exact prompt sent to the agent]"

**Verification**:
| Question | Baseline | Target |
|----------|----------|--------|
| Found the entity? | confirmed | confirmed |
| Used ha_search_entities? | confirmed | confirmed |

**Score**: baseline=pass, target=pass, trend=stable
**Tokens**: baseline=28,500, target=27,200 (-5%)

Regressions

If any trend = decreased:

Flag prominently
Suggest re-run to check flakiness
Show relevant section of git diff <baseline>..HEAD

Step 7: Investigate Outliers

When a story has >30% more billable tokens vs baseline, check for KV-cache misses:

for i, msg in enumerate(data["messages"]):
    tok = msg.get("tokens", {})
    cached = tok.get("cached", 0)
    total = tok.get("input", 0)
    print(f"Turn {i+1}: input={total:,} cached={cached:,} non-cached={total-cached:,}")

A turn with cached=0 after a non-cold-start turn = KV-cache miss (provider-side, not a code regression).

Step 8: Tool Description Size

Compare tool description sizes between versions:

uv run python tests/uat/stories/scripts/measure_tools.py \
  --output local/tool-sizes-target.json
uv run python tests/uat/stories/scripts/measure_tools.py \
  --output local/tool-sizes-baseline.json --branch <baseline>

Flag >5% total size increase (directly impacts token cost per turn).

Key Files

File	Purpose
`tests/uat/stories/run_story.py`	Story runner (container, setup, agent CLI)
`tests/uat/stories/scripts/ha_query.py`	Query live HA via agent+MCP for verification
`tests/uat/stories/catalog/s*.yaml`	Pre-built story definitions
`local/uat-results.jsonl`	Historical results (gitignored)

Important Notes

--baseline is required: it's both the diff source and the control group
Run pre-built stories BEFORE custom stories (cleanest state)
ALWAYS verify each story via ha_query.py before running the next
Reuse containers: first story starts it (--keep-container), rest use --ha-url
Custom story YAMLs go to /tmp/ (ephemeral); full details reported in Step 6
See "Metrics" section in Step 4 for primary vs secondary metric classification
The working directory MUST be /home/julien/github/ha-mcp/worktree/uat-stories for uv run