bat-adhoc

star 3.4k

Run bot acceptance tests to validate MCP tools work correctly from a real AI agent's perspective. Use when testing PRs, detecting regressions, or verifying tool changes end-to-end with Claude/Gemini CLIs.

homeassistant-ai By homeassistant-ai schedule Updated 2/17/2026

name: bat-adhoc description: Run bot acceptance tests to validate MCP tools work correctly from a real AI agent's perspective. Use when testing PRs, detecting regressions, or verifying tool changes end-to-end with Claude/Gemini CLIs. disable-model-invocation: true argument-hint: [scenario-description or --help] allowed-tools: Bash, Read, Write

BAT - Bot Acceptance Testing

Bot acceptance testing validates that MCP tools work correctly from a real AI agent's perspective. You design test scenarios dynamically, run them via tests/uat/run_uat.py, and evaluate results.

When to Use BAT

  • PR validation: Test that tool changes work correctly from an agent's perspective
  • Regression detection: Compare behavior between branches
  • Integration verification: Ensure MCP tools work end-to-end with real agent CLIs

Workflow

  1. Analyze the change: Read the diff, identify which tools are affected
  2. Design scenario: Generate a scenario JSON with setup/test/teardown prompts
  3. Run the script: Pipe the scenario to python tests/uat/run_uat.py
  4. Evaluate summary: Check all_passed per agent. If true, you're done.
  5. Dig deeper on failure: Read results_file for full output, stderr, raw JSON
  6. Regression check: If test fails, re-run with --branch master to compare

Output Structure

The runner returns a concise summary to stdout (saves context when all passes):

{
  "results_file": "/tmp/bat_results_abc123.json",
  "agents": {
    "gemini": {
      "all_passed": true,
      "test": {
        "completed": true,
        "duration_ms": 8100,
        "exit_code": 0,
        "num_turns": 5,
        "tool_stats": { "totalCalls": 4, "totalSuccess": 4, "totalFail": 0 }
      },
      "aggregate": {
        "total_duration_ms": 15300,
        "total_turns": 12,
        "total_tool_calls": 9,
        "total_tool_success": 9,
        "total_tool_fail": 0
      }
    }
  }
}
  • Phase stats: num_turns, tool_stats (per phase) for fine-grained comparison
  • Aggregate stats: Total counts across all phases for overall efficiency comparison
  • On failure: also includes output and stderr for diagnosis
  • Full results: raw JSON, complete output always available at results_file

Scenario Design Guidelines

  • setup_prompt: Create any entities/state the test needs
  • test_prompt: Exercise the tools being tested, ask the agent to report results clearly
  • teardown_prompt: Clean up created entities
  • Keep prompts focused - each scenario tests ONE behavior
  • Ask the agent to report: what succeeded, what failed, any unexpected behavior

Example: Testing Error Signaling

cat <<'EOF' | python tests/uat/run_uat.py --agents gemini
{
  "setup_prompt": "Create a test automation called 'bat_error_test' with a time trigger at 23:59 and action to turn on light.bed_light.",
  "test_prompt": "Try to get automation 'automation.nonexistent_xyz'. Report if the tool signaled an error or returned a normal response. Then get automation 'automation.bat_error_test' and report its structure.",
  "teardown_prompt": "Delete automation 'bat_error_test' if it exists."
}
EOF

Regression Comparison Workflow

Full BAT comparison (recommended):

  1. Pull latest master: git fetch origin master && git checkout master && git pull
  2. Run on master: Save scenario to file, run and save results
  3. Switch to branch: git checkout feat/my-branch
  4. Run on branch: Run same scenario, compare stats

Compare these metrics:

Primary (decide pass/fail on these):

  • Task completion: Did both pass? Any new failures?
  • Accuracy: Check agent output quality - did it understand the task correctly?
  • Tool success rate: Compare aggregate.total_tool_calls vs total_tool_fail

Secondary (report but don't decide on these alone):

  • Tool call count: Compare aggregate.total_tool_calls, aggregate.total_turns — directional signal, not conclusive (agent exploration varies between runs)
  • Duration: Compare aggregate.total_duration_ms — noisy due to network, cache misses, server load. Only flag large (>2x) regressions.

Robustness tip: Ask the same task in different ways (variation testing) to check if results are consistent across phrasings.

Quick comparison (single command):

# Test the PR branch
echo '{"test_prompt":"..."}' | python tests/uat/run_uat.py --branch feat/tool-errors --agents gemini

# Compare against master
echo '{"test_prompt":"..."}' | python tests/uat/run_uat.py --branch master --agents gemini

Cost Awareness

Each scenario invocation costs API credits (one per agent per phase). Design scenarios efficiently:

  • Combine related checks in a single test_prompt when possible
  • Only use setup/teardown when the test needs specific state
  • Start with one agent, expand to both only when cross-agent comparison matters

Handling Arguments

When /bat-adhoc is invoked with arguments:

If arguments contain a scenario description, generate the JSON scenario and run it:

/bat-adhoc test automation create with sunrise trigger then modify to sunset

→ Generate appropriate scenario JSON and execute

If --help or no arguments, show this help text.

Otherwise, treat $ARGUMENTS as instructions for what to test and design+run the scenario accordingly.

Full Documentation

For complete CLI reference and output format, see tests/uat/README.md.

Install via CLI
npx skills add https://github.com/homeassistant-ai/ha-mcp --skill bat-adhoc
Repository Details
star Stars 3,439
call_split Forks 137
navigation Branch main
article Path SKILL.md
More from Creator
homeassistant-ai
homeassistant-ai Explore all skills →