name: bat-adhoc description: Run bot acceptance tests to validate MCP tools work correctly from a real AI agent's perspective. Use when testing PRs, detecting regressions, or verifying tool changes end-to-end with Claude/Gemini CLIs. disable-model-invocation: true argument-hint: [scenario-description or --help] allowed-tools: Bash, Read, Write
BAT - Bot Acceptance Testing
Bot acceptance testing validates that MCP tools work correctly from a real AI agent's perspective. You design test scenarios dynamically, run them via tests/uat/run_uat.py, and evaluate results.
When to Use BAT
- PR validation: Test that tool changes work correctly from an agent's perspective
- Regression detection: Compare behavior between branches
- Integration verification: Ensure MCP tools work end-to-end with real agent CLIs
Workflow
- Analyze the change: Read the diff, identify which tools are affected
- Design scenario: Generate a scenario JSON with setup/test/teardown prompts
- Run the script: Pipe the scenario to
python tests/uat/run_uat.py - Evaluate summary: Check
all_passedper agent. If true, you're done. - Dig deeper on failure: Read
results_filefor full output, stderr, raw JSON - Regression check: If test fails, re-run with
--branch masterto compare
Output Structure
The runner returns a concise summary to stdout (saves context when all passes):
{
"results_file": "/tmp/bat_results_abc123.json",
"agents": {
"gemini": {
"all_passed": true,
"test": {
"completed": true,
"duration_ms": 8100,
"exit_code": 0,
"num_turns": 5,
"tool_stats": { "totalCalls": 4, "totalSuccess": 4, "totalFail": 0 }
},
"aggregate": {
"total_duration_ms": 15300,
"total_turns": 12,
"total_tool_calls": 9,
"total_tool_success": 9,
"total_tool_fail": 0
}
}
}
}
- Phase stats:
num_turns,tool_stats(per phase) for fine-grained comparison - Aggregate stats: Total counts across all phases for overall efficiency comparison
- On failure: also includes
outputandstderrfor diagnosis - Full results: raw JSON, complete output always available at
results_file
Scenario Design Guidelines
- setup_prompt: Create any entities/state the test needs
- test_prompt: Exercise the tools being tested, ask the agent to report results clearly
- teardown_prompt: Clean up created entities
- Keep prompts focused - each scenario tests ONE behavior
- Ask the agent to report: what succeeded, what failed, any unexpected behavior
Example: Testing Error Signaling
cat <<'EOF' | python tests/uat/run_uat.py --agents gemini
{
"setup_prompt": "Create a test automation called 'bat_error_test' with a time trigger at 23:59 and action to turn on light.bed_light.",
"test_prompt": "Try to get automation 'automation.nonexistent_xyz'. Report if the tool signaled an error or returned a normal response. Then get automation 'automation.bat_error_test' and report its structure.",
"teardown_prompt": "Delete automation 'bat_error_test' if it exists."
}
EOF
Regression Comparison Workflow
Full BAT comparison (recommended):
- Pull latest master:
git fetch origin master && git checkout master && git pull - Run on master: Save scenario to file, run and save results
- Switch to branch:
git checkout feat/my-branch - Run on branch: Run same scenario, compare stats
Compare these metrics:
Primary (decide pass/fail on these):
- Task completion: Did both pass? Any new failures?
- Accuracy: Check agent output quality - did it understand the task correctly?
- Tool success rate: Compare
aggregate.total_tool_callsvstotal_tool_fail
Secondary (report but don't decide on these alone):
- Tool call count: Compare
aggregate.total_tool_calls,aggregate.total_turns— directional signal, not conclusive (agent exploration varies between runs) - Duration: Compare
aggregate.total_duration_ms— noisy due to network, cache misses, server load. Only flag large (>2x) regressions.
Robustness tip: Ask the same task in different ways (variation testing) to check if results are consistent across phrasings.
Quick comparison (single command):
# Test the PR branch
echo '{"test_prompt":"..."}' | python tests/uat/run_uat.py --branch feat/tool-errors --agents gemini
# Compare against master
echo '{"test_prompt":"..."}' | python tests/uat/run_uat.py --branch master --agents gemini
Cost Awareness
Each scenario invocation costs API credits (one per agent per phase). Design scenarios efficiently:
- Combine related checks in a single test_prompt when possible
- Only use setup/teardown when the test needs specific state
- Start with one agent, expand to both only when cross-agent comparison matters
Handling Arguments
When /bat-adhoc is invoked with arguments:
If arguments contain a scenario description, generate the JSON scenario and run it:
/bat-adhoc test automation create with sunrise trigger then modify to sunset
→ Generate appropriate scenario JSON and execute
If --help or no arguments, show this help text.
Otherwise, treat $ARGUMENTS as instructions for what to test and design+run the scenario accordingly.
Full Documentation
For complete CLI reference and output format, see tests/uat/README.md.