waza

star 1.0k

**WORKFLOW SKILL** - Evaluate AI agent skills using structured benchmarks with YAML specs, fixture isolation, and pluggable validators. USE FOR: run waza, waza help, run eval, run benchmark, evaluate skill, test agent, generate eval suite, init eval, compare results, score agent, agent evaluation, skill testing, cross-model comparison. DO NOT USE FOR: improving skill frontmatter (use waza dev), creating new skills from scratch (use skill-creator), token counting or budget checks (use waza tokens). INVOKES: Copilot SDK executor, mock engine, code/regex validators. FOR SINGLE OPERATIONS: use waza run directly for a single benchmark.

microsoft By microsoft schedule Updated 3/3/2026

name: waza description: "WORKFLOW SKILL - Evaluate AI agent skills using structured benchmarks with YAML specs, fixture isolation, and pluggable validators. USE FOR: run waza, waza help, run eval, run benchmark, evaluate skill, test agent, generate eval suite, init eval, compare results, score agent, agent evaluation, skill testing, cross-model comparison. DO NOT USE FOR: improving skill frontmatter (use waza dev), creating new skills from scratch (use skill-creator), token counting or budget checks (use waza tokens). INVOKES: Copilot SDK executor, mock engine, code/regex validators. FOR SINGLE OPERATIONS: use waza run directly for a single benchmark."

Waza

"The way of technique — measure, refine, master."

A Go CLI tool for evaluating AI agent skills through structured benchmarks. Define test cases in YAML, run them against agent engines, and validate results with pluggable scoring validators.

Help

When user says "waza help" or asks how to use waza:

╔══════════════════════════════════════════════════════════════════╗
║  WAZA - CLI Tool for Evaluating Agent Skills                     ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  COMMANDS:                                                       ║
║    waza run <eval.yaml>        # Run an evaluation benchmark     ║
║    waza init [directory]       # Initialize a new eval suite     ║
║    waza generate <SKILL.md>    # Generate eval from SKILL.md     ║
║    waza compare <r1> <r2> ...  # Compare result files            ║
║    waza dev [skill-path]       # Improve SKILL.md compliance     ║
║                                                                  ║
║  RUN FLAGS:                                                      ║
║    --context-dir, -c   Fixtures directory (default: ./fixtures)  ║
║    --output, -o        Save results JSON to file                 ║
║    --verbose, -v       Verbose output                            ║
║    --task, -t          Filter tasks by name (repeatable)         ║
║    --parallel, -p      Run tasks in parallel                     ║
║    --workers, -w       Number of parallel workers                ║
║    --transcript-dir    Save per-task transcripts                 ║
║                                                                  ║
║  COMPARE FLAGS:                                                  ║
║    --format, -f        Output format: table or json              ║
║                                                                  ║
║  GENERATE FLAGS:                                                 ║
║    --output-dir, -d    Output directory for generated files      ║
║                                                                  ║
║  DEV FLAGS:                                                      ║
║    --target            Adherence level: low|medium|high          ║
║    --max-iterations    Max improvement iterations (default: 5)   ║
║    --auto              Auto-apply without prompting              ║
║                                                                  ║
║  WORKFLOW:                                                       ║
║    1. waza init my-eval        # Scaffold eval suite             ║
║    2. Edit eval.yaml + tasks   # Define test cases               ║
║    3. waza run eval.yaml -v    # Execute benchmark               ║
║    4. waza compare a.json b.json  # Cross-model comparison       ║
║                                                                  ║
║  FIXTURE ISOLATION:                                              ║
║    Each task gets a fresh temp workspace with fixtures copied    ║
║    in. Original fixtures are never modified.                     ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝

Commands

waza run

Run an evaluation benchmark from a YAML spec file.

# Run with default mock engine
waza run path/to/eval.yaml --context-dir path/to/fixtures

# Verbose output with results saved
waza run eval.yaml -c ./fixtures -v -o results.json

# Filter to specific tasks
waza run eval.yaml -t "task-name-1" -t "task-name-2"

# Parallel execution
waza run eval.yaml --parallel --workers 4

# Save per-task transcripts
waza run eval.yaml --transcript-dir ./transcripts

waza init

Initialize a new evaluation suite with a compliant directory structure.

# Initialize in current directory
waza init

# Initialize in a named directory
waza init my-eval-suite

Creates: eval.yaml, tasks/ with example task, fixtures/ with example fixture.

waza generate

Generate an eval suite from an existing SKILL.md file.

# Generate eval from SKILL.md
waza generate path/to/SKILL.md

# Specify output directory
waza generate SKILL.md --output-dir ./my-eval

Parses YAML frontmatter (name, description) and creates eval.yaml, starter tasks, and fixtures.

waza compare

Compare results from multiple evaluation runs side by side.

# Compare two result files
waza compare run1.json run2.json

# Compare three or more
waza compare gpt4.json claude.json gemini.json

# JSON output
waza compare run1.json run2.json --format json

Shows per-task score deltas, pass rate differences, and aggregate statistics.

waza dev

Iteratively improve SKILL.md frontmatter compliance with automated scoring.

# Score current skill and suggest improvements
waza dev skills/my-skill

# Target high compliance level
waza dev skills/my-skill --target high

# Auto-apply improvements without prompts
waza dev skills/my-skill --target medium --auto --max-iterations 3

Compliance Levels:

  • Low (< 150 chars or no triggers) — Minimal description
  • Medium (150+ chars, has triggers) — Basic trigger coverage
  • Medium-High (+ anti-triggers) — Routing clarity improved
  • High (+ routing markers like INVOKES/FOR SINGLE OPERATIONS) — Full compliance

Scoring Checks:

  • Description length (150+ chars required, 1024 max)
  • Trigger phrases (USE FOR: patterns)
  • Anti-trigger phrases (DO NOT USE FOR: patterns)
  • Routing clarity markers (WORKFLOW SKILL, INVOKES:, etc.)
  • Token budget (500 soft limit, 5000 hard limit)

Coming Soon: Trigger accuracy tests (#36), --skip-integration (#37), --fast (#38), improvement suggestions engine (#34).

Evaluation Spec Format

name: my-eval
skill: my-skill
version: "1.0"
executor: mock          # or copilot-sdk
tasks:
  - id: task-1
    name: "Describe the task"
    prompt: "Your prompt to the agent"
    expected: "Expected behavior"
    validators:
      - type: code
        config:
          language: go
      - type: text
        config:
          pattern: "expected pattern"

Engines

Engine Use Description
mock Testing Returns canned responses for validator development
copilot-sdk Production Executes via Copilot CLI SDK

Validators

Validator What it checks
code Code compiles / passes syntax check
regex Output matches regex pattern

Configuration

Setting Flag Default
Fixtures dir --context-dir ./fixtures
Output file --output (none)
Verbose --verbose false
Parallel --parallel false
Workers --workers CPU count
Transcript dir --transcript-dir (none)

Scoring Quick Reference

Each task produces an EvaluationOutcome with:

Field Description
score 0.0–1.0 normalized score
pass Boolean pass/fail
validator_results Per-validator details
duration Execution time
Install via CLI
npx skills add https://github.com/microsoft/waza --skill waza
Repository Details
star Stars 1,005
call_split Forks 56
navigation Branch main
article Path SKILL.md
More from Creator