wandb-evals - SKILL.md Agent Skill

name: wandb-evals description: Configure and run W&B-backed evaluations for agents. Use when executing benchmark slices, logging question-level correctness, and scoring model outputs for run-to-run comparison.

Evaluate the whole agent, not isolated tool functions.

Define eval source first:
- benchmark slice (dataset, split, offset, limit), or
- custom eval dataset with stable IDs and expected outputs.
If no dataset exists, create an initial test set before iteration runs.
Log question-level eval outcomes:
- question
- prediction
- gold or expected output
- is_correct (canonical boolean)
- error category when incorrect
Track both step-level metrics and final aggregate metrics.
Keep scorer logic versioned; include scorer version and dataset version in run config.
Persist failure rows for RCA (failures.jsonl or table-equivalent artifact).
Validate that aggregate accuracy equals derived accuracy from question rows.
Prefer dual scoring when needed:
- strict score (anchor metric),
- optional judge score (non-scalar/narrative outputs).

Produce both:

{
  "correct": 53,
  "total": 100,
  "accuracy": 0.53
}