name: wandb-evals description: Configure and run W&B-backed evaluations for agents. Use when executing benchmark slices, logging question-level correctness, and scoring model outputs for run-to-run comparison.
W&B Evals
Evaluate the whole agent, not isolated tool functions.
Execute
- Define eval source first:
- benchmark slice (
dataset,split,offset,limit), or - custom eval dataset with stable IDs and expected outputs.
- benchmark slice (
- If no dataset exists, create an initial test set before iteration runs.
- Log question-level eval outcomes:
questionpredictiongoldor expected outputis_correct(canonical boolean)- error category when incorrect
- Track both step-level metrics and final aggregate metrics.
- Keep scorer logic versioned; include scorer version and dataset version in run config.
- Persist failure rows for RCA (
failures.jsonlor table-equivalent artifact). - Validate that aggregate accuracy equals derived accuracy from question rows.
- Prefer dual scoring when needed:
- strict score (anchor metric),
- optional judge score (non-scalar/narrative outputs).
Fallback Order
- Use W&B/Weave eval tooling and project scorer scripts.
- If scorer behavior is ambiguous, check official W&B eval docs.
- If unresolved, inspect local eval runner/scorer code directly.
Output Contract
Produce both:
- Question-level table with
is_correct. - Run-level summary:
{
"correct": 53,
"total": 100,
"accuracy": 0.53
}