name: evaluation
description: Install and run a verifiers environment — smoke testing during development, full benchmark evals, and pushing results to the Prime platform. Covers the v1 eval CLI (the *_v1 tasksets) and the legacy v0 vf-eval CLI. Independent sections; use whichever the request needs. Use while developing/iterating on an environment, when running/evaluating/benchmarking one, or when the user mentions eval, vf-eval, tasksets, eval runs, smoke tests, or pushing eval results.
Evaluation
Running an environment in this repo — both while developing/iterating on it (smoke-testing changes) and when benchmarking a finished env and publishing results. Run everything from the repo root, always via uv.
Which CLI?
Two environment styles coexist in environments/:
- v1 taskset — a
*_v1package that exports aTaskset(and noload_environment). Run it with theevalCLI, by taskset id:uv run eval <taskset-id>. - legacy v0 env — a classic env that exposes
load_environment. Run it withvf-eval:uv run vf-eval <env-id>.
The tell is the export, not the name: a v1 taskset's package has __all__ = ["...Taskset"] and no load_environment. The sections below lead with the v1 eval flow and note the v0 vf-eval equivalent where it differs. They're independent — do only what the request calls for.
Setup
Editable, local install from the repo root (not from inside the env dir):
uv pip install -e environments/<env_id>
Editable means code edits are picked up without reinstalling — so during development you can change the env and immediately re-run it. Only re-run this install after editing pyproject.toml (e.g. new deps). v1 tasksets pin a verifiers pre-release; if the install complains, add --prerelease=allow.
After dependency edits, sync the environment project before testing imports:
uv sync --project environments/<env_id> --all-extras
If a fresh environment import fails because resolved package versions are incompatible, fix the environment's pyproject.toml bounds and sync again. Do not patch generated .venv files.
Smoke test
Run a 3x1 (3 tasks, 1 rollout each) in plain-log mode to confirm the env loads and scores end-to-end. Spanning a few tasks surfaces weird things (bad rows, edge-case prompts, inconsistent scoring) that a single task hides:
uv run eval <taskset-id> -n 3 -r 1 --rich false -v
--rich false turns off the live dashboard (which is on by default) for plain logs, -v prints prompts/completions. This is the inner loop while developing — re-run it after each change to verify the dataset loads, the rollout runs, and the rubric scores as expected. Fix any errors here before scaling up.
Legacy v0:
uv run vf-eval <env-id> -n 3 -r 1 -d -v(-dis the v0 equivalent of--rich false).
Full eval
Determine sample size from the environment's pyproject.toml:
- If
[tool.verifiers.eval]is present, follow it — it sets the convention for this env:[tool.verifiers.eval] num_examples = 5 # -> -n 5 rollouts_per_example = 3 # -> -r 3 - If absent, run all tasks (omit
-n) and pick-rso the total sample is representative — usually >500 total rollouts (num_tasks × r) is a good target. For large datasets-r 1may already clear that; for small ones, bump-r.
Recommended full-eval invocation:
uv run eval <taskset-id> -r <r> -c <N> --rich false
The run is always saved to disk (see Inspect output) — there's no save flag.
Legacy v0:
uv run vf-eval <env-id> -n -1 -r <r> -i -s -c -1 -d(v0 needs-sto save and-n -1for all examples).
Key flags (eval)
| Flag | Meaning | When |
|---|---|---|
-m <slug> |
model id (default deepseek/deepseek-v4-flash) |
to override the default |
-n <N> |
number of tasks; omit for all | smoke (-n 3) vs full (omit) |
-r <N> |
rollouts per task (>=2 if the taskset has @group_rewards) |
almost always set |
-c <N> |
max rollouts in flight (default 128) |
raise for cheap envs, lower for sandboxed |
-s |
shuffle tasks before taking the first -n |
sampling a subset of a big dataset |
-v |
debug logs (prompts/completions) | developing/debugging |
--rich false |
plain logs instead of the live dashboard | non-interactive/captured runs; required with --server |
-o <dir> |
output dir (default a fresh per-run dir) | pinning a known location |
--resume <dir> |
re-run only a prior run's missing/errored rollouts | recovering an interrupted run |
Coming from vf-eval? The flags overlap but a few differ:
-smeans shuffle, not save. v1evalalways writes the run to disk; there's no save flag.- No
-a '{...}'JSON. Pass env args as typed flags (--taskset.*,--harness.*) or@ eval.toml(see below). - The dashboard is on by default — pass
--rich false(the analogue of v0-d) for plain logs. - All tasks = omit
-n(there's no-n -1idiom). - No
-i— scoring is defined by the taskset's@reward/@group_rewards.
Configuring the env
Pass typed, dotted flags or a TOML file — there is no -a JSON blob:
uv run eval wikispeedia-v1 --taskset.min-path-length 5 --taskset.max-path-length 8
uv run eval <taskset-id> @ eval.toml # a run's own config.toml is itself re-runnable
Common knobs: --max-turns, --max-total-tokens, --sampling.max-tokens, --sampling.temperature, --harness.id, --harness.runtime.type. uv run eval <taskset-id> -h prints the full typed help, narrowed to the chosen taskset/harness.
Legacy v0 uses
-aJSON:uv run vf-eval <env-id> -a '{"task": "...", "max_turns": 50}'.
Harness & runtime (sandboxed / agentic tasksets)
A taskset that bundles its own harness runs with it by default; otherwise pass --harness.id. Select where rollouts execute with --harness.runtime.type:
uv run eval <taskset-id> --harness.runtime.type subprocess # local process (default)
uv run eval <taskset-id> --harness.runtime.type docker # local container (needs local docker)
uv run eval <taskset-id> --harness.runtime.type prime # remote prime sandbox (needs auth)
uv run eval <taskset-id> --harness.runtime.type modal # remote modal sandbox (needs auth)
Concurrency
-c caps rollouts in flight (default 128). Raise it for cheap, non-sandboxed tasksets; keep it lower for sandboxed ones (containers/remote runtimes) and tune up from there.
Models
The taskset's default model is usually fine (deepseek/deepseek-v4-flash). Otherwise pick by tier (-m <slug>):
- Cheap —
deepseek/deepseek-v4-flashorz-ai/glm-5.1. - Good (more capable) —
openai/gpt-5.4oropenai/gpt-5.5at medium reasoning. - Very hard only —
openai/gpt-5.5is the strongest but expensive. Avoid it by default; use it only as a last resort to confirm a task is solvable when the cheaper models get zero reward.
Live dashboard in tmux
The Rich dashboard is on by default and shows a live reward <mean> · err <share> headline, but it doesn't render in a plain captured shell. To watch it live, run inside tmux and drive it with send-keys:
tmux new-session -d -s eval
tmux send-keys -t eval 'uv run eval <taskset-id> -r <r>' Enter
# watch it:
tmux attach -t eval # or: tmux capture-pane -t eval -p
For headless/automated runs, prefer --rich false and follow the logs directly.
Inspect output
Each run is saved to a fresh per-run dir (so runs never overwrite each other):
outputs/<taskset>--<model>--<harness>/<uuid>/
config.toml # the resolved run config — re-runnable via `@ config.toml`
results.jsonl # one full trace per line (the data the platform + prime-rl consume)
eval.log # the run's logs
# find the newest run
ls -dt outputs/*/*/ | head
# avg reward across rollouts (recomputed — aggregates aren't stored)
jq -s 'map(.rewards | add // 0) | add / length' outputs/<...>/results.jsonl
The dashboard shows the avg reward live; in --rich false mode there's no summary line, so recompute it from results.jsonl (each line's rewards is the per-@reward breakdown; the run reward is their sum). Skim a few traces for sanity before publishing.
Legacy v0 runs land under
outputs/evals/<env-id>--<model>/<run-id>/withmetadata.json+results.jsonl.
Push results
prime eval push <output_dir>
<output_dir> is a run directory (e.g. outputs/<taskset>--<model>--<harness>/<uuid>). With no path it auto-discovers the latest run under outputs/. Useful overrides:
-e owner/<name>— published environment slug (push the env withprime env pushfirst if needed)--name "<label>"— display name for the evaluation