evaluation

name: evaluation description: Install and run a verifiers environment — smoke testing during development, full benchmark evals, and pushing results to the Prime platform. Covers the v1 `eval` CLI (the `*_v1` tasksets) and the legacy v0 `vf-eval` CLI. Independent sections; use whichever the request needs. Use while developing/iterating on an environment, when running/evaluating/benchmarking one, or when the user mentions eval, vf-eval, tasksets, eval runs, smoke tests, or pushing eval results.

Running an environment in this repo — both while developing/iterating on it (smoke-testing changes) and when benchmarking a finished env and publishing results. Run everything from the repo root, always via uv.

Which CLI?

Two environment styles coexist in environments/:

v1 taskset — a *_v1 package that exports a Taskset (and no load_environment). Run it with the eval CLI, by taskset id: uv run eval <taskset-id>.
legacy v0 env — a classic env that exposes load_environment. Run it with vf-eval: uv run vf-eval <env-id>.

The tell is the export, not the name: a v1 taskset's package has __all__ = ["...Taskset"] and no load_environment. The sections below lead with the v1 eval flow and note the v0 vf-eval equivalent where it differs. They're independent — do only what the request calls for.

Setup

Editable, local install from the repo root (not from inside the env dir):

uv pip install -e environments/<env_id>

Editable means code edits are picked up without reinstalling — so during development you can change the env and immediately re-run it. Only re-run this install after editing pyproject.toml (e.g. new deps). v1 tasksets pin a verifiers pre-release; if the install complains, add --prerelease=allow.

After dependency edits, sync the environment project before testing imports:

uv sync --project environments/<env_id> --all-extras

If a fresh environment import fails because resolved package versions are incompatible, fix the environment's pyproject.toml bounds and sync again. Do not patch generated .venv files.

Smoke test

Run a 3x1 (3 tasks, 1 rollout each) in plain-log mode to confirm the env loads and scores end-to-end. Spanning a few tasks surfaces weird things (bad rows, edge-case prompts, inconsistent scoring) that a single task hides:

uv run eval <taskset-id> -n 3 -r 1 --rich false -v

--rich false turns off the live dashboard (which is on by default) for plain logs, -v prints prompts/completions. This is the inner loop while developing — re-run it after each change to verify the dataset loads, the rollout runs, and the rubric scores as expected. Fix any errors here before scaling up.

Legacy v0: uv run vf-eval <env-id> -n 3 -r 1 -d -v (-d is the v0 equivalent of --rich false).

Full eval

Determine sample size from the environment's pyproject.toml:

If [tool.verifiers.eval] is present, follow it — it sets the convention for this env:

[tool.verifiers.eval]
num_examples = 5          # -> -n 5
rollouts_per_example = 3  # -> -r 3

If absent, run all tasks (omit -n) and pick -r so the total sample is representative — usually >500 total rollouts (num_tasks × r) is a good target. For large datasets -r 1 may already clear that; for small ones, bump -r.

Recommended full-eval invocation:

uv run eval <taskset-id> -r <r> -c <N> --rich false

The run is always saved to disk (see Inspect output) — there's no save flag.

Legacy v0: uv run vf-eval <env-id> -n -1 -r <r> -i -s -c -1 -d (v0 needs -s to save and -n -1 for all examples).

Key flags (`eval`)

Flag	Meaning	When
`-m <slug>`	model id (default `deepseek/deepseek-v4-flash`)	to override the default
`-n <N>`	number of tasks; omit for all	smoke (`-n 3`) vs full (omit)
`-r <N>`	rollouts per task (>=2 if the taskset has `@group_reward`s)	almost always set
`-c <N>`	max rollouts in flight (default `128`)	raise for cheap envs, lower for sandboxed
`-s`	shuffle tasks before taking the first `-n`	sampling a subset of a big dataset
`-v`	debug logs (prompts/completions)	developing/debugging
`--rich false`	plain logs instead of the live dashboard	non-interactive/captured runs; required with `--server`
`-o <dir>`	output dir (default a fresh per-run dir)	pinning a known location
`--resume <dir>`	re-run only a prior run's missing/errored rollouts	recovering an interrupted run

Coming from vf-eval? The flags overlap but a few differ:

-s means shuffle, not save. v1 eval always writes the run to disk; there's no save flag.
No -a '{...}' JSON. Pass env args as typed flags (--taskset.*, --harness.*) or @ eval.toml (see below).
The dashboard is on by default — pass --rich false (the analogue of v0 -d) for plain logs.
All tasks = omit -n (there's no -n -1 idiom).
No -i — scoring is defined by the taskset's @reward/@group_rewards.

Configuring the env

Pass typed, dotted flags or a TOML file — there is no -a JSON blob:

uv run eval wikispeedia-v1 --taskset.min-path-length 5 --taskset.max-path-length 8
uv run eval <taskset-id> @ eval.toml          # a run's own config.toml is itself re-runnable

Common knobs: --max-turns, --max-total-tokens, --sampling.max-tokens, --sampling.temperature, --harness.id, --harness.runtime.type. uv run eval <taskset-id> -h prints the full typed help, narrowed to the chosen taskset/harness.

Legacy v0 uses -a JSON: uv run vf-eval <env-id> -a '{"task": "...", "max_turns": 50}'.

Harness & runtime (sandboxed / agentic tasksets)

A taskset that bundles its own harness runs with it by default; otherwise pass --harness.id. Select where rollouts execute with --harness.runtime.type:

uv run eval <taskset-id> --harness.runtime.type subprocess  # local process (default)
uv run eval <taskset-id> --harness.runtime.type docker      # local container (needs local docker)
uv run eval <taskset-id> --harness.runtime.type prime       # remote prime sandbox (needs auth)
uv run eval <taskset-id> --harness.runtime.type modal       # remote modal sandbox (needs auth)

Concurrency

-c caps rollouts in flight (default 128). Raise it for cheap, non-sandboxed tasksets; keep it lower for sandboxed ones (containers/remote runtimes) and tune up from there.

Models

The taskset's default model is usually fine (deepseek/deepseek-v4-flash). Otherwise pick by tier (-m <slug>):

Cheap — deepseek/deepseek-v4-flash or z-ai/glm-5.1.
Good (more capable) — openai/gpt-5.4 or openai/gpt-5.5 at medium reasoning.
Very hard only — openai/gpt-5.5 is the strongest but expensive. Avoid it by default; use it only as a last resort to confirm a task is solvable when the cheaper models get zero reward.

Live dashboard in tmux

The Rich dashboard is on by default and shows a live reward <mean> · err <share> headline, but it doesn't render in a plain captured shell. To watch it live, run inside tmux and drive it with send-keys:

tmux new-session -d -s eval
tmux send-keys -t eval 'uv run eval <taskset-id> -r <r>' Enter
# watch it:
tmux attach -t eval        # or: tmux capture-pane -t eval -p

For headless/automated runs, prefer --rich false and follow the logs directly.

Inspect output

Each run is saved to a fresh per-run dir (so runs never overwrite each other):

outputs/<taskset>--<model>--<harness>/<uuid>/
  config.toml      # the resolved run config — re-runnable via `@ config.toml`
  results.jsonl    # one full trace per line (the data the platform + prime-rl consume)
  eval.log         # the run's logs

# find the newest run
ls -dt outputs/*/*/ | head

# avg reward across rollouts (recomputed — aggregates aren't stored)
jq -s 'map(.rewards | add // 0) | add / length' outputs/<...>/results.jsonl

The dashboard shows the avg reward live; in --rich false mode there's no summary line, so recompute it from results.jsonl (each line's rewards is the per-@reward breakdown; the run reward is their sum). Skim a few traces for sanity before publishing.

Legacy v0 runs land under outputs/evals/<env-id>--<model>/<run-id>/ with metadata.json + results.jsonl.

Push results

prime eval push <output_dir>

<output_dir> is a run directory (e.g. outputs/<taskset>--<model>--<harness>/<uuid>). With no path it auto-discovers the latest run under outputs/. Useful overrides:

-e owner/<name> — published environment slug (push the env with prime env push first if needed)
--name "<label>" — display name for the evaluation