name: evaluate-environments description: Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps.
Evaluate Environments
Goal
Run reliable environment evaluations and produce actionable summaries, not raw logs.
Canonical Eval Path
- Use
prime eval runas the default way to run evaluations. - Do not add
--skip-uploador other opt-out flags unless the user explicitly requests that deviation. - Standard
prime eval runruns save results automatically, keeping them available in the user's private Evaluations tab and locally inprime eval view. - For Prime Inference models with available pricing, eval output and saved metadata include estimated total-run USD cost automatically; no extra flags or API-key handling are needed.
Core Loop
- Run a smoke evaluation first (do not require pre-install):
prime eval run my-env -m openai/gpt-4.1-mini -n 5
- Use owner/env slug directly when evaluating Hub environments:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 5
- Scale only after smoke pass:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 --shuffle -s
- Use
--shufflefor representative dataset sampling once smoke tests pass. Set--shuffle-seedexplicitly for reproducible reports; if omitted, the default seed is0. - Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.
- When the user asks for a "real" or "base" eval, do not substitute a tiny smoke run. Use the requested model/env and make the run size explicit before interpreting results.
- If the user says the defaults are fine or asks for no flags, use the shortest canonical command and rely on global config:
prime eval run my-env
prime eval run my-env -m openai/gpt-4.1-mini
Endpoint Shortcuts And Model Family Choice
- Encourage users to define endpoint aliases in
configs/endpoints.tomlso model, base URL, and key wiring stay reusable. - Use aliases via
-m <endpoint_id>instead of repeating-band-k. - Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
- Instruct go-tos for quick behavior checks:
gpt-4.1series andqwen3instruct series. - Reasoning go-tos for deeper test coverage:
gpt-5series,qwen3thinking series, andglmseries. - Example endpoint registry:
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
[[endpoint]]
endpoint_id = "qwen3-32b-i"
model = "qwen/qwen3-32b-instruct"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"
- Endpoint entries support optional
headers(orextra_headers) for custom HTTP headers sent with inference requests:
[[endpoint]]
endpoint_id = "my-proxy"
model = "gpt-4.1-mini"
url = "https://api.example/v1"
key = "OPENAI_API_KEY"
headers = { "X-Custom-Header" = "value" }
- Endpoint entries support
api_client_typewhen the provider is not OpenAI Chat Completions compatible. Useopenai_responsesfor Responses-compatible endpoints andanthropic_messagesfor Anthropic Messages endpoints:
[[endpoint]]
endpoint_id = "gpt-responses"
model = "gpt-5.4-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
api_client_type = "openai_responses"
Publish Gate Before Large Runs
- After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
- Ask the user explicitly: should visibility be
PUBLICorPRIVATE? - Push with chosen visibility:
prime env push my-env --visibility PUBLIC
or
prime env push my-env --visibility PRIVATE
- For hosted environment workflows, prefer running large jobs against the Hub slug:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s
Prefer Config-Driven Evals Beyond Smoke Tests
- For anything beyond quick checks, nudge the user to create an eval TOML config.
- Use config files to run multiple evals in one command and keep runs reproducible:
prime eval run configs/eval/my-benchmark.toml
- Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.
- Use
nameon individual[[eval]]entries when the same environment appears multiple times.idselects the environment to load;namelabels the run in displays, summaries, metadata, and saved result paths. - Put
shuffle = trueand, for reproducibility,shuffle_seed = <int>in each[[eval]]entry that should sample from a shuffled dataset before selectingnum_examples.
Common Evaluation Patterns
- For single-environment v1 smoke runs, override typed taskset and harness config with dotted flags:
prime eval run my-env --taskset.difficulty hard --harness.max-turns 20
- For reproducible or multi-eval v1 config, put the same settings in TOML child sections:
[[eval]]
id = "my-env"
[eval.taskset]
difficulty = "hard"
[eval.harness]
max_turns = 20
- Override legacy/v0 constructor kwargs only when the environment still exposes them; for v1, use taskset/harness config instead:
prime eval run my-env -x '{"max_turns":20}'
- Bound per-rollout wall-clock time (use the dedicated
--timeoutflag; wins over-xand TOML[eval.extra_env_kwargs]):
prime eval run my-env --timeout 600
- Save extra state columns:
prime eval run my-env -s -C "judge_response,parsed_answer"
- Resume interrupted runs:
prime eval run my-env -n 1000 -s --resume
Resume matching includes --shuffle and --shuffle-seed. Use the same shuffle settings as the interrupted run; only increase -n/--num-examples when extending a saved run.
7. Shuffle examples before selecting the evaluation subset:
prime eval run my-env -n 200 -r 3 --shuffle --shuffle-seed 123 -s
- Configure shuffle in TOML:
[[eval]]
id = "my-env"
num_examples = 200
rollouts_per_example = 3
shuffle = true
shuffle_seed = 123
- Save results to a custom output directory:
prime eval run my-env -s -o /path/to/output
- Run multi-environment TOML suites:
prime eval run configs/eval/my-benchmark.toml
- Run the same environment more than once with different args by giving each entry a
name:
[[eval]]
id = "reverse-text"
name = "reverse-text-short"
[eval.args]
max_length = 32
[[eval]]
id = "reverse-text"
name = "reverse-text-long"
[eval.args]
max_length = 256
- Put generation parameters in TOML sampling sections:
[sampling]
max_tokens = 1024
temperature = 0.7
reasoning_effort = "medium"
enable_thinking = true
[[eval]]
env_id = "my-env"
Use [eval.sampling] for per-eval overrides. [sampling] is shorthand for sampling_args; reasoning_effort and enable_thinking stay top-level and are mirrored into extra_body.chat_template_kwargs.
13. Pass extra HTTP headers via CLI (repeatable):
prime eval run my-env -m my-proxy --header "X-Custom-Header: value"
- Set headers in
[[eval]]TOML configs as a table or list (merge order: registry row <headerstable <headerlist /--header):
[[eval]]
env_id = "my-env"
headers = { "X-Custom-Header" = "value" }
header = ["X-Another: val"]
- Run ablation sweeps using
[[ablation]]blocks in TOML configs:
[[ablation]]
env_id = "my-env"
[ablation.sweep]
temperature = [0.0, 0.5, 1.0]
[ablation.sweep.taskset]
difficulty = ["easy", "hard"]
This generates the cartesian product (6 configs in this example). Sweep v1 environment-owned settings under taskset or harness, not as root args. Use --abbreviated-summary (-A) for compact ablation results.
Inspect Saved Results
- Browse locally saved runs:
prime eval view
- Check
metadata.jsonfor aggregate token usage and, when available, total-runcost.input_usd,cost.output_usd, andcost.total_usd. - Inspect platform-visible runs when needed:
prime eval list
prime eval get <eval-id>
prime eval samples <eval-id>
Metrics Interpretation
- Treat binary and continuous rewards differently.
- Use pass@k-style interpretation only when rewards are effectively binary.
- For continuous rewards, focus on distribution shifts and per-task means.
- Always inspect samples before concluding regressions.
Reliability Rules
- Keep environment/model/config fixed while comparing variants.
- Record exact command lines and key flags in the report.
- Call out missing credentials, endpoint mismatches, and dependency errors directly.
- Do not overinterpret tiny sample runs.
- Distinguish a completed rollout with poor reward from an environment/runtime failure.
- For timeout debugging, check the environment's own timeout behavior and the outer sandbox/eval timeout before changing reward logic.
- For repo example changes, use
tests/test_envs.py -k <env>when package installability is part of the risk, not justprime eval runfrom the current checkout.
Output Format
Return:
- Run configuration table.
- Aggregate metrics and key deltas.
- Sample-level failure themes.
- Clear recommendation: proceed, iterate environment, or retune model/sampling.