evaluate-environments - SKILL.md Agent Skill

name: evaluate-environments description: Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps.

Evaluate Environments

Goal

Run reliable environment evaluations and produce actionable summaries, not raw logs.

Canonical Eval Path

Use prime eval run as the default way to run evaluations.
Do not add --skip-upload or other opt-out flags unless the user explicitly requests that deviation.
Standard prime eval run runs save results automatically, keeping them available in the user's private Evaluations tab and locally in prime eval view.
For Prime Inference models with available pricing, eval output and saved metadata include estimated total-run USD cost automatically; no extra flags or API-key handling are needed.

Core Loop

Run a smoke evaluation first (do not require pre-install):

prime eval run my-env -m openai/gpt-4.1-mini -n 5

Use owner/env slug directly when evaluating Hub environments:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 5

Scale only after smoke pass:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 --shuffle -s

Use --shuffle for representative dataset sampling once smoke tests pass. Set --shuffle-seed explicitly for reproducible reports; if omitted, the default seed is 0.
Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.
When the user asks for a "real" or "base" eval, do not substitute a tiny smoke run. Use the requested model/env and make the run size explicit before interpreting results.
If the user says the defaults are fine or asks for no flags, use the shortest canonical command and rely on global config:

prime eval run my-env
prime eval run my-env -m openai/gpt-4.1-mini

Endpoint Shortcuts And Model Family Choice

Encourage users to define endpoint aliases in configs/endpoints.toml so model, base URL, and key wiring stay reusable.
Use aliases via -m <endpoint_id> instead of repeating -b and -k.
Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
Instruct go-tos for quick behavior checks: gpt-4.1 series and qwen3 instruct series.
Reasoning go-tos for deeper test coverage: gpt-5 series, qwen3 thinking series, and glm series.
Example endpoint registry:

[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-32b-i"
model = "qwen/qwen3-32b-instruct"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

Endpoint entries support optional headers (or extra_headers) for custom HTTP headers sent with inference requests:

[[endpoint]]
endpoint_id = "my-proxy"
model = "gpt-4.1-mini"
url = "https://api.example/v1"
key = "OPENAI_API_KEY"
headers = { "X-Custom-Header" = "value" }

Endpoint entries support api_client_type when the provider is not OpenAI Chat Completions compatible. Use openai_responses for Responses-compatible endpoints and anthropic_messages for Anthropic Messages endpoints:

[[endpoint]]
endpoint_id = "gpt-responses"
model = "gpt-5.4-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
api_client_type = "openai_responses"

Publish Gate Before Large Runs

After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
Ask the user explicitly: should visibility be PUBLIC or PRIVATE?
Push with chosen visibility:

prime env push my-env --visibility PUBLIC

prime env push my-env --visibility PRIVATE

For hosted environment workflows, prefer running large jobs against the Hub slug:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s

Prefer Config-Driven Evals Beyond Smoke Tests

For anything beyond quick checks, nudge the user to create an eval TOML config.
Use config files to run multiple evals in one command and keep runs reproducible:

prime eval run configs/eval/my-benchmark.toml

Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.
Use name on individual [[eval]] entries when the same environment appears multiple times. id selects the environment to load; name labels the run in displays, summaries, metadata, and saved result paths.
Put shuffle = true and, for reproducibility, shuffle_seed = <int> in each [[eval]] entry that should sample from a shuffled dataset before selecting num_examples.

Common Evaluation Patterns

For single-environment v1 smoke runs, override typed taskset and harness config with dotted flags:

prime eval run my-env --taskset.difficulty hard --harness.max-turns 20

For reproducible or multi-eval v1 config, put the same settings in TOML child sections:

[[eval]]
id = "my-env"

[eval.taskset]
difficulty = "hard"

[eval.harness]
max_turns = 20

Override legacy/v0 constructor kwargs only when the environment still exposes them; for v1, use taskset/harness config instead:

prime eval run my-env -x '{"max_turns":20}'

Bound per-rollout wall-clock time (use the dedicated --timeout flag; wins over -x and TOML [eval.extra_env_kwargs]):

prime eval run my-env --timeout 600

Save extra state columns:

prime eval run my-env -s -C "judge_response,parsed_answer"

Resume interrupted runs:

prime eval run my-env -n 1000 -s --resume

Resume matching includes --shuffle and --shuffle-seed. Use the same shuffle settings as the interrupted run; only increase -n/--num-examples when extending a saved run. 7. Shuffle examples before selecting the evaluation subset:

prime eval run my-env -n 200 -r 3 --shuffle --shuffle-seed 123 -s

Configure shuffle in TOML:

[[eval]]
id = "my-env"
num_examples = 200
rollouts_per_example = 3
shuffle = true
shuffle_seed = 123

Save results to a custom output directory:

prime eval run my-env -s -o /path/to/output

Run multi-environment TOML suites:

prime eval run configs/eval/my-benchmark.toml

Run the same environment more than once with different args by giving each entry a name:

[[eval]]
id = "reverse-text"
name = "reverse-text-short"

[eval.args]
max_length = 32

[[eval]]
id = "reverse-text"
name = "reverse-text-long"

[eval.args]
max_length = 256

Put generation parameters in TOML sampling sections:

[sampling]
max_tokens = 1024
temperature = 0.7
reasoning_effort = "medium"
enable_thinking = true

[[eval]]
env_id = "my-env"

Use [eval.sampling] for per-eval overrides. [sampling] is shorthand for sampling_args; reasoning_effort and enable_thinking stay top-level and are mirrored into extra_body.chat_template_kwargs. 13. Pass extra HTTP headers via CLI (repeatable):

prime eval run my-env -m my-proxy --header "X-Custom-Header: value"

Set headers in [[eval]] TOML configs as a table or list (merge order: registry row < headers table < header list / --header):

[[eval]]
env_id = "my-env"
headers = { "X-Custom-Header" = "value" }
header = ["X-Another: val"]

Run ablation sweeps using [[ablation]] blocks in TOML configs:

[[ablation]]
env_id = "my-env"

[ablation.sweep]
temperature = [0.0, 0.5, 1.0]

[ablation.sweep.taskset]
difficulty = ["easy", "hard"]

This generates the cartesian product (6 configs in this example). Sweep v1 environment-owned settings under taskset or harness, not as root args. Use --abbreviated-summary (-A) for compact ablation results.

Inspect Saved Results

Browse locally saved runs:

prime eval view

Check metadata.json for aggregate token usage and, when available, total-run cost.input_usd, cost.output_usd, and cost.total_usd.
Inspect platform-visible runs when needed:

prime eval list
prime eval get <eval-id>
prime eval samples <eval-id>

Metrics Interpretation

Treat binary and continuous rewards differently.
Use pass@k-style interpretation only when rewards are effectively binary.
For continuous rewards, focus on distribution shifts and per-task means.
Always inspect samples before concluding regressions.

Reliability Rules

Keep environment/model/config fixed while comparing variants.
Record exact command lines and key flags in the report.
Call out missing credentials, endpoint mismatches, and dependency errors directly.
Do not overinterpret tiny sample runs.
Distinguish a completed rollout with poor reward from an environment/runtime failure.
For timeout debugging, check the environment's own timeout behavior and the outer sandbox/eval timeout before changing reward logic.
For repo example changes, use tests/test_envs.py -k <env> when package installability is part of the risk, not just prime eval run from the current checkout.

Output Format

Return:

Run configuration table.
Aggregate metrics and key deltas.
Sample-level failure themes.
Clear recommendation: proceed, iterate environment, or retune model/sampling.