name: review-environments description: Review verifiers environments for correctness, robustness, and ecosystem compatibility. Use when asked for environment code review, quality audit, migration validation, or release readiness checks for local environments or environments pulled from the Hub.
Review Environments
Goal
Find correctness risks and regressions first, then assess maintainability and ecosystem compliance.
Review Input Modes
- Local environment module in
./environments/<env_name>. - Pulled Hub environment via
prime env pull owner/name. - Installed package under active workspace.
Review Workflow
- Identify environment contract:
load_environment(...)- base class and rollout behavior (
SingleTurnEnv,MultiTurnEnv,ToolEnv/MCPEnv/StatefulToolEnv,SandboxEnv/PythonEnv, V1vf.Envwith explicitvf.Taskset/vf.Harnessobjects for framework programs,CliAgentEnvfor sandboxed agents) - rubric and metrics
- Verify installability and runtime entrypoint with the canonical eval path. Do not add
--skip-uploadunless the user explicitly requests that deviation; standard runs save automatically for the private Evaluations tab andprime eval view:
prime env install <env>
prime eval run <env> -m openai/gpt-4.1-mini -n 5
- Trace reward pipeline and validate scoring semantics.
- Run targeted checks for tool/stateful behavior where applicable.
Endpoint And Model Selection Nudge
- Encourage endpoint alias setup in
configs/endpoints.tomlfor reproducible review runs. - Check
api_client_typewhen reviewing non-default providers.openai_chat_completionsis the default;openai_responsesandanthropic_messagesshould be explicit in endpoint configs when those protocols are required. - Ask whether review coverage should prioritize instruct or reasoning behavior.
- Instruct go-tos:
gpt-4.1series,qwen3instruct series. - Reasoning go-tos:
gpt-5series,qwen3thinking series,glmseries.
Critical Review Criteria
- Reward correctness:
- Prefer deterministic, explicit checks or LLM judges.
- Flag best-effort keyword or style heuristics unless explicitly approved.
- Verify the scoring semantics from code before treating a low reward as an implementation failure. Some environments intentionally complete with
0.0reward when the model fails the task.
- Environment self-containment:
- Flag any requirement for user-managed background services before
load_environment(). - Require environment-managed lifecycle for sandboxes/sessions.
- v1 taskset/harness contracts:
- Expect new taskset/harness environments to use the v1
vf.Env/vf.Taskset/vf.Harnessboundary, withload_taskset(config: MyTasksetConfig)and optionalload_harness(config: MyHarnessConfig)defining child config types, plus the canonicalload_environment(config: vf.EnvConfig)shim delegating throughvf.load_taskset(config=config.taskset)andvf.load_harness(config=config.harness). - Expect tasksets to own task data, task-owned tools, user behavior, metrics, rewards, and task-specific config. Flag one-off harness classes that only wrap task behavior.
- Review v1 implementations against the generated
prime env init my-env --v1shape: task settings inTasksetConfig, tasks inload_tasks, task-owned tools inload_toolsets, user behavior inUsersubclasses, lifecycle/metrics/rewards as@vf.*methods, and typed component entrypoints throughload_taskset, optionalload_harness, andload_environment. - Require environment packages and READMEs to preserve the generated
prime env initstructure. Flag hand-scaffolded environments and freeform environment READMEs; authors should fill in the CLI-generated template sections instead of inventing a new shape. - Expect shared dependencies to use bindings owned by the taskset, toolset, user, program, or harness that needs them. Flag pre-initialized resource objects passed through environment loaders; object entries should be serializable loader paths or no-arg loader callables.
- Verify
Taskdata is serializable,stateremains serializable at rollout boundaries, and model/client controls flow through runtime state rather than top-level dataset columns. - For V1 harness programs, verify framework clients consume
state.get_endpoint_config(api="chat")rather than hardcoding an upstream LLM endpoint. ForCliAgentEnvagents, verify sandboxed agent code consumes the injected interception endpoint; the proxy is what makes rollouts visible to the rubric.
- Migration fidelity:
- For ports, verify one-to-one equivalence of prompts, tool traces, and scoring logic.
- Flag any assumptions made without user decision.
- Secrets handling:
- Ensure required keys are validated in
load_environment()withvf.ensure_keys(...).
- Performance and scaling:
- Identify obvious bottlenecks in dataset loading, rubric calls, or tool execution.
- Packaging and repo hygiene:
- If an environment was renamed or moved, verify
pyproject.toml, README/docs references, package include paths, tests, and generated AGENTS output were updated together. - Flag bytecode, coverage files, local eval outputs, and temporary build artifacts unless they are intentional release assets.
Config And Docs Surface
- Check that eval, GEPA, RL, and Hosted Training examples use the same public TOML shape where applicable.
- For v1 configs, route settings through taskset and harness child config sections; do not subclass
EnvConfigjust to narrow child config types, and avoid root env config knobs. - If docs changed public behavior, verify the relevant bundled skill was updated too.
Findings Format
Return findings first, sorted by severity:
P0/P1bugs and behavioral mismatches.P2quality risks and maintainability issues.- Test gaps and missing eval coverage. Include file paths, exact lines, impact, and concrete fix direction.
If No Findings
State explicitly that no defects were found, then list residual risk and untested areas.