create-environments - SKILL.md Agent Skill

name: create-environments description: Create or migrate verifiers environments for the Prime Lab ecosystem. Use when asked to build a new environment from scratch, port an eval or benchmark from papers or other libraries, start from an environment on the Hub, or convert existing tasks into a package that exposes load_environment and installs cleanly with prime env install.

Create Environments

Goal

Build production-quality verifiers environments that work immediately in the Prime ecosystem: install, load, evaluate, and train without hidden setup.

Start With Ecosystem Paths

Prefer ecosystem-native setup before custom scaffolding.
Use this default loop:

prime env init my-env --v1
prime env install my-env
prime eval run my-env -m openai/gpt-4.1-mini -n 5

Use prime env init my-env --v1 --with-harness when the environment owns an explicit reusable harness. 3. Treat prime eval run as the canonical eval path. It saves results automatically, so do not add --skip-upload unless the user explicitly requests that deviation. 4. Prefer an existing environment as a starting point when possible:

prime env list --search "keyword"
prime env info owner/name
prime env install owner/name

For repository examples, use repo install when available:

prime env install math-python --from-repo

Encourage users to keep endpoint aliases in configs/endpoints.toml so smoke tests can switch models quickly.
Ask users whether they want instruct or reasoning models for validation.
Instruct-first smoke choices: gpt-4.1 series, qwen3 instruct series.
Reasoning validation choices: gpt-5 series, qwen3 thinking series, glm series.

Build Modes

1. Build From Scratch

Define task contract first: prompt shape, allowed tools, stop conditions, rubric outputs, metrics.
Select the smallest correct base class:

SingleTurnEnv for one-response tasks.
MultiTurnEnv for custom interaction loops.
ToolEnv or MCPEnv for stateless tools.
StatefulToolEnv for per-rollout resources.
CliAgentEnv for running agent binaries in sandboxes with API interception. Override get_sandbox_resources(state) for per-instance resources, build_env_vars(state) for custom env vars.
V1 vf.Env with explicit vf.Taskset/vf.Harness objects for the current taskset/harness environment pattern that separates the task collection from the rollout runner. Use this for new taskset/harness work that needs config-driven metrics, rewards, toolsets, user functions, endpoint interception, or sandboxed Python/command programs. Framework programs should build clients from state.get_endpoint_config(api="chat").

For v1, start from the generated template. Edit TasksetConfig for task settings, Taskset.load_tasks() for task records, Taskset.load_toolsets() for task-owned tools, User subclasses for user behavior, and @vf.* methods for lifecycle, metrics, rewards, and advantages. Add a harness class only for reusable execution behavior.
Keep load_environment(config: vf.EnvConfig) as the canonical Taskset/Harness shim:

def load_environment(config: vf.EnvConfig) -> vf.Env:
    """Loader pattern for all Taskset/Harness environments."""
    return vf.Env(
        taskset=vf.load_taskset(config=config.taskset),
        harness=vf.load_harness(config=config.harness),
    )

For v0 environments, keep the existing vf.Environment patterns and preserve v0 compatibility.
Add pyproject.toml defaults in [tool.verifiers.eval] only when stable.

V1 Authoring Rules

Keep v1 environment entrypoints tiny: import verifiers as vf, define TasksetConfig / optional HarnessConfig subclasses for user-facing knobs, define Taskset / optional Harness classes, then expose typed child loaders and the canonical load_environment(config: vf.EnvConfig) shim that delegates through vf.load_taskset and vf.load_harness.
Keep shared dependencies behind the taskset or harness that owns them. Use bindings as the canonical injection path; prefer serializable loader paths for bound objects in config, and use no-arg loader callables only for Python-only construction. Do not pass already-instantiated resource objects through environment loaders. Do not introduce v1 Parser/Rubric wrappers; parsing is ordinary Python.
Use vf.get_messages(state.get("completion") or [], role="assistant") when reading state completions. The helper returns typed message objects and should not receive None.
Use program.channels for v1 program protocol/channel selection. Do not use stale program.tools terminology.
Use generated child loaders as typed component entrypoints. Add implementation behavior to the taskset or harness class through config fields, load_* methods, User subclasses, Toolset, and @vf.* lifecycle methods.
Put settings as leaf fields on the taskset or harness config that owns them.

V1 Taskset/Harness Shape

Put task data, task-owned tools, user behavior, metrics, rewards, and task-specific configuration on the Taskset.
Use the base vf.Harness unless the harness owns a reusable execution adapter such as a CLI, framework program, sandboxed program, or nested harness flow.
Avoid one-off harness classes whose only purpose is to hold task behavior. That behavior belongs behind the taskset.
Keep small example environments direct. Do not add private helper layers, duplicate loader paths, or optional knobs unless they clarify a real reusable boundary.
Use the current config shape consistently:

[[eval]]
env_id = "owner/my-env"

[eval.taskset]
num_examples = 100

[eval.harness]
max_turns = 8

For package-only composition, omit env_id and select loader packages through child config ids:

[[eval]]

[eval.taskset]
id = "tasksets.harbor"
tasks_dir = "tasks"

[eval.harness]
id = "harnesses.opencode"
max_turns = 8

In code, use the current class-based config shape:

import verifiers as vf


class MyTasksetConfig(vf.TasksetConfig):
    system_prompt: vf.SystemPrompt = "Answer exactly."


class MyTaskset(vf.Taskset[MyTasksetConfig]):
    def load_tasks(self, split: vf.TaskSplit = "train") -> vf.Tasks:
        """Return serializable task records as a list, generator, or Dataset."""
        if split == "eval":
            return []
        return [
            {
                "prompt": [{"role": "user", "content": "Reverse abc."}],
                "answer": "cba",
                "max_turns": 1,
            }
        ]

    @vf.reward(weight=1.0)
    async def correct_answer(self, task: vf.Task, state: vf.State) -> float:
        messages = vf.get_messages(state.get("completion") or [], role="assistant")
        if not messages:
            return 0.0
        response = str(messages[-1].content or "").strip()
        return float(response == task["answer"])


def load_taskset(config: MyTasksetConfig) -> MyTaskset:
    return MyTaskset(config=config)


def load_environment(config: vf.EnvConfig) -> vf.Env:
    """Loader pattern for all Taskset/Harness environments."""
    return vf.Env(
        taskset=vf.load_taskset(config=config.taskset),
        harness=vf.load_harness(config=config.harness),
    )

Use prime env init my-env --v1 as the reference shape when an implementation starts to drift.

2. Port From Another Library, Project, or Paper

Create a strict source-to-target mapping before coding:

dataset rows and splits
prompt rendering and role ordering
tool I/O schema and stop logic
scoring math and aggregation
pass/fail thresholds and special cases

Preserve one-to-one logical equivalence for what the model sees and what gets scored.
Never invent unresolved formatting decisions. Ask the user to decide explicitly.
Benchmark runtime and remove avoidable bottlenecks before handoff.

3. Start From Hub Environment

Install or pull the closest baseline:

prime env install owner/name
prime env pull owner/name -t ./tmp-env

Keep proven interfaces stable unless a migration is deliberate and explicit.
Re-run smoke evals after each major change.

Non-Negotiable Quality Rules

Use deterministic, well-defined reward checks or LLM judges.
Avoid best-effort deterministic heuristics such as keyword style checks except as an explicit last resort with user sign-off.
Make environments self-contained after install. Do not require users to run background servers before load_environment().
Manage external resources inside the environment lifecycle.
Validate required secrets in load_environment() via vf.ensure_keys(...).
Surface feature limits directly. Do not ship hacky workarounds without explicit user approval.

Verification Gate

Run these before claiming completion:

prime env install my-env
prime eval run my-env -m openai/gpt-4.1-mini -n 5
prime eval run my-env -m openai/gpt-4.1-mini -n 50 -r 1 -s

If multi-turn or tool-heavy, also run with higher rollouts:

prime eval run my-env -m openai/gpt-4.1-mini -n 30 -r 3 -s

For repo example environments, also use the package-install path when packaging or dependencies changed:

uv run pytest tests/test_envs.py -k my_env -vv

Publish Gate Before Large Evals Or Training

After smoke tests pass and behavior is stable, recommend pushing to Hub before large evals or RL training.
Ask the user explicitly whether visibility should be PUBLIC or PRIVATE.
Use:

prime env push my-env --visibility PUBLIC

prime env push my-env --visibility PRIVATE

For hosted or large-scale workflows, prefer running with the Hub slug after push:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s

Synthetic Data

Ask users for preferences on which LLMs to use for synthetic data generation and curation before implementation.
Prefer generating synthetic data from raw source documents whenever possible instead of relying only on hand-authored prompts.
Use LLM orchestration (planner/generator/validator loops) to improve sample quality and diversity.
Use back-translation: start from complete materials and decompose them into incomplete tasks, criteria, or partial artifacts that the model must reconstruct.
Use fan-out subtopic sampling from LLMs to expand coverage and avoid overfitting to a narrow slice of the domain.

Deliverable Format

Report:

Environment ID and path.
Exact install and eval commands used.
Port-equivalence notes if migrated.
Any unresolved user decisions that block strict fidelity.