name: coral-new-task
description: End-to-end recipe for adding a new task under examples/ — the three pieces that have to line up (task.yaml, seed/, and grader/), what to put in each, the TaskGrader API surface, the coral validate → smoke-test loop, and the common mistakes (repo_path pointing at the wrong dir, score direction backwards, hidden answer keys leaking into seed/, grader writing to codebase_path which the daemon force-removes, private-vs-public confusion, missing run() signature). Use whenever the user wants to add a new CORAL task or port an existing benchmark into CORAL.
Creating a new CORAL task
A CORAL task is three things that must line up:
examples/<task>/
├── task.yaml # config: name, description, grader entrypoint, agent count
├── seed/ # starter code agents see when they begin (the repo_path)
│ └── solution.py
└── grader/ # standalone Python package
├── pyproject.toml
└── src/<task>_grader/
├── __init__.py
└── grader.py # class Grader(TaskGrader): ...
The packaged form is the only supported form. The package gives the grader its own venv and ships everything the eval needs — grader code, helper modules, and hidden data (see "Hidden data" below).
Reference implementations
Look at these before writing anything new — copy the closest one and edit:
| Reference | When to copy it |
|---|---|
| examples/erdos/ | Minimal packaged grader, single grader file, numpy-only deps |
| examples/dna_design/ | Packaged grader with bundled data files (importlib.resources) and [ml] optional-deps for heavy libs |
| examples/swebench-verified/ | Tiered eval (different instance counts per tier), private answer keys, harbor integration |
| examples/circle_packing/ | Smallest packaged task end-to-end — single solution file, single grader file |
| examples/mnist/ | Packaged grader with a hidden answer key shipped inside the package (taskdata/answers/) |
1. The seed
Whatever lives in seed/ is what the agent sees on first checkout — it's the working directory the grader will later score. The contract between seed/ and the grader is the program file: a Python file with a function the grader imports and calls.
The convention across examples is:
solution.py(orinitial_program.py) defining a top-levelrun()function.- The grader passes
program_file: "solution.py"viagrader.args. run()'s signature is whatever the grader expects — usually() -> resultor(input_path) -> result.
Put a real, runnable baseline here. Agents should be able to coral eval immediately and get a non-zero score, so they have a starting point to improve. A no-op skeleton that crashes is not a good baseline.
If the task needs data files at runtime (training data, fixtures), put them under seed/data/ and reference them by relative path from solution.py. The grader will see them at <codebase_path>/data/....
2. The grader
Packaged grader — the recommended path
grader/
├── pyproject.toml
└── src/<task>_grader/
├── __init__.py
└── grader.py
pyproject.toml is a thin Hatchling package. Crib from examples/erdos/grader/pyproject.toml:
[project]
name = "<task>-grader"
version = "0.1.0"
description = "CORAL grader for the <task> task."
requires-python = ">=3.11"
dependencies = ["coral", "numpy"] # Whatever the grader actually imports.
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/<task>_grader"]
Subclass TaskGrader and implement evaluate():
# grader/src/<task>_grader/grader.py
from coral.grader import TaskGrader
from coral.types import ScoreBundle
class Grader(TaskGrader):
def evaluate(self) -> float | ScoreBundle:
program_file = self.args.get("program_file", "solution.py")
# self.codebase_path — the agent's commit checked out detached
# self.private_dir — .coral/private/ (your hidden answer keys live here)
# self.args — dict from task.yaml grader.args
# self.timeout — grader.timeout in seconds (or None)
# self.eval_logs_dir — write subprocess logs / artifacts the agent should see post-grade
try:
result = run_program_and_score(...)
except TimeoutError:
return self.fail(f"Evaluation timed out after {self.timeout}s")
except Exception as e:
return self.fail(f"Evaluation failed: {e}")
return self.score(result, explanation=f"score={result:.4f}")
What you have available on self:
| Attribute / method | Use it for |
|---|---|
self.codebase_path |
Path to the commit being graded (detached worktree). Read-only — anything written here is discarded after the eval. |
self.private_dir |
.coral/private/. Your answer keys, hidden test data, anything from grader.private lives here. |
self.args |
dict from task.yaml::grader.args. Use self.args.get("program_file", "solution.py") etc. |
self.timeout |
Eval timeout in seconds (or None if grader.timeout: 0). |
self.eval_logs_dir |
Per-attempt directory for logs/artifacts that should outlive the grader. Symlinked into each agent worktree as <shared_dir>/eval_logs/<hash>/. |
self.score(value, explanation=...) |
Build a single-task ScoreBundle from a numeric score. |
self.fail(reason) |
Return a fail ScoreBundle with reason as feedback. |
self.get_python_command() |
List for the python binary inside the codebase's env (uses uv run if a pyproject.toml is present). Always use this instead of sys.executable so task-specific deps are visible. |
self.run_program(filename, *args) |
Convenience: runs <codebase_path>/<filename> as a subprocess via get_python_command(). |
Bundling data files with the grader
If the grader needs reference files (model weights, ground-truth answers, scoring fixtures), ship them inside the package and load via importlib.resources:
import importlib.resources
scorer_dir = str(importlib.resources.files("<task>_grader.scorers"))
examples/dna_design/grader/src/dna_design_grader/grader.py is the canonical pattern — note the scorers/ subpackage. Add the directory to [tool.hatch.build.targets.wheel] if it has non-Python files.
Heavy / optional dependencies
If the grader wants torch, grelu, etc., put them in optional-dependencies and have the grader fall back gracefully when missing — see examples/dna_design/grader/pyproject.toml. Then grader.setup becomes ["uv pip install -e ./grader[ml]"] for the full version.
Hidden data: ship it inside the package
Answer keys, test fixtures, and helper modules live inside the grader package — the convention is a taskdata/ subdir next to grader.py, resolved with _TASKDATA = Path(__file__).parent / "taskdata" (works for editable and wheel installs alike; hatchling includes non-Python files under packages automatically). See examples/mnist/grader/src/mnist_grader/taskdata/ for a data file and examples/ADRS/txn_scheduling/ for helper modules put on sys.path.
For files that genuinely can't live in the package (huge datasets, shared across tasks), list them under grader.private in task.yaml — they get copied to .coral/private/<name> and the grader reads them via self.private_dir.
3. The task.yaml
Fields that must be set; everything else has a sensible default.
task:
name: "My Task" # shows up in results/<slug>/
description: | # rendered into CORAL.md, agents read this
What the agent should do.
Reference the program file by name (e.g. solution.py and its run() signature).
tips: | # optional, also rendered into CORAL.md
- Eval timeout is N seconds.
- Constraints / scoring details / known baselines.
grader:
entrypoint: "<task>_grader.grader:Grader" # required
setup:
- "uv pip install -e ./grader" # runs once in .coral/private/grader_venv/
timeout: 600 # seconds; 0 disables, default 300
direction: maximize # or minimize — controls leaderboard ordering
args: # arbitrary dict, read inside grader as self.args
program_file: "solution.py"
private: [] # extra files copied into .coral/private/ (hidden from agents)
parallel:
max_workers: 1 # bump only when the grader is concurrency-safe
max_pending_per_agent: 1 # cap on in-flight submissions per agent
agents:
count: 1 # raise once the task is known to be stable
runtime: claude_code # claude_code | codex | cursor_agent | kiro | opencode
model: sonnet # default depends on runtime; see coral/agent/registry.py
workspace:
results_dir: "./results" # where each run lands
repo_path: "./examples/<task>/seed" # MUST point at the seed/ dir
setup: # runs in each agent worktree before agents start
- "uv pip install numpy" # task-runtime deps go here, NOT in grader/setup
run:
verbose: false
ui: false
session: tmux # local | tmux | docker
The examples/README.md documents the full schema with every default. When in doubt, look there before adding fields.
4. Validate before running agents
coral validate examples/<task>
This:
- Parses
task.yamland reports schema errors. - Bootstraps
.coral/private/grader_venv/and runsgrader.setup. - Copies
seed/into a tempdir and runs the grader against it once. - Prints the resulting score and explanation.
If coral validate succeeds, the grader can score the seed. That's the single most important checkpoint — most "agent stuck" issues trace back to a grader that crashes on the seed.
After validation, smoke-test with one agent:
coral start -c examples/<task>/task.yaml agents.count=1 run.session=local
# Wait for one eval, then:
coral stop
5. Wire it into the index
Add a one-line entry to the table in examples/README.md and a short ### <task> section under "Details" with the bullet points the others use (Agents / Timeout / Session). Skip this only for throwaway local tasks.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
repo_path points at examples/<task>/ instead of examples/<task>/seed/ |
Grader sees task.yaml and grader/ in codebase_path |
Always point repo_path at the seed dir. |
direction: maximize for a loss / minimize for a benchmark ratio |
Leaderboard ordered backwards | Score = "ratio against benchmark, >1 is better" → maximize. Score = "raw error" → minimize. |
Hidden answer key under seed/ |
Agents can read it and game the score | Bundle it into the grader package (taskdata/), or use grader.private for files that can't live in the package. |
Grader writes results under self.codebase_path and reads them later |
Files vanish — daemon force-removes the worktree after each eval | Write under self.eval_logs_dir. |
Grader uses sys.executable to run the agent's program |
Misses task-specific deps installed via workspace.setup |
Use self.get_python_command() (it switches to uv run --project when the codebase has a pyproject.toml). |
Heavy deps in main grader dependencies |
coral validate is slow / fails on machines without the GPU stack |
Move to optional-dependencies and fall back gracefully. See dna_design's [ml] extra. |
grader.setup tries to install task-runtime deps |
The grader venv has them but the agent's worktree doesn't | Task-runtime deps go in workspace.setup. Grader-only deps go in grader.setup. |
parallel.max_workers > 1 with a non-concurrency-safe grader |
Sporadic failures when two evals collide on Docker ports / GPU / scratch dirs | Leave at 1 unless the grader is provably safe. |
Forgetting coral validate |
Agents start, fail every eval with the same error | Always validate first. |