rewardkit

star 2.4k

Write Harbor task verifiers using Reward Kit. Use when creating or editing a task's tests/ directory, adding grading criteria, setting up LLM/agent judges, or designing verifiers that produce a reward score.

harbor-framework By harbor-framework schedule Updated 5/30/2026

name: rewardkit description: Write Harbor task verifiers using Reward Kit. Use when creating or editing a task's tests/ directory, adding grading criteria, setting up LLM/agent judges, or designing verifiers that produce a reward score.

Help the user write task verifiers with Reward Kit. Reward Kit is a lightweight Python package that turns a directory of criteria files into a reward score. Each criterion is a Python function call or a TOML judge file; folders become separate rewards.

Setup in a Harbor task

Put criteria alongside test.sh in the task's tests/ directory:

tests/
├── test.sh
├── checks.py         # programmatic criteria
└── judge.toml        # optional LLM/agent judge

tests/test.sh:

#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests

This runs all criteria in /tests/ against the workspace at /app and writes /logs/verifier/reward.json. Defaults match Harbor's conventions — no extra config needed.

If judge criteria need API keys, pass them through task.toml:

[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

Ask whether Reward Kit should run in the agent's shared environment or in a separate verifier environment. Prefer a separate verifier environment when judge prompts, grading dependencies, API keys, or clean-room checks should not be available to the agent:

[environment]
network_mode = "no-network"   # Agent env baseline — offline during agent.run()

[verifier]
environment_mode = "separate"

[verifier.environment]
network_mode = "public"     # Verifier env baseline — LLM judge API calls
docker_image = "python:3.12-slim"

In shared mode, the verifier runs in the agent container and inherits [environment].network_mode. Put [verifier].network_mode only when verify() needs different network access than the agent phase (a phase override, not a baseline). If agent and verifier need different baselines without runtime switching, use environment_mode = "separate" and set [verifier.environment].network_mode.

Judge criteria that call external APIs need a public baseline or allowlist on the verifier environment. Programmatic checks that only read local files can use no-network.

In separate mode, tests/ is the verifier image build context and must provide /tests/test.sh at runtime; Harbor does not upload tests/ into the running verifier container.

Programmatic criteria

Call built-ins from any .py file in tests/:

import rewardkit as rk

rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")

All criteria accept weight (default 1.0) and isolated (default False, runs in overlayfs so side effects don't leak).

Available built-ins

  • Files: file_exists, file_not_exists, file_contains, file_contains_regex, file_matches, files_equal, diff_ratio
  • Commands: command_succeeds, command_output_contains, command_output_matches, command_output_matches_regex (30s default timeout, optional cwd)
  • Data: json_key_equals, json_path_equals, csv_cell_equals, xlsx_cell_equals (needs [office] extra), sqlite_query_equals
  • HTTP: http_status_equals, http_response_contains
  • Images: image_similarity, image_size_equals (needs [image] extra)
  • Trajectory: trajectory_tool_used, trajectory_tool_not_used, trajectory_turn_count

For extras, install with uv tool install harbor-rewardkit[all].

Custom criteria

Use the @criterion decorator. First parameter is always workspace: Path. Returns bool or float:

from pathlib import Path
from rewardkit import criterion

@criterion
def has_valid_output(workspace: Path) -> bool:
    return (workspace / "output.txt").read_text().strip() != ""

Zero-parameter criteria auto-register. Criteria with extra args must be called via rk:

@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
    return len((workspace / "output.txt").read_text().splitlines()) >= n

rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)

For criteria shared across reward subdirs, define with shared=True in a root-level file and call from subdirs.

Judge criteria (LLM or agent-as-a-judge)

For subjective checks (quality, readability, edge cases), create a TOML file:

[judge]
judge = "anthropic/claude-sonnet-4-6"   # LiteLLM model string
files = ["/app/main.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0

Criterion types:

  • binary — yes/no → 1.0 or 0.0
  • likert — 1..points, normalized to [0, 1]
  • numeric — min..max, normalized to [0, 1]

Agent judges

Agent judges shell out to a CLI and can explore the filesystem:

[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true

[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"

Slower and more expensive than LLM judges, but they can run commands and inspect files.

Useful [judge] options

timeout (default 300), reasoning_effort (low|medium|high), reference (path to reference solution), atif-trajectory (evaluate the agent's trajectory), weight, prompt_template (custom prompt with {criteria} placeholder).

Scoring aggregation

[scoring]
aggregation = "all_pass"   # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7             # only for threshold

Only affects aggregation within this TOML file.

Multi-reward tasks

Put criteria in subdirectories — each becomes a separate reward:

tests/
├── test.sh
├── correctness/
│   └── check.py
├── structure/
│   └── files_exist.py
└── quality/
    └── quality.toml

Produces:

{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }

Output files

  • /logs/verifier/reward.json — per-reward scores
  • /logs/verifier/reward-details.json — per-criterion results, judge reasoning, errors

Multi-step tasks

In a multi-step task, each step has its own tests/ under steps/{name}/tests/, and the verifier runs once per step. Reward Kit behaves the same as in a single-step task: for each step it reads /tests, runs the criteria against /app, and writes /logs/verifier/reward.json for that step. Harbor then aggregates per-step results into a trial-level reward via multi_step_reward_strategy in task.toml — aggregation happens outside Reward Kit, so don't try to encode cross-step logic in your criteria.

A task-level tests/ directory (at the task root) is uploaded to /tests first, then the step's own tests/ is layered on top (same-name files win). Put shared helpers (common checks.py functions with shared=True, fixture files, a fallback test.sh) at the task level, and step-specific criteria under each step.

Multi-reward subdirectories still work within a step: steps/foo/tests/ can contain correctness/, structure/, quality/ — each produces a separate reward key for that step, and multi_step_reward_strategy = "mean" averages each key across steps. Use "final" when the last step is an end-to-end check whose rewards already represent the full task.

When to reach for what

  • Use built-ins for file existence, string matches, command output, JSON/CSV checks, HTTP probes.
  • Use @criterion when logic is task-specific but still programmatic.
  • Use LLM judges for subjective quality dimensions (readability, correctness of prose).
  • Use agent judges when the rubric requires exploring the filesystem or running code (e.g. "does the test suite actually pass?").
  • Use subdirectories when you want separate scores (correctness vs structure vs quality) rather than one blended number.
  • Use isolated=True for any criterion that runs mutating commands, so it doesn't corrupt the workspace for other criteria.

Working example

See examples/tasks/reward-kit-example/ in the Harbor repo.

Install via CLI
npx skills add https://github.com/harbor-framework/harbor --skill rewardkit
Repository Details
star Stars 2,381
call_split Forks 1,126
navigation Branch main
article Path SKILL.md
More from Creator
harbor-framework
harbor-framework Explore all skills →