name: golem-skill-harness description: "Developing, testing, and running Golem skill tests with the skill test harness. Use when creating new skills, writing scenario YAML files, running skill tests locally, or debugging skill test failures."
Golem Skill Test Harness
The skill test harness lives in golem-skills/tests/harness/. It drives coding agents (Claude Code, OpenCode, Codex) through scenario YAML files, verifying that skills are activated and produce correct results. Skill definitions live in golem-skills/skills/.
Skill Directory Structure
Skills in golem-skills/skills/ are organized by language scope:
golem-skills/skills/
common/ # Language-independent skills (included for all languages)
golem-new-project/
SKILL.md
rust/ # Rust-specific skills (included only for Rust projects)
golem-add-rust-crate/
SKILL.md
ts/ # TypeScript-specific skills (included only for TS projects)
golem-add-npm-package/
SKILL.md
scala/ # Scala-specific skills (included only for Scala projects)
moonbit/ # MoonBit-specific skills (included only for MoonBit projects)
When golem new creates a project, it embeds the common/ skills plus the language-specific skills into the project's .agents/skills/ and .claude/skills/ directories.
Rebuilding After Skill Changes
Skills are embedded in the golem / golem-cli binaries. If you add or modify a skill under golem-skills/skills/, you must recompile the binaries before the changes take effect — including before running the skill test harness.
cargo make build-release-full
Without this step, golem new will still emit the old skill content, and the harness will test against stale skills.
Also regenerate the public How-To Guides
Each SKILL.md is also republished as a How-To Guide on learn.golem.cloud under docs/src/content/how-to-guides/. After adding or editing a skill, regenerate those MDX pages:
cargo make generate-docs-skills
CI's check-docs-skills task will fail any PR that changes golem-skills/skills/ without also updating the generated MDX.
Prerequisites
- Node.js 20+ and npm
- Golem binary pre-built: the harness requires a
golembinary in$GOLEM_PATH/target/release/or$GOLEM_PATH/target/debug/. Build withcargo build -p golem(debug) orcargo build -p golem --release(release). The harness prefers the release build and falls back to debug. - No pre-running Golem server: the harness starts its own server automatically using
golem server run --data-dir <workspaces/golem-server-data> --cleanand stops it when done. If a server is already running on port 9881, the harness fails with an error to avoid conflicts. - Agent CLI installed: one of
claude(Claude Code),opencode, orcodex - Filesystem watcher:
fswatchon macOS,inotify-toolson Linux - GOLEM_PATH env var set to the golem repo root. If not set, the harness auto-detects it by walking up from
cwdlooking forsdks/rust/golem-rustandsdks/ts/packagesdirectories (same markers asgolem-cli). If auto-detection also fails, the harness exits with an error. The resolved target directory (target/releaseortarget/debug) is prepended toPATHso all spawned processes — including agent drivers — use the correctgolemandgolem-clibinaries. - For Rust skills:
cargo-componentandwasm32-wasip2target - For TS skills:
pnpm,wasm-rquickjs-cli, TS SDK built (cargo make build-sdk-ts) - For MoonBit skills:
moon(MoonBit toolchain),wasm-tools
Install and Build
cd golem-skills/tests/harness
npm install
npm run build
The build script runs ESLint then tsc, so lint errors will fail the build.
Linting and Formatting
The harness uses ESLint 9 with typescript-eslint for linting and Prettier for formatting. Configuration files:
eslint.config.js— ESLint flat config withtypescript-eslintrecommended rules.prettierrc— Prettier config (2-space indent, double quotes, trailing commas, 100 char width)
cd golem-skills/tests/harness
npm run lint # Check for lint errors
npm run lint:fix # Auto-fix lint errors
npm run format:check # Check formatting without changing files
npm run format # Auto-format all source files
Always run npm run lint:fix and npm run format before committing harness changes. CI enforces both lint (via npm run build) and formatting (via npm run format:check).
Running Unit Tests (harness self-tests)
cd golem-skills/tests/harness
npm test
Running Skill Scenarios
From golem-skills/tests/harness/:
npx tsx src/run.ts [options]
CLI Options
| Option | Description | Default |
|---|---|---|
--agent <name> |
Agent driver: claude-code, opencode, codex, or all |
all |
--language <lang> |
Language: ts, rust, or all |
all |
--scenario <name> |
Run only the named scenario | all scenarios |
--scenarios <dir> |
Path to scenario YAML directory | ./scenarios |
--output <dir> |
Results output directory | ./results |
--timeout <seconds> |
Global timeout per step | 300 |
--dry-run |
Validate scenarios without executing | false |
--resume-from <id> |
Resume from a specific step ID | — |
--workspace <path> |
Override workspace directory | — |
--merge-reports <dir> |
Merge summary.json files into aggregated report | — |
Examples
# Run a single scenario with Claude Code for Rust
npx tsx src/run.ts --agent claude-code --language rust --scenario golem-new-project-rust
# Dry-run to validate YAML
npx tsx src/run.ts --dry-run --scenario golem-db-app-ts
# Resume a failed scenario from a specific step, reusing a previous workspace
npx tsx src/run.ts --agent claude-code --language ts --scenario golem-db-app-ts \
--resume-from build-and-deploy --workspace ./workspaces/<run-id>/golem-db-app-ts/ts
# Merge reports from multiple CI runs
npx tsx src/run.ts --merge-reports ./ci-results --output ./merged
Workspace Directory
Each harness run generates a unique run ID (UUID). Without --workspace, each scenario gets its own directory at <cwd>/workspaces/<run-id>/<scenario-name>/<language>/. With --workspace, the same structure is created under the specified root: <workspace>/<run-id>/<scenario-name>/<language>/. Workspace directories are never deleted, so you can inspect the results after the run.
Golem Server Lifecycle
The harness manages the Golem server automatically:
- Startup: Before running scenarios, the harness checks port 9881. If a server is already running, it fails with an error. Otherwise it starts
golem server run --data-dir <workspaces/<run-id>/golem-server-data> --cleanand waits up to 60 seconds for the healthcheck to pass. - Between scenarios: The server is restarted (stopped and started again with
--clean) to ensure a fresh state for each scenario. - Per-scenario check: Before each scenario, the harness verifies that a
localGolem profile exists and the server is still reachable. - Teardown: After all scenarios complete (or on Ctrl+C), the harness stops the server process.
Adding a New Skill
1. Create the skill definition
Create the skill under the appropriate subdirectory of golem-skills/skills/:
common/<skill-name>/SKILL.md— for language-independent skillsrust/<skill-name>/SKILL.md— for Rust-specific skillsts/<skill-name>/SKILL.md— for TypeScript-specific skillsscala/<skill-name>/SKILL.md— for Scala-specific skillsmoonbit/<skill-name>/SKILL.md— for MoonBit-specific skills
Use YAML frontmatter:
---
name: my-new-skill
description: "What the skill does. Use when <trigger conditions>."
---
# Skill Title
Instructions for the agent...
2. (Optional) Link the skill from golem-cli --help
If the skill is relevant to one or more golem-cli subcommands, add a SkillBinding entry so that — when an automated coding agent invokes golem-cli ... --help inside a Golem application that has the skill installed — a Relevant skills: block linking to the skill's SKILL.md is appended to that command's long help.
Edit cli/golem-cli/src/agent_help_hints/builtin_skill_map.rs and add a row to SKILL_BINDINGS:
// Common (language-independent) skill:
SkillBinding {
cli_path: &["agent", "delete"],
basename: "golem-delete-agent",
kind: SkillKind::Common,
summary: "Delete an agent instance.",
},
// Per-language skill (one variant per listed language; folder is
// `<basename>-<lang>` where lang is rust|ts|scala|moonbit):
SkillBinding {
cli_path: &["secret", "create"],
basename: "golem-add-secret",
kind: SkillKind::PerLanguage(ALL_LANGS),
summary: "Add a typed secret available to your agents.",
},
Rules:
cli_pathis the chain of subcommand names exactly as they appear in clap's tree (kebab-case, e.g.&["agent", "cancel-invocation"]).basenameis the skill folder name without any language suffix.kindisSkillKind::Commonfor language-independent skills, orSkillKind::PerLanguage(...)for per-language ones.summaryis a one-line, language-agnostic description shown above the file links.- The same
cli_pathcan appear in multiple bindings; they are merged into a single block under that command in source order. - A skill that is not installed under
<app_dir>/.agents/skills/is silently skipped at runtime, so adding speculative bindings is safe.
Two compile-time tests guard the table:
every_binding_basename_exists_in_golem_skills_repo— fails if the named skill folder is missing fromgolem-skills/skills/.every_binding_path_resolves_in_clap_tree— fails ifcli_pathdoesn't match a real subcommand (catches CLI renames).
Run them with:
cargo test -p golem-cli --lib -- agent_help_hints
3. Rebuild the binaries
After creating or modifying a skill, recompile so the changes are embedded:
cargo make build-release-full
4. Write a scenario YAML
Create golem-skills/tests/harness/scenarios/<scenario-name>.yaml:
name: "my-scenario"
settings:
timeout_per_subprompt: 300
golem_server:
custom_request_port: 9006
steps:
- id: "step-one"
prompt: "Do something using the skill"
expectedSkills:
- "my-new-skill"
verify:
build: true
5. Run the scenario
npx tsx src/run.ts --agent claude-code --language rust --scenario my-scenario
Scenario YAML Reference
Top-Level Fields
name: "scenario-name" # Required. Unique scenario identifier.
settings:
timeout_per_subprompt: 300 # Default timeout for prompt steps (seconds)
golem_server:
router_port: 9881 # Golem router port (for healthcheck)
custom_request_port: 9006 # Sets GOLEM_CUSTOM_REQUEST_PORT env var
cleanup: true # Whether to clean workspace before run
prerequisites:
env: # Extra env vars set during execution
DATABASE_URL: "postgres://..."
skip_if: # Skip entire scenario conditionally
language: "ts" # Skip when language is "ts"
agent: "codex" # Skip when agent is "codex"
os: "windows" # Skip when OS matches (darwin→macos, win32→windows)
steps: [...] # Required. At least one step.
Step Types
Every step must have exactly one action field. Common fields available on all steps:
- id: "unique-step-id" # Optional. Used for --resume-from.
timeout: 600 # Override step timeout (seconds)
expect: { ... } # Assertions (see below)
retry: # Retry on failure
attempts: 3
delay: 5 # Seconds between retries
only_if: # Run only when conditions match
language: "rust"
agent: "claude-code"
os: "macos"
skip_if: # Skip when conditions match
language: "ts"
prompt — Send a prompt to the coding agent
- id: "create-app"
prompt: "Create a new Golem application called my-app with Rust."
expectedSkills: # Skills that MUST be activated
- "golem-new-project"
allowedExtraSkills: # Extra skills that are OK to activate
- "golem-db-app-rust"
strictSkillMatch: false # If true, ONLY expectedSkills may activate
continueSession: true # Continue previous agent session and keep cumulative
# skill tracking for that prompt session.
# Set to false to start a fresh agent session with
# fresh skill tracking.
verify:
build: true # Run `golem build` after the prompt
deploy: true # Run `golem build` + `golem deploy --yes`
create_project — Create a Golem project directly (without an agent prompt)
Runs golem new <name> --template <language> --yes in the workspace, automatically using the current language as the template. Useful when a scenario needs a pre-existing project without involving the agent.
- id: "setup-project"
create_project:
name: "my-app"
verify:
build: true
deploy: true
With language-conditional presets:
- id: "setup-project"
create_project:
name: "my-app"
presets:
rust: ["some-rust-preset"]
ts: ["some-ts-preset"]
verify:
build: true
deploy: true
shell — Run a shell command
- id: "check-files"
shell:
command: "ls"
args: ["my-app/golem.yaml"]
cwd: "subdirectory" # Relative to workspace
expect:
exit_code: 0
stdout_contains: "golem.yaml"
http — Make an HTTP request
- id: "call-api"
http:
url: "http://my-app.localhost:9006/path"
method: "POST" # GET, POST, PUT, DELETE, PATCH
headers:
Content-Type: "application/json"
body: '{"key": "value"}'
expect:
status: 200
body_contains: "expected text"
body_matches: "regex.*pattern"
invoke — Invoke a Golem agent function via CLI
- id: "call-function"
invoke:
agent: 'CounterAgent("my-counter")'
method: "increment"
args: '"hello"' # Optional function arguments
expect:
stdout_contains: "1"
Use the real method name as it appears in source code, not a kebab-cased external name. For
cross-language scenarios, method and args can be language-conditional:
- id: "call-function"
invoke:
agent: 'ItemRepositoryAgent("catalog")'
method:
rust: "create_item"
ts: "createItem"
scala: "createItem"
args: '{id: "item-1", name: "Hammer"}'
Prompts must use language-appropriate method name casing (snake_case for Rust/MoonBit, camelCase for TypeScript/Scala) — not kebab-case. Invocation steps must also use the source-language method names that the generated code actually exposes.
invoke_json — Invoke with --json output
Same as invoke but requests JSON-formatted CLI output. Supports result_json assertions with
JSONPath.
result_json assertions are evaluated against the unwrapped invocation result value, not the full
CLI envelope. That means:
- if the method returns a record/object/case class, use paths like
$.id - if the method returns a scalar, assert against
$ - if the method returns a list, assert against
$or list element paths like$[0].id
- id: "call-json"
invoke_json:
agent: 'MyAgent("test")'
method: "getData"
expect:
result_json:
- path: "$.name"
equals: "test"
- path: "$.items[0]"
contains: "expected"
Cross-language example:
- id: "create-item"
invoke_json:
agent: 'ItemRepositoryAgent("catalog")'
method:
rust: "create_item"
ts: "createItem"
scala: "createItem"
args: '{id: "item-1", name: "Hammer"}'
expect:
result_json:
- path: "$.id"
equals: "item-1"
- path: "$.name"
equals: "Hammer"
create_agent — Create a Golem agent
- id: "make-agent"
create_agent:
name: 'MyAgent("instance-1")'
env:
KEY: "value"
config:
setting: "value"
delete_agent — Delete a Golem agent
- id: "remove-agent"
delete_agent:
name: 'MyAgent("instance-1")'
trigger — Fire-and-forget agent function call
- id: "trigger-bg"
trigger:
agent: 'MyAgent("test")'
method: "backgroundTask"
Like invoke and invoke_json, trigger.method can be language-conditional when Rust,
TypeScript, and Scala use different method casing.
check_file — Assert on file contents
Reads a file relative to the golem project directory and runs assertions against its contents.
The file content is treated as stdout for assertion purposes.
- id: "check-output"
check_file:
path: "output.txt"
expect:
stdout_contains: "expected text"
stdout_not_contains: "unwanted text"
stdout_matches: "regex.*pattern"
mcp_call — Call an MCP server method
Initializes an MCP session via the Streamable HTTP transport, then sends a JSON-RPC method call. Session management (initialize + session ID forwarding) is handled automatically.
- id: "list-tools"
mcp_call:
url: "http://my-app.localhost:9007/mcp"
method: "tools/list"
expect:
status: 200
body_contains: "my-tool-name"
With parameters (e.g., calling a tool):
- id: "call-tool"
mcp_call:
url: "http://my-app.localhost:9007/mcp"
method: "tools/call"
params:
name: "CounterAgent-increment"
arguments:
name: "my-counter"
expect:
status: 200
body_contains: "1"
sleep — Wait for a duration
- id: "wait"
sleep: 5 # seconds
Assertions (expect)
Available assertion fields:
| Field | Applies To | Description |
|---|---|---|
exit_code |
shell, invoke | Assert process exit code |
stdout_contains |
shell, invoke, check_file, mcp_call | Stdout includes substring |
stdout_not_contains |
shell, invoke, check_file, mcp_call | Stdout must NOT include substring |
stdout_matches |
shell, invoke, check_file, mcp_call | Stdout matches regex |
status |
http, mcp_call | HTTP response status code |
body_contains |
http, mcp_call | Response body includes substring |
body_matches |
http, mcp_call | Response body matches regex |
result_json |
invoke_json | JSONPath assertions on parsed JSON result |
Regex-based assertions use JavaScript RegExp syntax because the harness evaluates them with
Node.js. --dry-run validates that stdout_matches and body_matches compile successfully.
Use JavaScript-compatible patterns such as \\d+, (?:...), and [\\s\\S]* for cross-line
matches. Do not use PCRE-only inline flags such as (?s).
result_json entries support:
path: JSONPath expression (e.g.,$.name,$.items[0].id)equals: Exact match (deep equality)contains: Substring match on stringified value
Language-Conditional Fields
prompt, expectedSkills, allowedExtraSkills, verify, create_project, invoke.method,
invoke_json.method, trigger.method, invoke.args, invoke_json.args, and trigger.args
can be language-conditional:
- id: "create-project"
prompt:
ts: "Create a new Golem application with TypeScript."
rust: "Create a new Golem application with Rust."
expectedSkills:
ts: ["golem-new-project", "golem-db-app-ts"]
rust: ["golem-new-project", "golem-db-app-rust"]
Another common pattern is language-specific invocation naming:
- id: "list-items"
invoke_json:
agent: 'ItemRepositoryAgent("catalog")'
method:
rust: "list_items"
ts: "listItems"
scala: "listItems"
moonbit: "list_items"
When method arguments contain records or other composite types, use per-language args because
golem agent invoke parses arguments using language-specific syntax. Rust uses { field: value }
with :, TypeScript uses { field: value } with :, Scala uses TypeName(field = value)
with =, and MoonBit uses { field: value } with : (same as Rust):
- id: "create-item"
invoke_json:
agent: 'ItemRepositoryAgent("catalog")'
method:
rust: "create_item"
ts: "createItem"
scala: "createItem"
moonbit: "create_item"
args:
rust: '{ id: "item-1", name: "Hammer" }'
ts: '{ id: "item-1", name: "Hammer" }'
scala: 'Item(id = "item-1", name = "Hammer")'
moonbit: '{ id: "item-1", name: "Hammer" }'
For simple scalar arguments (strings, numbers, booleans), the syntax is the same across all
languages, so a plain args string suffices:
args: '"item-1"'
Scenario Authoring Tips
- Prefer
create_projectfor setup when the scenario is not specifically testing project creation. This keeps skill activation expectations focused on the behavior under test. - Prefer
invoke_jsonoverinvokefor behavioral verification. It is more stable for assertions, especially for records, lists, and other structured return values. - Use language-conditional
methodfields whenever Rust, TypeScript, Scala, and MoonBit differ in method casing or naming style. MoonBit usessnake_case(same as Rust). - Prompts MUST use language-appropriate method name casing, not kebab-case. When a
prompt mentions method names that the agent should create, use the casing convention for
each target language: TypeScript and Scala use
camelCase(e.g.,createItem,getTag), Rust and MoonBit usesnake_case(e.g.,create_item,get_tag). If a prompt mentions method names, use per-language prompt syntax even if the rest of the text is identical. Agents (especially Codex) may interpret kebab-case method names literally and generate code with computed property syntax likeasync ["create-item"](), producing kebab-case WIT exports that don't match theinvoke/invoke_jsonstep's expected method names. - Avoid repetitive per-language prompts beyond method naming. Use language-conditional
promptwhen the wording genuinely differs between languages (e.g., different method names, file names, or syntax). If the prompt is essentially the same for all languages except for method name casing, still use per-language prompts for correctness. The agent already knows the project language from the AGENTS.md guide and will pick the right REPL language, file extension, etc. - Helper agents with HTTP APIs for observable side effects: Some skills (atomic blocks,
transactions, durability controls) need an external service to observe side effects — e.g., to
verify that operations were retried, compensated, or executed in the correct order. The harness
does not provide a built-in mock HTTP server, but you can achieve the same effect by prompting
the coding agent to create a helper agent that exposes an HTTP API and records events.
Configure
settings.golem_server.custom_request_portso the app has a known HTTP endpoint, then ask the agent to add a second agent type with an HTTP mount that acts as the "other side." For example, aSideEffectRecorderagent withPOST /record(appends an event string to an internal list) andGET /events(returns the full event history as JSON). The agent under test then makes HTTP requests to this recorder during its operation. After the invocation, the scenario can use anhttpstep toGET /eventsand assert on the recorded sequence. This pattern mirrors how the worker executor tests use aTestHttpServerto capture side-effect ordering, but uses a real Golem agent instead — no external infrastructure needed. Seetransactions-1-fallible-rollback-http-ledger.yamlfor a concrete example whereOrderLedgerserves this role, recording reserve/charge/refund/release history via HTTP endpoints and exposing aGET /stateendpoint that the harness asserts against.
Template Variables
Steps support {{variable}} substitution. Built-in variables:
| Variable | Value |
|---|---|
{{workspace}} |
Absolute workspace path |
{{scenario}} |
Scenario name |
{{agent}} |
Current agent name |
{{language}} |
Current language |
Skill Activation Detection
The harness detects whether an agent actually read a skill using two mechanisms:
- Filesystem watcher:
fswatch(macOS) orinotifywait(Linux) monitors SKILL.md file access events - atime comparison: Snapshots file access times before each step and compares after
Both mechanisms feed into expectedSkills / allowedExtraSkills / strictSkillMatch
verification. Skill tracking is scoped to the current prompt session: followup prompts accumulate
activations, while the first prompt in a scenario and any prompt with continueSession: false
start a fresh tracking session.
Agent Drivers
| Agent | CLI Command | Skill Directories | Session Support |
|---|---|---|---|
claude-code |
claude --print --permission-mode bypassPermissions |
.claude/skills/ |
Yes (sessionId) |
opencode |
opencode run |
.claude/skills/, .agents/skills/ |
No |
codex |
codex exec --dangerously-bypass-approvals-and-sandbox |
.agents/skills/ |
Yes (session_id) |
The driver copies/symlinks all skills from the --skills directory into the agent's expected skill directories within the workspace.
Failure Classification
Failed steps are automatically classified:
| Code | Category | Meaning |
|---|---|---|
SKILL_NOT_ACTIVATED |
agent | Expected skill was not read by the agent |
SKILL_MISMATCH |
agent | Unexpected extra skills were activated |
BUILD_FAILED |
build | golem build failed |
DEPLOY_FAILED |
deploy | golem deploy failed |
INVOKE_FAILED |
deploy | Agent function invocation failed |
INVOKE_JSON_FAILED |
deploy | JSON agent invocation failed |
SHELL_FAILED |
infra | Shell command returned non-zero exit |
HTTP_FAILED |
network | HTTP request failed or timed out |
MCP_CALL_FAILED |
network | MCP call failed (init, session, or method error) |
CREATE_PROJECT_FAILED |
infra | golem new project creation failed |
CREATE_AGENT_FAILED |
infra | golem agent new failed |
DELETE_AGENT_FAILED |
infra | golem agent delete failed |
FILE_CHECK_FAILED |
assertion | Could not read file for check_file step |
ASSERTION_FAILED |
assertion | Output didn't match expect assertions |
Output and Reports
Results are written to --output (default ./results/):
- Per-scenario JSON:
<agent>-<language>-<scenario-name>.jsonwith step-by-step results - summary.json: Aggregated pass/fail counts, durations, worst failures
- report.html: Visual HTML report
- GitHub Actions summary: Auto-generated if
GITHUB_STEP_SUMMARYis set
Existing Skills and Scenarios
Skills in golem-skills/skills/ (see Skill Directory Structure for layout):
common/golem-new-project— scaffolding withgolem newrust/golem-add-rust-crate— adding Rust crate dependenciests/golem-add-npm-package— adding npm package dependenciesscala/golem-add-scala-dependency— adding Scala library dependenciesmoonbit/golem-add-moonbit-package— adding MoonBit mooncakes dependencies
Scenarios in golem-skills/tests/harness/scenarios/:
create-a-new-project.yaml— project creation, build, deploy, and invokeadd-third-party-dependency.yaml— add a third-party dependency, use it in code, and verify