bare-eval

star 189

Run isolated eval and grading calls using CC 2.1.81 --bare mode. Constructs claude -p --bare invocations for skill evaluation, trigger testing, and LLM grading without plugin/hook interference. Use when running eval pipelines, grading skill outputs, benchmarking prompt quality, or testing trigger accuracy in isolation.

yonatangross By yonatangross schedule Updated 6/10/2026

name: bare-eval compatibility: "Claude Code 2.1.170+" description: "Run isolated eval and grading calls using CC 2.1.81 --bare mode. Constructs claude -p --bare invocations for skill evaluation, trigger testing, and LLM grading without plugin/hook interference. Use when running eval pipelines, grading skill outputs, benchmarking prompt quality, or testing trigger accuracy in isolation." tags: [eval, bare, grading, pipeline, testing, ci] version: 1.0.0 author: OrchestKit user-invocable: false complexity: medium context: inherit persuasion-type: discipline effort: low

Bare Eval — Isolated Evaluation Calls

Run claude -p --bare for fast, clean eval/grading without plugin overhead.

CC 2.1.81 required. The --bare flag skips hooks, LSP, plugin sync, and skill directory walks.

When to Use

  • Grading skill outputs against assertions
  • Trigger classification (which skill matches a prompt)
  • Description optimization iterations
  • Any scripted -p call that doesn't need plugins

When NOT to Use

  • Testing skill routing (needs --plugin-dir)
  • Testing agent orchestration (needs full plugin context)
  • Interactive sessions

Prerequisites

# --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled)
export ANTHROPIC_API_KEY="sk-ant-..."

# Verify CC version
claude --version  # Must be >= 2.1.81

Quick Reference

Call Type Command Pattern
Grading claude -p "$prompt" --bare --max-turns 1 --output-format text
Trigger claude -p "$prompt" --bare --json-schema "$schema" --output-format json
Streaming grade claude -p "$prompt" --bare --max-turns 1 --output-format stream-json
Optimize echo "$prompt" | claude -p --bare --max-turns 1 --output-format text
Force-skill claude -p "$prompt" --bare --print --append-system-prompt "$content"
@-file in prompt claude -p "grade @fixtures/case-1.md against rubric" --bare (CC 2.1.113 Remote Control autocomplete)

--output-format stream-json

Newline-delimited JSON events (one per token/tool-call) — lets a runner score partial output or abort early on a failing probe without waiting for the full response.

claude -p "$prompt" --bare --max-turns 1 --output-format stream-json \
  | while IFS= read -r line; do
      # line is a single JSON event; inspect $.type == "content_block_delta"
      jq -r 'select(.type == "content_block_delta") | .delta.text' <<< "$line"
    done

Use stream-json over json when:

  • grading long outputs and you want incremental scoring,
  • piping into another CLI step-by-step (e.g. ork:eval-runner),
  • you need per-token timing data alongside the content.

Invocation Patterns

Load detailed patterns and examples:

Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md")

Grading Schemas

JSON schemas for structured eval output:

Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md")

Pipeline Integration

OrchestKit's eval scripts (npm run eval:skill) auto-detect bare mode:

# eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true
# Scripts add --bare to all non-plugin calls automatically

Bare calls: Trigger classification, force-skill, baseline, all grading. Never bare: run_with_skill (needs plugin context for routing tests).

CC 2.1.119: --print honors agent tools: / disallowedTools: (M122)

Before CC 2.1.119, --print mode ran with the full default tool set regardless of the agent's frontmatter tools: and disallowedTools:. Bare-eval grading was effectively ungated — graders could call any tool they wanted, even if the agent definition restricted them.

As of 2.1.119, --print enforces the agent's declared tool surface. Implications for eval design:

Consequence Action
Eval graders that relied on unrestricted tool access may now fail Audit grader prompts for tools they actually need; whitelist explicitly via the agent's tools: frontmatter
Eval results match interactive runs Reproducibility improves — grading what the model can actually do, not what it could do in an unsandboxed --print
--agent <name> also honors permissionMode in --print Permission-gated tools (Bash, Edit) require either permissionMode: acceptEdits or explicit allowlists in the agent definition

Migration test:

# Run an eval against an agent with a deliberately tight tools: list.
# Graders that previously called Read/Bash freely will now fail unless those
# tools are declared on the agent.
claude -p "$prompt" --bare --print --agent grader-test

If the grader fails with a "tool not permitted" error, add the required tool to the agent's tools: frontmatter and re-run.

CC 2.1.121: CLAUDE_CODE_FORK_SUBAGENT=1 for grader determinism (#1545)

Before CC 2.1.121, the env var only worked in interactive sessions. As of 2.1.121, non-interactive paths (claude -p, SDK) honor it too — each grader invocation gets a fresh forked subagent context.

The cross-eval state-leak problem this fixes:

Without forking, sequential claude -p --bare graders inherit harness state:

Inherited Symptom
memory MCP query cache grader sees stale hit from previous run; same fixture grades differently
.claude/chain/*.json on disk grader for "implement" thinks "explore" already ran (file is from previous test)
ToolSearch deferred-tool cache first grader's MCP loads bleed into next grader's tool registry
model picker pref grader N inherits --model=opus from grader N-1

This produced ~5–10% retry rate and non-reproducible scores — the eval baseline drifted between runs, engineers chased phantom regressions.

Fix: tests/evals/scripts/lib/eval-common.sh exports CLAUDE_CODE_FORK_SUBAGENT=1, so every script that sources it (run-trigger-eval, run-quality-eval, run-agent-eval, optimize-description, etc.) gets forked graders automatically. The CI workflow .github/workflows/orchestkit-eval.yml also sets it at the workflow level. Older CC silently ignores the env var (no-op).

Determinism contract: running the same grader on the same fixture twice in a row produces the same score. Verified by tests/evals/scripts/test-grader-determinism.sh.

Performance

Scenario Without --bare With --bare Savings
Single grading call ~3-5s startup ~0.5-1s 2-4x
Trigger (per prompt) ~3-5s ~0.5-1s 2-4x
Full eval (50 calls) ~150-250s overhead ~25-50s 3-5x

Rules

Read("${CLAUDE_SKILL_DIR}/rules/_sections.md")

Troubleshooting

Read("${CLAUDE_SKILL_DIR}/references/troubleshooting.md")

Dynamic-workflow harness (template-in-skill)

workflows/skill-fitness.mjs is a runnable dynamic-workflow template — the workflow-backed complement to the static conformance grader (scripts/eval/conformance-check.mjs). It fans out one isolated-context agent per skill to score fitness (freshness / router-clarity / structure) and synthesizes a ranked scorecard, catching qualitative drift a static grep can't (description/body count mismatches, duplicate headings, install-specific absolute paths, version drift). Run it with the Workflow tool:

Workflow({ scriptPath: "${CLAUDE_SKILL_DIR}/workflows/skill-fitness.mjs",
           args: ["assess", "commit", "doctor"] })

Treat it as a template, not a verbatim script — adapt the SKILLS list and rubric per use. Cost is real (~50k tokens/skill; scoring all ~112 is ~6M tokens), so pass an explicit batch via args. Static-first: run conformance-check.mjs (zero tokens) to pre-filter, then this harness for the judgment grep can't make.

Related

  • eval:skill npm script — unified skill evaluation runner
  • eval:trigger — trigger accuracy testing
  • eval:quality — A/B quality comparison
  • optimize-description.sh — iterative description improvement
  • Version compatibility: doctor/references/version-compatibility.md
Install via CLI
npx skills add https://github.com/yonatangross/orchestkit --skill bare-eval
Repository Details
star Stars 189
call_split Forks 15
navigation Branch main
article Path SKILL.md
More from Creator
yonatangross
yonatangross Explore all skills →