nexus-eval-harness - SKILL.md Agent Skill

name: nexus-eval-harness description: Run or update the Nexus LLM eval harness for arbitrary provider/model matrices. Use when comparing models in the real vault-like tool environment, changing eval configs, adding eval scenarios, or debugging eval reports.

Nexus Eval Harness

Use this skill for full Nexus model-behavior evals, not simple provider availability checks.

Core Rule

Use the shared harness in tests/eval/eval.test.ts. Do not create one-off runners unless the harness itself is broken and you are actively fixing it.

The harness should run provider/model/scenario jobs in parallel. Avoid sequential loops for model comparisons.

Production-Like Vault Runs

For the native vault environment, run live mode against the two-tool surface:

RUN_EVAL=1 EVAL_MODE=live EVAL_TOOL_SET=meta EVAL_TARGETS='openrouter=deepseek/deepseek-v4-pro,openrouter=deepseek/deepseek-v4-flash' npx jest tests/eval/eval.test.ts --runInBand --no-coverage --verbose

Notes:

--runInBand only keeps Jest in one worker; the harness runs the eval matrix with Promise.all.
RUN_EVAL=1 is required so ordinary test runs do not hit live provider APIs.
EVAL_TOOL_SET=meta restricts scenarios to the production getTools/useTools contract.
EVAL_SCENARIO_NAMES narrows a run to specific scenario names.
EVAL_TRACE_STREAM=1 writes per-scenario JSONL traces under test-artifacts/traces/ as chunks, tool calls, tool events, and assertions arrive.
API keys are read from process env or repo .env; never print credentials.
Reports are written under test-artifacts/.

Target Selection

Preferred arbitrary target format:

EVAL_TARGETS='provider=model,provider=model'

Examples:

EVAL_TARGETS='openrouter=anthropic/claude-sonnet-4.6,openrouter=openai/gpt-5.4-mini'
EVAL_TARGETS='openai=gpt-5.4,openrouter=openai/gpt-5.4'

Single-provider shorthand:

EVAL_PROVIDER=openrouter EVAL_MODELS='deepseek/deepseek-v4-pro,deepseek/deepseek-v4-flash'

Useful overrides:

EVAL_CONFIG=tests/eval/configs/live.yaml
RUN_EVAL=1
EVAL_MODE=live
EVAL_SCENARIOS='tests/eval/scenarios/search-variations.eval.yaml'
EVAL_SCENARIO_NAMES='simple-read,replace-content'
EVAL_TOOL_SET=meta
EVAL_TRACE_STREAM=1
EVAL_MAX_RETRIES=3
EVAL_RETRY_DELAY_MS=1000
EVAL_RETRY_BACKOFF_MULTIPLIER=2
EVAL_RETRY_MAX_DELAY_MS=30000
EVAL_TIMEOUT_MS=120000

Retry notes:

The harness retries provider/server failures such as 408, 409, 425, 429, 5xx, timeouts, and transient transport errors with exponential backoff.
Behavioral scenario failures may still retry up to maxRetries, but auth/validation-only stream errors should fail fast.
Retry delays are per parallel job; one throttled model should not serialize the rest of the matrix.

Interpreting Results

Separate these failure classes:

Provider/API failures: stream errors, auth failures, rate limits, transport errors.
Tool contract failures: missing workspaceId, sessionId, memory, or goal; wrong CLI flags; skipped getTools.
Task failures: valid tools called but wrong tool choice, wrong order, or incomplete multi-step plan.
Harness failures: zero loaded scenarios, non-production tool surface for a vault eval, or leftover generated test vault state.

When the user asks how a model performs “in our environment,” report the meta live-run numbers first.