promptfoo-evals

star 0

Write, refine, run, and QA non-redteam promptfoo eval suites after the target or provider already works: prompts, vars, test cases, assertions, model-graded rubrics, transforms, datasets, output exports, filters, and CI gates. Use for regression tests and eval-suite authoring. Do not use for connecting a new target/provider, mapping HTTP requests or auth, smoke-testing an endpoint, or redteam plugin/strategy setup; use `promptfoo-provider-setup` for connection work instead.

nludd25 By nludd25 schedule Updated 6/10/2026

name: promptfoo-evals description: > Write, refine, run, and QA non-redteam promptfoo eval suites after the target or provider already works: prompts, vars, test cases, assertions, model-graded rubrics, transforms, datasets, output exports, filters, and CI gates. Use for regression tests and eval-suite authoring. Do not use for connecting a new target/provider, mapping HTTP requests or auth, smoke-testing an endpoint, or redteam plugin/strategy setup; use promptfoo-provider-setup for connection work instead.

Promptfoo Evals

Build a small eval that answers one product question clearly, run it with fresh results, then inspect the exported artifact before expanding.

Read references/eval-patterns.md when you need concrete YAML patterns, assertion examples, or CI snippets. For deep promptfoo feature questions that are not covered here, consult https://www.promptfoo.dev/llms-full.txt.

Inputs

Infer these from the repo or user prompt:

  • Behavior being evaluated and what "good" means.
  • Target/provider already configured, or whether promptfoo-provider-setup is needed first.
  • Prompt shape and variables.
  • Test data source: inline cases, CSV/JSON, generated data, production examples, or hand-picked regressions.
  • Assertion style: deterministic checks, structured output validation, JavaScript assertions, model-graded rubrics, or a mix.
  • Output needs: JSON export, comparison, CI gate, or human triage.

If the provider does not work yet, switch to promptfoo-provider-setup. If the task is adversarial security scanning, switch to promptfoo-redteam-setup or promptfoo-redteam-run.

Workflow

1. State the eval question

Search for existing configs first: promptfooconfig.yaml, promptfooconfig.yml, or repo evals/promptfoo directories. Extend an existing suite when possible.

Write one sentence for the behavior under test, then choose 3-10 starter cases. Include both ordinary success cases and edge cases that have broken before.

For new suites, prefer this layout unless the repo already has a convention:

evals/<suite-name>/
  promptfooconfig.yaml
  prompts/
  tests/

2. Choose assertions

Prefer deterministic assertions first:

  • Exact or substring behavior: equals, contains, icontains, regex
  • Structured output: is-json, contains-json, javascript
  • Numeric or score-like outputs: javascript returning a boolean or score
  • Semantic quality: llm-rubric with an explicit grader provider when possible

Use model-graded assertions sparingly for qualities that deterministic checks cannot capture. Configure a local or explicit grader for reproducible QA.

3. Write the config

Include:

  • # yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
  • A short description
  • Field order: description, env, prompts, providers, defaultTest, scenarios, tests
  • prompts via file://prompts/*.txt or file://prompts/*.json when prompts are more than a one-line smoke test
  • tests: file://tests/*.yaml for suites that will grow beyond a few cases
  • defaultTest only for shared assertions/options
  • options.transform when parsing JSON once makes assertions cleaner
  • Stable metric names only when they help compare dashboards over time

Keep secrets as {{env.VAR}}; do not commit .env values. When checking faithfulness or hallucination with llm-rubric, inline the source material in the rubric via {{variable}} so the grader can actually compare.

4. Validate and run

From the promptfoo repo:

source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c path/to/promptfooconfig.yaml
npm run local -- eval -c path/to/promptfooconfig.yaml -o /tmp/eval-results.json --no-cache --no-share

Outside the repo:

npx promptfoo@latest validate config -c path/to/promptfooconfig.yaml
npx promptfoo@latest eval -c path/to/promptfooconfig.yaml -o /tmp/eval-results.json --no-cache --no-share

Inspect the output JSON for results.stats, response.output, score, gradingResult.reason, and error.

5. Iterate deliberately

  • Add cases when a failure represents real expected behavior.
  • Tighten assertions when false positives pass.
  • Use --filter-pattern, --filter-metadata, or --filter-failing for focused reruns.
  • Keep --no-cache while developing so you are not validating stale outputs.
  • Use --no-share unless the user asks for a shareable URL.

Common Mistakes

# WRONG: vague rubric with no examples or grader control
- type: llm-rubric
  value: Is this good?

# BETTER: concrete success criteria
- type: llm-rubric
  value: >-
    The answer must cite the requested invoice id, state approved/denied, and
    avoid inventing fields not present in the tool result.
# WRONG: unquoted JS expression that starts with [ or { is parsed as YAML flow
- type: javascript
  value: ['billing', 'technical'].includes(output.category)

# BETTER: quote any assertion value that begins with [, {, *, &, or !
- type: javascript
  value: "['billing', 'technical'].includes(output.category)"
# WRONG: inline prompts that contain JSON-like braces are misread as file paths
prompts:
  - 'Classify: {{text}}. Return {"category": "..."} JSON.'

# BETTER: move non-trivial prompts (JSON examples, multi-line, quotes) to a file
prompts:
  - file://./prompts/classify.txt
# WRONG: reparsing JSON in every assertion
assert:
  - type: javascript
    value: JSON.parse(output).status === 'approved'

# BETTER: parse once for the test
options:
  transform: JSON.parse(output)
assert:
  - type: javascript
    value: output.status === 'approved'

Output Contract

When done, state:

  • Eval question and target/provider used
  • Files created or changed
  • Assertion strategy and why
  • Validation/eval commands run
  • Result stats and any failures/errors
  • Required environment variables
  • Follow-up cases or assertions to add next
Install via CLI
npx skills add https://github.com/nludd25/promptfoo --skill promptfoo-evals
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator