promptfoo-init

name: promptfoo-init description: Generate an initial promptfoo YAML configuration for evaluating LLM prompts. Use when the user wants to set up promptfoo, create a promptfooconfig.yaml config, scaffold an eval, bootstrap prompt testing, start evaluating prompts, or create evals based on a user request. Triggers on mentions of "promptfoo", "prompt eval", "eval config", "test my prompt", "create an eval for", or requests to create evaluation configurations for LLM tasks.

Promptfoo Init

Generate a complete promptfooconfig.yaml configuration from a task description. The output is a ready-to-run config with system prompt, dataset, assertions, and provider setup. Also supports creating targeted evals from user requests (e.g., "create an eval that checks if the model refuses harmful requests").

Workflow

1. Gather requirements

Ask the user (concisely, in one message):

Task: What does the LLM need to do? (e.g., "classify support tickets", "extract entities from text", "translate English to Spanish")
Input: What does the user input look like? (e.g., "a customer email", "a JSON object with name and address")
Expected output: What should the LLM produce? (e.g., "a category label", "JSON with extracted fields", "translated text")
Provider: Which LLM provider? (e.g., OpenAI, Anthropic)
Any extra requirements: Output format constraints, tone, special instructions

If the user already provided some of this context, skip those questions.

2. Generate the config

Read references/promptfoo-patterns.md for syntax reference, then generate the following files:

File structure

promptfooconfig.yaml
prompts/
  {prompt-name}-v1.yaml
datasets/
  test-cases.csv

`promptfooconfig.yaml`

description: "<task summary>"

prompts:
  - file://prompts/{prompt-name}-v1.yaml

providers:
  # based on user's choice

defaultTest:
  assert:
    # global assertions that apply to every test

tests: file://datasets/test-cases.csv

`prompts/{prompt-name}-v1.yaml`

- role: system
  content: |
    <hardcoded system prompt tailored to the task>
- role: user
  content: |
    {{input}}

`datasets/test-cases.csv`

input,__expected
"example input 1","expected output 1"
"example input 2","expected output 2"

Prompt writing best practices

Be specific about the role: Instead of "You are a helpful assistant", define a concrete role like "You are a senior medical billing specialist who reviews insurance claims for coding accuracy."
Define output format explicitly: Specify the exact structure — e.g., "Respond with a JSON object containing category (string), confidence (number 0-1), and reasoning (string)."
Include constraints and boundaries: State what the model should NOT do — e.g., "Do not guess if the information is not present. Return null for missing fields."
Provide few-shot examples only when necessary: Only include input/output examples in the system prompt when the task is ambiguous or the output format is hard to describe. Never generate example data for the dataset — that's what datasets/test-cases.csv is for.
Use step-by-step instructions when necessary: Only break into numbered steps for complex multi-stage tasks. Simple tasks (e.g., classification, translation) don't need step-by-step.
Set the tone and style when necessary: Only specify tone/style when the task requires a specific register (e.g., customer-facing, formal legal). Skip for technical/structured output tasks.
Handle edge cases in the prompt: Tell the model what to do with ambiguous, empty, or malformed input — e.g., "If the input is empty or unintelligible, respond with {\"error\": \"invalid_input\"}."
Keep it focused: One prompt = one task. Don't combine classification, summarization, and translation into a single prompt.

Config rules

Prompts: Store in prompts/ folder, named {name}-v1.yaml. Always use chat format (role: system + role: user). The system prompt must be hardcoded — never use {{system_prompt}} variable. Use {{input}} for the user message variable. Add more variables only if the task requires them (e.g., {{language}} for translation tasks).
Dataset: Store as CSV in datasets/ folder. Generate 3–5 realistic dummy rows covering typical inputs, edge cases, and one tricky case. Add a comment in promptfooconfig.yaml # TODO: Replace dummy data in datasets/test-cases.csv with real examples.
Assertions: Pick assertion types that match the task:
- Classification → contains or equals for expected labels
- Free-text generation → llm-rubric with a clear rubric + similar with expected output
- Structured output (JSON) → is-json + contains for required fields
- Translation → llm-rubric for quality + not-contains for source language leakage
- Always include latency threshold as a default assertion
Providers: Configure based on user's choice. Use temperature: 0 for reproducibility.

3. Write the files

Create the prompts/ and datasets/ directories, then write all three files. Tell the user:

How to install promptfoo: npx promptfoo@latest init (if not installed) or npm install -g promptfoo
How to run: npx promptfoo@latest eval
How to view results: npx promptfoo@latest view
Remind them to replace dummy data in datasets/test-cases.csv with real examples

Creating Evals from User Requests

When a user describes what they want to evaluate (e.g., "create an eval that checks if the model refuses harmful requests", "add an eval for tone consistency"), follow this workflow instead of the full init workflow.

1. Understand the eval request

Parse the user's request to determine:

What behavior to evaluate: What specific LLM behavior should be tested? (e.g., safety refusals, factual accuracy, format compliance, tone, language quality)
Scope: Is this a new standalone eval or an addition to an existing promptfooconfig.yaml?
Success criteria: What does a passing test look like? What does a failing test look like?

If the request is ambiguous, ask one clarifying question. Do not over-ask — infer reasonable defaults.

2. Generate eval components

Based on the request, generate the appropriate combination of:

Test cases (`datasets/test-cases.csv` or inline YAML)

Generate 5–10 targeted test cases that exercise the behavior being evaluated
Include positive cases (should pass) and negative cases (should fail/be caught)
Cover edge cases and boundary conditions relevant to the eval
Use realistic, diverse inputs — not trivial examples

Assertions

Choose assertion types that precisely match what the user wants to evaluate:

Eval type	Recommended assertions
Safety / refusal	`contains` ("I can't", "I'm unable"), `not-contains` (harmful content), `llm-rubric` for nuanced refusal quality
Factual accuracy	`contains` for key facts, `similar` with reference answer, `llm-rubric` for correctness
Format compliance	`is-json`, `regex` for patterns, `javascript` for structural validation
Tone / style	`llm-rubric` with tone criteria, `not-contains` for unwanted tone markers
Consistency	`similar` across paraphrased inputs, `llm-rubric` for consistency
Language quality	`llm-rubric` for grammar/fluency, `not-contains` for source language leakage
Summarization	`llm-rubric` for completeness, `javascript` for length constraints, `similar` with reference

Default assertions

Add defaultTest assertions when the eval behavior should apply globally (e.g., latency thresholds, cost limits, universal safety checks).

3. Write or update files

New eval: Create all files following the standard file structure (promptfooconfig.yaml, prompts/, datasets/)
Adding to existing eval: Read the existing promptfooconfig.yaml, then update it by adding new test cases to the dataset and/or new assertions to defaultTest. Do not overwrite existing tests — append to them.

After writing, tell the user:

What eval cases were generated and why
How to run: npx promptfoo@latest eval
How to customize: which test cases or thresholds to adjust for their specific needs