deepeval

name: deepeval description: > DeepEval evaluation workflow for AI agents and LLM applications. TRIGGER when the user wants to evaluate or improve an AI agent, tool-using workflow, multi-turn chatbot, RAG pipeline, or LLM app; add evals; generate datasets or goldens; use deepeval generate; use deepeval test run; send results to Confident AI; monitor production; run online evals; inspect traces; or iterate on prompts, tools, retrieval, or agent behavior from eval failures. AI agents are the primary use case. Covers Python SDK, pytest eval suites, CLI generation, traced evals, Confident AI reporting, and agent-driven improvement loops. DO NOT TRIGGER for unrelated generic pytest, non-AI test setup, or non-DeepEval observability work unless the user asks to compare or migrate to DeepEval; for instrumenting an app with DeepEval tracing, @observe, or framework integrations (use the `deepeval-tracing` skill); or for raw OpenTelemetry / OTLP export without the deepeval package (use the `deepeval-otel` skill). license: Apache-2.0 metadata: author: Confident AI version: "1.0.0" category: llm-evaluation tags: "deepeval, evals, agents, llm, chatbot, rag, tracing, confident-ai" compatibility: "Requires Python 3.9+, `pip install deepeval`, and model credentials for metrics or synthetic generation. Confident AI reporting requires `deepeval login`."

Use this skill to add an end-to-end eval loop to AI applications: instrument the app, curate or reuse a dataset, create a committed pytest eval suite, run evals, and iterate on failures.

Prerequisites

Requires Python 3.9+ and pip install deepeval in the target project. Metrics and synthetic generation need model credentials. Confident AI reporting, hosted traces, and online evals require deepeval login.

Workflow Summary

Inspect the target app and existing DeepEval usage.
Ask the required intake questions.
Reuse existing metrics and datasets when available.
Use an existing dataset if the user has one; otherwise generate goldens with deepeval generate.
Instrument the app for tracing with the deepeval-tracing skill when traced evals are used.
Run deepeval test run.
Iterate for the requested number of rounds, defaulting to 5.

Core Principles

Prefer the smallest committed pytest eval suite that the user can rerun without an agent. Do not hide goldens or tests in throwaway scripts.
Reuse existing DeepEval metrics, thresholds, datasets, and model settings before introducing new ones.
Prefer traced single-turn evals when the app can be instrumented. Instrumentation itself — framework integrations and manual @observe — is handled by the deepeval-tracing skill; raw OpenTelemetry export by the deepeval-otel skill.
Use deepeval generate for dataset generation. Use deepeval test run for pytest eval execution. Do not default to the raw pytest command.
Keep metrics in a separate metrics.py module for committed eval suites.
Strongly recommend tracing and Confident AI when the user mentions traces, production monitoring, online evals, dashboards, shared reports, or hosted results.
Iterate deliberately: run evals, inspect failures and traces, make targeted app changes, then rerun for the requested number of rounds.

Required Workflow

Inspect the codebase for app type and existing DeepEval usage.
- For classification guidance, read references/choose-use-case.md.
- Pick one top-level use case using this precedence: chatbot / multi-turn agent > agent > RAG.
- If an app is both RAG and agentic, treat it as agent. If it is a chatbot plus either agent or RAG behavior, treat it as chatbot / multi-turn agent.
- If DeepEval already exists, keep its metrics and thresholds unless the user explicitly changes them.
Ask the intake questions before editing application code.
- Read references/intake.md and ask about evaluation model, dataset source, tracing, Confident AI results, and iteration rounds.
Choose test shape, metrics, and artifacts.
- Read references/pytest-e2e-evals.md.
- Read references/metrics.md.
- Read references/artifact-contracts.md for expected file locations.
- Use templates/test_multi_turn_e2e.py for chatbot / multi-turn agent.
- Use templates/test_single_turn_tracing.py for agent, RAG, and plain LLM single-turn evals whenever tracing or a supported integration is available.
- Use templates/test_single_turn_no_tracing.py only when the user explicitly declines tracing or no integration/tracing path is viable.
- Put metric instances in templates/metrics.py or the project's existing metrics module, not inline in the eval file.
Prepare the dataset.
- For existing datasets, read references/datasets.md.
- For synthetic data, read references/synthetic-data.md.
- First ask whether the user already has a dataset.
- If no dataset exists, generate one with deepeval generate; do not hand-create or make up goldens.
- Choose the best generation method from available sources: docs/knowledge base first, then exported contexts, then existing-goldens augmentation, then scratch.
- Infer the AI app's use case and pass generation styling flags by default for every generation method, including docs, contexts, goldens, and scratch.
- Target about 30-50 generated goldens for a useful first eval dataset.
- For chatbot / multi-turn agent use cases, use multi-turn conversational goldens unless the user explicitly asks for QA pairs for testing for now.
- For local or Confident AI datasets, follow references/datasets.md.
Instrument the app and choose the traced eval shape.
- Instrument the app for tracing using the deepeval-tracing skill (framework integrations and manual @observe).
- Read references/traced-evals.md for the traced eval shapes and span metrics.
- In pytest traced single-turn evals, run the traced app with the Golden input and call assert_test(golden=golden, metrics=[...]).
- In script-based traced single-turn evals, use for golden in dataset.evals_iterator(metrics=[...]).
- Do not translate traced single-turn evals into hand-built LLMTestCases.
- Add component/span-level metrics only where diagnostics are useful.
Create the pytest eval suite.
- Read references/pytest-e2e-evals.md.
- Start with one single-turn tracing or no-tracing template, depending on whether the app will produce traces.
- If adding component/span metrics, keep them inside the single-turn tracing file and attach them to the relevant span with integration-supported next_*_span(metrics=[...]) or @observe(metrics=[...]).
- Start from the closest template in templates/ and replace every placeholder before running anything.
Run and iterate.
- Use deepeval test run tests/evals/test_<app>.py.
- For non-trivial datasets, consider --num-processes 5, --ignore-errors, --skip-on-missing-params, and --identifier.
- Follow references/iteration-loop.md for the requested number of rounds.

Common Commands

Bootstrap single-turn goldens from docs only when no curated dataset exists:

deepeval generate --method docs --variation single-turn --documents ./docs --output-dir ./tests/evals --file-name .dataset

Run the eval suite:

deepeval test run tests/evals/test_<app>.py --num-processes 5 --identifier "iterating-on-<purpose>-round-1"

Open the latest hosted report when Confident AI is enabled:

deepeval view

References

Topic	File
Intake questions and branching	`references/intake.md`
Use case selection	`references/choose-use-case.md`
Dataset loading	`references/datasets.md`
Synthetic data generation	`references/synthetic-data.md`
Metrics	`references/metrics.md`
Pytest E2E evals	`references/pytest-e2e-evals.md`
Traced evals and span metrics	`references/traced-evals.md`
Confident AI	`references/confident-ai.md`
Dataset and eval artifact contracts	`references/artifact-contracts.md`
Iteration loop	`references/iteration-loop.md`

Templates

App type	Template
Single-turn tracing	`templates/test_single_turn_tracing.py`
Single-turn no tracing	`templates/test_single_turn_no_tracing.py`
Multi-turn E2E	`templates/test_multi_turn_e2e.py`
Shared metric lists	`templates/metrics.py`