name: annotate-spans
description: >
Write effective, consistent annotations on LLM/agent spans and traces, and coach the user on annotation practice. Load this whenever you are about to record structured feedback with the batch_span_annotate tool, or when the user asks how to annotate, label, score, or review spans/traces, build a failure taxonomy, or set up human/LLM review. Do NOT load for: pure analysis with no intent to save feedback (use debug-trace), latency or cost statistics, or prompt authoring (use playground).
summary: Create consistent span or trace annotations and help design useful feedback taxonomies.
Annotating Spans and Traces
An annotation is durable, structured feedback attached to a span or trace: a name (the dimension being judged), an optional label and/or score (the outcome), and an explanation (why). Annotations are not throwaway commentary — they accumulate into a dataset the user filters, aggregates, and iterates against.
A good annotation earns its place by being useful later:
- Filterable —
annotations['answer_relevance'].label == 'fail'returns the spans you meant. - Aggregatable — counting labels across spans yields a failure rate that tells the user where to focus.
- Auditable — months later, the explanation still justifies the judgment without rerunning anything.
- Curatable — failing spans can be pulled into a dataset to drive evals or fixes.
This skill governs the judgment behind annotations. The batch_span_annotate tool description governs the mechanics (one array, ID requirements, update keying); follow both, and never contradict the tool's naming and identifier rules.
What Makes an Annotation Useful
Grounded in observed behavior, not generic quality vibes. Annotate what actually went wrong or right in this span. "Cited a refund policy that does not exist in the retrieved context" beats a free-floating
hallucination_score: 0.3. Generic dimensions likehelpfulnessorcoherenceare rarely grounded in the application's real failure modes — prefer names that point at a concrete behavior.One dimension per annotation.
nameis the rubric dimension; the outcome lives inlabel/score. Usename: "tool_selection",label: "incorrect"— notname: "wrong_tool". If you find yourself judging two things at once (e.g., retrieval relevance and answer faithfulness), write two annotations.Target the most specific responsible span. Annotate the LLM span for model output, the tool span for tool behavior, the retriever span for retrieval quality. Reserve root agent/chain spans for genuinely end-to-end judgments (task success, trajectory). A faithfulness failure pinned to the right LLM span is actionable; the same label on the root span forces the user to hunt.
Judge the first failure, not every downstream symptom. Errors cascade — bad retrieval produces a bad answer. Annotate the root cause where it occurred. Add a separate annotation downstream only when it reveals an independent problem, not a consequence of the first.
Prefer crisp labels over fuzzy scores. A binary or small categorical label (
pass/fail,relevant/irrelevant,correct/partial/incorrect) is easy to apply consistently and easy to aggregate. Use a numericscoreonly when the scale is genuinely meaningful and defined; put the rubric, scale, or threshold inmetadataso the number is interpretable later.Explanations are specific observations, not restatements. Write what you saw, citing the evidence. Good: "Returned chunks about onboarding; user asked about cancellation — no relevant chunk retrieved." Weak: "The retrieval was bad." Always include an explanation for any score, any failure, any unclear label, or any judgment the user might want to revisit.
Be consistent across spans. The same dimension must use the same
nameand the same label vocabulary everywhere, or filtering and rate computation break. Decide the vocabulary once, then apply it uniformly. Keep names stable across runs (no_v2/_newsuffixes).Set
annotatorKindhonestly.LLMfor your own judgment,HUMANonly when recording feedback the user explicitly gave,CODEfor deterministic checks. Don't record your own opinion asHUMAN.
Mode A: Coaching the User
When the user asks how to annotate (rather than asking you to do it), teach the process rather than handing over a fixed rubric:
- Start from failures they have actually seen. Resist proposing a polished taxonomy up front — a pre-baked list causes confirmation bias. Ask what's going wrong, or use the
debug-traceskill to surface real failure modes first. - Open-code before naming. Encourage free-form notes on a handful of traces ("what's the first thing wrong here?") before committing to category names.
- Axial-code into a small vocabulary. Group similar notes into 5–10 named, mutually distinct, actionable categories. Each should be specific enough that two reviewers would label the same span the same way.
- Define the label set per category. Usually binary. Write a one-sentence definition so the boundary is unambiguous.
- Apply, then aggregate. Label a representative sample consistently, then compute failure rates to prioritize. Highest-frequency × highest-impact wins.
- Fix-first. Remind the user that many failures (missing prompt instruction, missing tool, retrieval bug) are better fixed than measured. Reserve standing annotations/evals for failures they will iterate on repeatedly or that need a guardrail.
Explain why a convention matters ("stable names let you trend this over time") rather than only stating it.
Mode B: Writing Annotations Yourself
When the user asks you to save annotations:
- Confirm scope and intent. Only annotate when the user wants feedback persisted, not during ordinary analysis. Know which spans and which dimension(s) are in scope. If no failure modes are established yet, diagnose first with the
debug-traceskill, then return here to persist the results — don't re-derive a taxonomy. - Inspect before judging. Read the actual span input/output (and relevant parent/child spans for context). Never annotate from span status codes alone — a success status can mask an error in attributes, and an exception can be expected behavior.
- Pick the dimension(s) and a fixed label vocabulary before writing, so every span in the batch is judged on the same scale. If you're judging more than one dimension, decide each one's vocabulary up front.
- Annotate the right span for each judgment (principle 3) and the first failure (principle 4). Use only real IDs from context or prior tool results — never guess span IDs.
- Write a grounded explanation per annotation citing the specific evidence in that span.
- Batch the related annotations into one
batch_span_annotatecall and pick theidentifierto match your intent (annotations are keyed by(name, span, identifier)):- Use a stable identifier that names you as the author — e.g.
pxi— when the judgment should be updatable: re-reviewing the same span and dimension overwrites the prior annotation instead of duplicating it. - Use a descriptive, run-scoped identifier — e.g.
pxi:tool-misuse-2026-05-29— for a discrete review batch you may want to query or revert as a unit later. This mirrors the Phoenix CLI'scoding-run:<topic>-<date>convention: a descriptive id carries meaning for whoever opens the data later, far better than an opaque constant. - Either way, the author prefix keeps your annotations distinguishable from human or other-evaluator annotations on the same span. Do not reuse one identifier for two unrelated runs you'd want to revert separately.
- Use a stable identifier that names you as the author — e.g.
- Report back what you annotated, the dimension(s) and vocabulary used, and the distribution of labels — and link to a representative annotated span so the user can verify. Build links from the OpenTelemetry hex IDs in the
spanId/traceIdGraphQL fields (never the Relayidfield):[short label](/redirects/spans/<spanId>)for a span, or[short label](/redirects/traces/<traceId>)when no single span captures the judgment. A span link lands the user on the exact annotated node.
Anti-Patterns
- Generic score columns (
hallucination_score,quality_score) not tied to a concrete observed behavior. - Outcome baked into the name (
name: "passed_relevance") instead ofname: "relevance",label: "pass". - Inconsistent vocabulary across spans (
failhere,badthere,0elsewhere) — kills filtering and aggregation. - Annotating every downstream symptom of a single root failure.
- Labeling without an explanation when the judgment is anything but trivially obvious.
- Annotating from status codes or metadata without reading the actual span content.
- Recording your own judgment as
HUMAN, or guessing span IDs. - Inventing a taxonomy before reading traces, when coaching the user.