playground - SKILL.md Agent Skill

name: playground description: Author, edit, or iterate on prompts in the Phoenix prompt playground, including running experiments over a dataset. Load before any playground tool call, including single-shot prompt rewrites. summary: Author, edit, run, compare, and improve prompts in the Phoenix playground.

Prompt Playground

The prompt playground is a tool for authoring and optimizing prompts. It supports two different ways of working: fast manual prompt iteration without a dataset, and dataset-backed prompt experimentation with evaluators and experiments. Choose the workflow that matches the user's current goal and the UI context they have mounted.

Workflow: Create And Iterate Without A Dataset

Use this workflow when the user wants to draft, rewrite, or manually improve a prompt and no dataset-backed evaluation loop is in scope.

Clarify the task the prompt must perform: input variables, expected output shape, audience, constraints, and examples of good or bad behavior when available.
If a playground prompt already exists, call read_prompt_instance before proposing changes so you have the current messages, message IDs, labels, and revision.
Draft or revise the prompt so it clearly states the task, required context, output contract, and success criteria. Keep the prompt directly tied to the user's stated goal.
Use edit_prompt_instance for changes to the mounted prompt so the user can review the diff before accepting it.
Use add_prompt_instance when the user wants a fresh comparison instance that starts from the default prompt messages. Use clone_prompt_instance when comparing alternatives should preserve existing prompt content as the starting point. Discuss variants by their alphabetic labels, but pass numeric instance IDs to tools. After adding, use the returned addedInstance snapshot for follow-up edits.
Use set_variable_values when the user provides manual values for prompt template variables.
Use set_playground_repetitions before running when the user is concerned about flakes, structured output consistency, tool-call reliability, or whether the prompt is ready to save. LLM outputs are nondeterministic; repetitions build confidence by checking the same task across multiple runs instead of trusting one successful response.
Call run_playground only when the user asks to run, try, test, or compare the current prompt. Treat the output as qualitative feedback rather than dataset-backed evidence.
After the run finishes, call read_playground_output to inspect raw output and get the traceId for trace analysis when needed. If the run used multiple repetitions, inspect every repetition before summarizing confidence or recommending that the user save.
Call save_prompt only when the user explicitly asks to save or confirms that the current prompt should be persisted. For a first-time save of an unsaved prompt, omit name unless the user provided one; the tool will derive a valid Phoenix prompt name from the prompt content. Always pass a save description; it should read like a clear, short git commit message. Treat tags like releases and do not promote tags unless the user asks.
Inspect the output with the user, identify the next concrete improvement, and repeat the edit or comparison loop until the prompt is useful for the task.

Workflow: Iterate Over A Dataset With Evaluators And Experiments

Use this workflow when the user wants evidence that a prompt is improving across a dataset, or when they are comparing prompt variants using evaluator results. Running a prompt over a dataset is implicitly an experiment: consult the experiments skill before designing the run, not only after results arrive — it owns the iteration methodology end to end (what to stage at creation, how to read and compare results, when an evaluator is warranted), and the evaluators skill owns designing the evaluators that score them. This workflow covers only the playground mechanics of setting up and starting a recorded run.

Load the dataset with load_dataset if it isn't already loaded. If the user named a dataset but no split and the dataset has splits, name them and ask whether to scope to one or load the whole dataset — then load once.
Make sure the starting prompt is well formed before running it: it should define the task, relevant variables, output format, and any constraints needed for consistent evaluation.
Use set_playground_experiment_recording before running when the user wants the next dataset-backed playground run recorded, persisted, or saved as an experiment, or wants to name, describe, or attach metadata (such as a hypothesis or the variable being changed) to the next experiment. Set recordExperiments to false only when the user explicitly asks for a temporary, throwaway, unrecorded, or ephemeral run. Call this tool only when the requested recording mode or scaffold fields differ from the advertised recordExperiments and nextExperimentScaffold values; the staged scaffold applies to that one run and is consumed when it starts. This is separate from save_prompt, which saves prompt versions rather than run results.
Use set_playground_repetitions before running when the user needs confidence across repeated attempts, especially for flaky behavior, structured outputs, or tool-call correctness.
Run the playground over the dataset. When recording is enabled, each prompt instance run over a dataset is captured as an experiment, with outputs and evaluator annotations available for review.
To read the experiment results and decide whether a change helped, follow the experiments skill; to create the next candidate, use edit_prompt_instance, add_prompt_instance, or clone_prompt_instance (add_prompt_instance starts from the default prompt messages, clone_prompt_instance from existing prompt content), then rerun.
Use save_prompt to save a prompt as a new version only after the evidence shows an improvement or the user explicitly accepts the tradeoff. For unsaved prompts, the tool can create the Phoenix prompt directly without asking for a name unless the user cares about the exact name.

Reading experiment results

When an instance carries an experimentId, read its cost and evaluator scores with phoenix-gql:

phoenix-gql --vars '{"experimentId":"<id>"}' 'query($experimentId: ID!){ node(id:$experimentId){ ...on Experiment { runCount expectedRunCount job{status} costSummary{total{cost tokens}} annotationSummaries{annotationName meanScore count errorCount} } } }'

An experimentId only means the experiment is queryable, not that the run finished — trust the summaries as final only when job.status is COMPLETED or runCount == expectedRunCount. To compare reruns, re-query earlier experiment IDs from the conversation and diff their summaries. Experiments from unrecorded runs are ephemeral and the server sweeps them ~24h after their last update; a freshly surfaced experimentId is well within that window, but an id re-queried from much earlier in a long session may no longer resolve.

Workflow: Author, Refine, Or Remove A Function Tool

Use this workflow when the user wants the model to be able to call a function/tool from the prompt, when they want to refine the signature of an existing one, or when they want to remove a tool. Function tools are JSON-Schema function definitions stored on the playground prompt instance (alongside messages and model config). They are the things the model can "call" during a run.

Call read_prompt_tools before doing anything else. The result gives you the current tool list, each tool's id and kind, and a revision token. Use the existing ids and names to decide whether you should update an existing tool, create a new one, or delete one.
If the user described a function in words, propose a concrete JSON Schema for it. Default to lowercase snake_case parameter names and a {"type":"object","properties":{...},"required":[...]} shape unless the user specifies otherwise.
Call write_prompt_tools with the latest revision. Put every change in a single call: tools is an array of creates/updates (omit id to create, pass an existing id to patch — only the fields you include change), and deleteToolIds is a list of ids to remove. Deletes may target raw vendor tools too, even though writes can't. The batch is all-or-nothing: if any change is invalid (missing id, a raw tool on the write path, or the same id created/updated and deleted) nothing is applied and the error explains which. Deleting the tool that is the forced tool choice is allowed — the choice is reset to auto and reported back; mention that to the user.
After the write, briefly summarize what changed in plain English (which tools were created vs updated) so the user knows what to look for in the tool editor. If you created tools, tell them the new ids.
If the user wants the model to use the new tool in a run, call run_playground and then read_playground_output to see whether the model actually invoked it.

Few-shot examples

These are concrete, runnable shapes — treat them as templates, not as fixed prompts. Always pass the latest revision returned by read_prompt_tools.

Create a brand-new tool. One entry with no id.

{
  "instanceId": 1,
  "expectedRevision": "prompt-tools-abc",
  "tools": [
    {
      "name": "get_weather",
      "description": "Look up the current weather for a city.",
      "parameters": {
        "type": "object",
        "properties": {
          "city": { "type": "string", "description": "City name, e.g. \"San Francisco\"." },
          "units": { "type": "string", "enum": ["c", "f"], "description": "Temperature units." }
        },
        "required": ["city"]
      }
    }
  ]
}

Create several tools at once. Put every tool in the tools array — one call, one revision check. Prefer this over issuing one call per tool.

{
  "instanceId": 1,
  "expectedRevision": "prompt-tools-abc",
  "tools": [
    {
      "name": "get_weather",
      "parameters": {
        "type": "object",
        "properties": { "city": { "type": "string" } },
        "required": ["city"]
      }
    },
    {
      "name": "get_forecast",
      "parameters": {
        "type": "object",
        "properties": {
          "city": { "type": "string" },
          "days": { "type": "integer" }
        },
        "required": ["city", "days"]
      }
    }
  ]
}

Add a required parameter to an existing tool. Pass the existing id and the full new parameters schema. Patch semantics — name is required even if unchanged.

{
  "instanceId": 1,
  "expectedRevision": "prompt-tools-abc",
  "tools": [
    {
      "id": 3,
      "name": "get_weather",
      "parameters": {
        "type": "object",
        "properties": {
          "city": { "type": "string" },
          "units": { "type": "string", "enum": ["c", "f"] }
        },
        "required": ["city", "units"]
      }
    }
  ]
}

Create one tool and patch another in the same batch. Mix entries with and without id.

{
  "instanceId": 1,
  "expectedRevision": "prompt-tools-abc",
  "tools": [
    {
      "name": "get_time",
      "parameters": {
        "type": "object",
        "properties": { "timezone": { "type": "string" } },
        "required": ["timezone"]
      }
    },
    {
      "id": 3,
      "name": "get_weather",
      "description": "Look up the current weather for a city. Returns temperature, humidity, and conditions."
    }
  ]
}

Define a tool that returns structured output via a categorical choice. The model is forced to pick one of the enum labels and optionally explain.

{
  "instanceId": 1,
  "expectedRevision": "prompt-tools-abc",
  "tools": [
    {
      "name": "classify_sentiment",
      "description": "Classify the sentiment of the input as positive, negative, or neutral.",
      "parameters": {
        "type": "object",
        "properties": {
          "label": {
            "type": "string",
            "enum": ["positive", "negative", "neutral"],
            "description": "The sentiment classification."
          },
          "explanation": {
            "type": "string",
            "description": "Short justification for the label."
          }
        },
        "required": ["label"]
      }
    }
  ]
}

Delete a tool — and optionally swap in a replacement in the same batch. deleteToolIds removes by id; combine it with tools to delete and add atomically. Deletes may target raw vendor tools.

{
  "instanceId": 1,
  "expectedRevision": "prompt-tools-abc",
  "deleteToolIds": [3],
  "tools": [
    {
      "name": "get_forecast",
      "parameters": {
        "type": "object",
        "properties": { "city": { "type": "string" } },
        "required": ["city"]
      }
    }
  ]
}

Things to avoid

Don't call write_prompt_tools without calling read_prompt_tools first this turn — the expectedRevision will be stale and the write will be rejected.
Don't try to write a tool whose kind was raw in the read snapshot. Vendor passthrough tools (e.g. provider builtins like web_search) are not editable through PXI — tell the user to author those in the playground tool editor. A raw entry in tools rejects the whole batch. (You can delete a raw tool via deleteToolIds, though.)
Deleting the tool that is the prompt's forced tool choice (tool_choice = specific function) is allowed — the tool choice is automatically reset to auto (zero-or-more) and the result reports resetToolChoiceFrom. Tell the user, since it changes how the model picks tools at run time.
Don't invent tool ids. An entry's id (and every deleteToolIds id) comes from a read snapshot, or is omitted for create. You cannot reference an id created earlier in the same batch.
Don't issue multiple write_prompt_tools calls in a row without re-reading the revision between them. Each successful write or delete changes the revision. Batch the changes into one call.