name: playground description: Author, edit, or iterate on prompts in the Phoenix prompt playground, including running experiments over a dataset. Load before any playground tool call, including single-shot prompt rewrites. summary: Author, edit, run, compare, and improve prompts in the Phoenix playground.
Prompt Playground
The prompt playground is a tool for authoring and optimizing prompts. It supports two different ways of working: fast manual prompt iteration without a dataset, and dataset-backed prompt experimentation with evaluators and experiments. Choose the workflow that matches the user's current goal and the UI context they have mounted.
Workflow: Create And Iterate Without A Dataset
Use this workflow when the user wants to draft, rewrite, or manually improve a prompt and no dataset-backed evaluation loop is in scope.
- Clarify the task the prompt must perform: input variables, expected output shape, audience, constraints, and examples of good or bad behavior when available.
- If a playground prompt already exists, call
read_prompt_instancebefore proposing changes so you have the current messages, message IDs, labels, and revision. - Draft or revise the prompt so it clearly states the task, required context, output contract, and success criteria. Keep the prompt directly tied to the user's stated goal.
- Use
edit_prompt_instancefor changes to the mounted prompt so the user can review the diff before accepting it. - Use
add_prompt_instancewhen the user wants a fresh comparison instance that starts from the default prompt messages. Useclone_prompt_instancewhen comparing alternatives should preserve existing prompt content as the starting point. Discuss variants by their alphabetic labels, but pass numeric instance IDs to tools. After adding, use the returnedaddedInstancesnapshot for follow-up edits. - Use
set_variable_valueswhen the user provides manual values for prompt template variables. - Use
set_playground_repetitionsbefore running when the user is concerned about flakes, structured output consistency, tool-call reliability, or whether the prompt is ready to save. LLM outputs are nondeterministic; repetitions build confidence by checking the same task across multiple runs instead of trusting one successful response. - Call
run_playgroundonly when the user asks to run, try, test, or compare the current prompt. Treat the output as qualitative feedback rather than dataset-backed evidence. - After the run finishes, call
read_playground_outputto inspect raw output and get the traceId for trace analysis when needed. If the run used multiple repetitions, inspect every repetition before summarizing confidence or recommending that the user save. - Call
save_promptonly when the user explicitly asks to save or confirms that the current prompt should be persisted. For a first-time save of an unsaved prompt, omitnameunless the user provided one; the tool will derive a valid Phoenix prompt name from the prompt content. Always pass a save description; it should read like a clear, short git commit message. Treat tags like releases and do not promote tags unless the user asks. - Inspect the output with the user, identify the next concrete improvement, and repeat the edit or comparison loop until the prompt is useful for the task.
Workflow: Iterate Over A Dataset With Evaluators And Experiments
Use this workflow when the user wants evidence that a prompt is improving across a dataset, or when
they are comparing prompt variants using evaluator results. Running a prompt over a dataset is
implicitly an experiment: consult the experiments skill before designing the run, not only after
results arrive — it owns the iteration methodology end to end (what to stage at creation, how to
read and compare results, when an evaluator is warranted), and the evaluators skill owns designing
the evaluators that score them. This workflow covers only the playground mechanics of setting up and
starting a recorded run.
- Load the dataset with
load_datasetif it isn't already loaded. If the user named a dataset but no split and the dataset has splits, name them and ask whether to scope to one or load the whole dataset — then load once. - Make sure the starting prompt is well formed before running it: it should define the task, relevant variables, output format, and any constraints needed for consistent evaluation.
- Use
set_playground_experiment_recordingbefore running when the user wants the next dataset-backed playground run recorded, persisted, or saved as an experiment, or wants to name, describe, or attach metadata (such as a hypothesis or the variable being changed) to the next experiment. SetrecordExperimentsto false only when the user explicitly asks for a temporary, throwaway, unrecorded, or ephemeral run. Call this tool only when the requested recording mode or scaffold fields differ from the advertisedrecordExperimentsandnextExperimentScaffoldvalues; the staged scaffold applies to that one run and is consumed when it starts. This is separate fromsave_prompt, which saves prompt versions rather than run results. - Use
set_playground_repetitionsbefore running when the user needs confidence across repeated attempts, especially for flaky behavior, structured outputs, or tool-call correctness. - Run the playground over the dataset. When recording is enabled, each prompt instance run over a dataset is captured as an experiment, with outputs and evaluator annotations available for review.
- To read the experiment results and decide whether a change helped, follow the
experimentsskill; to create the next candidate, useedit_prompt_instance,add_prompt_instance, orclone_prompt_instance(add_prompt_instancestarts from the default prompt messages,clone_prompt_instancefrom existing prompt content), then rerun. - Use
save_promptto save a prompt as a new version only after the evidence shows an improvement or the user explicitly accepts the tradeoff. For unsaved prompts, the tool can create the Phoenix prompt directly without asking for a name unless the user cares about the exact name.
Reading experiment results
When an instance carries an experimentId, read its cost and evaluator scores with phoenix-gql:
phoenix-gql --vars '{"experimentId":"<id>"}' 'query($experimentId: ID!){ node(id:$experimentId){ ...on Experiment { runCount expectedRunCount job{status} costSummary{total{cost tokens}} annotationSummaries{annotationName meanScore count errorCount} } } }'
An experimentId only means the experiment is queryable, not that the run finished — trust the
summaries as final only when job.status is COMPLETED or runCount == expectedRunCount. To
compare reruns, re-query earlier experiment IDs from the conversation and diff their summaries.
Experiments from unrecorded runs are ephemeral and the server sweeps them ~24h after their last
update; a freshly surfaced experimentId is well within that window, but an id re-queried from
much earlier in a long session may no longer resolve.
Workflow: Author, Refine, Or Remove A Function Tool
Use this workflow when the user wants the model to be able to call a function/tool from the prompt, when they want to refine the signature of an existing one, or when they want to remove a tool. Function tools are JSON-Schema function definitions stored on the playground prompt instance (alongside messages and model config). They are the things the model can "call" during a run.
- Call
read_prompt_toolsbefore doing anything else. The result gives you the current tool list, each tool's id and kind, and arevisiontoken. Use the existing ids and names to decide whether you should update an existing tool, create a new one, or delete one. - If the user described a function in words, propose a concrete JSON Schema for it. Default to
lowercase snake_case parameter names and a
{"type":"object","properties":{...},"required":[...]}shape unless the user specifies otherwise. - Call
write_prompt_toolswith the latestrevision. Put every change in a single call:toolsis an array of creates/updates (omitidto create, pass an existingidto patch — only the fields you include change), anddeleteToolIdsis a list of ids to remove. Deletes may targetrawvendor tools too, even though writes can't. The batch is all-or-nothing: if any change is invalid (missing id, arawtool on the write path, or the same id created/updated and deleted) nothing is applied and the error explains which. Deleting the tool that is the forced tool choice is allowed — the choice is reset to auto and reported back; mention that to the user. - After the write, briefly summarize what changed in plain English (which tools were created vs updated) so the user knows what to look for in the tool editor. If you created tools, tell them the new ids.
- If the user wants the model to use the new tool in a run, call
run_playgroundand thenread_playground_outputto see whether the model actually invoked it.
Few-shot examples
These are concrete, runnable shapes — treat them as templates, not as fixed prompts. Always pass
the latest revision returned by read_prompt_tools.
Create a brand-new tool. One entry with no id.
{
"instanceId": 1,
"expectedRevision": "prompt-tools-abc",
"tools": [
{
"name": "get_weather",
"description": "Look up the current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string", "description": "City name, e.g. \"San Francisco\"." },
"units": { "type": "string", "enum": ["c", "f"], "description": "Temperature units." }
},
"required": ["city"]
}
}
]
}
Create several tools at once. Put every tool in the tools array — one call, one revision
check. Prefer this over issuing one call per tool.
{
"instanceId": 1,
"expectedRevision": "prompt-tools-abc",
"tools": [
{
"name": "get_weather",
"parameters": {
"type": "object",
"properties": { "city": { "type": "string" } },
"required": ["city"]
}
},
{
"name": "get_forecast",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string" },
"days": { "type": "integer" }
},
"required": ["city", "days"]
}
}
]
}
Add a required parameter to an existing tool. Pass the existing id and the full new
parameters schema. Patch semantics — name is required even if unchanged.
{
"instanceId": 1,
"expectedRevision": "prompt-tools-abc",
"tools": [
{
"id": 3,
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {
"city": { "type": "string" },
"units": { "type": "string", "enum": ["c", "f"] }
},
"required": ["city", "units"]
}
}
]
}
Create one tool and patch another in the same batch. Mix entries with and without id.
{
"instanceId": 1,
"expectedRevision": "prompt-tools-abc",
"tools": [
{
"name": "get_time",
"parameters": {
"type": "object",
"properties": { "timezone": { "type": "string" } },
"required": ["timezone"]
}
},
{
"id": 3,
"name": "get_weather",
"description": "Look up the current weather for a city. Returns temperature, humidity, and conditions."
}
]
}
Define a tool that returns structured output via a categorical choice. The model is forced to pick one of the enum labels and optionally explain.
{
"instanceId": 1,
"expectedRevision": "prompt-tools-abc",
"tools": [
{
"name": "classify_sentiment",
"description": "Classify the sentiment of the input as positive, negative, or neutral.",
"parameters": {
"type": "object",
"properties": {
"label": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "The sentiment classification."
},
"explanation": {
"type": "string",
"description": "Short justification for the label."
}
},
"required": ["label"]
}
}
]
}
Delete a tool — and optionally swap in a replacement in the same batch. deleteToolIds removes
by id; combine it with tools to delete and add atomically. Deletes may target raw vendor tools.
{
"instanceId": 1,
"expectedRevision": "prompt-tools-abc",
"deleteToolIds": [3],
"tools": [
{
"name": "get_forecast",
"parameters": {
"type": "object",
"properties": { "city": { "type": "string" } },
"required": ["city"]
}
}
]
}
Things to avoid
- Don't call
write_prompt_toolswithout callingread_prompt_toolsfirst this turn — theexpectedRevisionwill be stale and the write will be rejected. - Don't try to write a tool whose
kindwasrawin the read snapshot. Vendor passthrough tools (e.g. provider builtins likeweb_search) are not editable through PXI — tell the user to author those in the playground tool editor. Arawentry intoolsrejects the whole batch. (You can delete arawtool viadeleteToolIds, though.) - Deleting the tool that is the prompt's forced tool choice (tool_choice = specific function) is
allowed — the tool choice is automatically reset to auto (zero-or-more) and the result reports
resetToolChoiceFrom. Tell the user, since it changes how the model picks tools at run time. - Don't invent tool
ids. An entry'sid(and everydeleteToolIdsid) comes from a read snapshot, or is omitted for create. You cannot reference an id created earlier in the same batch. - Don't issue multiple
write_prompt_toolscalls in a row without re-reading the revision between them. Each successful write or delete changes the revision. Batch the changes into one call.