t2vtree-user-centered-visual-analytics - SKILL.md Agent Skill

name: "t2vtree-user-centered-visual-analytics" description: > Build tree-structured, agent-assisted thought-to-video authoring systems where each generation step is a node binding intent, prompts, parameters, and outputs. Four collaborating agents (Master, Knowledge, Workflow, Prompt) translate user intent into editable executable plans. Supports branching exploration, provenance tracking, and convergent stitching assembly. Trigger phrases: "build a video authoring pipeline", "tree-based video generation", "agent-assisted video creation", "thought-to-video workflow", "branching video exploration system", "multi-scene video authoring with agents"

T2VTree: Tree-Structured Agent-Assisted Thought-to-Video Authoring

This skill enables Claude to design and implement tree-structured, agent-assisted video authoring systems based on the T2VTree methodology. The core idea: represent the entire thought-to-video creation process as a tree where each node binds an editable specification (user intent, reference inputs, workflow choice, prompts, parameters) with its resulting multimodal outputs. Four collaborating agents decompose high-level creative intent into inspectable, user-editable execution plans before any generation runs. This makes branching exploration, localized comparison, provenance tracking, and convergent assembly into a final video all first-class operations rather than afterthoughts.

When to Use

When the user wants to build a multi-step video generation pipeline with branching and refinement capabilities
When designing a system where users explore alternative creative directions (image/video/audio variants) without losing prior work
When implementing multi-agent planning that translates natural-language intent into executable generation workflows (e.g., text-to-image, image-to-video, audio generation)
When building a visual analytics interface for managing complex generative AI workflows with provenance
When the user asks to create a multi-scene video authoring tool that stitches outputs from different generation branches
When implementing tree-based state management for any iterative creative AI process (not limited to video)

Key Technique

Tree-as-authoring-history. T2VTree models the entire creative process as a rooted tree on an infinite canvas. The root is an Init node. Each user action (generate an image, refine a prompt, produce a video clip) creates a child node extending its parent. Trying alternatives from the same starting point creates sibling nodes. This means every exploration trace is preserved, comparable, and resumable. Parent-child edges represent progressive refinement; sibling edges represent parallel alternatives. Nodes are color-coded by output modality: blue for images, green for video, red for audio.

Four-agent collaborative planning. Rather than dumping the user into raw workflow configuration, T2VTree interposes four specialized agents between intent and execution. The Master Agent reads scene-level context and proposes an action category (e.g., "generate background image," "animate character"). The Knowledge Agent resolves ambiguous entities or concepts. The Workflow Agent selects a compatible generation module given the action category and available inputs. The Prompt Agent materializes a modality-aware specification: prompt text, negative prompts, numerical parameters (steps, CFG scale, resolution), and seed values. Critically, the resulting plan surfaces as an editable node artifact — the user inspects and revises before triggering execution. This keeps agents assistive, not autonomous.

Convergent stitching assembly. After divergent exploration across branches, users collect candidate assets (images, video clips, audio tracks) from any node in the tree into a multi-track timeline. Drag-and-drop ordering produces a final linear video while maintaining provenance links back to originating tree nodes, so upstream revision propagates without reconstructing intermediate steps.

Step-by-Step Workflow

1. Define the Node Data Model

Each tree node must store exactly these five specification fields plus outputs:

interface T2VNode {
  id: string;
  parentId: string | null;
  intent: string;                    // Natural-language goal for this step
  referencedInputs: AssetRef[];      // Prior outputs or uploaded assets used as input
  workflowChoice: {
    actionCategory: string;          // e.g., "text-to-image", "image-to-video", "audio-gen"
    moduleId: string;                // Specific pipeline/model identifier
  };
  prompts: {
    positive: string;
    negative: string;
  };
  parameters: Record<string, number | string>;  // CFG scale, steps, resolution, seed, etc.
  outputs: MultimodalOutput[];       // Generated images, videos, or audio
  modality: "image" | "video" | "audio";
  createdAt: string;
  metadata: Record<string, any>;     // Agent planning trace, timing, model info
}

2. Implement the Tree Structure with Branching Semantics

Build the tree with three operations:

Extend (create child): progressive refinement from a selected node
Branch (create sibling): explore an alternative from the same parent context
Prune: collapse or delete abandoned branches while preserving the rest

Store nodes in a flat collection keyed by id, with parentId forming the tree. Render using a hierarchical layout algorithm (e.g., Reingold-Tilford or layered DAG layout).

3. Implement the Four-Agent Planning Pipeline

Wire up four LLM-based agents that execute sequentially on each user action:

def plan_next_step(scene_context: str, current_path: list[T2VNode], user_intent: str):
    # Agent 1: Master — classify the action
    action_category = master_agent.classify(scene_context, current_path, user_intent)

    # Agent 2: Knowledge — resolve ambiguous references
    clarifications = knowledge_agent.resolve(user_intent, scene_context)

    # Agent 3: Workflow — select generation module
    workflow_spec = workflow_agent.select(action_category, available_inputs(current_path))

    # Agent 4: Prompt — materialize full specification
    node_spec = prompt_agent.materialize(
        intent=user_intent,
        action=action_category,
        workflow=workflow_spec,
        clarifications=clarifications,
        style_context=scene_context
    )
    return node_spec  # User reviews and edits BEFORE execution

Each agent call uses ~500-1000 tokens. Keep planning calls lightweight (2k-4k tokens total per step) since generation is the expensive part.

4. Make Agent Plans Editable Before Execution

Never auto-execute agent output. Present the generated node_spec as an editable form:

Show the proposed prompt text (positive and negative) in editable text fields
Expose numerical parameters (CFG scale, steps, resolution, seed) as sliders/inputs
Display the selected workflow module with alternatives available via dropdown
Include the Knowledge Agent's clarifications as collapsible context

5. Execute Generation and Attach Outputs to the Node

After the user confirms or edits the plan, dispatch to the appropriate generation backend. Attach all outputs (thumbnails, full-resolution assets, metadata) to the node. Support in-place preview: image thumbnails, looping video playback, and audio waveform display directly within the tree node.

6. Enable Provenance Inspection

Every node retains its full specification context. Implement a detail panel that shows:

What the user originally requested (intent)
What the agents proposed vs. what was actually executed (diff view)
Which upstream outputs were referenced as inputs
Generation timing and model configuration

7. Implement the Stitching/Assembly View

Build a separate timeline view for convergent assembly:

Users star/collect candidate assets from any tree node
Multi-track timeline with video, image (as stills), and audio lanes
Drag-and-drop ordering with transition controls
Each segment links back to its source node for upstream editing
Export produces the final concatenated video

8. Maintain Scene-Level Context

Store global scene context at the project root: base model preferences, style direction, mood, color palette, and reference materials. Pass this context into every agent planning call so that generations across different branches maintain visual coherence.

Concrete Examples

Example 1: Building a T2VTree-style authoring backend in Python/Flask

User: "I want to build a video authoring tool where users can explore different generation options as a tree and stitch the best results together."

Approach:

Set up a Flask backend with SQLite storing the node data model above
Create REST endpoints: POST /nodes (create child or sibling), GET /tree (fetch full tree), PUT /nodes/:id (edit spec before execution), POST /nodes/:id/execute (trigger generation), POST /stitch (assemble timeline)
Implement the four-agent planning pipeline using an LLM API, returning editable JSON specs
Wire generation dispatch to ComfyUI, Stable Diffusion, or other local backends based on workflowChoice.moduleId
Build the tree retrieval query that reconstructs parent-child relationships from flat storage

Output structure:

/backend
  app.py                 # Flask app with REST endpoints
  models.py              # T2VNode SQLAlchemy model
  agents/
    master_agent.py      # Action category classification
    knowledge_agent.py   # Entity/concept resolution
    workflow_agent.py    # Module selection logic
    prompt_agent.py      # Prompt materialization
  workflows/
    text_to_image.py     # Generation dispatch wrappers
    image_to_video.py
    audio_gen.py
  stitcher.py            # Video concatenation with ffmpeg

Example 2: Implementing the four-agent planning pipeline

User: "Help me implement the agent planning system that turns user intent into a generation plan."

Approach:

Define system prompts for each of the four agents with their specific roles
Chain them sequentially, passing accumulated context forward
Return a structured JSON plan the user can edit

# master_agent.py
MASTER_SYSTEM = """You classify video authoring actions into categories.
Given the scene context, current branch history, and user intent,
output exactly one of: text-to-image, image-to-image, image-to-video,
video-to-video, text-to-audio, audio-to-audio, text-to-video."""

async def classify(scene_ctx: str, path_summary: str, intent: str) -> str:
    response = await llm.chat([
        {"role": "system", "content": MASTER_SYSTEM},
        {"role": "user", "content": f"Scene: {scene_ctx}\nHistory: {path_summary}\nIntent: {intent}"}
    ])
    return response.strip()

# prompt_agent.py
PROMPT_SYSTEM = """You materialize generation specifications.
Given an action category, workflow module, and user intent, produce:
- positive_prompt: detailed generation prompt
- negative_prompt: what to avoid
- parameters: dict of numerical controls with sensible defaults
Output valid JSON only."""

async def materialize(intent, action, workflow, clarifications, style_ctx) -> dict:
    response = await llm.chat([
        {"role": "system", "content": PROMPT_SYSTEM},
        {"role": "user", "content": f"Action: {action}\nWorkflow: {workflow}\n"
         f"Intent: {intent}\nClarifications: {clarifications}\nStyle: {style_ctx}"}
    ])
    return json.loads(response)

Example 3: Building the tree visualization frontend

User: "I need to build the tree UI where each node shows a preview of the generated content and users can branch off alternatives."

Approach:

Use Vue.js (or React) with an infinite canvas library (e.g., vue-flow, reactflow)
Render each node as a card showing: modality color border, thumbnail preview, intent text snippet
Implement three interaction modes: click to inspect/edit, drag-from-edge to create child, right-click to create sibling branch
Color-code nodes: blue border for images, green for video, red for audio

<template>
  <div class="t2v-node" :class="`modality-${node.modality}`">
    <div class="node-preview">
      <img v-if="node.modality === 'image'" :src="node.outputs[0]?.thumbnail" />
      <video v-else-if="node.modality === 'video'" :src="node.outputs[0]?.url" loop muted autoplay />
      <div v-else class="audio-waveform">{{ node.outputs[0]?.duration }}s</div>
    </div>
    <p class="node-intent">{{ truncate(node.intent, 60) }}</p>
    <div class="node-actions">
      <button @click="$emit('extend', node.id)">Refine</button>
      <button @click="$emit('branch', node.id)">Branch</button>
      <button @click="$emit('star', node.id)">Star</button>
    </div>
  </div>
</template>

Best Practices

Do:

Always surface agent-generated plans as editable artifacts before execution. The user must remain the decision-maker, not the agents.
Preserve every exploration state as a tree node. Never overwrite — create child or sibling nodes instead. This is the core design principle.
Keep agent planning calls lightweight (2k-4k tokens total across all four agents). Reserve compute budget for actual generation.
Color-code nodes by output modality (image/video/audio) so the tree structure communicates content type at a glance.
Maintain scene-level context (style, mood, palette) at the project root and propagate it into every planning call for cross-branch coherence.

Avoid:

Auto-executing agent plans without user review. This undermines the user-centered design that makes T2VTree effective.
Using explicit dependency edges between nodes for asset reuse. Instead, use referenced inputs within nodes — this avoids collapsing the tree into a complex DAG.
Storing outputs outside the node. Each node must be self-contained: specification + outputs together enable provenance inspection.
Building a linear pipeline. The tree structure is the point — if users can only refine sequentially, you lose branching comparison and the ability to resume abandoned directions.

Error Handling

Failure Mode	Handling Strategy
Generation fails (OOM, timeout, model error)	Attach error state to the node with full error details. The node persists in the tree so the user can edit parameters and retry without losing context.
Agent planning produces invalid JSON	Validate each agent's output with a schema. On failure, retry once with a stricter prompt. If still invalid, surface raw output in the edit form for manual correction.
Referenced input asset is missing	Check asset availability before execution. If a referenced node's output was pruned, warn the user and offer to re-generate or select an alternative.
Stitching produces discontinuities	Provide per-segment preview in the timeline before final export. Allow users to insert transition frames or adjust segment boundaries.
Tree becomes too large to render	Implement branch collapsing (hide subtrees), pruning (soft-delete abandoned branches), and viewport-based lazy rendering on the infinite canvas.

Limitations

Compute-intensive: Each tree node potentially triggers a full generation run. Large trees with many branches multiply GPU cost. This approach works best when generation backends are available locally or via affordable APIs.
Not real-time: The four-agent planning pipeline adds latency (several seconds) before generation even starts. Suitable for deliberate authoring, not rapid prototyping.
Tree complexity scales with exploration: Heavy branching produces trees that are hard to navigate without good collapse/filter UI. The visualization itself requires engineering effort.
Audio-visual coherence is not automatic: While scene-level context helps, the system does not enforce temporal consistency across video clips or audio-visual sync — these remain manual stitching decisions.
LLM planning quality depends on model capability: The four agents need a capable model (GPT-4 class or above) to reliably classify actions, resolve entities, and materialize good prompts. Smaller models will produce lower-quality plans that require more user editing.

Reference

Paper: T2VTree: User-Centered Visual Analytics for Agent-Assisted Thought-to-Video Authoring (Zheng et al., 2026) Code: github.com/tezuka0210/T2VTree Key insight: Model the video authoring process as a tree where nodes bind editable specs with outputs, interpose four collaborating agents for intent-to-plan translation (Master/Knowledge/Workflow/Prompt), and keep all plans user-editable before execution. Look for Section 4 (System Design) and Section 5 (Agent Planning Pipeline) for implementation-level details.