desktop-control - SKILL.md Agent Skill

name: desktop-control description: Background knowledge for droid-control workflows -- not invoked directly. Desktop-control driver mechanics for native GUI app automation via trycua cua-driver. user-invocable: false

Desktop-Control Driver

The orchestrator routed you here. Use these mechanics to execute your plan.

Drive native desktop GUI apps through upstream trycua/cua cua-driver: enumerate apps and windows, snapshot accessibility trees, click/type/scroll by element_index or pixel coordinates, and verify by re-snapshot -- all without bringing the target to the foreground.

When to use

Automating a native desktop app (Finder, Notepad, System Settings, native editors)
Driving native dialogs and security/permission sheets that no DOM or PTY can reach
Visual QA of native UI: per-window screenshots, accessibility-tree assertions

If the target is a terminal TUI, use tuistory or true-input. If it is a web page or an Electron app, use agent-browser -- CDP beats accessibility trees for anything Chromium-based.

Platform support

Platform	Upstream tier	Read
macOS	Production	platforms/macos.md
Windows	Production	platforms/windows.md
Linux	Pre-release (real caveats)	platforms/linux.md

Read the platform file for your target OS. Each contains permissions, daemon launch, and platform-specific patterns and failure modes.

Prerequisites

# one-time install: per-user, no sudo/admin
curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh | bash
# Windows (PowerShell):
#   irm https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.ps1 | iex

cua-driver doctor           # platform probes: permissions, daemon, accessibility plumbing
cua-driver skills install   # fetch the upstream skill pack to ~/.cua-driver/skills/cua-driver

The upstream pack (~/.cua-driver/skills/cua-driver/SKILL.md + your platform's doc) is the deep reference -- full tool surface, window-state behavior matrix, forbidden-command lists -- and it updates with the binary. Read it before any nontrivial workflow. This atom owns the droid-control integration: routing, run isolation, delegation, evidence handoff.

Daemon lifecycle

element_index workflows require the daemon. Without it each CLI invocation is a fresh process and the per-(pid, window_id) element cache dies between calls.

cua-driver serve            # start the daemon (macOS needs the LaunchServices form -- see platforms/macos.md)
cua-driver status           # daemon + socket health
cua-driver stop

Permissions are checked and granted through the driver, not by hand-editing system settings (macOS-only gate; a no-op surface on Windows/Linux):

cua-driver permissions status   # read-only; answers via the running daemon
cua-driver permissions grant    # attributed prompt flow -- the correct way to grant

Core loop

Tool names are snake_case and invoked directly: cua-driver <tool> '<json>'. (cua-driver call <tool> is legacy; do not use it.) cua-driver list-tools for the inventory, cua-driver describe <tool> for any schema.

Every workflow is Discover -> Observe -> Act -> Verify against an explicit (pid, window_id):

cua-driver launch_app '{"name":"TextEdit"}'
#  -> {pid: 844, windows: [{window_id: 10725, ...}]}   # list_windows only needed for long-lived pids
cua-driver get_window_state '{"pid":844,"window_id":10725}' --screenshot-out-file "${RUN_DIR}/before.png"
cua-driver click '{"pid":844,"window_id":10725,"element_index":14,"session":"'"${RUN_ID}"'-desktop"}'
cua-driver get_window_state '{"pid":844,"window_id":10725}' --screenshot-out-file "${RUN_DIR}/after.png"

Snapshot before AND after every action. The pre-action get_window_state resolves the element_index you are about to use -- indices are per-snapshot, per (pid, window_id), and stale ones fail with No cached AX state. The post-action snapshot is the evidence the action landed; without it a silent no-op looks like success.

Addressing-mode preference:

element_index (default) -- semantic, works on hidden and backgrounded windows, no foreground change.
Pixel click '{"pid":N,"window_id":W,"x":X,"y":Y}' -- for surfaces the tree does not reach (canvases, custom-drawn controls). Coordinates are window-local screenshot pixels, top-left origin.
Keyboard (press_key, hotkey) and platform fallbacks -- last resort; see the platform files.

Run isolation (ground rule 5 -> cua sessions)

cua sessions are the desktop equivalent of tctl session prefixes: a session owns its agent cursor, config overrides, and recording scope. Declare one per run, derived from the workflow's RUN_ID, and pass it on every action:

cua-driver start_session '{"session":"'"${RUN_ID}"'-desktop"}'
# ... every action carries "session":"${RUN_ID}-desktop" ...
cua-driver end_session '{"session":"'"${RUN_ID}"'-desktop"}'

Parallel workers each declare their own session and pass creates_new_application_instance: true to launch_app so each gets its own window. The element cache is keyed on (pid, window_id) and the cursor on session, so isolated workers cannot collide.

Delegation

cua-driver is on PATH -- workers need no ${DROID_PLUGIN_ROOT} resolution. As with the other drivers, give capture workers exact commands with the parent's run scope baked in:

Task prompt for a desktop capture worker:
  "Run these commands in order. Report screenshot paths and any errors.
   1. cua-driver start_session '{"session":"1712345678-42-notepad"}'
   2. cua-driver launch_app '{"name":"Notepad","creates_new_application_instance":true}'
      -> note the returned pid and window_id
   3. cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}' --screenshot-out-file /tmp/droid-run-1712345678-42-xxxx/before.png
   4. cua-driver type_text '{"pid":<pid>,"window_id":<wid>,"element_index":<text-area>,"text":"hello","session":"1712345678-42-notepad"}'
   5. cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}' --screenshot-out-file /tmp/droid-run-1712345678-42-xxxx/after.png
   6. cua-driver end_session '{"session":"1712345678-42-notepad"}'"

Evidence handoff

Proof type	How to capture
Window state	`get_window_state ... --screenshot-out-file ${RUN_DIR}/proof-N.png` (also keeps the PNG out of the tool response)
Full display	`cua-driver screenshot '{"out_file":"'"${RUN_DIR}"'/screen.png"}'`
Semantic assertions	`tree_markdown` from `get_window_state` (filter with `"query":"..."`)
Video	`cua-driver recording start` / `recording stop` -> session-scoped `recording.mp4`

Hand PNG/mp4 paths to compose / verify like any other driver output. Keep raw tool output alongside screenshots whenever GUI behavior is the thing under test.

Critical rules

Never change the user's frontmost app. If a command says activate, foreground, raise, or make key -- stop; the per-pid event paths exist precisely so you do not need it. Platform forbidden-lists live in the upstream pack.
Re-snapshot after every action and report what you observed, not what you intended. An unchanged tree after an action is a finding, not a formality.
Destructive actions need explicit user intent. Do not delete files, send messages, or submit forms unless the workflow asked for exactly that.