name: cua-driver description: Drive native macOS apps and browser UI through OpenClicky's local computer-use MCP background lane, backed by Cua. Snapshot state, prefer AX element_index actions, use explicit pixel/foreground paths only when the current OpenClicky tool surface exposes them, and verify via re-snapshot. Use when the user asks you to operate, drive, automate, or perform a GUI task in a real macOS application on the host.
cua-driver
Orchestrates macOS app automation through Cua. In OpenClicky's managed
runtime, this skill is the instruction surface for the Agent Mode
computer-use MCP background lane. OpenClicky also has product routes
for native Swift CUA, foreground app/URL opens, Background Computer Use
pixel clicks, and the external-control bridge. Treat the app/router code
as the source of truth for those lanes; this skill tells agents how to
operate when the callable tool server is computer-use. Call the local
MCP tools by name (launch_app, get_window_state, click,
set_value, etc.) instead of shelling out to cua-driver.
Whenever a user asks to drive a native macOS app or browser UI, follow the loop in this skill rather than calling tools ad-hoc. The snapshot-before-action invariant is not optional and silently breaks if you skip it.
OpenClicky managed runtime rules
OpenClicky ships Cua as a local computer-use MCP server backed by
ClickyComputerUseRuntime. That helper is spawned by Codex inside
Clicky.app, so macOS attributes Accessibility and Screen Recording use
to OpenClicky itself. In this MCP lane:
- Use the
computer-useMCP tools directly. Do not run thecua-driverCLI,open, AppleScript,cliclick, browser CLIs, or other GUI automation shims to operate apps. - Do not treat OpenClicky's own voice/router/native paths as violations
of this skill. The app may intentionally use
NSWorkspacefor foreground app or URL opens, native Swift CUA for screen-coordinate clicks/typing/keys, Background Computer Use/v1/clickfor window-screenshot pixel clicks, or the external-control bridge for visible guidance and explicit coordinate clicks. Agents should not reimplement those paths with shell shims. - For external account-backed apps, prefer a connected structured MCP or Composio integration when one is exposed and suitable. Do not silently use Computer Use as a substitute for a missing/expired connector on private account data, and do not offer to connect that integration yourself. Tell the user to open OpenClicky Settings -> Integrations and connect/reconnect the named app, offer voice/in-app guidance through setup, and do not use Computer Use to operate OpenClicky's own Settings/Integrations setup flow. Use visible UI only for the original third-party app task when the user asked for it, approved it, or the app has no shipped connector route.
- Screenshots and focused-window context do not choose the Computer Use route by themselves. If an integration-capable app such as LinkedIn, Gmail, Slack, Notion, Linear, GitHub, Calendar, Drive, Docs, or Sheets is visible, still prefer the structured MCP/Composio route first unless the user explicitly asks to click, type, or use that visible page/window.
- Keep work backgrounded by default. The user should be able to keep typing in their current app while you operate another app or browser window.
- For a new browser task, resolve the user's default HTTPS browser and
open the target URL in a new background window with
launch_app({bundle_id: <default_browser_bundle_id>, urls: [...]}). This preserves the user's current tab/window and their normal logged-in browser profile. - If the user explicitly asks you to work in "this tab", "this window", "the page I have open", or the screenshot clearly shows the target browser context they want operated, use that existing window/tab after a fresh snapshot instead of opening a new one.
- Do not pass
--user-data-diror other isolated-profile flags; they log the agent out of the accounts the user expects to use. - Prefer AX and app-scoped actions:
click({pid, window_id, element_index}),right_click({pid, window_id, element_index}),set_value,type_text,press_key,hotkey,scroll,page, andget_window_state. - OpenClicky's
computer-useMCP lane does not exposemove_cursor,drag,double_click, recording, replay, or zoom in this build. Prefer AX/page actions in this skill. Coordinate clicks are valid only through explicit OpenClicky pixel/foreground surfaces that document their coordinate space; do not invent rawCGEvent,cliclick, or shell-based pixel input. - Stop before purchases, sends, deletes, payments, account changes, submitting forms, or other irreversible actions unless the user explicitly approved that exact step.
The MCP background contract — read this first
For this computer-use MCP skill, default to preserving the user's
frontmost app. Users should be able to keep typing in their editor while
an agent drives another app in the background. This is a lane-specific
contract, not a statement that OpenClicky's app code must never
foreground anything: explicit voice app-open requests, native CUA,
Background Computer Use pixel actions, the visual-control bridge, and
product-managed foreground flows are separate OpenClicky routes.
Before running any shell command, ask: "does this raise,
activate, foreground, or make-key any app?" If yes, don't run it.
Every one of the commands below activates the target on macOS and is
therefore forbidden inside this MCP lane. If the user explicitly asks
for frontmost state and the current OpenClicky tool surface exposes a
foreground/native route, use that documented route. If it does not,
explain that this skill can keep working in the background but cannot
foreground the app through computer-use.
Every form of the
openCLI —open -a <App>,open -b <bundle-id>,open <file>,open <path-to-App.app>,open <url>— always activates. macOS routes all forms through LaunchServices, which unhides and foregrounds the target regardless of whether you passed an app name, a bundle id, a document, a URL, or the bundle path itself. The activation happens even when the only intent was "start the process." Never useopenfor any app launch. This includes launching a just-built .app from a local build dir (e.g.open build/Build/Products/Debug/MyApp.app) — resolve theCFBundleIdentifierfromInfo.plistand uselaunch_appwith that id. If foregrounding is genuinely needed, use a documented OpenClicky foreground/native route rather than shelling out.osascript -e 'tell application "X" to activate'— activates by design. Same for... to open <file>,... to launch, and anything withactivatein the tell block.osascript -e 'tell application "System Events" to ... frontmost'in a mutating form (settingfrontmostrather than reading it).AppleScript files that invoke
activate,launch, oropenagainst the target app.cliclick(moves the user's real cursor to the target coords before clicking — a focus-steal-equivalent even if the app's window state is unchanged).CGEventPostwithcghidEventTaptargeting a coordinate over a different app's window (warps the cursor, possibly activates on hit).AppleScriptTask,NSAppleScript,Processwrappingosascriptthat contains any of the above.NSRunningApplication.activate(options:)called from your own helper binary — same class.Dock clicks and any
openinvocation (see the first bullet — every form ofopengoes through LaunchServices which activates, full stop).Keyboard shortcuts that semantically mean "focus here" — most notably Chrome / Safari / Arc's
⌘L(focus omnibox) and Finder's⌘⇧G(Go to Folder). These aren't pure key events — the receiving app interprets "user wants to type here" as activation intent and raises its window to be key. Even when delivered to a backgrounded pid viahotkey, the downstream app pulls focus. For omnibox navigation specifically, the correct path is to resolve the user's default browser bundle id and calllaunch_app({bundle_id: <default_browser_bundle_id>, urls: ["https://..."]})— no omnibox dance, no⌘L, no focus-steal. Do NOT tryset_valueon the omnibox for navigation: browser commit logic often requires a "user-typed" signal that neither an AX value write norCGEvent.postToPidkeystrokes supply from a backgrounded pid. The URL may land in the field but Return can fire as a no-op. SeeWEB_APPS.md→ "Navigate to a URL" for the full default-browser pattern. The general principle: a shortcut that says "put my cursor inside this app" is a focus-steal; a shortcut that says "do this thing" (copy, save, quit) is fine.Tab-switching shortcuts in browsers (
⌘1..⌘9,⌘],⌘[,⌘⇧[,⌘⇧]) are visibly disruptive even when delivered to a backgrounded pid. The app's key handler processes the shortcut, the window re-renders the new tab's content, the user sees their tabs flipping. There is no AX-only workaround: page content (HTML, form state,AXWebArea) populates only for the focused tab; inspecting a background tab requires activating it, which is the visible flip. Observed with Dia; the same mechanic applies to every Chromium-family browser (Chrome, Arc, Brave, Edge).Prefer the windows-over-tabs pattern: for each URL you need to drive backgrounded, resolve the user's default browser and use
launch_app({bundle_id: <default_browser_bundle_id>, urls: [url]})so browsers open each URL in a new window where supported. Each window has its ownwindow_id, its own AX tree, and can be inspected / interacted with viaelement_indexwithout activating or switching anything. Tabs are a UX grouping for humans; cua-driver workflows should default to windows. SeeWEB_APPS.md→ "Tabs vs windows" for the full pattern.Tab-title enumeration (read-only) IS safe — walk a window's toolbar AX tree for
AXTab/AXRadioButtonchildren and read theirAXTitles. Tab switching (activating one) is not.
If exact frontmost-app identity matters and the local computer-use
tools cannot answer it, stop and explain the limitation. Do not shell
out to osascript as a side channel.
Corollary — the AXMenuBar rule. AXMenuBarItem + AXPick
dispatches at the AX layer regardless of which app is frontmost,
but macOS's on-screen menu bar always belongs to the frontmost
app. If you drive a backgrounded app's menu bar, the AX call
succeeds but the viewer sees the dispatch rendered over the
frontmost app's menu bar — confusing in any observed session and
routinely a silent no-op too, because action menu items go
DISABLED when their owning app isn't the key window. So: only
use menu-bar navigation when the target is already frontmost. For
backgrounded targets, read state via in-window AX (window title,
toolbar AXStaticText) and dispatch via in-window element_index
actions. Pixel clicks are not the default path in this MCP lane; use
them only through explicit OpenClicky pixel/foreground surfaces that
document their coordinate space. Full rationale in "Navigating native
menu bars" below.
Inside this MCP lane, "open <app>" means launch, not activate.
launch_app is the correct path for background process startup: it is
idempotent, returns the pid, and is designed for background operation.
If OpenClicky's voice/native router handled the same phrase as an
explicit foreground app-open, that is app behavior outside this skill.
Do not override it, and do not recreate it with open or AppleScript.
Defaults — always prefer the local MCP tools over shell shims
Every reference to click(...), get_window_state(...), etc. in this
skill means call the same-named tool on OpenClicky's local computer-use
MCP server. The standalone cua-driver CLI examples in upstream docs
are for local development only; they are not the product runtime path.
Intent → tool mapping. If you find yourself reaching for the right
column while operating through computer-use, something has gone wrong
— re-read "The MCP background contract" above:
| Intent | Use | Don't use |
|---|---|---|
| Open / launch an app | launch_app({bundle_id}) or launch_app({bundle_id, urls:[...]}) |
open -a, osascript 'tell app … to launch/activate/open' |
| Find a pid | list_apps or launch_app's return |
pgrep, ps, osascript frontmost |
| Enumerate an app's windows | list_windows({pid}) — or read the windows array launch_app already returns |
osascript 'every window of app …' |
| Click / type / scroll / keys | click, type_text, scroll, press_key, hotkey |
osascript, cliclick, raw CGEvent, open <url> |
| Drag / drag-and-drop / marquee select | Not exposed in the computer-use MCP lane; use only an explicit OpenClicky foreground/pixel route if the current tool surface documents one |
cliclick dd:, osascript drag |
| Screenshot | screenshot or the PNG in get_window_state |
screencapture |
| Quit an app | ask the user first, then hotkey({pid, keys:["cmd","q"]}) |
kill, killall, pkill |
| Hand a file/URL to an app | launch_app({bundle_id, urls:[<path>]}) |
open -a <App> <path>, open <url> |
The frontmost handoff
The computer-use MCP lane has no activation tool and does not allow
osascript activation. If the user explicitly asks for frontmost state
("bring Chrome to the front", "make it frontmost", "I want to see X"),
use an explicit OpenClicky foreground/native tool only if the current
tool surface exposes one. If only the computer-use MCP lane is
available, say that this build can keep working in the background but
cannot foreground the app through computer-use.
When a Computer Use call surprises you, diagnose the local MCP state first:
- Tiny screenshot / empty
tree_markdown? Checkget_config→capture_mode. Default"som"returns both the AX tree and screenshot."vision"omits the AX tree (PNG only),"ax"omits the PNG. If a snapshot lacks a tree,capture_modeis almost certainly"vision"— either reason purely from the PNG or flip to"som"/"ax"viaset_config. has_screenshot: false? The window capture failed (transient race against a close, or the window has no backing store yet). Re-snapshot; if persistent, pick a differentwindow_idvialist_windows.Invalid element_index/No cached AX state? You either skippedget_window_statethis turn or passed a differentwindow_idthan the one the snapshot cached against. The cache is keyed on(pid, window_id)— indices don't carry across windows of the same app. Re-snapshot with the same window_id you're about to click in.- Sparse Chromium AX tree? Retry
get_window_stateonce — the tree populates on second call.
Only after those are ruled out, and only if the user's action genuinely
needs frontmost state, hand off to an explicit OpenClicky foreground
route when one is available. Otherwise stop and explain the foreground
requirement. The computer-use MCP lane does not expose an activation
tool.
Self-check pattern
Before every Computer Use tool call that touches any macOS app (launching, opening, clicking, typing, scripting, screenshotting), run the self-check:
- Does this command foreground the target? If yes — stop and translate to the Computer Use equivalent from the mapping table, or use an explicit OpenClicky foreground/native route if that is the selected tool surface.
- Does this command move the user's real cursor? (
cliclick, anyCGEventPostatcghidEventTapover another app's window). If yes — stop; use an AX element action, keyboard shortcut, orpageaction, or use a documented OpenClicky pixel route when the tool surface explicitly exposes one. - Does this command bypass Computer Use entirely? (
osascriptmutating GUI state, AppleScript files, external helpers.) If yes — stop; find the local MCP tool that does the intent.
If all three are "no," the command is safe. If you can't answer,
default to stop and ask rather than proceed. A single open -a
run by accident kills the demo, the trust, and the user's in-flight
editor state.
Prerequisites — check before starting
- Confirm the
computer-useMCP server is available in the current Codex session. OpenClicky starts it automatically from the bundledClickyComputerUseRuntimehelper. - Call
check_permissions({"prompt": false}). If Accessibility or Screen Recording is missing, stop and ask the user to grant those permissions to Clicky.app in System Settings. - Do not start daemons, install CuaDriver.app, or invoke standalone shell commands. In OpenClicky, the MCP server is already the daemon.
Using the local MCP tools
Tool names are snake_case and are called directly through the
computer-use MCP server.
Canonical multi-step workflow:
launch_app({"bundle_id":"com.apple.calculator"})
# -> {pid: 844, windows: [{window_id: 10725, ...}]}
get_window_state({"pid":844,"window_id":10725})
click({"pid":844,"window_id":10725,"element_index":14})
get_window_state({"pid":844,"window_id":10725})
Agent cursor overlay
Visual cursor overlay for demos and visible progress. Default:
enabled. Toggle with set_agent_cursor_enabled({"enabled":true|false}).
A triangle pointer Bezier-glides to each
click target, ring-ripples on landing, idle-hides after ~1.5s.
Motion knobs: set_agent_cursor_motion takes any subset of
start_handle, end_handle, arc_size, arc_flow, spring —
tuneable at runtime, persisted to config.
OpenClicky's MCP helper bootstraps the AppKit runloop required by the overlay.
The core invariant — snapshot before AND after every action
Every action MUST be bracketed by get_window_state(pid, window_id):
- Before — the pre-action snapshot resolves the
element_indexyou're about to use. Indices from previous turns are stale; the server replaces the element index map on every snapshot, keyed on(pid, window_id). Indices from turn N don't resolve in turn N+1, and indices from window A don't resolve against window B of the same app. Skip this and element-indexed actions fail withNo cached AX state. - After — the post-action snapshot verifies the action actually landed. Without it you can't tell a silent no-op from a real effect. The AX tree change (new value, new window, disappeared menu, disabled button, etc.) is your evidence that the action fired. If nothing changed, the action probably failed silently — say so, don't assume success.
This applies to pixel clicks too — re-snapshot after to confirm the click landed on the intended target.
Why window selection is the caller's job now
get_app_state used to pick a window for you via a max-area heuristic
that returned the wrong surface on apps with large off-screen utility
panels. Concrete reproducer: IINA's OpenSubtitles helper (600×432
off-screen) out-area'd the visible 320×240 player window, so
get_app_state(pid) screenshot'd the invisible panel and clicks landed
there silently. The new get_window_state(pid, window_id) makes the
caller name the window explicitly — the driver validates that the
window belongs to the pid and is on the current Space, then snapshots
exactly what was asked for. Enumerate candidates via list_windows or
read the windows array launch_app already returns.
Behavior matrix
Two orthogonal axes shape what the agent can do.
capture_mode → addressing mode
capture_mode |
get_window_state returns |
Use for actions |
|---|---|---|
som (default) |
tree + screenshot | element_index preferred; pixels are diagnostic only in OpenClicky |
ax |
tree only (no PNG) | element_index only |
vision |
PNG only (no tree) | read-only visual inspection in OpenClicky — see SCREENSHOT.md |
vision was renamed from screenshot — the old name still decodes
as a deprecated alias, so an on-disk "capture_mode": "screenshot"
keeps working. Default is som so element_index clicks work the
first time a user calls get_window_state; the other modes are
opt-in when the caller specifically doesn't want one half of the
work. Note the tool named screenshot is separate (raw PNG, no AX
walk) and unrelated to the capture mode.
When a snapshot looks wrong (tiny screenshot / empty tree), call
get_config and inspect capture_mode before anything else.
Pure-vision mode has its own caveats — the model's vision pipeline downsamples dense text aggressively, so pixel grounding takes multiple correction cycles on text-heavy UIs. Read SCREENSHOT.md before driving anything in that mode; it documents the iterate/annotate/verify recipe plus the JPEG-over-PNG finding.
Window state → what works
| state | get_window_state |
click/set_value (AX) |
press_key commit (Return/Space/Tab) |
pixel click in explicit OpenClicky routes |
|---|---|---|---|---|
| frontmost | ✅ | ✅ | ✅ | native/bridge route when explicitly exposed |
| backgrounded / visible | ✅ | ✅ | ✅ | BCU window-screenshot route when explicitly exposed |
| minimized (Dock genie) | ✅ | ✅ (no deminiaturize — AX actions fire on the minimized window in place) | ❌ silent no-op / system beep — use set_value or click equivalent |
❌ no on-screen bounds |
hidden (hides=true / NSApp.hide) |
✅ | ✅ | depends | ❌ |
| on another Space | ⚠️ AX tree often stripped to menu-bar-only on SwiftUI apps (System Settings) — AppKit apps usually fine. Response carries off_space: true + window_space_ids so you can detect it |
✅ | ✅ | ❌ window not in current-Space list |
Critical cell — minimized + keyboard commit. The keystroke
reaches the app but AX focus doesn't propagate to renderer focus on
a minimized window. Workarounds in order of preference:
set_value to write the field's entire value directly, or AX-click
a commit-equivalent button (Go, Submit, checkbox). Tell the user
the window needs to un-minimize only as a last resort.
The canonical loop
launch_app(target)
→ pick window_id from the returned `windows` array
(or call list_windows(pid) separately)
→ get_window_state(pid, window_id)
→ [act] # every action also takes (pid, window_id)
→ get_window_state(pid, window_id) → verify
launch_app now returns a windows array alongside the pid, so the
common case collapses to two calls (launch_app → get_window_state)
without a separate list_windows hop.
1. Resolve target pid — always via launch_app
Always start with launch_app, whether or not the target is already
running. It's idempotent (relaunching returns the existing pid with no
side effects) and gives you the pid in one call — no list_apps hop.
launch_app({bundle_id: "com.apple.finder"})— preferred, unambiguous.launch_app({name: "Calculator"})— when bundle_id isn't known.
launch_app is the background-safe launch primitive for this MCP
lane: agents drive apps in the background while the user keeps typing in
their real foreground app. In Cua v0.1.6,
LaunchServices is called without activating the target, menu hotkey
delivery restores the previous frontmost app immediately after posting
the key, and the previous frontmost app is restored if the target
self-activates while opening URLs. The target window may become visible
behind the user's current app, but it must not become frontmost.
If the user explicitly wants the window foregrounded (usually for a
demo or recording), hand off to an explicit OpenClicky
foreground/native route if the current tool surface exposes one.
Otherwise, tell the user to bring it forward themselves — Dock click,
Cmd-Tab, or Spotlight. Do not reach for open / osascript activate
as a shortcut to make the window visible inside this MCP lane.
Never shell out to any form of open (including open <path-to-App.app> for a just-built binary — resolve the bundle id
from Info.plist and use launch_app with that), osascript 'tell app … to launch/open', or similar. Those paths activate the target,
bypass the driver's focus-restore guard, and require a Bash
permission prompt the agent loop shouldn't be burning on app launch.
See "Prefer the local MCP tools over shell shims" above for the full
intent → tool mapping.
list_apps is for app-level discovery (answering "what's installed /
running / frontmost?") — not part of the core action loop. Skip it in
the loop. For window-level questions — "does this app have a
visible window?", "which Space is this window on?", "which of this
pid's windows is the main one?" — call list_windows instead; the
app record doesn't carry window state on purpose. In the common
single-window case you can skip list_windows entirely and read the
windows array that launch_app already returned.
2. Snapshot and act by element_index
Call get_window_state({pid, window_id}) with the window_id from
launch_app's windows array (or a fresh list_windows({pid}) if
you're interacting with a long-lived process). The default som
capture_mode returns both the AX tree and screenshot, so the
canonical loop works immediately without any config change. The rest
of this section walks through som mode. If you're in vision mode
(PNG only, no AX tree), flip back with set_config({"key": "capture_mode", "value": "som"}).
In som mode (the default) the response carries:
tree_markdown— every actionable element tagged[N]. ThatNis theelement_index. The tree can be very large (Finder is ~1600 elements, ~190 KB); when it exceeds token limits, inspect the returned file/path or re-snapshot a narrower target window.screenshot_file_path— absolute path to the saved screenshot whenscreenshot_out_filewas passed. Absent otherwise.screenshot_width/_height/_scale_factor— dimensions of the captured image. Present whenever a screenshot was taken. Getting the screenshot as a file:
Pass screenshot_out_file when using get_window_state and you need a
file path instead of inline image content:
{"pid": N, "window_id": W, "screenshot_out_file": "/tmp/shot.jpg"}
The MCP image content block is omitted from the response when this param
is set, so the model receives the AX tree and screenshot_file_path.
Reason over both the tree AND the screenshot — they're
complementary, not redundant. In som mode every
turn's get_window_state gives you both halves and you should pull
signal from each:
- The AX tree tells you what's clickable — roles, labels,
element_indexhandles, advertised actions, parent-child structure. This is the ground truth for dispatching. - The screenshot tells you which one — the tree often has
many buttons with similar or empty labels ("Delete", "OK",
anonymous UUID-labeled buttons, five
AXStaticText = " "), and visual context disambiguates. Captions, colors, layout relationships visible in pixels often don't show up in the AX tree at all (especially in Chromium / Electron / web content).
Canonical pattern: look at the screenshot to decide "the blue
Subscribe button on the top-right of the video card", then walk the
tree to find the matching AXButton and dispatch by its
element_index. Don't try to do it from just the tree — you'll
pick the wrong element when labels repeat. Don't try to do it from
just the screenshot — you lose the reliable AX-action path and the
safe backgrounded-dispatch.
Standalone Cua can support coordinate-based input for canvas, video,
WebGL, or custom-drawn surfaces that are not reachable through AX.
OpenClicky's MCP background lane should still prefer AX/page actions,
but app-managed routes may expose coordinate clicks with explicit
coordinate spaces: native Swift CUA uses global AppKit screen points,
and Background Computer Use /v1/click uses window-screenshot pixel
coordinates. If AX, keyboard, and page cannot reach the target, use
only those documented OpenClicky pixel routes when the current tool
surface exposes them; otherwise stop and report the missing capability
instead of guessing.
The actions=[...] list on each element is advisory, not
authoritative. cua-driver does not pre-flight check against it —
click({pid, element_index}) always attempts AXPress (or the
action you pass) and surfaces whatever the target returns. Many
apps accept AXPress on elements that don't advertise it — Chrome's
omnibox suggestion AXMenuItem is a live example. Try the click
first — pivot only on the returned AX error code.
Dispatch table (every row assumes a (pid, window_id) pair from the
last get_window_state; window_id is required alongside
element_index):
| Intent | Tool | Notes |
|---|---|---|
| List an app's windows | list_windows({pid}) |
returns window_id, title, bounds, z_index, is_on_screen, on_current_space. Already included in launch_app's response — only call this for long-lived pids |
| Snapshot a window | get_window_state({pid, window_id}) |
returns tree_markdown + screenshot_*; populates the (pid, window_id) element_index cache |
| Left click | click({pid, window_id, element_index}) |
default action: "press". Prefer element_index in this MCP lane; use coordinate forms only when the current OpenClicky tool descriptor documents them. |
| Double-click / open | Use click({pid, window_id, element_index, action: "open"}) when the element advertises an open action |
double_click is not exposed in the computer-use MCP lane because it can fall back to pixel stamping. |
| Right click / context menu | right_click({pid, window_id, element_index}) or click({pid, window_id, element_index, action: "show_menu"}) |
Chromium web-content coerces pixel right-click to left — see WEB_APPS.md |
| Type at cursor | type_text({pid, text, window_id, element_index}) |
AXSelectedText write; focuses first |
| Set whole field value | set_value({pid, window_id, element_index, value}) |
sliders, steppers, text fields; use for keyboard-commit workarounds on minimized windows |
| Scroll | scroll({pid, direction, amount, by, window_id, element_index}) |
synthesizes PageUp/PageDown/arrows via SLEventPostToPid |
| Focus + send key | press_key({pid, key, window_id, element_index, modifiers}) |
element_index sets AXFocused, then posts key |
| Send key to pid | press_key({pid, key, modifiers}) |
no focus change; key goes to pid's current focus |
| Modifier combo | hotkey({pid, keys}) |
e.g. ["cmd","c"]; posted per-pid, not HID tap |
| Unicode keystrokes | type_text({pid, text, delay_ms}) |
AX write with automatic CGEvent fallback; reaches Chromium/Electron inputs |
All keyboard/text primitives require pid. There is no
frontmost-routed variant — every key goes to the named target via
CGEvent.postToPid, so the driver cannot leak keystrokes into the
user's foreground app.
Why element_index is the primary path: works on hidden /
occluded / off-Space windows, no focus steal, stable across
rebuilds, labels tell you what you're clicking. If AX cannot reach the
target, use keyboard shortcuts or page where possible; otherwise use
an explicit OpenClicky pixel route only when the current tool surface
documents it.
Pixel-coordinate clicks
Pixel clicks are supported in OpenClicky only through explicit routes
with explicit coordinate spaces. Do not infer that every click tool
uses the same coordinates.
computer-useMCP element clicks use(pid, window_id, element_index).- Background Computer Use
/v1/clickuses the focused window screenshot pixel coordinate returned by that BCU capture, not global display points. - Native Swift CUA and the external-control bridge use global AppKit
screen points, matching screenshot
displayFramemetadata.
Screenshots in this MCP skill are primarily for visual disambiguation:
use them to choose the right AX element, then dispatch by
element_index. Translate pixels into clicks only when the current
OpenClicky tool descriptor explicitly says that is the accepted input
space.
If the target is a canvas, video player, WebGL surface, game viewport,
or custom-drawn control with no useful AX tree, try reachable
alternatives first: keyboard shortcuts, command palettes, toolbar
controls, or the page tool. If those cannot perform the task, stop
and tell the user the action needs an OpenClicky pixel/foreground route
that is not exposed in the current tool surface.
For browser video playback, prefer page or keyboard controls where
available, such as press_key({pid, key: "k"}) for YouTube-style
players or press_key({pid, key: "space"}) for a focused generic
player. Verify with a fresh snapshot.
Canvases, viewports, games (Blender, Unity, GHOST, Qt, wxWidgets)
Apps whose main surface is an OpenGL / Metal / Qt / wxWidgets
viewport expose no useful AX tree — the whole surface is one
opaque AXGroup or AXWindow from AX's perspective. Per-pid event
paths (SLEventPostToPid, CGEvent.postToPid) are filtered by the
viewport's own event-source check and silently dropped — the event
loop wants "real HID origin".
There is no general AX/page backgrounded path that reaches these apps
today in the computer-use MCP lane. Use an explicit OpenClicky
pixel/foreground route only when the current tool surface exposes one;
otherwise stop and say the task requires a pixel/foreground capability
instead of silently escalating.
Navigating native menu bars (AXMenuBar)
Only drive the menu bar when the target app is frontmost. This
is the single most-misused cua-driver capability. If the target is
backgrounded, don't reach for AXMenuBarItem + AXPick — use
in-window element_index, toolbar controls, command palettes, or
keyboard shortcuts instead. Two reasons, one
functional and one perceptual:
- Functional: menu items that touch document/playback/editor
state go
DISABLEDwhen their owning app isn't the key window (Preview rotate, IINA speed change, most editor commands). AXPick- AXPress will dispatch successfully from the driver's side but no-op at the target — you get a silent false-pass.
- Perceptual (matters for demos, screen recordings, and anything
the user watches live): macOS's screen-rendered menu bar
always belongs to the frontmost app. AXPick on a backgrounded
app's
AXMenuBarItemdispatches to that app's per-process menu at the AX layer, but any visible menu render happens over the frontmost app's menu bar — the viewer sees an IINA submenu flashing on top of Chrome's menus, which reads as "the agent clicked the wrong app." The AX call was correct; the frame the user sees is not. For recorded or observed sessions, this is an integrity bug even though it's not a correctness bug.
Good decision rule: if the target is not already frontmost, do
not use AXMenuBarItem at all. For reading in-window state,
snapshot the window AX tree — most apps expose the same state via
an in-window AXStaticText, title bar, or toolbar. For dispatching
actions, use in-window element_index targets, toolbar controls,
command palettes, keyboard shortcuts, or page where available. If
the only workable path is a coordinate/pixel click, stop and say that
the current computer-use MCP lane does not expose the needed pixel
mode, unless another explicit OpenClicky foreground/pixel route is
available.
When the target IS frontmost, the menu-bar flow below is fine and the canonical path for menus.
The two-snapshot pattern (target frontmost only)
Menu contents are a two-snapshot flow. Closed AXMenu subtrees are
deliberately skipped during snapshot — otherwise every app's File /
Edit / View hierarchy plus every Recent Items macOS has ever seen
would inflate the tree 10-100x. But once a menu is open, its
AXMenuItem children do receive element_index values so you can
click them normally.
- Find the
[N] AXMenuBarItem "<Menu Name>"in the tree. click({pid, element_index: N, action: "pick"})— menu bar items implementAXPick("open my submenu"), notAXPress. Using the default action on an AXMenuBarItem is a no-op.- Re-snapshot. The expanded menu's items now appear under the bar
item as
[M] AXMenuItem "<Item Name>". - Click the target item — most items respond to
AXPress(default action). Submenus nest under the item and are walked the same way. - Re-snapshot and verify.
If you ever need to back out without selecting, press_key({pid, key: "escape"}) closes the open menu. Leaving a menu expanded between
turns poisons subsequent snapshots for that pid.
Commands gated on the target being frontmost
Some menu items and global shortcuts (Preview's Tools → Rotate
Right, ⌘R; anything in the View menu that manipulates the current
document; most editor commands) are disabled unless the target
app is the key / frontmost window. You'll see it in the AX tree
as DISABLED on the menu item even though the user's intent is
obviously valid.
Before activating, confirm you're in this narrow case — the menu
item still reads DISABLED after a fresh snapshot AND the action
the user requested genuinely requires frontmost (Preview rotate,
View menu document manipulation, editor commands). If either
check fails, don't activate.
When both checks pass, the computer-use MCP lane still does not expose
an activation tool and does not allow osascript activation. Hand off
to an explicit OpenClicky foreground/native route if the current tool
surface exposes one; otherwise tell the user the action requires the
target app to be frontmost. If they bring it frontmost, re-snapshot and
continue.
Web-rendered apps (browsers, Electron, Tauri)
For Chrome / Edge / Brave / Arc / Safari, Electron apps (Slack,
VSCode, Notion, Discord), and Tauri apps — see WEB_APPS.md.
Covers: sparse AX tree population (retry-once pattern for Chromium),
URL navigation via launch_app({bundle_id, urls:[...]}), forbidden
omnibox fallbacks (⌘L, URL-like type_text, URL-like set_value),
the set_value workaround for ordinary keyboard commits on
minimized windows (Return silently no-ops — symptom is a macOS
system beep), scrolling via synthetic PageUp/Down keystrokes, in-page
clicks, and typing into web inputs.
Chromium web content specifically also coerces right_click back to
left — use element_index for AX-addressable targets and accept the
limit otherwise.
Browser JS primitives — page tool and get_window_state(javascript=)
When the AX tree doesn't expose the data you need (common in
Chromium/Electron — the tree is sparse for web content), use the
page tool or the javascript param on get_window_state to query
the DOM directly via Apple Events. Requires "Allow JavaScript from
Apple Events" to be enabled — see WEB_APPS.md for the setup path.
Three actions on the page tool:
page({pid, window_id, action: "get_text"})— returnsdocument.body.innerText. Fastest way to read page content, prices, article text, or any raw text the AX tree truncates or omits.page({pid, window_id, action: "query_dom", css_selector: "a[href]", attributes: ["href"]})— runsquerySelectorAlland returns each match's tag, text, and requested attributes as a JSON array. Use for table rows, link hrefs, data attributes, structured page data.page({pid, window_id, action: "execute_javascript", javascript: "..."})— raw JS. Wrap in an IIFE with try-catch. Don't use this for elements already indexed byget_window_state—clickandset_valueare more reliable there.
Co-located read — get_window_state with javascript:
get_window_state({pid, window_id, javascript: "document.title"})
Runs the JS and appends the result as a ## JavaScript result section
alongside the AX snapshot — one round-trip instead of two. Use this
when you need both the element tree (for subsequent clicks) and some
page data in the same turn.
Decision rule — AX vs JS:
| Need | Use |
|---|---|
| Click / type into an element | get_window_state → click / set_value (AX, works backgrounded) |
| Read text the AX tree drops | page(get_text) or get_window_state(javascript=) |
| Scrape structured data (tables, hrefs) | page(query_dom) |
| Trigger JS events / mutations | page(execute_javascript) |
Supported backends:
| App type | How | Context |
|---|---|---|
| Chrome / Brave / Edge | Apple Events execute javascript |
Full DOM ✅ |
| Safari | Apple Events do JavaScript |
Full DOM ✅ |
| Electron (VS Code, Cursor…) | SIGUSR1 → V8 inspector → CDP | Main process only: process, Buffer — no document, no require in sandboxed apps |
| Electron debug-port page target | Future/runtime-explicit only | Full DOM when exposed |
Electron sandbox note: SIGUSR1 connects to the Node.js main process.
Sandboxed Electron apps (VS Code, Cursor) strip require and Electron
APIs there. Useful for: process.env, process.versions, process.cwd(),
process.pid. Do not relaunch user apps with debugging flags in
OpenClicky's default runtime; only use a debug-port page target if the
runtime already exposes it.
Arc returns no values; Firefox has no JS-via-AppleEvents support — see
WEB_APPS.md for the full matrix.
3. Re-snapshot and verify — mandatory
Always call get_window_state({pid, window_id}) after the action.
This isn't optional verification — it's the second half of the
snapshot invariant.
Check the AX tree diff: a changed value, a new element, a new window, or the disappearance of the thing you just clicked (menus collapse after selection, buttons may become disabled, etc.). If nothing changed, the action likely failed silently — tell the user what you attempted and what you observed, don't paper over with "done" language. Agents that skip this step report success on silently-dropped actions — the single most common failure mode.
Recording trajectories
OpenClicky's default runtime does not expose set_recording, get_recording_state, or replay_trajectory.
If the user asks to record, replay, or export a browser/app trajectory,
stop and explain that this build cannot do it through Computer Use.
See RECORDING.md only as upstream reference material for a future
runtime that explicitly exposes those tools.
Common error patterns
| Error text | Meaning | Fix |
|---|---|---|
No cached AX state for pid X window_id W |
You either skipped get_window_state this turn, or passed a different window_id to the click than the one the snapshot cached against |
Call get_window_state({pid: X, window_id: W}) first — the same window_id you intend to click in |
Invalid element_index N for pid X window_id W |
Index is stale or out of range | Re-run get_window_state with the same window_id, pick a fresh index from the new tree |
window_id W belongs to pid P, not … |
Passed a window_id that's owned by a different process | Use list_windows({pid: X}) to enumerate this pid's own windows |
AX action AXPress failed with code … |
Element doesn't support AXPress | Try show_menu, confirm, cancel, or pick |
macOS system-alert beep on press_key with no visible change |
Target window is minimized; Return / Space / Tab commits don't establish real renderer focus on minimized windows | AX-click a clickable equivalent (Go button, Submit button, checkbox) instead of pressing the key; see "Keyboard commits on minimized windows" under the Browser section |
Accessibility permission not granted |
TCC not granted | Stop; tell user to grant in System Settings |
Screen Recording permission not granted |
TCC not granted for capture | Affects screenshot and get_window_state (which always captures). Grant in System Settings — the driver can't operate without it |
Things to avoid
- Never reuse an
element_indexacross a re-snapshot of the same pid. - Never translate screenshot pixels into a click unless the current
OpenClicky tool descriptor explicitly documents that coordinate space.
In the MCP lane, screenshots are for visual disambiguation and
element_indexis the primary action handle. - Prefer AX over pixels in the MCP lane. Exhaust AX paths, command
palettes, toolbar items, keyboard shortcuts, and
pagebefore using or reporting the need for an explicit pixel/foreground route. - Never drive destructive actions (delete files, close unsaved documents, send messages, submit forms) without explicit user intent for that specific destructive step.
- Never launch apps autonomously; confirm with the user first unless their original request clearly implies the launch.
Example end-to-end task
User: "Open the Downloads folder in Finder."
launch_app({bundle_id: "com.apple.finder", urls: ["~/Downloads"]})→{pid: 844, windows: [{window_id: 6123, title: "Downloads", ...}]}. Idempotent launch; plus Finder opens a background-safe window rooted at~/Downloadsviaapplication(_:open:)— zero activation, no focus steal. Thewindowsarray lets you skip alist_windowshop.get_window_state({pid: 844, window_id: 6123})→ verify anAXWindowwhose title contains "Downloads" is present with a populated AX subtree (sidebar, list view, files).- Done.
If the user instead asks to navigate within an already-open Finder window, use the menu-bar flow from the "Navigating native menu bars" section above (click Go → pick a menu item → re-snapshot → click it).