discord-voice - SKILL.md Agent Skill

name: discord-voice description: Sutando joins a Discord voice channel and runs a 2-way Gemini Live conversation. Standalone TS process — discord.js + @discordjs/voice + bodhi VoiceSession. when_to_use: When the user (in a DM or task) asks Sutando to "join voice", "join the lounge", or generally to be present in a Discord voice channel for live conversation.

Discord Voice

Sutando joins a Discord voice channel and runs a real-time 2-way conversation via Gemini Live, reusing the same bodhi VoiceSession + tool wiring as skills/phone-conversation/scripts/conversation-server.ts (Twilio path).

When to Use

User says "join voice", "join the lounge", "join <voice channel name>", or any equivalent.
A task arrives asking Sutando to be present in a Discord voice channel.

NOT for: silent presence (no Gemini), text-only Discord channels (use discord-bridge.py), Zoom/Meet/phone (use the respective skills).

Architecture

One process, all in TypeScript:

Discord user voice
    ↓
@discordjs/voice receiver (opus packets per speaking user)
    ↓ prism opus.Decoder → PCM s16le 48k stereo
    ↓ downsample48StereoTo16Mono
    ↓
bodhi VoiceSession.handleAudioFromClient (PCM 16k mono)
    ↓
Gemini Live
    ↓ base64 PCM 24k mono
    ↓ upsample24MonoTo48Stereo
    ↓
@discordjs/voice AudioPlayer → opus-encoded out to voice connection
    ↓
Discord channel audio out

@discordjs/voice handles Discord's DAVE (E2EE) via DAVESession first-party — no extra config.

Setup

Register a Discord bot account at the Discord developer portal. Give it the bot scope with applications.commands + the voice perms (Connect, Speak, Use Voice Activity).
Add the bot token to ~/.claude/channels/discord/.env:
```
DISCORD_BOT_TOKEN=...
```
Invite the bot to your Discord server with voice channel access.
Set GEMINI_API_KEY in .env at the repo root.

Run

DISCORD_VOICE_SERVER=1 \
  npx tsx skills/discord-voice/scripts/discord-voice-server.ts \
  --guild <GUILD_ID> \
  --channel <VOICE_CHANNEL_ID>

Optional env:

VOICE_MODEL / VOICE_NATIVE_AUDIO_MODEL — mirrors voice-agent.ts.
SUTANDO_WORKSPACE — workspace root for tasks/results/data/logs and the per-user config (see below).

DISCORD_VOICE_SERVER=1 flips the polymorphic dismiss tool (src/meeting-tools.ts) into "SIGTERM self" mode instead of its default Zoom AppleScript path. Without it, asking Sutando to "leave"/"dismiss" in the channel would try to leave a (non-existent) Zoom meeting.

Config — per-user, lives in the workspace

This skill's config carries per-user data (your Discord channel ids, your owner-mode choices), so it does NOT live in the git repo. It lives in the workspace:

$SUTANDO_WORKSPACE/config/discord-voice.json

(default ~/.sutando/workspace/config/discord-voice.json; $SUTANDO_WORKSPACE is resolved by the canonical workspace helper).

The repo ships a committed template — skills/discord-voice/config.json.example — with the safe defaults. On first run, if the workspace config is missing, the server copies the template into place; you then edit the workspace copy. (If the copy can't happen, the server falls back to the built-in defaults — owner_mode: false, every channel read-only.) Never commit a live discord-voice.json back into the repo — it's per-user data, not code.

Keys:

model / googleSearch — voice model + Web-grounding preference (defaults: gemini-2.5-flash-native-audio-preview-12-2025, true).
owner_mode — skill-wide owner-mode default (boolean). false by default.
channels — per-voice-channel override map: { "<voice_channel_id>": { "owner_mode": true } }. The channel entry is an object so it stays extensible.

Resolution for a given channel: channels[<channel_id>].owner_mode if that entry exists, else the skill-wide owner_mode, else false. A fresh config (owner_mode: false, channels: {}) runs every channel read-only.

{
  "model": "gemini-2.5-flash-native-audio-preview-12-2025",
  "googleSearch": true,
  "owner_mode": false,
  "channels": {
    "111111111111111111": { "owner_mode": true }
  }
}

Trust boundary — read this before enabling owner-mode

owner_mode: false is the safe default: non-owner speakers in the voice channel get the read-only tool surface (current time, status checks, lookups) but NOT owner-tier work, file edits, or message sends.

owner_mode: true — whether set skill-wide or per-channel via channels — is the opt-in for single-operator personal-use mode: it inherits owner-tier privileges to every speaker in the channel. It has a sharp edge — anyone who can speak in the same voice channel can delegate work, edit files, send messages, anything the proactive loop can do. Only enable it for voice channels whose membership is fully trusted (your own Lounge, never community/public). Prefer the per-channel channels override over the skill-wide owner_mode so a trusted-channel grant doesn't leak to every channel the bot joins. Set it in the workspace config ($SUTANDO_WORKSPACE/config/discord-voice.json), never the committed .example template.

Trust boundary — per-speaker access tiers

Independently of owner_mode, owner-tier tools are gated per speaker, by Discord user id. Each turn is attributed to the speaker who started it, and tools are gated by that speaker's tier — read from the same ~/.claude/channels/discord/access.json the discord-bridge uses, so the two never drift:

owner — an id in the top-level allowFrom of access.json. Full tool surface: work, dismiss, screen-share, file edits, message sends.
team — an id in any groups[*].allowFrom (per-channel trusted circle: peers, collaborators) that is not also owner. Read-only inline tools + configurable tools + dismiss; no work / file edits. (dismiss is intentional: a teammate can end the bot's voice session — useful when the owner isn't present to close the room; the owner can rejoin via DM.)
other — anyone else speaking in the channel. Read-only inline tools only (time, status, lookups).

owner_mode: true and the per-speaker tiers compose: a speaker gets the owner surface when owner_mode is on for the channel or their id resolves to the owner tier.

This is exactly the model discord-bridge.py uses (top-level allowFrom = owner, groups[*].allowFrom = team), so the same access.json is never read two ways. If allowFrom is empty, the gate falls back to the channel-wide owner_mode resolved from config.json (see Owner-mode config above).

This means the bot can sit safely in a shared/multi-person voice channel: a non-owner speaker physically cannot trigger owner-tier tools — the gate runs at tool-execution time, so even if the model tries, the call is denied.

DM-triggered join — owner only

The bot joins a voice channel when its owner DMs it "join the lounge voice channel in <server>" — the loop spawns the run command above as a subprocess. The task-bridge → proactive-loop → Bash pipeline handles it.

A join request is honored only when the originating task's access_tier is owner. access_tier is set by discord-bridge.py from access.json (owner = top-level allowFrom; team = the union of groups[*].allowFrom; other = neither). A team- or other-tier "join voice" request is declined — a non-owner cannot make the bot enter a voice channel. This holds at two layers: non-owner Discord tasks are already routed to a read-only sandbox (see CLAUDE.md "Discord access control") which cannot spawn the server, and the join request itself is owner-gated on top of that.

Tools

Inherits the full inlineTools + ownerOnlyTools set from src/inline-tools.ts (same surface as voice-agent.ts and conversation-server.ts). Notable Discord-relevant tools:

work — delegate non-trivial tasks to core (writes tasks/voice-task-{ts}.txt, blocks on result).
dismiss — leave the current voice presence. Polymorphic via DISCORD_VOICE_SERVER env: SIGTERMs self in Discord mode, runs Zoom AppleScript otherwise.
share_screen / stop_share_screen — drive Discord's screen-share picker. Has a hard dependency — see below.
summon — skill-local override redirecting "share my screen" to share_screen (the core summon opens Zoom, wrong app when user is in Discord).
get_current_time, get_core_status, join_zoom, join_gmeet, lookup_meeting_id, call_contact — all standard.

Screen sharing — extra setup required

share_screen / stop_share_screen are NOT free — they CGEvent-click the Discord webapp's "Share Your Screen" button and the Chrome native share-picker. That means:

You need a separate Chrome instance running with Discord logged in. The tool targets the chrome-devtools-mcp Chrome profile specifically (at ~/.cache/chrome-devtools-mcp/chrome-profile), so the share happens as whoever is logged into THAT Chrome — not the bot, not necessarily your main Discord. Recommended: create a secondary ("alt") Discord account and log into the MCP-Chrome as that, so your primary Discord (in regular Chrome / desktop app) stays uninterrupted. The alt and the bot both join the voice channel; the alt's screen is what gets shared. The alt must be a member of the same Discord server — voice channels are server-only (no DM voice for bots / no DM screen-share via this tool).
That Chrome window must be open to the Discord voice channel detail view (not a text channel, not minimized). The script clicks at a hardcoded screen coord that corresponds to the main-view "Share Your Screen" button.
Hardcoded coords assume a maximized Chrome window (screenX=0, screenY=32 on macOS, 1920×972 outer). Move/resize the window and clicks miss. Re-derive coords via macos-use refresh_traversal on the MCP-Chrome main PID, then update COORDS in scripts/share-screen-modal.py.
macOS Accessibility permission is required for the controlling process (Claude Code / Terminal) to post CGEvent clicks. Grant in System Settings → Privacy & Security → Accessibility.

If you don't want screen-sharing, the rest of the skill (voice conversation, tool delegation) works without any of this — share_screen will fail silently with no impact on voice.

Graceful shutdown

SIGTERM/SIGINT triggers cleanupSession() which calls connection.destroy() (sends Discord voice-gateway disconnect frame) and voiceSession.close(). The handler then waits 1.5s before process.exit(0) so the disconnect frame actually flushes — without that delay, Discord pins the bot in-channel until its own 60-90s heartbeat timeout.

Transcripts + session metrics land in conversation.sqlite — the shared conversation and sessions tables (also used by voice + phone) — and are mirrored into the shared logs/conversation.log text log.

Operational/diagnostic output (the [Setup]/[Voice]/[Tool]/[VoiceSession]/[Dismiss] lines) is tee'd to $SUTANDO_WORKSPACE/logs/discord-voice.log — the discord-voice counterpart to logs/discord-bridge.log and logs/voice-agent.log — so the operational history survives a process exit.