prism-xr-empowering-privacy-aware-xr - SKILL.md Agent Skill

name: "prism-xr-empowering-privacy-aware-xr" description: "Build privacy-aware pipelines that filter sensitive content from visual frames before sending to cloud AI models, using edge preprocessing with object detection (YOLO), text-based scene description, selective cropping, and structured MLLM interaction. Triggers: 'privacy-aware XR pipeline', 'filter sensitive data from camera frames', 'edge preprocessing before cloud AI', 'PRISM-XR privacy pipeline', 'sanitize visual input for LLM', 'multi-user XR collaboration with privacy'"

PRISM-XR: Privacy-Aware Visual Preprocessing for Cloud MLLM Integration

This skill teaches Claude to implement the PRISM-XR pattern: an edge-preprocessing pipeline that intercepts visual frames from cameras or XR headsets, detects and classifies objects by privacy sensitivity, replaces raw images with textual scene descriptions, and only sends minimal cropped regions to cloud MLLMs when absolutely necessary. The technique achieves 90%+ filtering of sensitive objects while maintaining ~89% task accuracy, and applies broadly to any system that sends camera/visual data to cloud AI services.

When to Use

When building an XR/AR application that sends headset camera frames to cloud LLMs (GPT-4o, Claude Vision) and needs to protect bystander faces, ID cards, medical records, or financial information visible in the background
When designing an edge-cloud architecture where a local server preprocesses visual data before forwarding to a cloud API, and you need a privacy filtering stage
When implementing a multi-user collaborative XR system that must synchronize AI-generated 3D content across devices while minimizing private data exposure
When adding a "privacy gate" to any vision pipeline that currently sends full-resolution frames to third-party APIs
When building content moderation or data minimization into a camera-to-LLM pipeline (security cameras, telepresence, remote assistance)
When implementing lightweight spatial registration between multiple AR/VR devices without full environment scanning

Key Technique

PRISM-XR's core insight is replacing raw image frames with structured textual descriptions generated by a local object detector (YOLO v11) running on an edge server. Instead of uploading a full camera frame to a cloud MLLM, the edge server runs YOLO locally, producing a text summary like "keyboard, center (327.80, 352.12), box (214, 283, 441, 421), conf 0.91" for each detected object. This textual description is substituted into the MLLM prompt in place of the image. The cloud model never sees the raw pixels of the user's environment — only structured object metadata.

When the text-only description is insufficient for the task (e.g., the user asks "make that object on my desk into a 3D model"), the system uses a two-stage MLLM pipeline: Stage 1 asks the cloud model to return a CropArea property (bounding box coordinates) identifying exactly which region it needs to see. The edge server then crops only that region from the original frame — excluding sensitive surroundings — and sends it in Stage 2 for detailed generation. This selective-crop approach reduced highly sensitive object exposure from 53.6% (full frames) to 7.1% (cropped frames) in evaluation.

For multi-user synchronization, PRISM-XR replaces expensive environment-scanning registration (Meta SSA takes ~6.7s) with AprilTag marker-based coordinate alignment completing in 0.27s. Content synchronization uses a compact 48-byte-per-object wire format (4B ID + 12B position + 16B quaternion rotation + 12B scale + 4B events) broadcast over native TCP sockets, achieving 15.5ms interaction sync latency at 60Hz.

Step-by-Step Workflow

Set up the edge server with a local object detector. Install YOLO v11 (or equivalent) on a machine with GPU access. Configure it to accept frames via WebSocket from XR clients and return structured detection results (object class, center coordinates, bounding box, confidence score).
Define a privacy sensitivity taxonomy. Classify detectable objects into three tiers:
- Insensitive: furniture, cups, books, lab equipment (always safe to describe)
- Maybe Sensitive: laptops, phones, monitors (require user confirmation before cropping)
- Highly Sensitive: human faces, ID cards, medicine/medical records, personal mail (always filtered, never sent to cloud)
Implement the text-description substitution layer. When a user issues a voice/text command, capture the current camera frame on the XR client and send it to the edge server. Run YOLO detection, then construct a structured text prompt listing all detected objects with their spatial metadata. Send this text — not the image — to the cloud MLLM.
Implement the two-stage MLLM pipeline. In Stage 1, include the text description in a prompt that asks the model to either fulfill the request from text alone or return a CropArea JSON property specifying pixel coordinates of the region it needs. If CropArea is returned (not "None"), crop that bounding box from the original frame, verify it doesn't contain highly sensitive objects, and send only the crop in Stage 2.
Enforce structured output from the MLLM. Use the model's structured output / JSON mode to guarantee responses conform to a schema with fields like objectName, position, rotation, scale, CropArea, and animationSequence. This eliminates parsing failures and enables direct deserialization into Unity/3D engine objects.
Implement AprilTag-based registration for multi-user alignment. Place a physical AprilTag marker in the shared space. When each user issues a "register" command, capture a frame, detect the AprilTag pose in that user's world coordinate system, and store the transform on the edge server. Use basis-change math to convert between users' coordinate frames.
Build the content synchronization protocol. Define a 48-byte compact object representation. When the "owner" user manipulates an AI-generated object, detect changes via threshold-based triggering, transform from local to world coordinates, and broadcast the update to all registered users via TCP socket. Receiving clients apply the inverse coordinate transform.
Add user consent and confirmation gates. Before sending any cropped image region to the cloud, present the crop to the user for approval. For "maybe sensitive" objects, require explicit confirmation. For "highly sensitive" objects, block transmission entirely and substitute a text description or placeholder.
Integrate speech-to-text locally. Run Whisper (or equivalent) on the edge server for voice command transcription, keeping audio data off the cloud. Detect speech boundaries using audio magnitude thresholds.
Profile and optimize latency. Measure end-to-end speak-to-action latency (target: <8s). The primary bottleneck is cloud MLLM response time (~2.5s per stage). Cache YOLO results across frames, use async WebSocket communication, and consider edge-deployable models (LLaVA 13B, Llama 3.2 Vision 11B) as cloud fallbacks.

Concrete Examples

Example 1: Privacy-filtering edge server for an AR assistant

User: "Build a Python edge server that receives camera frames from a Quest headset, detects objects with YOLO, and sends only text descriptions to GPT-4o."

Approach:

Create a WebSocket server (Python websockets library) that accepts binary frame data from the XR client
Run YOLOv11 inference on each received frame, extracting detections as structured text
Classify each detection against the sensitivity taxonomy
Construct a prompt containing only textual object descriptions and the user's voice command
Call the OpenAI API with the text-only prompt (no image attachment)
If the model responds with a CropArea, crop the frame, validate no sensitive objects are in the crop, and re-query with the cropped image
Return the structured JSON response to the XR client

Output:

import asyncio
import websockets
import json
import numpy as np
from ultralytics import YOLO
from openai import OpenAI

model = YOLO("yolo11n.pt")
client = OpenAI()

HIGHLY_SENSITIVE = {"person", "cell phone", "laptop"}  # extend per taxonomy
SENSITIVITY_MAP = {
    "highly_sensitive": {"person", "id_card", "medicine", "mail"},
    "maybe_sensitive": {"laptop", "cell phone", "tv"},
    "insensitive": {"cup", "chair", "desk", "book", "bottle", "keyboard"}
}

def classify_sensitivity(label: str) -> str:
    for level, objects in SENSITIVITY_MAP.items():
        if label in objects:
            return level
    return "insensitive"

def frame_to_text_description(frame_bytes: bytes) -> tuple[str, dict]:
    """Run YOLO, return text description and detection metadata."""
    img = np.frombuffer(frame_bytes, dtype=np.uint8).reshape((720, 1280, 3))
    results = model(img, verbose=False)[0]
    descriptions = []
    detections = []
    for box in results.boxes:
        label = results.names[int(box.cls)]
        sensitivity = classify_sensitivity(label)
        if sensitivity == "highly_sensitive":
            continue  # filter out entirely
        cx, cy = box.xywh[0][:2].tolist()
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        conf = float(box.conf)
        desc = f"{label}, center ({cx:.1f}, {cy:.1f}), box ({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}), conf {conf:.2f}"
        descriptions.append(desc)
        detections.append({"label": label, "sensitivity": sensitivity, "box": [x1, y1, x2, y2]})
    return "\n".join(descriptions), {"detections": detections, "frame": img}

async def handle_client(websocket):
    async for message in websocket:
        data = json.loads(message)
        frame_bytes = bytes.fromhex(data["frame_hex"])
        user_command = data["command"]

        text_desc, meta = frame_to_text_description(frame_bytes)

        # Stage 1: text-only query
        stage1_response = client.chat.completions.create(
            model="gpt-4o",
            response_format={"type": "json_object"},
            messages=[{
                "role": "system",
                "content": "You create 3D objects for XR. Return JSON with objectName, position, rotation, scale. If you need to see a specific region, include CropArea as [x1,y1,x2,y2] pixel coords, otherwise set CropArea to null."
            }, {
                "role": "user",
                "content": f"Scene objects:\n{text_desc}\n\nUser request: {user_command}"
            }]
        )
        result = json.loads(stage1_response.choices[0].message.content)

        # Stage 2: selective crop if needed
        if result.get("CropArea"):
            x1, y1, x2, y2 = [int(v) for v in result["CropArea"]]
            crop = meta["frame"][y1:y2, x1:x2]
            # Validate crop doesn't contain sensitive objects
            crop_results = model(crop, verbose=False)[0]
            has_sensitive = any(
                classify_sensitivity(crop_results.names[int(b.cls)]) == "highly_sensitive"
                for b in crop_results.boxes
            )
            if not has_sensitive:
                import base64, cv2
                _, buf = cv2.imencode(".jpg", crop)
                b64 = base64.b64encode(buf).decode()
                stage2_response = client.chat.completions.create(
                    model="gpt-4o",
                    response_format={"type": "json_object"},
                    messages=[{
                        "role": "user",
                        "content": [
                            {"type": "text", "text": f"Generate 3D object JSON for: {user_command}"},
                            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}}
                        ]
                    }]
                )
                result = json.loads(stage2_response.choices[0].message.content)

        await websocket.send(json.dumps(result))

asyncio.run(websockets.serve(handle_client, "0.0.0.0", 8765))

Example 2: AprilTag-based multi-user registration

User: "Implement a lightweight registration system so two Quest headsets can share the same coordinate space using an AprilTag marker instead of environment scanning."

Approach:

Each client captures a frame containing a visible AprilTag
Client sends the frame to the edge server with its current camera pose (position + rotation from XR SDK)
Edge server detects the AprilTag, computes its pose relative to the camera
Server stores T_world_to_tag = T_world_to_camera * T_camera_to_tag for each user
To transform coordinates between users: T_userB_world = T_userB_tag_to_world * T_userA_world_to_tag * P_userA

Output:

import numpy as np
from pupil_apriltags import Detector

detector = Detector(families="tag36h11")
user_transforms = {}  # user_id -> 4x4 world-to-tag matrix

def register_user(user_id: str, frame_gray: np.ndarray,
                  camera_matrix: np.ndarray, camera_pose_world: np.ndarray):
    """Register a user's coordinate frame via AprilTag detection."""
    tags = detector.detect(frame_gray, estimate_tag_pose=True,
                           camera_params=camera_matrix.flatten()[:4],
                           tag_size=0.1)
    if not tags:
        return {"error": "No AprilTag detected"}

    tag = tags[0]
    T_cam_to_tag = np.eye(4)
    T_cam_to_tag[:3, :3] = tag.pose_R
    T_cam_to_tag[:3, 3] = tag.pose_t.flatten()

    # camera_pose_world is T_world_to_camera (4x4)
    T_world_to_tag = camera_pose_world @ T_cam_to_tag
    user_transforms[user_id] = T_world_to_tag
    return {"status": "registered", "user_id": user_id}

def transform_point(from_user: str, to_user: str, point_world: np.ndarray):
    """Convert a 3D point from one user's world space to another's."""
    T_from = user_transforms[from_user]   # from_user world -> tag
    T_to = user_transforms[to_user]       # to_user world -> tag
    T_to_inv = np.linalg.inv(T_to)        # tag -> to_user world

    point_tag = T_from @ np.append(point_world, 1.0)
    point_to_world = T_to_inv @ point_tag
    return point_to_world[:3]

Example 3: Compact object synchronization protocol

User: "Design a binary sync protocol for broadcasting 3D object state across XR clients at 60Hz with minimal bandwidth."

Approach:

Define a 48-byte wire format per object: 4B uint32 ID, 12B float32 position (x,y,z), 16B float32 quaternion (x,y,z,w), 12B float32 scale (x,y,z), 4B uint32 event flags
Implement threshold-based change detection to avoid sending unchanged objects
Broadcast only changed objects each frame via TCP socket

Output:

import struct
from dataclasses import dataclass

@dataclass
class ObjectState:
    obj_id: int
    position: tuple  # (x, y, z)
    rotation: tuple  # quaternion (x, y, z, w)
    scale: tuple     # (x, y, z)
    events: int      # bitmask: 0x1=created, 0x2=deleted, 0x4=grabbed

    PACK_FORMAT = "<I3f4f3fI"  # 48 bytes total

    def pack(self) -> bytes:
        return struct.pack(
            self.PACK_FORMAT, self.obj_id,
            *self.position, *self.rotation, *self.scale, self.events
        )

    @classmethod
    def unpack(cls, data: bytes) -> "ObjectState":
        vals = struct.unpack(cls.PACK_FORMAT, data)
        return cls(
            obj_id=vals[0],
            position=vals[1:4],
            rotation=vals[4:8],
            scale=vals[8:11],
            events=vals[11]
        )

POSITION_THRESHOLD = 0.005   # 5mm
ROTATION_THRESHOLD = 0.01    # ~0.57 degrees

def has_changed(prev: ObjectState, curr: ObjectState) -> bool:
    pos_delta = sum((a - b) ** 2 for a, b in zip(prev.position, curr.position)) ** 0.5
    rot_delta = sum((a - b) ** 2 for a, b in zip(prev.rotation, curr.rotation)) ** 0.5
    return pos_delta > POSITION_THRESHOLD or rot_delta > ROTATION_THRESHOLD

def build_sync_packet(prev_states: dict, curr_states: dict) -> bytes:
    """Build a packet containing only changed objects."""
    changed = []
    for oid, curr in curr_states.items():
        if oid not in prev_states or has_changed(prev_states[oid], curr):
            changed.append(curr.pack())
    header = struct.pack("<I", len(changed))  # object count
    return header + b"".join(changed)

Best Practices

Do: Run object detection (YOLO) on the edge server, never on the cloud. The entire point is that raw frames stay local. Only text descriptions and explicitly approved crops cross the network boundary.
Do: Use structured output / JSON mode when querying cloud MLLMs. This eliminates parsing errors and allows direct deserialization into 3D engine objects. Define schemas with fields like objectName, position (3-float array), rotation (4-float quaternion), scale, and optional CropArea.
Do: Re-validate cropped regions before sending to the cloud. Run YOLO again on the crop to confirm no sensitive objects leaked into the bounding box. This catches cases where a sensitive object partially overlaps the crop area.
Do: Use threshold-based change detection for synchronization. Broadcasting every object every frame wastes bandwidth. Only send updates when position changes >5mm or rotation changes >0.5 degrees.
Avoid: Sending full-resolution frames to cloud APIs "just in case." The text-description substitution achieves 89% task accuracy without any image. Only escalate to cropped images when the model explicitly requests visual context via CropArea.
Avoid: Using environment scanning (like Meta's Shared Spatial Anchors) for multi-user registration when a simple fiducial marker achieves alignment in 0.27s with <3.5cm accuracy. AprilTags are sufficient for room-scale collaboration.

Error Handling

YOLO fails to detect relevant objects: Fall back to sending a low-resolution, heavily blurred version of the frame with only the region of interest sharp. Alternatively, prompt the user to describe the object verbally.
Cloud MLLM returns invalid JSON: Retry with an explicit reminder to use structured output mode. Implement a 3-retry loop with exponential backoff. If all retries fail, return a user-friendly error suggesting they rephrase the request.
AprilTag not detected during registration: Check camera exposure and lighting conditions. Provide user feedback ("Move closer to the marker" or "Ensure the marker is fully visible"). Fall back to manual coordinate entry if detection fails after 3 attempts.
Crop still contains sensitive data: If re-validation detects sensitive objects in the crop, expand the crop exclusion zone by 20% padding around the sensitive detection and re-crop. If the sensitive object cannot be excluded, refuse the crop and fall back to text-only mode.
High latency from cloud MLLM (>5s): Consider routing to a smaller edge-deployed model (LLaVA 13B or Llama 3.2 Vision 11B) for latency-critical requests, accepting a potential accuracy tradeoff.

Limitations

Text descriptions lose visual detail. When the task requires fine-grained visual understanding (exact color matching, texture recognition, reading handwritten text), text-only mode is insufficient and the system must fall back to sending cropped images.
YOLO detection coverage. The privacy filter is only as good as the object detector. If YOLO doesn't recognize a sensitive object class (e.g., a prescription bottle not in the training set), it won't be filtered. Extend the detection model or add a catch-all "unknown object" blurring step for unrecognized regions.
Single-marker registration drift. AprilTag alignment accuracy degrades beyond ~3m from the marker (3.5cm average error). For large spaces, use multiple markers or integrate with SLAM-based refinement.
Cloud API dependency. The two-stage pipeline requires two sequential cloud API calls (~5s total), which is the latency bottleneck. Real-time interaction requires either faster models or pre-caching common requests.
Not a replacement for on-device processing. If the edge server is compromised, raw frames are exposed. For maximum privacy, run detection directly on the XR device if hardware permits, treating the edge server as a second layer.

Reference

Paper: PRISM-XR: Empowering Privacy-Aware XR Collaboration with Multimodal Large Language Models (IEEE VR 2026) Key takeaway: Section III details the two-stage edge preprocessing pipeline (text-description substitution + selective cropping) and Section IV covers the AprilTag registration and 48-byte synchronization protocol. Table I has the sensitivity taxonomy; Tables II-IV have the accuracy and latency benchmarks.