name: vision-emotion-analysis description: Analyzes real-time camera input to detect facial expressions, microexpressions, body posture, hand gestures, environmental context, and dangerous objects. Bridges the gap between verbal and non-verbal communication in therapy.
Vision & Emotion Analysis Skill
Overview
This skill defines how the AI therapist processes and interprets visual data from the user's camera during sessions. It enables the therapist to respond to what users show, not only what they say.
Facial Expression Analysis
Emotion Mapping (Based on Paul Ekman's FACS)
The system must recognize and interpret the 7 universal primary emotions:
| Emotion | Key Facial Indicators | Therapeutic Significance |
|---|---|---|
| Happiness | Lip corners raised, cheeks lifted, Duchenne smile | Genuine vs. masked — check if smile reaches eyes |
| Sadness | Inner brow raise, lip corners down, quivering | May suppress — watch for micro-sadness during "I'm fine" |
| Fear | Brow raise + together, upper eyelid raised, horizontal lip stretch | Anxiety signal — pivot to grounding |
| Anger | Brow lower + together, lip press, jaw clench | Needs validation before exploration |
| Disgust | Nose wrinkle, upper lip raise | Self-directed disgust = shame → handle with care |
| Surprise | Brow raise, widened eyes, dropped jaw | Context-dependent — distress vs. pleasant surprise |
| Contempt | Unilateral lip corner raise | Toward self = self-criticism; toward therapist = rupture signal |
Microexpression Detection
- Microexpressions last 1/25 to 1/5 of a second — must be detected at 25+ fps
- Most important: suppressed emotions leaking through when verbal content contradicts
- Example: User says "I don't care anymore" while microexpression shows fear → flag discrepancy
Discrepancy Detection
IF verbal_sentiment = POSITIVE AND facial_emotion IN [SAD, FEAR, ANGER]:
→ Flag: "Possible emotional masking"
→ Adjust: Shift from structured technique to empathic reflection
→ Do NOT immediately confront — reflect gently: "I noticed something shift in your expression..."
Body Language Analysis
Posture Indicators
| Signal | Meaning | Response |
|---|---|---|
| Closed body (crossed arms/legs, hunched) | Defensive, unsafe | Slow down, increase warmth, don't push |
| Forward lean | Engaged, interested | Continue current approach |
| Backward lean | Withdrawn, uncomfortable | Check in directly |
| Self-touching (face, arms, hair) | Anxiety, self-soothing | Offer grounding exercise |
| Rocking / repetitive movement | High distress, dissociation | Ground immediately |
| Head nodding | Agreement vs. forced compliance | Note frequency and timing |
| Gaze avoidance | Shame, trauma activation | Reduce intensity, normalize |
Hand & Gesture Analysis
- Visible trembling → anxiety/panic indicator → offer breathing exercise
- Touching neck/chest → high anxiety or heart rate spike
- Fist clenching → suppressed anger
- Wringing hands → high stress
Environmental Analysis
Background Context
- Note changes in environment across sessions (consistency = stability signal)
- Messy/chaotic background may indicate mental state decline
- Dark room, blinds closed = possible avoidance / depression sign
- Assess lighting — ensure user can be properly seen for analysis
Safety Scanning
Every frame must be scanned for:
- Weapons: firearms, knives, sharp objects (flag immediately → Crisis Skill)
- Medications: large quantities of pills visible (medium-high alert)
- Third parties: other people in frame, especially aggressive movements
- Physical injuries: visible blood, bruising, bandages not mentioned by user
Presence Detection
- User leaves frame → start 60-second timer
- If user doesn't return: pause session, display check-in prompt
- If no response after 90 seconds: display safety resources and emergency contact option
Real-Time Feedback to Session Engine
Data Format
interface VisualAnalysisFrame {
timestamp: string;
faceDetected: boolean;
primaryEmotion: EmotionLabel;
emotionConfidence: number; // 0-1
microexpressionFlag: boolean;
verbalVisualDiscrepancy: boolean;
bodyLanguage: BodyLanguageSignals[];
environmentSafetyFlags: SafetyFlag[];
attentionLevel: 'high' | 'medium' | 'low' | 'absent';
}
Aggregation Rules
- Single-frame anomalies: log but don't trigger response
- 3+ consecutive frames with same signal: trigger soft response
- Any safety flag at any confidence > 0.7: trigger Crisis Skill immediately
- Aggregate emotion over full 30-second window for session-level insight
Technical Implementation Notes
Recommended Stack
- Client-side:
MediaPipe Face Mesh(164 landmarks, runs in browser via WASM) - Emotion inference:
face-api.jsorTensorFlow.jswith fine-tuned model - Body pose:
MediaPipe PoseorMoveNet(TensorFlow.js) - Object detection:
COCO-SSDmodel for environmental scanning - Performance target: Analysis pipeline < 100ms per frame, 10-25 fps
Privacy Requirements
- All camera processing runs CLIENT-SIDE — raw video frames NEVER sent to server
- Only analysis results (emotion labels, confidence scores) transmitted
- User must explicitly grant camera permission with clear explanation
- Users can disable visual analysis and proceed with audio-only mode
Fallback Behavior
- If camera unavailable: proceed with audio + text only, note reduced analysis capability
- If low-confidence detection (< 0.4): do not act on data, wait for clearer signal
- If performance issues: reduce to 5 fps minimum viable analysis rate