name: "event-vstream-event-driven-real-time-understanding" description: "Build event-driven video stream processing pipelines that detect meaningful state transitions instead of processing every frame. Use when asked to: 'build a real-time video understanding system', 'detect events in a video stream', 'process long video with memory', 'reduce redundant frame processing', 'stream video to LLM efficiently', 'build an event-aware video pipeline'."
Event-VStream: Event-Driven Real-Time Video Stream Understanding
This skill enables Claude to design and implement event-driven video stream processing systems based on the Event-VStream framework. Instead of processing every frame at fixed intervals (which wastes compute on redundant content and forgets past context), Event-VStream detects semantically meaningful state transitions by fusing motion, semantic, and predictive cues, then triggers language generation only at those boundaries. A persistent memory bank consolidates event embeddings for long-horizon reasoning. This approach achieves competitive accuracy while maintaining sub-100ms latency across multi-hour streams.
When to Use
- When the user asks to build a real-time video understanding or captioning system that must handle long streams without degrading
- When the user wants to reduce compute costs by skipping redundant frames in a video processing pipeline
- When building a system that needs to detect "something happened" moments in continuous video (surveillance, dashcam, egocentric, sports)
- When the user needs an event memory system that persists across hours of video without running out of context
- When implementing streaming video-to-text pipelines that should speak only when something meaningful changes
- When the user wants to segment continuous video into discrete semantic events for indexing or retrieval
- When adapting a vision-language model for real-time streaming instead of batch processing
Key Technique
Event-driven processing replaces fixed-interval decoding. Traditional streaming video-LLM systems sample frames at a constant rate (e.g., every 0.5s) and decode text continuously. This produces repetitive outputs during static scenes and misses fast transitions. Event-VStream flips this: it monitors a lightweight boundary score fusing three complementary signals, and only invokes the expensive language model when a genuine state transition is detected.
Three-signal boundary detection. The boundary score E_t combines: (1) semantic drift -- cosine distance between the current frame embedding and a running event average, catching content changes; (2) motion cue -- normalized optical flow or frame-difference energy, which empirically precedes semantic drift by ~2 seconds and acts as an early warning; (3) prediction error -- the L2 error of a lightweight 3-layer MLP that predicts the next frame embedding from the previous one, catching unexpected transitions. An adaptive threshold tau_t tightens during high-motion segments (to avoid false triggers from continuous motion) and relaxes during stable scenes. Ablation shows all three signals are essential: removing motion drops win rate from 68% to 12%, removing semantics drops it to 38%, and removing prediction drops it to 47%.
Persistent event memory with merge-or-append. When a boundary fires, frames within the detected segment are aggregated into an event embedding using Gaussian-weighted pooling (emphasizing frames near the boundary). This embedding is either merged into the most recent memory slot (if cosine similarity exceeds a redundancy threshold gamma_mem) or appended as a new entry. This keeps the memory bank compact -- semantically similar consecutive events consolidate rather than accumulating. At generation time, relevant past events are retrieved by similarity to the current event, giving the language model long-horizon context without growing the token budget.
Step-by-Step Workflow
Set up the frame ingestion loop. Accept video frames from a stream source (webcam, RTSP, file) at a fixed capture rate (2 FPS is the paper's default). Extract a visual embedding
f_tfor each frame using a pretrained vision encoder (CLIP, SigLIP, or a VideoLLM-Online encoder).Maintain a running event representation. Keep an exponential moving average
f_barof frame embeddings within the current event segment:f_bar <- (1 - rho) * f_bar + rho * f_t. This serves as the "what the current event looks like" anchor.Compute the three-signal boundary score. For each frame, calculate:
- Semantic drift:
(1 - cosine_similarity(f_t, f_bar)) - Motion cue: normalized frame-difference energy or optical flow magnitude
m_tilde_t - Prediction error:
c_t = ||MLP(f_{t-1}) - f_t||^2using a lightweight 3-layer MLP - Combined score:
E_t = w_sem * (1 - s_t) + w_mot * m_tilde_t + w_pred * c_t
- Semantic drift:
Apply the adaptive threshold. Compute
tau_t = tau_0 * (1 + eta * Var(m_{t-w:t}))wheretau_0 = 0.96andeta = 0.03. A boundary fires whensigmoid(E_t) > tau_t. Enforce a minimum intervalDelta_minbetween triggers to coalesce bursty updates, and a maximum intervalDelta_maxto prevent excessive silence.Aggregate the event embedding on boundary detection. Collect all frame embeddings since the last boundary. Compute a Gaussian-weighted average:
E_k = sum(w_i * f_i) / sum(w_i)wherew_i ~ exp(-|t_i - t_boundary| / sigma). This emphasizes frames near the transition point.Update the persistent memory bank. Apply the merge-or-append rule: if
cosine_similarity(E_k, E_last) > gamma_mem, merge viaE_last <- (1 - lambda) * E_last + lambda * E_k. Otherwise, appendE_kas a new memory slot.Retrieve relevant context and generate text. Query the memory bank for the top-K events most similar to
E_k. Concatenate these retrieved event embeddings withE_kas visual context tokens, prepend a system prompt, and invoke the language model to produce a description or answer:y_k = LM(E_k, Retrieve(Memory, E_k)).Between boundaries, remain silent. Do not invoke the language model. Continue the frame ingestion and scoring loop. This is the key efficiency gain -- the LM is called only when something meaningful happens.
Reset the running average. After a boundary fires, reset
f_barto the current frame embedding to begin tracking the next event segment.Implement pacing controls for production use. Add a minimum silence interval (prevents chattering during rapid scene changes) and a maximum silence interval (forces a periodic update even during very long static scenes for liveness).
Concrete Examples
Example 1: Building a security camera event narrator
User: "I want a system that watches a security camera feed and only generates descriptions when something actually happens -- not every frame."
Approach:
- Ingest RTSP stream at 2 FPS using OpenCV. Extract CLIP embeddings per frame.
- Implement the three-signal boundary detector with default parameters (
tau_0=0.96,eta=0.03). - Train the prediction MLP on 1 hour of typical footage from the same camera (self-supervised: predict next frame embedding from current).
- On boundary detection, aggregate the event embedding and query an LLM with the prompt: "Describe what just changed in this security camera view."
- Store events in a SQLite-backed memory bank with cosine-similarity merge (
gamma_mem=0.85). - Expose results via a WebSocket endpoint that pushes event descriptions to a dashboard.
Output:
import numpy as np
from collections import deque
class EventBoundaryDetector:
def __init__(self, tau_0=0.96, eta=0.03, window=30, rho=0.05,
w_sem=0.4, w_mot=0.35, w_pred=0.25,
delta_min=2.0, delta_max=30.0):
self.tau_0 = tau_0
self.eta = eta
self.w_sem = w_sem
self.w_mot = w_mot
self.w_pred = w_pred
self.rho = rho
self.delta_min = delta_min
self.delta_max = delta_max
self.f_bar = None # Running event average
self.motion_history = deque(maxlen=window)
self.last_boundary_time = -float('inf')
self.predictor = None # Lightweight MLP, set externally
def cosine_sim(self, a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)
def update(self, f_t, motion_energy, timestamp):
if self.f_bar is None:
self.f_bar = f_t.copy()
self.last_boundary_time = timestamp
return False
# Signal 1: Semantic drift
sem_drift = 1.0 - self.cosine_sim(f_t, self.f_bar)
# Signal 2: Normalized motion
self.motion_history.append(motion_energy)
m_max = max(self.motion_history) if self.motion_history else 1.0
m_norm = motion_energy / (m_max + 1e-8)
# Signal 3: Prediction error
pred_error = 0.0
if self.predictor is not None:
pred_error = self.predictor.prediction_error(f_t)
# Combined boundary score
E_t = (self.w_sem * sem_drift +
self.w_mot * m_norm +
self.w_pred * pred_error)
p_t = 1.0 / (1.0 + np.exp(-E_t)) # sigmoid
# Adaptive threshold
motion_var = np.var(list(self.motion_history)) if len(self.motion_history) > 1 else 0.0
tau_t = self.tau_0 * (1.0 + self.eta * motion_var)
# Pacing: enforce min/max intervals
elapsed = timestamp - self.last_boundary_time
if elapsed < self.delta_min:
fire = False
elif elapsed > self.delta_max:
fire = True # Force periodic update
else:
fire = p_t > tau_t
if fire:
self.f_bar = f_t.copy() # Reset running average
self.last_boundary_time = timestamp
# Update running average (EMA)
self.f_bar = (1 - self.rho) * self.f_bar + self.rho * f_t
return fire
Example 2: Long video indexing with event-based retrieval
User: "I have 4 hours of conference talk recordings. I want to index them so users can search for specific moments by natural language query."
Approach:
- Process the video offline at 2 FPS with the boundary detector to segment into discrete events.
- For each event, generate a text description using the LLM with memory context.
- Store event embeddings, timestamps, and descriptions in a vector database (e.g., ChromaDB).
- At query time, embed the user's natural language query, retrieve top-K events by cosine similarity, return timestamps and descriptions.
Output:
class EventMemoryBank:
def __init__(self, gamma_mem=0.85, merge_lambda=0.3):
self.events = [] # List of {"embedding": ..., "start": ..., "end": ..., "description": ...}
self.gamma_mem = gamma_mem
self.merge_lambda = merge_lambda
def add_event(self, embedding, start_time, end_time):
if self.events:
sim = cosine_sim(embedding, self.events[-1]["embedding"])
if sim > self.gamma_mem:
# Merge: consolidate similar consecutive events
last = self.events[-1]
last["embedding"] = ((1 - self.merge_lambda) * last["embedding"]
+ self.merge_lambda * embedding)
last["end"] = end_time
return len(self.events) - 1
# Append as new event
self.events.append({
"embedding": embedding,
"start": start_time,
"end": end_time,
"description": None
})
return len(self.events) - 1
def retrieve(self, query_embedding, top_k=5):
scores = [(i, cosine_sim(query_embedding, e["embedding"]))
for i, e in enumerate(self.events)]
scores.sort(key=lambda x: -x[1])
return [self.events[i] for i, _ in scores[:top_k]]
Example 3: Adapting an existing video-LLM for streaming
User: "I have a batch video QA model. How do I convert it to handle real-time streams without running out of memory on long videos?"
Approach:
- Replace the fixed frame sampler with the event boundary detector. Instead of uniformly sampling N frames from the entire video, process frames online and accumulate only event embeddings.
- Replace the flat frame-token sequence with the memory bank. At each generation step, the model sees only the current event embedding plus retrieved past events (typically 5-10 tokens total), not all frames.
- Add the merge-or-append logic so the memory bank stays bounded regardless of video length.
- Wrap the model in an async loop: ingest frames continuously, fire the decoder only on boundary events.
Output architecture:
Frame Stream (2 FPS)
|
v
Vision Encoder (CLIP/SigLIP) --> f_t
|
v
Boundary Detector (motion + semantic + prediction)
|
|-- No boundary: update EMA, continue
|
|-- Boundary detected:
|
v
Event Aggregator (Gaussian-weighted pooling)
|
v
Memory Bank (merge-or-append)
|
v
Retrieve top-K past events
|
v
LLM Decoder (current event + retrieved context -> text)
|
v
Output description / answer
Best Practices
- Do: Use all three boundary signals (motion, semantic, prediction). Ablations show removing any one causes major degradation -- motion alone catches 12% of events, all three catch 68%.
- Do: Train the prediction MLP on domain-specific video. A model trained on indoor surveillance will have different "surprise" patterns than one trained on driving footage. Self-supervised training (predict next embedding from previous) requires no labels.
- Do: Tune
gamma_memfor your use case. Lower values (0.7) create more granular events; higher values (0.9) aggressively merge similar segments. Start at 0.85 and adjust based on event density. - Do: Enforce pacing controls (
Delta_min,Delta_max). Without them, rapid scene changes (e.g., channel surfing) cause decoder flooding, and static scenes (e.g., parking lot at night) produce no output for hours. - Avoid: Processing at more than 2-4 FPS unless your hardware supports it. The boundary detector is lightweight, but the vision encoder is the bottleneck. Higher FPS gives diminishing returns for boundary detection accuracy.
- Avoid: Using fixed thresholds without the adaptive component. The variance-based adjustment (
eta * Var(motion)) prevents false triggers during sustained motion (e.g., a person continuously walking) while staying sensitive during calm periods. - Avoid: Skipping the Gaussian weighting in event aggregation. Uniform averaging dilutes the transition signal -- frames near the boundary carry the most information about what changed.
Error Handling
- Boundary detector fires too often (chattering): Increase
tau_0orDelta_min. Check if the motion signal is noisy -- apply temporal smoothing (e.g., 3-frame moving average) before feeding it to the detector. - Boundary detector misses obvious transitions: Decrease
tau_0or check that the vision encoder produces meaningfully different embeddings for different scenes. A poorly-trained encoder yields flat similarity scores. - Memory bank grows unbounded: The merge-or-append rule should prevent this, but verify
gamma_memis not set too low. Add a hard cap (e.g., 500 events) with FIFO eviction of oldest entries if needed. - Prediction MLP produces constant error: The MLP may have collapsed to predicting the mean embedding. Retrain with a lower learning rate or add dropout. Ensure training data includes diverse transitions.
- Latency spikes on boundary events: The LLM decode is the expensive step. Use speculative decoding, quantized models, or limit max output tokens to maintain real-time performance. The paper reports sub-100ms per token with LLaMA-3-8B on RTX 6000 Ada.
- Out-of-memory on long streams: Ensure frame embeddings are discarded after event aggregation. Only event-level embeddings should persist in the memory bank. Each event is a single vector, not a sequence of frame tokens.
Limitations
- Requires a vision encoder. The boundary detector operates on frame embeddings, not raw pixels. You need a pretrained vision encoder (CLIP, SigLIP, or similar). The choice of encoder affects what counts as "semantically different."
- Not suitable for frame-level precision tasks. Event-VStream operates at event granularity. If you need to identify the exact frame where an action occurs (e.g., frame-accurate video editing), this approach gives only approximate boundaries.
- Motion signal is camera-dependent. Optical flow from a static surveillance camera behaves very differently from a handheld or ego-centric camera. The
etaparameter needs tuning per camera setup. - Cold start problem. The prediction MLP needs domain-specific training data. On completely novel video domains, prediction error provides little signal until the MLP adapts.
- Text generation quality depends on the base LLM. Event-VStream is an architecture for when to generate, not what to generate. The quality of descriptions depends on the underlying vision-language model.
- Single-stream only. The paper addresses one video stream at a time. Multi-camera systems require separate detector instances with a shared or federated memory bank (not covered in the paper).
Reference
Paper: Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams (Guo et al., 2026). Focus on Section 3 (boundary detection formulas and adaptive threshold), Section 4 (memory bank merge-or-append rule), and Table 2 (ablation showing contribution of each signal).