gst-pipeline-optimizer - SKILL.md Agent Skill

name: gst-pipeline-optimizer description: Optimize GStreamer pipeline performance. Use when a user needs to reduce latency, increase throughput, fix dropped frames, tune buffer sizes, leverage hardware acceleration, or profile pipeline bottlenecks.

GStreamer Pipeline Optimizer

Optimize GStreamer pipelines for throughput, latency, CPU usage, and memory. Covers queue tuning, hardware acceleration, threading, and profiling.

Performance Diagnosis Checklist

Identify the bottleneck: Is it CPU, GPU, memory, I/O, or network?
Measure first: Use tracers and profiling before optimizing
Target the slowest element: One slow element throttles the entire pipeline
Check hardware acceleration: Software encoding/decoding is the most common bottleneck

Queue Placement and Tuning

When to Use Queues

Between every encoder/decoder and the rest of the pipeline
Before and after any element that blocks (network sinks, file I/O)
To create thread boundaries for parallel processing
In tee branches to prevent one slow branch from stalling others

queue vs queue2

Feature	`queue`	`queue2`
Buffering	Memory only	Memory + disk
Use case	Thread decoupling	Network stream buffering
Overhead	Lower	Higher
Temp file support	No	Yes

Queue Tuning Properties

# queue: control thread boundary buffering
queue max-size-buffers=200 max-size-bytes=10485760 max-size-time=1000000000
#       200 buffers          10 MB                  1 second (nanoseconds)

# Disable unneeded limits (set to 0)
queue max-size-buffers=0 max-size-bytes=0 max-size-time=2000000000  # Only time-based limit

# leaky queue: drop old/new buffers when full (live pipelines)
queue leaky=downstream max-size-buffers=3   # Drop oldest when full
queue leaky=upstream max-size-buffers=3     # Drop newest when full

# queue2: network buffering
queue2 use-buffering=true max-size-bytes=20971520  # 20 MB buffer

multiqueue for Parallel Streams

# Use multiqueue when handling multiple streams (audio + video)
decodebin ! multiqueue name=mq ! videoconvert ! encoder
                       mq. ! audioconvert ! audio_encoder

Hardware Acceleration

Detection

# Check for VAAPI support
gst-inspect-1.0 | grep vaapi
vainfo  # Show VAAPI capabilities

# Check for NVIDIA NVDEC/NVENC
gst-inspect-1.0 | grep -i nv
nvidia-smi  # Verify GPU is available

# Check for V4L2 hardware codecs
gst-inspect-1.0 | grep v4l2.*dec
gst-inspect-1.0 | grep v4l2.*enc
v4l2-ctl --list-devices

Hardware-Accelerated Pipeline Examples

# VAAPI H.264 encoding (Intel/AMD)
v4l2src ! videoconvert ! vaapih264enc rate-control=cbr bitrate=4000 ! h264parse ! mp4mux ! filesink location=out.mp4

# NVIDIA H.264 encoding
v4l2src ! videoconvert ! nvh264enc bitrate=4000 preset=low-latency-hq ! h264parse ! mp4mux ! filesink location=out.mp4

# VAAPI decoding + display (zero-copy)
filesrc location=video.mp4 ! qtdemux ! h264parse ! vaapih264dec ! vaapisink

# NVIDIA decode + encode (transcode on GPU)
filesrc location=input.mp4 ! qtdemux ! h264parse ! nvh264dec ! nvh264enc bitrate=2000 ! h264parse ! mp4mux ! filesink location=output.mp4

Hardware Acceleration Priority

VAAPI - Best Linux support (Intel, AMD)
NVIDIA NVENC/NVDEC - Best for NVIDIA GPUs
V4L2 - Embedded systems (Raspberry Pi, Jetson)
Software - Fallback, always available

Latency Optimization

Low-Latency Encoding

# x264enc low-latency settings
x264enc tune=zerolatency speed-preset=ultrafast bitrate=2500 key-int-max=30

# Key properties:
#   tune=zerolatency     - Disables B-frames, reduces lookahead
#   speed-preset=ultrafast - Fastest encoding, largest file
#   key-int-max=N        - Keyframe interval (lower = more seekable, larger file)
#   threads=N            - Encoding threads (0 = auto)

# VAAPI low-latency
vaapih264enc rate-control=cbr bitrate=2500 keyframe-period=30

Pipeline-Level Latency Settings

# Set latency on the pipeline (nanoseconds)
# Programmatic:
pipeline.set_latency(100 * Gst.MSECOND)  # 100ms

# For live sources, reduce latency-offset
gst-launch-1.0 v4l2src ! queue max-size-buffers=1 leaky=downstream ! videoconvert ! autovideosink sync=false

# Disable sync on sinks for lowest latency (may cause tearing)
autovideosink sync=false

Network Streaming Latency

# RTP: minimize jitter buffer
rtpjitterbuffer latency=50  # 50ms (default is 200ms)

# SRT low-latency
srtsrc uri="srt://0.0.0.0:1234" latency=125  # 125ms (default is 125)
srtsink uri="srt://dest:1234" latency=125

# RTMP
rtmpsink location="rtmp://server/live/key live=1"

Throughput Optimization

Parallel Processing with Threads

# Each queue creates a new thread - use them to parallelize
src ! queue ! decoder ! queue ! filter ! queue ! encoder ! queue ! sink
#     ^thread1          ^thread2         ^thread3          ^thread4

Batch Processing

# Process multiple files: use pipeline restart or dynamic pipelines
# For transcoding farms, run multiple gst-launch instances
parallel gst-launch-1.0 filesrc location={} ! decodebin ! x264enc ! mp4mux ! filesink location={.}.mp4 ::: *.avi

Memory Optimization

# Use buffer pools (automatic for most elements, configure on appsrc)
appsrc format=time block=true max-bytes=1048576  # 1 MB buffer limit

# Reduce queue memory
queue max-size-bytes=1048576 max-size-buffers=5  # Limit memory per queue

# Zero-copy where possible (hardware acceleration, same-GPU processing)

Profiling Tools

Built-in Tracers

# Latency tracer: measure end-to-end latency
GST_TRACERS=latency GST_DEBUG=GST_TRACER:7 gst-launch-1.0 ...

# Stats tracer: per-element statistics
GST_TRACERS=stats GST_DEBUG=GST_TRACER:7 gst-launch-1.0 ...

# Leaks tracer: detect buffer/event leaks
GST_TRACERS=leaks GST_DEBUG=GST_TRACER:7 gst-launch-1.0 ...

# Framerate tracer: measure FPS at each point
GST_TRACERS=framerate GST_DEBUG=GST_TRACER:7 gst-launch-1.0 ...

# Combined
GST_TRACERS='latency;stats;framerate' GST_DEBUG=GST_TRACER:7 gst-launch-1.0 ...

CPU Profiling

# Use perf to profile CPU usage per element
perf record gst-launch-1.0 ...
perf report

# Or use sysprof for graphical profiling
sysprof-cli --command "gst-launch-1.0 ..."

Monitoring at Runtime

# identity element: print buffer timestamps and sizes
... ! identity silent=false ! ...

# fpsdisplaysink: show FPS on screen
... ! fpsdisplaysink video-sink=autovideosink text-overlay=true

# Dot graph at different states
GST_DEBUG_DUMP_DOT_DIR=/tmp gst-launch-1.0 ...

Common Performance Issues and Fixes

Symptom	Likely Cause	Fix
Dropped frames	Encoder too slow	Use HW acceleration or faster preset
High CPU on encode	Software encoding	Switch to VAAPI/NVENC
Audio/video desync	Missing queues	Add queue before encoder and muxer
Pipeline stalls	Blocking sink	Add leaky queue before sink
Memory grows	Buffer leak	Use leaks tracer, check appsink usage
High latency	Large queue/jitter buffer	Reduce queue sizes, lower jitterbuffer latency
Choppy playback	No thread separation	Add queues between stages

Guidelines

Measure before optimizing - use tracers to identify the actual bottleneck
Hardware encoding typically gives 10x+ performance improvement over software
tune=zerolatency on x264enc is the single biggest low-latency win for software H.264
In live pipelines, use leaky=downstream queues to drop old frames rather than building up latency
sync=false on sinks removes display-clock synchronization overhead, useful for benchmarking or processing pipelines
Zero-copy paths (VAAPI decode -> VAAPI encode, or VAAPI decode -> vaapisink) avoid expensive GPU-to-CPU memory transfers
Over-queuing wastes memory and increases latency; under-queuing causes stalls. Start with max-size-time=1000000000 (1s) and adjust
For network streaming, the bottleneck is usually the encoder - optimize encoding first