name: webrtcvad
description: >-
Reference for py-webrtcvad, the Python wrapper around Google's WebRTC Voice
Activity Detector (import webrtcvad, the Vad class, is_speech). Use this
whenever working with webrtcvad: splitting PCM audio into speech vs. silence,
segmenting recordings, gating a transcription/ASR pipeline on speech, building
a frame_generator / vad_collector, or choosing an aggressiveness mode. Trigger
it for the strict audio-format constraints (16-bit mono PCM at 8/16/32/48 kHz,
frames of exactly 10/20/30 ms), for is_speech raising errors or returning
garbage, for the sliding-window "triggered" segmenter pattern, and any time
someone mentions webrtcvad, WebRTC VAD, or wiseman/py-webrtcvad — even if they
don't name the exact API. Note this is the classic energy/GMM WebRTC VAD, NOT a
neural model like Silero; pick the right tool when both are in play.
py-webrtcvad
py-webrtcvad is a thin Python binding over the voice-activity detector shipped in Google's WebRTC stack. It answers one narrow question per audio frame: does this frame contain speech? It is fast, dependency-free at runtime, and runs on CPU with no model file — but it is a classic signal-processing VAD (energy + a Gaussian mixture over frequency bands), not a neural network. That trade-off drives almost everything below: it is cheap and instant, but pickier about input format and noisier in its judgments than a model like Silero.
The one rule that breaks everyone: input format is strict
WebRTC VAD rejects anything that isn't shaped exactly right, and the failure modes are unhelpful — either a raised exception or a silently wrong answer. Get these four things right and the rest is easy:
- 16-bit signed PCM —
int16samples, little-endian, passed as rawbytes. Not float, not numpy arrays, not 8-bit, not 24-bit. - Mono — one channel. Downmix stereo before you start.
- Sample rate ∈ {8000, 16000, 32000, 48000} Hz — no other rate is accepted.
- Frame duration ∈ {10, 20, 30} ms — and only those. The frame's byte
length must equal
sample_rate * (duration_ms / 1000) * 2(the* 2is the two bytes per int16 sample). A frame of any other length is the single most common cause ofis_speechraising an error.
If you have audio in any other form (mp3, 44.1 kHz, float32, stereo), resample and convert first — e.g. with
ffmpeg,soundfile+librosa, orpydub. The VAD will not do it for you.
Core API
import webrtcvad
vad = webrtcvad.Vad() # default aggressiveness (mode 0)
vad = webrtcvad.Vad(2) # or construct with a mode 0-3
vad.set_mode(3) # change mode later
# frame must be exactly 10/20/30 ms of 16-bit mono PCM bytes at sample_rate
speech: bool = vad.is_speech(frame_bytes, sample_rate)
That is the entire surface area: construct, optionally set a mode, and call
is_speech(frame, sample_rate) -> bool per frame. There is no streaming state,
no internal buffer, no segmentation — is_speech is a pure per-frame classifier.
All the useful behaviour (turning a stream of yes/no flags into clean speech
segments) is logic you build on top, shown below.
Aggressiveness modes (0–3)
The mode controls how aggressively the detector filters out non-speech. Higher is stricter about calling something speech:
- 0 — least aggressive. Lets the most audio through as "speech"; good when missing speech is worse than including some noise (high recall).
- 1, 2 — middle ground. Mode 2 or 3 is the usual choice for noisy, real-world audio.
- 3 — most aggressive. Filters hardest; best in noisy environments where you want only confident speech, at the cost of clipping quiet or short utterances.
There is no universally correct mode — it depends on your noise floor and whether false positives or false negatives hurt more. Start at 2 or 3 for captured system/desktop audio, drop toward 0–1 for clean close-mic recordings where you don't want to lose soft speech.
The real pattern: per-frame flags → clean segments
A raw stream of is_speech results flickers — a single noisy frame mid-speech
reads as silence, a lip-smack mid-silence reads as speech. Chopping on every flip
gives you shredded, useless segments. The canonical fix from the project's
example.py is a sliding-window state machine with hysteresis: only start
a segment once a window is mostly-voiced, and only end it once a window is
mostly-unvoiced. This rides through brief glitches in both directions.
Two pieces make it work — a frame generator and the collector. Read
references/example.py.md for the full, runnable reference implementation
(reading/writing WAV, the Frame class, both functions, and a main). Adapt it
rather than retyping from memory; the padding/ratio bookkeeping is easy to get
subtly wrong.
The shape of it:
# 1. Slice contiguous PCM into fixed-duration Frames (carrying a timestamp).
for frame in frame_generator(frame_duration_ms=30, audio=pcm, sample_rate=sr):
...
# 2. Feed frames through the collector; it yields one bytes blob per detected
# utterance, already trimmed to start/end on confident speech transitions.
segments = vad_collector(sr, frame_duration_ms=30,
padding_duration_ms=300, vad=vad, frames=frames)
for segment in segments:
# `segment` is concatenated PCM bytes for one speech run — transcribe,
# write to a WAV, etc.
...
How the collector decides (the part worth internalising):
- It keeps a ring buffer of the last
padding_duration_ms / frame_duration_msframes (e.g. 300 ms / 30 ms = 10 frames). - Not triggered (in silence): when >90% of the ring buffer is voiced, it triggers — emits the buffered padding frames as the segment's lead-in (so you don't clip the word's onset) and starts collecting.
- Triggered (in speech): when >90% of the ring buffer is unvoiced, it de-triggers — yields the collected segment and resets.
- Trailing audio still buffered when the stream ends is flushed as a final segment.
The padding_duration_ms (typically ~300 ms) is the smoothing window: longer
padding tolerates longer gaps within one utterance but merges nearby utterances;
shorter padding splits more eagerly. The 0.9 ratios are the hysteresis
thresholds — they're what stop a single odd frame from starting or ending a
segment.
Common pitfalls
- Frame length mismatch. Computing frame size in samples but passing the
wrong byte count (forgetting
* 2for int16) — or feeding a leftover partial frame at the end of the buffer.frame_generatordeliberately stops before any short tail (while offset + n < len(audio)); honour that. - Passing numpy or float audio.
is_speechwants rawbytesof int16. From numpy:arr.astype(np.int16).tobytes(). From float in [-1, 1]:(arr * 32767).astype(np.int16).tobytes(). - Unsupported sample rate. 44100 Hz (CD audio) and 22050 Hz are not supported — resample to 16000 or 48000 first.
- Expecting word-level precision. This is frame-level (10–30 ms) and tuned for telephony-style speech presence, not phoneme boundaries. For tight word timings, follow VAD with a forced aligner.
- Treating it as a recognizer. It detects speech vs. not, not who or what. Music, laughter, and some noises can read as speech, especially at low aggressiveness.
Installation
pip install webrtcvad
If you hit a C-compiler error building the extension (common on Windows or in
minimal CI images), use the prebuilt-wheel fork instead — same import webrtcvad, same API:
pip install webrtcvad-wheels
Picking webrtcvad vs. a neural VAD (e.g. Silero)
Reach for webrtcvad when you want a tiny, instant, model-free CPU detector and can tolerate some noise sensitivity — quick segmentation, pre-filtering, or environments where pulling in PyTorch is overkill. Reach for a neural VAD like Silero when accuracy in noise matters more than footprint, or when you're already in a torch pipeline. They have different input contracts (Silero takes float tensors and its own chunk sizes), so don't assume frames are interchangeable — if a project already standardises on one, match it rather than mixing the two.