qwen-asr-skill

star 7

Local speech-to-text transcription using Qwen3-ASR-1.7B. Use this skill when a user provides an audio file path and needs speech recognition, transcription, or ASR. Supports 52 languages and dialects, and common audio formats including WAV, MP3, FLAC, M4A, and OGG.

brisyramshere By brisyramshere schedule Updated 2/15/2026

name: qwen-asr-skill description: Local speech-to-text transcription using Qwen3-ASR-1.7B. Use this skill when a user provides an audio file path and needs speech recognition, transcription, or ASR. Supports 52 languages and dialects, and common audio formats including WAV, MP3, FLAC, M4A, and OGG.

Qwen ASR Skill - Local Speech-to-Text

Transcribe audio files to text locally using the Qwen3-ASR-1.7B model.

Setup

Install dependencies into a Python 3.10+ virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install qwen-asr soundfile silero-vad

For GPUs with compute capability < 7.0 (e.g. GTX 1060), install PyTorch 2.4.x with CUDA 11.8:

pip install torch==2.4.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118

Usage

Run the transcription script with the path to an audio file:

python scripts/transcribe.py <audio_path>

Parameters

Parameter Description Default
audio_path Absolute path to audio file (required) -
--language Force language (e.g. Chinese, English). Auto-detect if omitted Auto-detect
--device Inference device: auto / cuda / cpu auto
--model-path Model path or HuggingFace ID ~/models/Qwen3-ASR-1.7B
--max-chunk-sec Max chunk duration for VAD splitting. Long audio is split at silence boundaries 90
--max-new-tokens Max tokens to generate. Increase for long audio 2048

Examples

Basic transcription:

python scripts/transcribe.py /path/to/audio.wav

Force language:

python scripts/transcribe.py /path/to/audio.mp3 --language Chinese

Force CPU inference:

python scripts/transcribe.py /path/to/audio.flac --device cpu

Output Format

The script outputs JSON to stdout and status info to stderr:

{"language": "Chinese", "text": "Transcribed text content"}

On error:

{"error": "Error description"}

Notes

  • First run downloads the model (~4.7GB), cached for subsequent runs
  • Auto mode: tries GPU (float16) first, falls back to CPU (float32) if VRAM is insufficient
  • Supports: WAV, MP3, FLAC, M4A, OGG and other common audio formats
  • 52 languages including Chinese, English, Japanese, Korean, French, German, etc.
  • 22 Chinese dialects supported
  • Long audio: Audio longer than 90s is automatically split at silence boundaries using silero-vad, transcribed chunk by chunk, then concatenated. This prevents OOM on limited VRAM GPUs.
Install via CLI
npx skills add https://github.com/brisyramshere/qwen-asr-skill --skill qwen-asr-skill
Repository Details
star Stars 7
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
brisyramshere
brisyramshere Explore all skills →