qwen-asr-skill - SKILL.md Agent Skill

name: qwen-asr-skill description: Local speech-to-text transcription using Qwen3-ASR-1.7B. Use this skill when a user provides an audio file path and needs speech recognition, transcription, or ASR. Supports 52 languages and dialects, and common audio formats including WAV, MP3, FLAC, M4A, and OGG.

Transcribe audio files to text locally using the Qwen3-ASR-1.7B model.

Install dependencies into a Python 3.10+ virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install qwen-asr soundfile silero-vad

For GPUs with compute capability < 7.0 (e.g. GTX 1060), install PyTorch 2.4.x with CUDA 11.8:

pip install torch==2.4.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118

Run the transcription script with the path to an audio file:

python scripts/transcribe.py <audio_path>

Parameter	Description	Default
`audio_path`	Absolute path to audio file (required)	-
`--language`	Force language (e.g. Chinese, English). Auto-detect if omitted	Auto-detect
`--device`	Inference device: auto / cuda / cpu	auto
`--model-path`	Model path or HuggingFace ID	~/models/Qwen3-ASR-1.7B
`--max-chunk-sec`	Max chunk duration for VAD splitting. Long audio is split at silence boundaries	90
`--max-new-tokens`	Max tokens to generate. Increase for long audio	2048

Basic transcription:

python scripts/transcribe.py /path/to/audio.wav

Force language:

python scripts/transcribe.py /path/to/audio.mp3 --language Chinese

Force CPU inference:

python scripts/transcribe.py /path/to/audio.flac --device cpu

The script outputs JSON to stdout and status info to stderr:

{"language": "Chinese", "text": "Transcribed text content"}

On error:

{"error": "Error description"}

First run downloads the model (~4.7GB), cached for subsequent runs
Auto mode: tries GPU (float16) first, falls back to CPU (float32) if VRAM is insufficient
Supports: WAV, MP3, FLAC, M4A, OGG and other common audio formats
52 languages including Chinese, English, Japanese, Korean, French, German, etc.
22 Chinese dialects supported
Long audio: Audio longer than 90s is automatically split at silence boundaries using silero-vad, transcribed chunk by chunk, then concatenated. This prevents OOM on limited VRAM GPUs.