name: listen
description: >
Nested swiss-knife reference for local audio analysis — transcribe speech
with Whisper, or extract musical features (tempo, key, dynamics, spectral
profile) with librosa. Both run on the user's machine with no API key. Read
this when the human asks you to transcribe a voice note, extract lyrics from
singing, critique generated music, or analyze audio characteristics. For
creating music or audio, use the sibling minimax-cli reference (or dj for journal-inspired music) instead.
version: 1.0.0
tags: [audio, transcribe, whisper, librosa, music-analysis, nested-skill]
listen
Nested swiss-knife reference for local-only audio analysis. No API key, no network. Two actions: transcribe (speech → text) or appreciate (music → numerical critique).
Two Actions
| Action | Backend | When |
|---|---|---|
| transcribe | faster-whisper (local Whisper) |
Spoken word, voice notes, podcasts, lectures. Works on singing too but lyrics may be inaccurate. |
| appreciate | librosa (signal processing) |
Music — tempo, key, frequency bands, dynamics. Returns numerical measurements, not subjective descriptions. |
Both actions are wrappers around the bundled scripts. Run them with bash like any other command-line tool:
python3 <skill-path>/scripts/transcribe.py <audio-file>
python3 <skill-path>/scripts/appreciate.py <audio-file>
The scripts auto-install their dependencies via lingtai.venv_resolve.ensure_package on first run, so the first invocation may take ~30 s.
transcribe — speech to text
python3 <skill-path>/scripts/transcribe.py <audio-path> [--model base] [--device cpu]
| Flag | Default | Notes |
|---|---|---|
--model |
base |
Whisper model size: tiny, base, small, medium, large-v2, large-v3. Larger = more accurate, slower, more RAM. |
--device |
cpu |
Use cuda if you have a GPU. |
--compute-type |
int8 |
CTranslate2 compute type. int8 is the fastest CPU mode. Use float16 on GPU. |
Output: a JSON document on stdout with:
{
"text": "<full transcript>",
"language": "en",
"language_probability": 0.99,
"duration": 42.3,
"segments": [
{"start": 0.0, "end": 4.2, "text": "..."},
...
]
}
Best for: Clear spoken word in any of Whisper's supported languages.
Caveats: Singing lyrics often mistranscribed — Whisper is trained on speech, not singing. Background music degrades accuracy. For very noisy input, try --model medium or large-v3.
appreciate — music analysis
python3 <skill-path>/scripts/appreciate.py <audio-path>
No flags — purely analytical. Output: a JSON document with:
| Field | Meaning |
|---|---|
duration |
Audio length in seconds |
tempo_bpm |
Estimated tempo |
beat_regularity_std |
Std-dev of inter-beat intervals — small (<0.05) = steady, large = rubato/free |
key |
Estimated key (e.g. D minor, G major) |
key_confidence |
0–1, correlation with Krumhansl key profile |
chroma_profile |
Per-pitch-class energy — useful for spotting modal mixture |
spectral_centroid_hz |
Brightness — higher = brighter mix |
spectral_bandwidth_hz |
Spread of spectrum |
spectral_rolloff_hz |
85th-percentile frequency — "where the highs end" |
zero_crossing_rate |
Noisiness measure |
dynamic_range_db |
Loud-vs-quiet contrast in dB |
frequency_bands_pct |
Percentage of energy in sub_bass/bass/low_mid/mid/upper_mid/presence/brilliance |
energy_contour |
RMS energy in 10 equal-time segments (loud-vs-quiet shape over time) |
onset_density_per_sec |
How many note-onsets per second — proxy for "busyness" |
These are measurements, not opinions. Your job is to translate the numbers into a critique:
- "tempo_bpm: 84, beat_regularity_std: 0.012" → "steady mid-tempo, ballad pacing".
- "spectral_centroid_hz: 3500, presence: 22%" → "bright, vocal-forward mix".
- "energy_contour: monotonically increasing" → "builds throughout".
Best for: Music. Useless for speech — gives spectral data with no semantic content.
When to use which
| Input | Action |
|---|---|
| Voice note, lecture, podcast | transcribe |
| Music with vocals — want lyrics | transcribe (warn human: lyrics may be wrong) |
| Music — want to know if it matches a brief | appreciate |
Generated music from the sibling minimax-cli reference or dj — QA |
appreciate |
TTS output from the sibling minimax-cli reference — verify pronunciation |
transcribe (round-trip QA) |
| Both (transcript + analysis) | Run both scripts |
Going Deeper
The bundled scripts are deliberately minimal. If you need:
- Per-section analysis (verse vs chorus): segment the file with
librosa.segmentfirst, then runappreciate.pyon each segment. - Multi-track separation: use
demucsorspleeter(heavier deps — install on demand viapip). - Pitch tracking (melody extraction): use
librosa.pyinorcrepe. - Lyrics alignment: the Whisper segments give you word-level timing if you pass
--word-timestamps.
You can write your own scripts using the same dependencies — librosa and faster-whisper are already installed once the bundled scripts have run.
When NOT to use this skill
- Human asked you to create audio (music, speech, sound effect) — use the sibling
minimax-clireference; for journal-inspired music, usedj. - Human asked you to describe a video or image — use the sibling
visionreference (../vision/SKILL.md). - You only need to play audio for the human — use an OS-native player; this reference only analyzes audio files.
Found a bug or issue? If you encounter any problems with this skill, load the
lingtai-issue-reportskill and follow its instructions to report it.