listen

star 435

Nested swiss-knife reference for local audio analysis — transcribe speech with Whisper, or extract musical features (tempo, key, dynamics, spectral profile) with librosa. Both run on the user's machine with no API key. Read this when the human asks you to transcribe a voice note, extract lyrics from singing, critique generated music, or analyze audio characteristics. For *creating* music or audio, use the sibling `minimax-cli` reference (or `dj` for journal-inspired music) instead.

Lingtai-AI By Lingtai-AI schedule Updated 6/2/2026

name: listen description: > Nested swiss-knife reference for local audio analysis — transcribe speech with Whisper, or extract musical features (tempo, key, dynamics, spectral profile) with librosa. Both run on the user's machine with no API key. Read this when the human asks you to transcribe a voice note, extract lyrics from singing, critique generated music, or analyze audio characteristics. For creating music or audio, use the sibling minimax-cli reference (or dj for journal-inspired music) instead. version: 1.0.0 tags: [audio, transcribe, whisper, librosa, music-analysis, nested-skill]

listen

Nested swiss-knife reference for local-only audio analysis. No API key, no network. Two actions: transcribe (speech → text) or appreciate (music → numerical critique).

Two Actions

Action Backend When
transcribe faster-whisper (local Whisper) Spoken word, voice notes, podcasts, lectures. Works on singing too but lyrics may be inaccurate.
appreciate librosa (signal processing) Music — tempo, key, frequency bands, dynamics. Returns numerical measurements, not subjective descriptions.

Both actions are wrappers around the bundled scripts. Run them with bash like any other command-line tool:

python3 <skill-path>/scripts/transcribe.py <audio-file>
python3 <skill-path>/scripts/appreciate.py <audio-file>

The scripts auto-install their dependencies via lingtai.venv_resolve.ensure_package on first run, so the first invocation may take ~30 s.

transcribe — speech to text

python3 <skill-path>/scripts/transcribe.py <audio-path> [--model base] [--device cpu]
Flag Default Notes
--model base Whisper model size: tiny, base, small, medium, large-v2, large-v3. Larger = more accurate, slower, more RAM.
--device cpu Use cuda if you have a GPU.
--compute-type int8 CTranslate2 compute type. int8 is the fastest CPU mode. Use float16 on GPU.

Output: a JSON document on stdout with:

{
  "text": "<full transcript>",
  "language": "en",
  "language_probability": 0.99,
  "duration": 42.3,
  "segments": [
    {"start": 0.0, "end": 4.2, "text": "..."},
    ...
  ]
}

Best for: Clear spoken word in any of Whisper's supported languages. Caveats: Singing lyrics often mistranscribed — Whisper is trained on speech, not singing. Background music degrades accuracy. For very noisy input, try --model medium or large-v3.

appreciate — music analysis

python3 <skill-path>/scripts/appreciate.py <audio-path>

No flags — purely analytical. Output: a JSON document with:

Field Meaning
duration Audio length in seconds
tempo_bpm Estimated tempo
beat_regularity_std Std-dev of inter-beat intervals — small (<0.05) = steady, large = rubato/free
key Estimated key (e.g. D minor, G major)
key_confidence 0–1, correlation with Krumhansl key profile
chroma_profile Per-pitch-class energy — useful for spotting modal mixture
spectral_centroid_hz Brightness — higher = brighter mix
spectral_bandwidth_hz Spread of spectrum
spectral_rolloff_hz 85th-percentile frequency — "where the highs end"
zero_crossing_rate Noisiness measure
dynamic_range_db Loud-vs-quiet contrast in dB
frequency_bands_pct Percentage of energy in sub_bass/bass/low_mid/mid/upper_mid/presence/brilliance
energy_contour RMS energy in 10 equal-time segments (loud-vs-quiet shape over time)
onset_density_per_sec How many note-onsets per second — proxy for "busyness"

These are measurements, not opinions. Your job is to translate the numbers into a critique:

  • "tempo_bpm: 84, beat_regularity_std: 0.012" → "steady mid-tempo, ballad pacing".
  • "spectral_centroid_hz: 3500, presence: 22%" → "bright, vocal-forward mix".
  • "energy_contour: monotonically increasing" → "builds throughout".

Best for: Music. Useless for speech — gives spectral data with no semantic content.

When to use which

Input Action
Voice note, lecture, podcast transcribe
Music with vocals — want lyrics transcribe (warn human: lyrics may be wrong)
Music — want to know if it matches a brief appreciate
Generated music from the sibling minimax-cli reference or dj — QA appreciate
TTS output from the sibling minimax-cli reference — verify pronunciation transcribe (round-trip QA)
Both (transcript + analysis) Run both scripts

Going Deeper

The bundled scripts are deliberately minimal. If you need:

  • Per-section analysis (verse vs chorus): segment the file with librosa.segment first, then run appreciate.py on each segment.
  • Multi-track separation: use demucs or spleeter (heavier deps — install on demand via pip).
  • Pitch tracking (melody extraction): use librosa.pyin or crepe.
  • Lyrics alignment: the Whisper segments give you word-level timing if you pass --word-timestamps.

You can write your own scripts using the same dependencies — librosa and faster-whisper are already installed once the bundled scripts have run.

When NOT to use this skill

  • Human asked you to create audio (music, speech, sound effect) — use the sibling minimax-cli reference; for journal-inspired music, use dj.
  • Human asked you to describe a video or image — use the sibling vision reference (../vision/SKILL.md).
  • You only need to play audio for the human — use an OS-native player; this reference only analyzes audio files.

Found a bug or issue? If you encounter any problems with this skill, load the lingtai-issue-report skill and follow its instructions to report it.

Install via CLI
npx skills add https://github.com/Lingtai-AI/lingtai --skill listen
Repository Details
star Stars 435
call_split Forks 40
navigation Branch main
article Path SKILL.md
More from Creator