podcast-quality-scorecard - SKILL.md Agent Skill

name: podcast-quality-scorecard description: Evaluate completed podcast episodes across 10 quality dimensions. Diagnostic tool that produces detailed scorecards with evidence-based ratings, strengths, weaknesses, and workflow improvement recommendations. user-invocable: false

Podcast Quality Scorecard Skill

Purpose: Evaluate completed podcast episodes across 10 quality dimensions to identify strengths and weaknesses. This is a diagnostic tool (not before/after comparison) applied to every episode to understand its unique profile.

When to use: After any episode is complete (audio generated and published). Can be run on baseline episodes, improved episodes, or experimental formats.

Output: Detailed quality scorecard saved to podcast/episodes/EPISODE_PATH/logs/quality_scorecard.md

Workflow

Phase 1: Gather Episode Materials

Read the following files from the episode directory:

Required:

content_plan.md - Episode structure and NotebookLM guidance
report.md - Research synthesis
EPISODE_SLUG_chapters.json - Chapter structure
Audio transcript (one of):
- EPISODE_SLUG_transcript.json (Whisper output, extract text field)
- transcript.txt (plain text)

Optional but valuable: 5. research/p3-briefing.md - Research organization 6. sources.md - Source validation 7. logs/metadata.md - Publishing metadata (if exists)

Audio duration:

Extract from chapters JSON (last chapter startTime + estimated final chapter duration ~120s)
OR use ffmpeg -i EPISODE.mp3 2>&1 | grep Duration

Phase 2: Evaluate 10 Dimensions

For each dimension, provide:

Score (1-5) using the rating scale
Evidence from episode materials (quotes, examples, observations)
Why this score (not higher/lower)
Specific observations unique to this episode

Dimension 1: Structural Clarity (1-5)

What we're measuring: Can a listener follow the episode's structure and know where they are at any moment?

Rating Scale:

5 - Crystal Clear: Structure stated upfront, clear signposting at transitions, easy to summarize arc in one sentence
4 - Well Structured: Most transitions are clear, structure is followable, minor gaps
3 - Adequate: Structure exists but requires listener effort to discern, some unclear transitions
2 - Meandering: Structure is hard to follow, transitions feel random, listener may get lost
1 - Chaotic: No discernible structure, topics jump without warning

Evidence to gather:

Does opening preview structure?
Count signposting phrases: "we just covered X, now we're moving to Y"
Can you write one-sentence arc summary?
Compare actual episode flow (chapters) to planned structure (content_plan.md)

Document:

One-sentence arc summary
Examples of signposting (or lack thereof)
Structural preview (if present)

Dimension 2: Depth Distribution (1-5)

What we're measuring: Do all major themes get proportional depth, or do some feel rushed/underdeveloped?

Rating Scale:

5 - Perfectly Balanced: All major themes get depth proportional to importance, no theme feels rushed or over-explored
4 - Well Balanced: Minor depth variations, but all themes adequately covered
3 - Uneven: One theme clearly gets more depth than equally important themes
2 - Imbalanced: Important theme feels like an add-on or afterthought, significant depth disparity
1 - Severely Skewed: Major theme mentioned briefly while minor themes dominate

Evidence to gather:

List all major themes from content_plan.md
Calculate time allocation per theme (from chapters)
Identify themes that got <15% of time when they deserved more
Note if depth differences are intentional (primary vs. secondary) or accidental

Document:

Theme analysis table with time allocations and percentages
Critical imbalances identified
Comparison to content plan intentions

Dimension 3: Mode-Switching Clarity (1-5)

What we're measuring: Are transitions between modes (philosophy, research, storytelling, practical, landing) intentional and smooth?

Rating Scale:

5 - Masterful: Modes are clearly defined, transitions feel purposeful, each mode serves its function
4 - Intentional: Modes are distinguishable, transitions mostly smooth, occasional blend
3 - Blended: Modes blend together, transitions not always clear, listener may not notice mode shifts
2 - Muddy: Modes blur together confusingly (philosophy mixed with practical advice, research mixed with opinion)
1 - Undefined: No clear modes, everything feels like one continuous stream

Evidence to gather:

Identify distinct philosophical, research, storytelling, practical, and landing moments
Count explicit mode transitions ("Let's look at what the research found...")
Note where modes blend without markers
Compare to content_plan.md mode intentions

Document:

Modes observed (yes/no for each, with quality rating)
Examples of clear vs. unclear transitions
The blend problem (if exists)

Dimension 4: Dialogue Dynamics (1-5)

What we're measuring: Does the conversation feel like a genuine exchange with counterpoint, or just mutual agreement and reinforcement?

Rating Scale:

5 - Dynamic Exchange: Multiple counterpoint moments, respectful disagreement, "wait, but..." challenges, diverse perspectives
4 - Engaging: Some counterpoint, occasional push-back, mostly collaborative with texture
3 - Supportive Riff: Mostly agreement, speakers build on each other, limited divergence
2 - Echo Chamber: Pure reinforcement, no push-back, feels like presentation with two voices
1 - Monotone: Could be one person talking, no meaningful interaction

Evidence to gather:

Count counterpoint moments (one speaker challenges or diverges)
Look for "wait, but what about..." or "I see it differently because..." phrases
Identify agreement patterns: "Exactly," "Absolutely," "Precisely," "Correct"
Note missed opportunities for debate (controversial topics presented without tension)

Document:

Counterpoint moments counted (with examples)
Pattern analysis (call-and-response, pure agreement, etc.)
Missing opportunities for counterpoint

Dimension 5: Practical Actionability (1-5)

What we're measuring: Does the listener leave with clear, specific, actionable steps?

Rating Scale:

5 - Highly Actionable: 3+ specific tactics, frameworks, or steps a listener can implement immediately
4 - Actionable: 2 specific tactics, clear enough to act on with minimal additional research
3 - Moderately Actionable: 1 specific tactic, or general advice that needs clarification
2 - Vaguely Actionable: Concepts discussed but no clear "how to do this" guidance
1 - Purely Conceptual: Interesting ideas but zero implementation guidance

Evidence to gather:

Extract all specific tactics, frameworks, steps mentioned
Check for timeframes (not "over time" but "9-12 weeks")
Check for thresholds (not "small decisions" but "under $5,000")
Assess: Could a listener implement these tomorrow?

Document:

List of actionable takeaways (numbered, with specificity details)
Assessment of implementation readiness

Dimension 6: Takeaway Clarity (1-5)

What we're measuring: Can a listener articulate 1-3 core takeaways from the episode?

Rating Scale:

5 - Crystal Clear: 1-3 core takeaways explicitly stated, memorable, listener could repeat them
4 - Clear: Takeaways are identifiable with minimal effort, mostly explicit
3 - Inferrable: Listener needs to synthesize or infer takeaways, not explicitly stated
2 - Fuzzy: Hard to identify core takeaways, too many ideas competing for attention
1 - Unclear: No clear takeaways, episode explores but doesn't land on key points

Evidence to gather:

Check closing section for explicit takeaway synthesis
Count core points (are there 1-3, or 10+?)
Test: Could you answer "what was this episode about?" in 1-2 sentences?
Look for callback to opening hook

Document:

Core takeaways (1-3 numbered points)
Whether explicitly stated or inferred
Quality of closing synthesis

Dimension 7: Storytelling Quality (1-5)

What we're measuring: Are examples, case studies, and narratives used effectively to illustrate concepts?

Rating Scale:

5 - Compelling: Multiple memorable stories, well-integrated, emotionally resonant, illustrate key points perfectly
4 - Effective: 2+ stories, good integration, serve to illustrate concepts
3 - Adequate: 1 story, or multiple stories that are functional but not memorable
2 - Minimal: Stories feel tacked on or tangential, limited illustrative power
1 - Absent: No stories, pure abstract discussion

Evidence to gather:

Count stories, examples, case studies
Assess memorability (would a listener remember this story?)
Check integration (do stories illustrate key concepts or feel tangential?)
Note emotional resonance

Document:

Stories/examples identified (numbered list with effectiveness ratings)
Assessment of integration quality

Dimension 8: Episode Arc & Resolution (1-5)

What we're measuring: Does the episode build toward a satisfying resolution, or does it trail off?

Rating Scale:

5 - Satisfying Arc: Clear problem → exploration → resolution, builds momentum, strong ending that lands the point
4 - Good Arc: Identifiable build and resolution, ending feels intentional
3 - Adequate Arc: Some build-up, ending is present but doesn't fully land
2 - Weak Arc: Little build-up, ending feels like it trails off or runs out of steam
1 - No Arc: Flat throughout, no sense of build or resolution

Evidence to gather:

Identify: opening hook → problem definition → exploration → resolution
Check if episode builds or meanders at consistent intensity
Assess closing: conclusion or "ran out of time"?
Look for callback to opening

Document:

Arc structure (opening, problem, exploration, resolution)
Assessment of momentum and build
Quality of resolution

Dimension 9: Packaging & Discoverability (1-5)

What we're measuring: Are episode metadata, descriptions, and resources useful for listeners?

Rating Scale:

5 - Excellent Packaging: Rich description with "What You'll Learn", timestamps, validated sources, clear CTA, useful show notes
4 - Strong Packaging: Description is informative, sources provided, show notes functional
3 - Adequate Packaging: Basic description, some sources, minimal show notes
2 - Weak Packaging: Generic description, few/no sources, poor show notes
1 - Minimal Packaging: Title and basic description only

Evidence to gather:

Read logs/metadata.md (if exists)
Check for "What You'll Learn" bullets
Check for key timestamps
Assess source descriptions (just URLs vs. actionable descriptions)
Check for CTA (call-to-action)

Document:

Current state (present vs. missing elements)
Assessment of discoverability

Dimension 10: Companion Resource Value (1-5)

What we're measuring: Do companion resources (summary, checklist, diagrams) add value beyond the audio?

Rating Scale:

5 - Highly Valuable: Multiple resources (summary, checklist, framework diagram), professionally formatted, immediately useful
4 - Valuable: 1-2 resources, clear utility, good formatting
3 - Moderately Valuable: Resources exist but basic, limited additional value beyond audio
2 - Low Value: Resources feel auto-generated, not tailored, minimal utility
1 - Absent: No companion resources

Evidence to gather:

Check for: one-page summary, action checklist, framework diagrams, decision trees, landing page
Assess utility: would a listener use these, or just "nice to have"?
Check formatting quality

Document:

Resources present (checklist)
Assessment of value and utility

Phase 3: Generate Summary

Create summary section with:

Scores Table:

Dimension	Score	Notes
1. Structural Clarity	X / 5	Brief note
2. Depth Distribution	X / 5	Theme list
3. Mode-Switching Clarity	X / 5
...	...	...

Total: XX / 50 (XX%)

Strengths (scores 4-5):

List 3-5 top strengths with specific examples

Weaknesses (scores 1-2):

List 1-3 critical weaknesses with specific examples

Areas for Improvement (score 3):

List 2-3 moderate improvements needed

Workflow Improvements to Apply for Next Episode:

Prioritize 3-5 high-impact improvements for next episode

Phase 4: Write Output File

Create podcast/episodes/EPISODE_PATH/logs/quality_scorecard.md with:

Header (episode title, date, evaluator, format, duration)
Scores table (summary)
Full 10-dimension evaluation (each dimension gets its own section with rating scale, evidence, assessment)
Summary (strengths, weaknesses, areas for improvement)
Workflow improvements (specific tasks from improvement plan)
Notes & observations (free-form insights, what worked, what needs work, ideas for next episode)

Quality Standards

Evidence-Based Evaluation

Quote from transcript to support claims about dialogue, signposting, etc.
Reference specific chapters when discussing structure or depth
Compare to content plan to assess execution vs. intention
Avoid vague assessments ("felt rushed") without evidence ("AI section: 90 seconds of 32-minute episode, 2.8% of total time")

Actionable Feedback

Not: "Dialogue needs improvement"
Instead: "Zero counterpoint moments. Founder Mode debate (Ch 9) presented perfect opportunity: one speaker could defend delegation, other defend hands-on involvement. Instead, both agreed throughout."
Not: "Packaging could be better"
Instead: "Missing 'What You'll Learn' bullets. Current description doesn't entice. Add: 'Why the famous 70% rule has zero research backing' + 4 more bullets highlighting key frameworks."

Respectful Tone

This is diagnostic feedback for improvement, not criticism. Focus on:

Opportunities (not "failures")
Specific improvements (not vague "be better")
Strengths to leverage (not just weaknesses)

Example Usage

Invocation via Task Tool

Use the Task tool with subagent_type='general-purpose':

"Run the podcast-quality-scorecard skill on the episode at podcast/episodes/algorithms-for-life/ep3-how-to-delegate/.

Follow the workflow in .claude/skills/podcast-quality-scorecard/SKILL.md to:
1. Gather episode materials (content_plan.md, report.md, transcript, chapters)
2. Evaluate all 10 dimensions with evidence-based scoring
3. Generate summary with strengths, weaknesses, and workflow improvement recommendations
4. Write comprehensive scorecard to logs/quality_scorecard.md

Episode title: Algorithms for Life: Ep. 3, How to Delegate
Format: Standard workflow (baseline evaluation)"

Output Location

podcast/episodes/EPISODE_PATH/logs/quality_scorecard.md

Notes

Not a before/after comparison - Each episode gets its own diagnostic profile
Apply to every episode - Baseline, improved, and experimental formats all get scored
Aggregate over time - After 5-10 episodes, identify patterns (consistent strengths, persistent weaknesses)
Reference quality framework - Quality Waves 1-5 are documented in CLAUDE.md