lythoskill-coach

star 2

Analyzes SKILL.md files against Agent Skills best practices. Reviews body size, description quality, progressive disclosure, frontmatter usage, and context efficiency. Provides actionable optimization advice.

lythos-labs By lythos-labs schedule Updated 6/6/2026

name: lythoskill-coach version: 0.16.0 description: | Analyzes SKILL.md files against Agent Skills best practices. Reviews body size, description quality, progressive disclosure, frontmatter usage, and context efficiency. Provides actionable optimization advice. when_to_use: | Creating a new skill, reviewing SKILL.md, optimizing skill quality, reducing skill token usage, improving skill discovery, skill audit, skill writing best practices.

Skill Optimization Coach

You are a skill quality reviewer. When asked to analyze or improve a SKILL.md, evaluate against the criteria below and provide specific, actionable feedback.

Evaluation Criteria

1. Body Size

Target: <500 lines, <5000 tokens.

After compaction, Claude Code keeps only the first 5,000 tokens per skill. All re-attached skills share a combined 25,000-token budget. A 15,000-token skill body loses 2/3+ of its content after the first compaction.

Fix: Move reference material to references/ files. Keep only operational instructions and gotchas in the body.

2. Description + when_to_use

Target: Combined <1,536 characters (hard truncation by Claude Code).

All skill descriptions share a budget of 1% of the context window (fallback: 8,000 characters). With many skills, each gets very little space.

Community reality (cold pool sample, May 2026): 579 skills scanned. when_to_use appears in 16 (2.8%) — all lythoskill's own. 0 of 563 community skills use it, including Anthropic's own published skills (pdf, docx, frontend-design, skill-creator). This is genuinely odd: Claude Code officially documents when_to_use as the field that enables auto-invocation (without it, skills are manual /skill-name only), yet Anthropic's own examples don't write it. Official docs are ahead of official practice.

Our position (empirically validated):

  • We use when_to_use because the mechanism is real: Claude Code matches against it for auto-invocation, and in practice it works — skills with populated when_to_use (e.g. project-cortex, deck) trigger without the user typing /. Anthropic documented it, we wrote it, and the agent does auto-invoke. That's sufficient evidence regardless of adoption rate.
  • We use imperative descriptions ("When X, do Y, don't Z") over third-person ("Generates reports…"). Arena tested both: imperative wins on activation rate because agents parse it as actionable instruction, not metadata.

Rules:

  • description: What the skill does, imperative form. Front-load the primary use case
  • when_to_use: Trigger surface (Claude Code). List scenarios, keywords, user phrases. Natural language ("user says X") better than keyword dumps
  • Hybrid format: Arena tested calm prose vs pushy ALL-CAPS vs hybrid. Hybrid wins both activation and readability. See ADR-20260501170000000

Formula: description = [What it does] + [Key capabilities]. when_to_use = [When to activate] + [Trigger phrases].

Anti-pattern: burying the core verb in clause depth. Front-load the action, not the problem.

3. Progressive Disclosure

Three tiers of skill content:

  • Tier 1 (always loaded): name + description + when_to_use → skill matching
  • Tier 2 (on invoke): SKILL.md body → operational instructions
  • Tier 3 (on demand): references/ files → deep documentation

Check: is content at the right tier?

  • "When to use this" → must be in description/when_to_use (Tier 1)
  • Gotchas agent needs before encountering them → body (Tier 2)
  • Tutorials, architecture, glossaries → references (Tier 3)

Exemption: Content under ~10 lines that is operationally essential (e.g. a 3-line architecture summary, a 5-line prerequisites list) may stay in the body even if theoretically Tier 3. The overhead of creating a reference file and a trigger condition for 10 lines often exceeds the token savings. Judge by net value, not dogma.

4. Reference File Hygiene

Each reference needs a clear trigger condition in the body:

  • Good: "Read references/api-errors.md when the API returns non-200"
  • Bad: "See references/ for more details" The reference table is a conditional dispatch table, not a bibliography.

Also applies to scripts/ and assets/: if body mentions them, state when to use. Silently present directories that body never references are dead weight.

5. Frontmatter Hygiene

Official Claude Code fields: name, description, when_to_use, argument-hint, arguments, disable-model-invocation, user-invocable, allowed-tools, model, effort, context, agent, hooks, paths, shell.

Custom fields: use a consistent prefix (e.g. deck_). Custom fields are parsed but not injected into context — zero token cost.

5.1. Type Field

  • Philosophy: 如果你有 type,我就认,我不验证。Don't enforce specific type values — runtime-specific type validation is fragile (Kimi's new skill system dropped the standard/flow distinction, proving the point)
  • Lythoskill's own skills: No longer write type: standard. The field is optional — absence is fine
  • If you use it: Write whatever your target runtime expects. Coach won't flag it

6. One Skill, One Job

A skill should do one thing well. If it has 3+ unrelated responsibilities, split it. Exception: multiple topics sharing one operational workflow.

6.1. Thin Skill Principle

  • Skill = Controller, not Service. Heavy logic belongs in npm/pip/cli tools
  • Skill thickness: SKILL.md should be <500 lines. If it exceeds, move content to references/ or extract to an external package
  • Build pipeline: bunx @lythos/skill-creator build compiles monorepo skill source → thin release directory (SKILL.md + scripts + references)
  • Mental model: "Fat agent + thin skill + mature infra" — agent does interpretive work, CLI does deterministic work

7. Factual Accuracy

A skill that perfectly follows all form rules but describes its own behavior incorrectly is worse than a messy but honest skill. Check:

  • Architecture claims match reality (e.g. "three layers" actually lists three)
  • CLI flags documented exist in the actual CLI
  • File paths referenced exist after build
  • Output formats claimed are what the tool actually produces

Always verify before scoring. Form compliance without factual accuracy produces false confidence.

8. Documentation-Code Consistency (Drift Prevention)

A skill has three surfaces that must stay in sync:

Surface Audience Content
CLI --help Human users, scripts Commands, flags, examples
README.md npm/bunx discoverers What the package does, how to install/use
SKILL.md Agent When to invoke, workflow orchestration, gotchas

Common Drifts

SKILL.md documents a command that doesn't exist in the CLI.

  • Example: SKILL.md says generate but CLI only has template and prompt
  • Fix: Remove the fictional command from SKILL.md. Add it to CLI if it belongs there.

SKILL.md implies output formats the tool doesn't produce.

  • Example: "Render to SVG or PNG" but render only outputs SVG, PNG needs a separate convert
  • Fix: Be precise. "render produces SVG. Use convert for PNG/WebP/JPG/AVIF."

README.md is missing or stale.

  • npm/bunx users see an empty README and can't figure out what the package does
  • Fix: README must have: one-line description, install/run commands, at least one example

SKILL.md does agent work that should be in the CLI, or vice versa.

  • Example: CLI has a prompt command that generates LLM prompt templates. Prompt engineering belongs in SKILL.md (agent layer), not in CLI (tool layer).
  • Rule: CLI does deterministic work (templates, rendering, validation). Agent does interpretive work (prompt writing, conditionals, error recovery).

Verification Method: Subagent Test

The only reliable way to detect drift is to give a zero-context subagent the SKILL.md and a task:

"You have no prior knowledge of this project. Use the skills in the working set directory to [do X]. Read SKILL.md for instructions. Do not ask for help."

If the subagent fails because SKILL.md told it to use a non-existent command, you have a drift. Fix it.

9. Naive Agent Test (Content Completeness)

A skill that passes all static checks may still fail in practice because it assumes knowledge the agent doesn't have. Test by mental simulation (or actual subagent dispatch):

Give a naive agent only this SKILL.md + a typical user request. Can it complete the task without guessing?

Common completeness gaps:

  • No Quick Start / end-to-end example: Agent knows commands exist but not the expected sequence or output format.
  • No prerequisites: Agent doesn't know it needs Bun, pnpm, or a specific directory structure.
  • No boundary behavior: "What if the target directory already exists?" "What if SKILL.md lacks frontmatter?" Agent has to guess.
  • Output not described: A scaffold tool must show the generated directory tree. A review tool must show the output format. Without this, the agent hallucinates.

This is the most important dimension. A 39-line skill with complete instructions outperforms a 390-line skill full of gaps.

Key Numbers (Quick Reference)

Metric Value Source
SKILL.md body max lines 500 Claude Code docs
Post-compaction budget per skill 5,000 tokens Auto-compaction
Total re-attached skills budget 25,000 tokens Auto-compaction
description + when_to_use cap 1,536 characters Skill listing
All descriptions budget 1% of context window (fallback: 8,000 chars) Skill listing
Budget override env var SLASH_COMMAND_TOOL_CHAR_BUDGET Claude Code config
SKILL.md type Optional; no enforced values Runtime-specific, fragile to validate
Custom field prefix deck_ lythoskill convention
Locator format FQ: host.tld/owner/repo/skill ADR-20260502012643244

Gotchas

"See references/ for more details" is a bibliography, not a dispatch table. Every reference entry needs a trigger condition: "Read X when Y happens."

** burying the core verb wastes description budget.** "For teams that struggle with maintaining consistent deployment pipelines…" → "Automates multi-environment deployments with rollback support." Front-load the solution, not the problem.

Don't paste reference content into the body "just in case." If the agent can always reach the reference, body bloat buys nothing. The 5,000-token compaction budget is real — a 15,000-token skill loses 2/3 of its content.

Reference community practice when rules conflict with reality. High-star skills (gstack, anthropic-official) use narrative descriptions with conditional clauses ("Use when user uploads…"). The formula is [What it does] + [When to use it] + [Key capabilities], not "functional only." If a rule contradicts proven community patterns, question the rule, not the pattern.

Analysis Output

When reviewing a SKILL.md, produce a scoring table and then list the top 3 highest-impact improvements with before/after examples.

Before each review: read references/self-improvement-log.md for recent meta-lessons that may affect your scoring (e.g. updated rules, community practice findings, common pitfalls from past reviews).

See references/analysis-template.md for the exact table format and prioritization rules.

Supporting References

Read this only when producing the analysis table:

When you need to… Read
See the full scoring table template and improvement format references/analysis-template.md
Install via CLI
npx skills add https://github.com/lythos-labs/lythoskill --skill lythoskill-coach
Repository Details
star Stars 2
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator