plugin-dev-workflow - SKILL.md Agent Skill

name: plugin-dev-workflow description: "Guide plugin development workflow — editing skills, agents, hooks, or eval framework in this repo. Use when modifying files in plugins/elixir-phoenix/, lab/eval/, or lab/autoresearch/. Ensures changes pass eval, lint, and tests before committing." effort: medium

Plugin Development Workflow

This repo is the Elixir/Phoenix Claude Code plugin. When editing plugin files, follow this workflow to ensure quality.

Before You Start

Run make help to see all available commands:

make eval          # Quick: lint + score changed skills/agents
make eval-all      # Full: all 40 skills + 20 agents
make eval-fix      # Auto-fix + show failures
make test          # 52 pytest tests for eval framework
make ci            # Full CI pipeline

Scoring Individual Files (CLI)

IMPORTANT: Always use -m module syntax, never run scorer.py directly.

# Score ONE skill (use -m, NOT direct file path)
python3 -m lab.eval.scorer plugins/elixir-phoenix/skills/verify/SKILL.md

# Score ONE skill with pretty output
python3 -m lab.eval.scorer plugins/elixir-phoenix/skills/verify/SKILL.md --pretty

# Score all skills
python3 -m lab.eval.scorer --all

# Score ONE agent
python3 -m lab.eval.agent_scorer plugins/elixir-phoenix/agents/verification-runner.md

# Score all agents
python3 -m lab.eval.agent_scorer --all
make ci            # Full CI pipeline

When Editing Skills (plugins/elixir-phoenix/skills/*/SKILL.md)

Read CLAUDE.md conventions (size limits, frontmatter requirements)
Make your changes
Run make eval — it auto-detects changed skills and scores them
If FAIL: check the dimension that failed, fix it
Run make lint to verify markdown formatting
Commit

Skill requirements (eval checks all of these):

Frontmatter: name, description, effort. Description must start with action verb + include "Use when..."
Iron Laws section with 1+ numbered items
Under 185 lines (command skills) or 150 lines (reference skills)
No section exceeds 45 lines
All /phx: references point to existing skills
All references/*.md paths exist
No dangerous code patterns outside Iron Laws sections
Code examples present (1+ fenced code blocks)
"Use when..." in description (for trigger accuracy)

When Editing Agents (plugins/elixir-phoenix/agents/*.md)

Make your changes
Run make eval-agents to score all agents
Agent requirements:
- permissionMode: bypassPermissions (always — background agents need it)
- disallowedTools: Write, Edit, NotebookEdit for review/analysis agents
- model matches effort: haiku=low, sonnet=medium, opus=high
- Under 300 lines (specialist) or 535 lines (orchestrator)

When Editing Eval Framework (lab/eval/*.py)

Make your changes
Run make test — 52 pytest tests must pass
Run make eval-all — verify no skills/agents regressed
If adding new matchers: add tests in lab/eval/tests/test_matchers.py

When Editing Hooks (plugins/elixir-phoenix/hooks/scripts/*.sh)

Make your changes
Run make lint (markdown in hook comments)
Test the hook manually (hooks run on Edit/Write/Bash events)
Check CLAUDE.md hook documentation is still accurate

Autoresearch (Self-Improvement Loop)

If make eval-fix shows failures, it suggests an autoresearch command:

# Copy-paste the suggested command from eval-fix output
claude -p 'Run autoresearch. Score all skills...' --allowedTools 'Edit,Read,Write,Bash,Glob,Grep'

This runs the autoresearch loop: find weakest skill → fix ONE issue → re-score → keep/revert.

Pre-Commit Checklist

Before committing any plugin changes:

make lint passes
make eval passes (changed files)
make test passes (if eval framework changed)
CHANGELOG.md updated (if user-visible change)
Version bumped in plugin.json (if releasing)

References

CLAUDE.md — full conventions, size limits, checklist
lab/eval/ — scoring framework (24 matchers, 8 dimensions)
lab/autoresearch/ — self-improvement loop
lab/findings/interesting.jsonl — log interesting discoveries here