name: apex-skill-forge-v9-claude description: Production skill engineering system that forges, audits, and packages agent skills with measured trigger accuracy, hard token budgets, and a marketplace-ready distribution pack (one-line install, before/after README, evidence scorecard). Use whenever the user wants to create, build, improve, audit, optimize, package, or publish a skill — including vague requests like "turn this workflow into a skill". Does not cover MCP server development or general prompt optimization. license: Proprietary - APEX Business Systems Ltd.
APEX-SKILL-FORGE v9
Input: Any skill request — new build, upgrade of an existing skill, audit, or publish prep.
Output: A skill package that passes forge.py lint --strict, ships with scorecard.json, and includes a distribution pack (README with one-line install + before/after, MANIFEST.json, packed .skill archive).
Success: Lint passes strict, token budgets hold, trigger eval exists and is balanced, every numeric claim traces to the scorecard.
Fails when: Intent is still ambiguous after 2 questions, lint fails, or a claim has no scorecard field behind it.
Why this design
Skill marketplaces now index thousands of entries. Supply is commoditized; selection and trust are the bottlenecks. A skill wins on three things only: (1) it triggers exactly when it should, (2) it costs few tokens to carry and run, (3) its quality claims are verifiable. The forge mechanizes all three. Adjectives are not evidence — a scorecard is.
Pipeline
Run every forge job through six phases. Skip none; phases are cheap and each one prevents a specific failure class.
INTENT → SPEC → BUILD → MEASURE → PACK → PUBLISH
Which entry point?
├─ "Create/build a skill" → Phase 1 (INTENT)
├─ "Improve/upgrade/fix my skill" → Phase 3 (BUILD), diff against current
├─ "Audit/score this skill" → Phase 4 (MEASURE) only
└─ "Package/publish/list my skill" → Phase 5–6 (PACK, PUBLISH)
Phase 1 — INTENT
Extract from the request (and conversation history if the user says "turn this into a skill"): purpose, trigger contexts, output format, success check. If two or more of these are missing, ask at most 2 questions, then lock a contract:
Purpose: <one sentence>
Triggers: <3-6 concrete user contexts>
Output: <format + location>
Verify: <command or check that proves success>
Present the contract. User confirms → proceed. This prevents the most expensive failure: building the wrong skill correctly.
Phase 2 — SPEC
Write the frontmatter description BEFORE the body. The description is the only signal the model has at selection time, so it determines whether the skill ever runs. Use the formula in references/description-engineering.md:
[What it does — 1 sentence, third person]
+ [Use when/whenever: concrete contexts, slightly pushy — models under-trigger]
+ [Does not cover: 1 near-miss exclusion]
Budget: ≤500 characters (spec hard limit is 1024). No hype lexicon — banned words are listed in references/token-economics.md and enforced by lint.
Phase 3 — BUILD
Scaffold, then fill slots instead of free-writing (cheaper to generate, harder to drift):
python scripts/forge.py init <skill-name> --dir <target>
Build rules, each with its reason:
- SKILL.md ≤200 lines / ≤2,500 estimated body tokens. Once loaded, every token competes with the user's conversation. Depth goes to
references/, loaded only on need. - Imperative voice; explain why, not just what. Models follow instructions better when the reason is stated; rigid all-caps walls are a yellow flag for overfitting.
- Decision trees at branch points, prose elsewhere. Trees remove variance where it matters; prose carries reasoning.
- Working code only. Every example must run as pasted. An untested example is a claim without evidence.
- Failure modes documented next to the operation that fails. A table of failures the author never saw is decoration; a check beside the risky step is protection.
- Extract to
scripts/anything the model would otherwise rewrite per run (multi-step transforms, syntax-fragile ops). Scripts execute without loading into context — near-zero token cost. - References >100 lines start with
## Contents. The model reads the ToC, then jumps; it never pays for the whole file. - Portability: write the body against the open Agent Skills format (folder + SKILL.md). Isolate vendor-specific tool calls in one labeled section; runtimes without code execution apply every gate manually via
references/universal-protocol.md— one source, every agent.
Phase 4 — MEASURE
No skill ships on assertion. Generate evidence:
python scripts/forge.py lint <skill-dir> # structure, budgets, hype, claims
python scripts/forge.py tokens <skill-dir> # per-tier token ledger (estimated)
python scripts/forge.py triggers <skill-dir> # validates evals/trigger-eval.json
python scripts/forge.py score <skill-dir> # writes scorecard.json from the above
python scripts/forge.py rubric <skill-dir> # 20 binary checks x 5 pts; 100 = publish-ready
The rubric is twenty machine-checkable gates — its score certifies structure and evidence presence, never market outcomes. Run it after score, before pack.
Trigger eval: write ≥8 should-trigger and ≥8 should-not-trigger queries into evals/trigger-eval.json. Make them realistic (typos, file paths, casual phrasing) and make the negatives near-misses — adjacent tasks that share keywords but need a different tool. Obvious negatives test nothing. Schema and worked examples: references/description-engineering.md. Where the Claude Code CLI is available, run the trigger set live and record the measured rate into the scorecard; in environments without it, the validated set ships with the skill so any consumer can reproduce the measurement.
Task evals (for skills with verifiable outputs): 2–3 realistic prompts with objective assertions, run with-skill vs without-skill, grade per references/evidence-loop.md. Skip assertions for purely subjective skills; use human review instead.
Claim policy (hard rule): a number, multiplier, or superlative may appear in SKILL.md or README only if scorecard.json contains the field that produced it. Lint enforces this. The reason is strategic, not cosmetic: in a curated market, unverifiable claims read as spam; measured ones read as engineering.
Phase 5 — PACK
python scripts/forge.py pack <skill-dir>
Runs lint in strict mode (warnings become failures for publish), then emits dist/<name>-<version>.skill (zip) + SHA-256. The checksum is the trust anchor: marketplaces tell users to install only from trusted sources, so provenance (source repo + hash) is part of the product.
Phase 6 — PUBLISH
Every published skill carries the distribution pack — this is what converts a listing into installs:
- One-line install at the top of README (all three channels):
npx skills add <repo-url> --skill <name>(skills.sh ecosystem)/plugin marketplace add <org>/<marketplace-repo>(Claude Code)- Upload
dist/<name>-<version>.skillin claude.ai → Settings → Capabilities (zip path)
- Before/After section showing one real workflow: the task, the unaided result, the with-skill result, with an artifact or measured numbers.
- Runtime footer linking the platform the skill is built to run against (for APEX skills: APEX-OmniHub).
- Listing targets + checklist:
references/distribution.md.
Token budgets (enforced by lint)
| Tier | Loaded | Budget | Why |
|---|---|---|---|
| name + description | Always, every conversation | desc ≤500 chars | This cost recurs across every session for every installed skill |
| SKILL.md body | On trigger | ≤200 lines, ≤2,500 est. tokens | Competes with conversation context once loaded |
| references/ | On demand | ToC if >100 lines | Pay only for the section needed |
| scripts/ | Executed, not loaded | unbounded | Runs outside context entirely |
Iron rules
| # | Rule | Why |
|---|---|---|
| 1 | Contract before code | Wrong-target builds are the costliest failure |
| 2 | Description before body | Selection failure makes everything else moot |
| 3 | Evidence before claims | Unverifiable numbers are spam in a curated market |
| 4 | Budgets are hard limits | Context is the scarcest shared resource |
| 5 | Scripts for the fragile | Deterministic ops don't belong in sampled text |
| 6 | Ship the scorecard | The measurement is the differentiator |
| 7 | Never invent paths, schemas, or results | Output UNCERTAIN: <gap> instead |
Command reference
python scripts/forge.py init <name> --dir <path> # scaffold from templates/
python scripts/forge.py lint <dir> [--strict] # validate; strict for publish
python scripts/forge.py tokens <dir> # token ledger (estimated, chars/4)
python scripts/forge.py triggers <dir> # trigger-eval validation
python scripts/forge.py score <dir> # write scorecard.json
python scripts/forge.py rubric <dir> # 20 binary checks; exit 0 only at 100
python scripts/forge.py pack <dir> [--label x] # strict lint → .skill zip + sha256
References
references/description-engineering.md— description formula, trigger eval designreferences/token-economics.md— budgets, hype lexicon, measurement methodreferences/evidence-loop.md— scorecard schema, task evals, claim policyreferences/distribution.md— marketplace matrix, install commands, listing checklistreferences/universal-protocol.md— manual gates + rubric for runtimes without code execution