ywc-onboard-repo

name: ywc-onboard-repo description: >- (ywc) Use when entering an existing / unfamiliar repository for the first time, generating a starter CLAUDE.md from detected conventions, or producing an architecture / convention briefing for a new joiner. The output is a printed Onboarding Guide plus a Starter CLAUDE.md written to the repo root (enhancing the existing one in place if present). Triggers: "onboard me", "이 repo 처음이야", "이 codebase 를 이해하게 해줘", "generate CLAUDE.md", "walk me through this repo", "리포 분석해줘", "コードベースを案内して", "onboarding 가이드", "ywc-onboard-repo". Do not use for creating a brand-new repository from scratch (use ywc-project-scaffold — that is the inverse direction), for creating a CodeTour `.tour` JSON walkthrough artifact (out of scope — this skill emits Markdown not .tour), for ad-hoc "explain this one file" requests (answer directly), or for incremental codemap refreshes on a repo you already understand (this skill is the cold-start; refreshes belong to a separate hygiene pass). category: onboarding phase: discovery requires: []

Announce at start: "I'm using the ywc-onboard-repo skill to reconnoiter the repository in 4 phases and emit an Onboarding Guide + Starter CLAUDE.md."

This skill is the canonical cold-start procedure when entering a repository for the first time. It exists because the default agent behavior — reading the README, then guessing — produces a CLAUDE.md filled with framework defaults rather than the conventions this codebase actually follows. Adapted from ECC/codebase-onboarding, tightened to delegate dead-code cleanup to ywc-refactor-clean and to forbid Read-every-file expansion (Glob + Grep reconnaissance only).

The Iron Law

RECONNAISSANCE WITH GLOB/GREP FIRST — NEVER READ EVERY FILE
CONVENTION DETECTED FROM CODE WINS OVER CONVENTION DETECTED FROM CONFIG
EXISTING CLAUDE.md IS ENHANCED IN PLACE, NEVER OVERWRITTEN

The 4-phase workflow (Reconnaissance → Architecture → Conventions → Generate) is sequential — each phase's output is the input to the next. Skipping reconnaissance to "just read the source" defeats the entire purpose: the agent ends up with intuition about a handful of files instead of structural insight about the whole repo.

Rationalization Defense

When tempted to bypass a rule, check this table first:

Excuse	Reality
"I'll just Read the top 20 files and figure it out"	Reading exhausts context before reconnaissance finishes. A 50-file repo has ~80 KB of source; a 5000-file monorepo has ~10 MB. Glob + Grep yield structural signals at ~1% of that token cost. Reading is for the 3-5 files reconnaissance flags as ambiguous, not the survey.
"The README already explains the architecture — I'll copy from it"	READMEs document what the author intended to ship, not what the repo actually contains. New features land without README updates; deprecated modules linger after README purges. The repo on disk is the source of truth; the README is a hint.
"package.json says React 18 — that's the framework"	A `package.json` dependency does not prove the framework is used. Many repos carry vestigial deps from past refactors. Verify by grepping for the framework's distinctive symbols (`useState`, `React.FC`, `'react-router'`) in the actual source.
"I'll generate a fresh CLAUDE.md — easier than reading the old one"	Overwriting CLAUDE.md silently destroys project-specific instructions the team has accumulated (rare-failure notes, secret commands, do-not-touch zones). Always Read the existing file first; if regenerating, diff before write and call out what was added vs preserved.
"Convention detection is vague — I'll just say `we use TypeScript and React`"	"TypeScript and React" is the framework, not a convention. A convention is "files use kebab-case", "errors return `Result<T, E>` not throws", "tests live next to source as `*.test.ts` not in `tests/`". If you cannot point to 2+ files demonstrating the convention, do not document it.
"The repo is empty — I'll produce a generic CLAUDE.md skeleton"	An empty (or just-cloned-shallow) repo cannot support convention detection. Phases 2-3 must report "insufficient signal" honestly rather than emit framework boilerplate. A generic skeleton invites the next reader to treat it as ground truth.
"I detected 3 frameworks (Next.js, NestJS, Remix) — listing all three"	Frameworks in a monorepo belong to specific workspaces. Locate each in the directory tree, scope the detection per workspace, and document accordingly. A flat "this repo uses Next.js + NestJS" claim is wrong; it uses Next.js in `apps/web/` and NestJS in `services/api/`.
"Git history is shallow — I'll skip convention detection"	Shallow history (e.g., `git clone --depth 1`) means commit-message and branch-naming conventions cannot be detected. Report the gap explicitly ("Git history unavailable or too shallow"); do not invent conventions from file content.
"There's an AGENTS.md but I only Read the CLAUDE.md"	`AGENTS.md` (and `.cursorrules`, copilot-instructions) encodes agent conventions the team already wrote for other tools. Skipping it means the generated CLAUDE.md may contradict rules the repo already enforces. Read every existing agent-context file; reconcile, never contradict. This variant still writes CLAUDE.md (the Codex variant owns AGENTS.md output) — but it must not author a CLAUDE.md that conflicts with an existing AGENTS.md.

Violating the letter of this discipline is violating the spirit. A wrong CLAUDE.md is worse than no CLAUDE.md — it teaches future agents the wrong conventions, and the error compounds across every subsequent skill invocation.

Arguments

Parameter	Format	Example	Description
`--scope`	`--scope <dir>`	`--scope apps/web/`	Limit reconnaissance to a workspace (useful in monorepos). Default: repository root.
`--guide-only`	flag	`--guide-only`	Emit the Onboarding Guide but skip writing the Starter CLAUDE.md.
`--claude-md-only`	flag	`--claude-md-only`	Emit only the Starter CLAUDE.md, skip the Guide.
`--enhance`	flag	`--enhance`	Force the "existing CLAUDE.md enhancement" path even when no CLAUDE.md is present (creates an empty stub first).

Workflow

Phase 1: Reconnaissance — Glob + Grep, never Read-all

Run all six signal-gathering passes (no Read tool — Glob and Grep only). The fastest path is the bundled script, which runs every pass in one shot and prints the structured summary:

bash claude-code/skills/ywc-onboard-repo/scripts/recon.sh [repo-dir]

Full per-pass tool invocations (for when you need to run or extend a single pass by hand) live in references/reconnaissance-checklist.md.

Pass	What it surfaces	Tool
1. Package manifest	Language, dependency footprint, scripts	Glob: `package.json`, `go.mod`, `Cargo.toml`, `pyproject.toml`, `pom.xml`, `build.gradle`, `Gemfile`, `composer.json`, `mix.exs`, `pubspec.yaml`
2. Framework fingerprint	Web framework, build tool, runtime	Glob: `next.config.`, `vite.config.`, `nuxt.config.`, `astro.config.`, `angular.json`, `nest-cli.json`, `manage.py`, `app/__init__.py`, `cmd/*/main.go`
3. Entry points	Where execution starts	Glob: `main.`, `index.`, `server.`, `app.`, `cmd/`, `src/main/`
4. Directory structure	Top-level shape (2 levels)	Bash: `find . -maxdepth 2 -type d -not -path '/node_modules/' -not -path '/.git/' -not -path '/dist/' -not -path '/build/'`
5. Tooling	Linter, formatter, CI, container	Glob: `.eslintrc`, `biome.json`, `.prettierrc`, `ruff.toml`, `Makefile`, `Dockerfile`, `docker-compose*`, `.github/workflows/`, `.env.example`
6. Test structure	Test framework, file convention	Glob: `*/test`, `/spec`, `pytest.ini`, `jest.config.`, `vitest.config.`, `playwright.config.`

Agent-context pre-check (not a 7th pass). Alongside the six passes, Glob for existing agent-instruction files: CLAUDE.md, AGENTS.md, .cursorrules, .cursor/rules/, .github/copilot-instructions.md. AGENTS.md is the 2026 vendor-neutral standard read by Codex / Cursor / Copilot and many other tools; a repo worked on by mixed tooling often carries one. Any file found here is a high-signal existing-context source the team already authored for agents — it is Read and reconciled in Phase 4 (Output B), never silently ignored, overwritten, or contradicted.

At the end of Phase 1, write a 10-line Reconnaissance Summary internally (not yet shown to the user): one line per pass, e.g. Pass 1: package.json present, runtime = Node 20.x, scripts = dev/build/test/lint.

Phase 2: Architecture Mapping

From the reconnaissance summary, identify the four architecture facets below. Verify each one by grepping the source, not just by reading the config.

Facet	Question	Verification
Tech stack	Which language and major libraries are actually used?	Cross-check `package.json` deps against `git grep -lE "from ['\"]<dep>['\"]"` — discard any dep with zero source hits.
Architecture pattern	Monolith / monorepo / microservices / serverless?	Monorepo: `packages/` or `apps/` directory present, root `package.json` has `workspaces`. Microservices: separate deploy units per service directory. Serverless: `serverless.yml`, `vercel.json`, AWS SAM template.
Key directories	What does each top-level directory hold?	Sample 1-2 file names per directory; infer purpose from filename pattern. Use `references/directory-conventions.md` for the canonical per-framework mapping.
Request lifecycle	How does one request travel from entry → response?	Locate one route handler, follow imports outward: handler → service → repository → DB. Document the chain as a 3-5 step trace.

When a facet cannot be determined with confidence ≥7/10, document it as Unknown — <one-line reason> rather than guessing.

Phase 3: Convention Detection

Inspect existing source to surface patterns the codebase already follows (not what frameworks default to).

Convention	Source of truth	Procedure
File naming	`find src -type f \| sed 's/.*\///' \| sort -u` then inspect majority pattern	Pick from: kebab-case, camelCase, PascalCase, snake_case. Report the pattern that wins ≥80% of files; report "mixed" otherwise.
Error handling	`git grep -nE "throw new \\w+\\(\|Result<\|return \\{ error\|raise [A-Z]"`	Categorize: throw-based (try/catch), Result/Either type, error-codes, exception with `raise`. Pick the dominant style.
Async pattern	`git grep -nE "async function\|await \|\\.then\\(\|go func\\(\|channels\|coroutine"`	Categorize: async/await, promise chains, callbacks, goroutines/channels, coroutines.
Git conventions	`git log --pretty=format:'%s' -n 50 \| head -20` + `git branch -a \| head -10`	Detect commit-message format (conventional commits?, scope prefix?), branch prefix (feature/, fix/, etc). If history is shallow (`< 10 commits`), report "shallow history — convention undetectable".
Test placement	Compare `find . -path '/tests/' \| wc -l` vs `find . -name '.test.' \| wc -l`	If first dominates: `tests/` directory convention. If second dominates: collocated `.test.`. If both, scope per workspace.

Detected convention with <3 supporting examples is not a convention — leave it out rather than over-claim.

Phase 4: Generate Artifacts

Emit two outputs. Both are required unless --guide-only or --claude-md-only is set.

Output A: Onboarding Guide (printed to conversation)

# Onboarding Guide: <repo-name>

## Overview
<2-3 sentences: what this project does and who it serves, derived from README + package description>

## Tech Stack
| Layer | Technology | Version | Source |
|-------|-----------|---------|--------|
| Language | <name> | <version> | <package.json / go.mod / etc.> |
| Framework | <name> | <version> | <config file> |
| Database | <name> | <version> | <ORM config / docker-compose> |
| Testing | <name> | <version> | <config file> |

## Architecture
<3-5 sentences: monolith / monorepo / microservices, frontend-backend split, API style>

## Key Entry Points
- **<purpose>**: `<path>` — <one-line role>
- **<purpose>**: `<path>` — <one-line role>

## Directory Map
| Path | Purpose |
|------|---------|
| `<top-level>` | <one-line purpose, inferred from sampled files> |

## Request Lifecycle
1. <entry: router / handler>
2. <middleware / validation>
3. <business logic: service / use-case>
4. <persistence: repo / ORM>
5. <response shape>

## Conventions
- File naming: <kebab-case / camelCase / ...>
- Error handling: <throw / Result / ...>
- Async: <async-await / promise / ...>
- Tests live: <collocated / tests-dir>
- Git: <conventional-commits / freeform; branch prefix=<feature/, fix/>>

## Common Tasks
- Dev server: `<command from package.json scripts>`
- Tests: `<command>`
- Lint: `<command>`
- Build: `<command>`
- Database: `<command if detected>`

## Where to Look
| I want to... | Look at... |
|--------------|-----------|
| <add an API endpoint> | `<path>` |
| <add a UI component> | `<path>` |
| <add a test> | `<path-pattern>` |
| <change build config> | `<path>` |

## Detection Confidence
- Detected: <N> facts
- Inferred (medium confidence): <N> facts
- Unknown / shallow: <N> facts (listed inline above as "Unknown — ...")

Output B: Starter CLAUDE.md (written to repo root)

If a CLAUDE.md already exists, Read it first and merge — preserve existing project-specific instructions, append a clearly-labeled ## Detected Conventions (<YYYY-MM-DD>) section at the bottom with the new findings. Never overwrite.

If an AGENTS.md (or .cursorrules / .github/copilot-instructions.md) surfaced in the Phase 1 agent-context pre-check, Read it too before writing. Reconcile: the generated CLAUDE.md must not contradict rules the team already wrote there. When both files will coexist, keep CLAUDE.md focused on Claude-specific guidance and add a one-line pointer to the shared AGENTS.md rather than duplicating its content. This variant writes CLAUDE.md only — emitting AGENTS.md is the Codex variant's job (intentional divergence); do not switch the output here.

When no CLAUDE.md exists, the canonical starter template lives in references/claude-md-starter.md. Copy it and fill in the placeholders from Phases 1-3.

Keep the generated CLAUDE.md under 100 lines — if it grows beyond that, the new content belongs in a project doc that CLAUDE.md links to, not in CLAUDE.md itself.

Output Format

The conversation surface emits the Onboarding Guide (markdown) directly. The Starter CLAUDE.md is written to the repo root and confirmed with a one-line claim:

Wrote CLAUDE.md (<N> lines, <M> sections) at <repo-root>/CLAUDE.md
  - Preserved existing sections: <list or "none — file did not exist">
  - Appended sections: ## Detected Conventions (<YYYY-MM-DD>)

The claim line follows ywc-verify-done's vocabulary rules (no "should" / "probably" / "seems"). The file was either written or it was not.

Integration

Upstream: user invocation (most common); a subagent runner that needs a CLAUDE.md before delegating implementation.
Downstream: ywc-refactor-clean (when reconnaissance reveals prior dead-code accumulation that blocks comprehension); ywc-impl-review (the Onboarding Guide is read by a reviewer entering the repo cold); ywc-plan (Phase 2's Request Lifecycle is the architectural anchor for plan Step 2).
Pairs with: ywc-project-scaffold (the inverse — scaffold creates a repo with conventions; onboard discovers conventions in an existing one). Never both in the same session; pick the direction.
Must not be paired with: ywc-code-gen in the same session — onboarding produces a CLAUDE.md the code-gen will consume; running both in one session means the code-gen reads a half-written CLAUDE.md.

Validation Checklist

Before declaring the onboarding pass complete, verify:

All 6 reconnaissance passes ran (or were explicitly skipped with reason)
Architecture facets have a verification source — not just a config-file claim
Every documented convention is supported by ≥3 source examples
Existing CLAUDE.md (if any) was Read before write — diff captured in the claim line
Existing AGENTS.md / .cursorrules / copilot-instructions (if any) was Read and reconciled — generated CLAUDE.md does not contradict it
Generated CLAUDE.md is ≤100 lines
Confidence section lists at least the count of Unknown facts (zero is allowed; missing the section is not)
If git history is shallow (<10 commits), conventions section reports the shallow-history caveat

Common Mistakes

(Procedural failure modes specific to repo onboarding. Behavioral rationalizations are in the table above — do not duplicate here.)

Reading the README and stopping there. The README is one signal among six. Skipping Phase 1's other five passes means the Onboarding Guide reflects what the author wrote about the project, not what is in the project. Always complete all six passes.
Documenting framework defaults as conventions. "Uses TypeScript strict mode" is not a convention if it ships in every Next.js scaffold — it is the framework's default. Convention = a choice the team made, not a default they accepted. Filter the output accordingly.
Bundling cleanup or refactor work into the onboarding PR. Onboarding is a read-only survey + a CLAUDE.md write. Cleanup is a separate skill (ywc-refactor-clean) and a separate branch. Mixing them produces a PR that is impossible to review.
Skipping the existing-CLAUDE.md Read because "it's outdated anyway". Outdated content may include load-bearing rules (security gates, hidden directory constraints, integration secrets handling) that you cannot detect from the source. Read, preserve, append.
Producing a 300-line CLAUDE.md "for completeness". A CLAUDE.md the reader does not read is worse than a shorter one they do. Hard cap at 100 lines; everything past that goes in a linked doc.

References

Reference	Use when
references/reconnaissance-checklist.md	Running Phase 1's six passes — full Glob / Grep / Bash invocations per ecosystem
references/directory-conventions.md	Mapping `src/api/`, `src/pages/`, `cmd/`, etc. to canonical purpose during Phase 2
references/claude-md-starter.md	Phase 4 Output B — starter CLAUDE.md template with placeholders
../ywc-project-scaffold/SKILL.md	When the user actually wants to create a new repo, not survey an existing one — route there
../ywc-refactor-clean/SKILL.md	When reconnaissance reveals significant dead-code accumulation — schedule a follow-up cleanup PR
../ywc-verify-done/SKILL.md	Vocabulary rules for the final "Wrote CLAUDE.md" claim