name: infra-document description: 'Stage 5 of the DevOps pipeline. Generate a living infrastructure document for an environment — derives the architecture from the actual Terraform module wiring + spec, writes docs/infrastructure.md, an editable docs/diagrams/infra.drawio (AWS-grouped), a temporary Mermaid block for verification, and a top-level README.md entry point. STOPS at human gate G5; never commits.' disable-model-invocation: true allowed-tools: Read, Glob, Grep, Bash, Write argument-hint: '[env-dir]'
Infra Document — Stage 5 (Living infrastructure documentation)
Produce the single source-of-truth document for an environment's infrastructure, plus an editable architecture diagram. This is a living document: re-run it whenever the infra changes so the doc and diagram stay accurate (derived from code, not hand-maintained).
Human gate G5: This skill writes docs + a diagram and then STOPS. It does not commit and does not run
terraform apply. Hand the document back for the human to review.
Argument: $ARGUMENTS first token = the environment dir to document (e.g.
environments/dev-care-hub). Default: current dir. Ask if ambiguous.
Outputs (in the project):
docs/infrastructure.md— the document (template:knowledge/templates/infra-document-template.md)docs/diagrams/infra.drawio— editable source diagram (one combined AWS-grouped diagram)- A temporary Mermaid block inside
infrastructure.mdfor cross-checking the.drawio README.md— top-level repo entry point (Phase 4.5; created or refreshed, never clobbered)
Phase 1: Gather the facts (derive, don't invent)
Read the real sources so the document is as-built, not aspirational:
- Spec —
docs/specs/*.spec.md(architecture intent, environments, cost, SLO). - Terraform — the env dir's
main.tf(which modules are instantiated and how their outputs wire into each other — this defines the real topology),terraform.tfvars,locals.tf,backend.tf,providers.tf. - Module catalog —
MODULES.mdat the custom-infrastructure library root (purpose + I/O of each module used). - Review report — read the latest
docs/reviews/<env>-*.md(written by/infra-review); fold its resolved security/cost posture into §7. Pick the newest by date if several exist. - Live outputs (optional) — only if already applied and the user confirms:
terraform outputfor real endpoints/ARNs. Never run apply.
Build a component list: module → AWS resource(s) → role → key inputs/outputs → which subnet/tier.
Phase 2: Write docs/infrastructure.md
Resolve the template from the guideline repo via the symlinked skill (same mechanism as
/secret-scan — the template is read live from the repo, not copied into projects; readlink -f
follows the symlink so it works from any project on any machine):
SK="$(readlink -f "${CLAUDE_SKILL_DIR:-$HOME/.claude/skills/infra-document}" 2>/dev/null)"
GUIDELINE="$(dirname "$(dirname "$(dirname "$SK")")")"
TPL="$GUIDELINE/knowledge/templates/infra-document-template.md"
REF="$(dirname "$SK")/drawio-reference.md" # ships next to SKILL.md; used in Phase 3
# GUARD: stop loudly (don't just warn) if the guideline repo didn't resolve — otherwise Phase 2
# proceeds with no template and Phase 3 with no stencil reference, silently degrading both.
test -f "$TPL" || { echo "ERROR: doc template not found at '$TPL' (resolved from '$SK') — is /infra-document SYMLINKED from the guideline repo, not copied? (Guide §1.1)"; exit 1; }
test -f "$REF" || { echo "ERROR: drawio-reference.md not found at '$REF' — guideline skill dir incomplete."; exit 1; }
echo "template: $TPL"; echo "drawio ref: $REF"
Read $TPL (8 sections) and fill it from Phase 1. The template is comprehension-first — a
reader should finish §1–§3 with a correct mental model, then use §4–§8 as reference. Rules:
- State facts derived from code; if something isn't in the code/spec, mark it
TODO— don't guess. - §1 Overview — include the "big picture" paragraph: what enters, what happens, what comes out, and the 2–4 main building blocks, in plain language with no resource names/jargon. A newcomer reads only this and gets the gist.
- §2 holds the diagram (PNG ref + temporary Mermaid, see Phase 4), a "How to read this
diagram" line (shapes/colors/numbered edges — see Phase 3), and a one-line numbered-path key
(
① → ② → ③ …) that decodes the diagram's edges. There is no separate data-flow section — that key plus the §3 walkthrough (which references the same numbers) covers it. - §3 How it works (architecture walkthrough) — the section that makes the infra click. This is
the most important content in the doc; do not reduce it to a table. Format for scanning, not an
essay — a DevOps/SA should skim the bold labels and bullets and get it:
- Structure as a few labeled blocks (e.g. one bold lead-in per subsystem/phase, plus a final "Key design decisions" block). Use short bullets, not dense multi-line paragraphs.
- Group by subsystem or by flow, not by Terraform module.
- For each major part answer three things: what it is · why it's here · what it connects to.
- Call out key design decisions and the non-obvious ("X is the handoff between the two halves", "Y exists only so Z passes its check", "single-AZ on purpose — it's a dev lab").
- Name the same components shown in the diagram and weave the diagram's ① ② ③ numbers into the bullets, so this section doubles as the flow explanation (no separate data-flow section).
- Keep it tight — a handful of labeled blocks, a few bullets each; push the exhaustive list to §4.
- §4 Components is the reference table; §3 explains, §4 enumerates — don't duplicate prose into the table.
- Link out rather than duplicate: spec, review report, dashboards.
- The template intentionally has no Operations/runbook or change-log section — this doc describes what the infrastructure is, not how to operate it. Keep ops/runbooks in their own doc.
Phase 3: Write docs/diagrams/infra.drawio (one combined diagram)
Create docs/diagrams/ if needed. Read $REF (resolved + existence-checked in Phase 1) and
hand-author one combined diagram following it — the proven AWS4 stencil patterns:
- Nest groups: AWS Cloud → Region → (Account) → VPC → public/private subnet → resources
(each child's geometry is relative to its parent via
parent=). - Use
mxgraph.aws4.resourceIconper service with the category fill colors from the reference (compute orange, networking purple, database blue/magenta, storage green, security red). - Draw edges left→right (ingress → compute → data); number the main data-plane edges
① ② ③, dash metadata/IAM edges. Add a title and a legend. - Map every component from Phase 1 to exactly one node; wire edges from the Terraform output→input
relationships you found in
main.tf. - If unsure of an exact
resIconname, use the labeled fallback box (reference §Special shapes) rather than a wrong stencil that renders empty. - Write the matching "How to read this diagram" line into §2 of
infrastructure.md— explain the conventions you actually used (nesting, numbered solid vs dashed edges, category colors) so a reader can decode the picture without guessing. The diagram and this legend must agree.
Validate the file is well-formed before finishing. Use a parser that does not resolve external entities or hit the network (avoids XXE / billion-laughs — drawio files need no DTD/entities):
# Try xmllint (libxml2); fall back to defusedxml; if NEITHER is available, say so — don't claim OK.
if command -v xmllint >/dev/null; then
xmllint --nonet --noout docs/diagrams/infra.drawio && echo "drawio XML OK (xmllint)"
elif python3 -c "import defusedxml" 2>/dev/null; then
python3 -c "import defusedxml.ElementTree as ET; ET.parse('docs/diagrams/infra.drawio'); print('drawio XML OK (defusedxml)')"
else
echo "WARN: cannot validate XML — install libxml2-utils (xmllint) or python3-defusedxml. Diagram written but UNVALIDATED."
fi
(Do not fall back to the plain xml.dom.minidom / xml.etree stdlib parsers — they are XXE-vulnerable by default.)
Phase 3.5: Coverage check (diagram vs code)
Make sure the diagram didn't drop a component. List the module instances in the env's main.tf and
confirm each appears as a node in infra.drawio (and a row in §4 Components):
grep -nE '^[[:space:]]*module[[:space:]]+"' <env-dir>/main.tf
For every module found, verify there's a matching node + components row. Flag any module missing from the diagram and add it — or note why it's intentionally omitted (e.g. a pure IAM/role module). This catches "drew it but forgot X" before the human reviews at G5.
Phase 4: Mermaid verification block (temporary)
Inside infrastructure.md §2, emit the same topology as a Mermaid flowchart so the human can
cross-check the .drawio without opening draw.io (guards against a malformed/incorrect diagram).
Wrap it with clear delete markers and a PNG placeholder:
## 2. Architecture diagram

<!-- ^ PNG not exported yet. Source: diagrams/infra.drawio -->
<!-- VERIFICATION DIAGRAM — delete after confirming infra.drawio (then export drawio → infra.png) -->
```mermaid
flowchart LR
subgraph AWS["AWS Cloud / ap-northeast-1"]
subgraph VPC["VPC"]
cf[CloudFront] --> alb[ALB]
alb --> ecs[ECS Fargate]
ecs --> rds[(Aurora PostgreSQL)]
end
end
```
<!-- END VERIFICATION DIAGRAM -->
The Mermaid must mirror the .drawio exactly (same nodes + edges). It is disposable — tell the
user to delete it after they confirm the drawio and export the PNG.
Phase 4.5: Project README (repo entry point)
Write a top-level README.md — the entry point a reader (or a public visitor) sees first. Keep
it short: it orients and links out; docs/infrastructure.md holds the depth (don't duplicate).
If README.md already exists, don't clobber it — refresh only the pipeline-managed sections (or
show a diff and ask). Derive everything from the same facts as Phase 1.
Structure:
# <project> — <one-line what-it-is>
<2–3 sentence overview: what this provisions and why. Plain language.>
## Stack
<key services / tools — one line>
## Layout
- `environments/<env>/` — Terraform root(s) · `modules/` — reused modules
- `docs/specs/` — design spec · `docs/infrastructure.md` — **architecture & diagram (start here)**
- `docs/reviews/` — security/cost review reports
## Prerequisites
<terraform version, AWS profile/creds, TF_MODULE_LIB if modules are vendored, tflint/checkov/trivy for local scans>
## Deploy
```bash
cd environments/<env>
terraform init -backend-config=<backend>.hcl
terraform plan -out=tfplan
terraform apply tfplan
Security / CI
- IaC scan gate:
.github/workflows/iac-scan.yml(fmt/validate/tflint/checkov/trivy on every PR) - Secret scan gate:
.github/workflows/secret-scan.yml+ local pre-push hook - Never commit
.mcp.json/backend-*.hcl(gitignored).
Adjust sections to what actually exists (omit Deploy specifics you can't derive; mark TODO rather
than guess). This is **public-facing**, so no account IDs, ARNs, or secrets in the README.
## Phase 5: STOP at Gate G5
Infrastructure doc ready for review (G5)
Written:
- docs/infrastructure.md
- docs/diagrams/infra.drawio (drawio XML OK)
- README.md (repo entry point — created/refreshed)
- Mermaid verification block embedded in §2 (temporary)
Diagram summary: [N nodes, M edges; ingress → compute → data]
Components documented: [list modules/resources]
👉 Next:
- Open docs/diagrams/infra.drawio in draw.io and check it matches the Mermaid block.
- Export it to docs/diagrams/infra.png, then delete the Mermaid verification block.
- Review docs/infrastructure.md — does §1–§3 make the infra clear on a single read? Re-run /infra-document anytime the infra changes — it's a living document.
**Do not commit.** Wait for the human.