eks-design - SKILL.md Agent Skill

name: eks-design description: Use when designing EKS architecture. Generates design documents with Mermaid diagrams, ADRs, security architecture, and validation reports. Translates requirements into tailored EKS designs guided by Well-Architected best practices. Covers cluster architecture, compute, networking, security, addons, observability, cost, and upgrade strategy. Also use when reviewing or validating existing EKS architectures, planning networking or security, evaluating deployment models, or generating architecture diagrams. Skip for short advisory recommendations without a formal document (eks-best-practices), Internal Developer Platforms or progressive delivery (eks-platform-engineering), and GenAI/LLM workload design — GPU vs Neuron, vLLM/Ray serving, distributed training (eks-genai).

EKS Design

Generate architecture design documents for production-ready EKS deployments. All output is structured for direct handoff to eks-build for code generation.

When to Use

Designing a new EKS cluster architecture from requirements
Reviewing or validating existing EKS architecture decisions
Choosing between EKS compute options (Karpenter, MNG, Auto Mode, Fargate)
Planning EKS networking or security architecture
Evaluating EKS deployment models (Standard, Auto Mode, Outposts, Anywhere)
Optimizing EKS cost and scalability
Generating architecture documentation, ADRs, or Mermaid diagrams for EKS
Generating standalone Mermaid architecture diagrams (EKS topology, VPC layout, subnet tiers, node groups, pod networking flows, load balancer placement)
Creating design artifacts that feed into eks-build for implementation

Don't Use

Generating Terraform code or Helm charts (use eks-build)
EKS cluster reconnaissance or discovery (use eks-recon)
Terraform module design or testing (use terraform-skill)
Detailed reference material on autoscaling, networking, security, observability, cost, reliability, or upgrades (use eks-best-practices)
Internal Developer Platforms, Backstage portals, golden paths, progressive delivery, or developer self-service (use eks-platform-engineering)
GenAI / LLM workload design — GPU vs Trainium/Inferentia selection, vLLM / Ray Serve / distributed-training architecture, ML storage (FSx for Lustre), or GPU/Neuron scheduling (use eks-genai). Design the cluster here; design the GenAI workload on it there.

Design Output Format

Design documents describe WHAT and WHY — never HOW.

USE in design output	DO NOT USE in design output
Decision tables (compare options)	YAML manifests (K8s, Helm, Kustomize)
Mermaid diagrams (architecture, flows)	Bash/CLI commands (aws, kubectl, helm)
ASCII flow diagrams (sequences, pipelines)	JSON/HCL (IAM policies, Terraform)
Bullet summaries (components, integration)	Code snippets (Python, Go, PromQL, SQL)
DO/DON'T lists (security, operations)	Step-by-step deployment procedures

Rule: If you find yourself writing a code block, stop and convert it to a table, diagram, or description. Implementation code belongs in eks-build.

How to use references: Skill references contain decision frameworks, comparison tables, and architecture patterns. Use them to INFORM your design decisions — do not copy reference content into design documents. Synthesize knowledge into project-specific recommendations.

Internet search (MANDATORY before generating): Before writing any design content, you MUST search the internet to determine the latest EKS version, tool versions (Karpenter, ArgoCD, Kyverno, etc.), and AWS service updates. Do NOT use version numbers from reference files — they are illustrative only and may be outdated. Always verify the chosen EKS version is in standard support (not extended or EOL). Never rely solely on cached knowledge for version numbers.

Design Workflow

MANDATORY: The validation loop (Stages 3-4) is NOT optional. Every design MUST be scored after generation. If the score is below threshold, you MUST fix the gaps and re-score. Do NOT skip to Stage 5 (Handoff) without a passing score. Do NOT present the design to the user as "complete" until it passes. The scoring loop is what separates a draft from a validated design.

Stage 1: Input Assessment

Analyze available inputs (requirements documents, meeting notes, technical assessments) to extract:

Business context: Project scope, stakeholders, success criteria, timeline, budget
Technical context: Existing VPC/network, compliance requirements, tooling preferences
Constraints: Air-gapped, proxy, private registry, multi-account, regulatory

Output: appendices/input-assessment-analysis.md

Rules:

All information must come from verifiable sources — never invent or assume
Focus on WHAT (requirements), not HOW (architecture) — no technology selections yet
Document gaps honestly rather than filling with assumptions

Stage 2: Architecture Generation

Generate EKS architecture based on requirements. Use the decision frameworks below and search the internet for latest AWS best practices when requirements don't match existing patterns.

Refer to eks-best-practices skill for detailed reference material on autoscaling, networking, security, observability, cost optimization, reliability, and cluster upgrades.

Process:

Select EKS deployment model (Standard, Auto Mode, Fargate, Outposts, Anywhere)
Select compute strategy using the Compute Selection Matrix
Select networking model (VPC CNI mode, ingress pattern)
Select addon management pattern (Pattern 1, 2a, or 2b — see eks-build)
Design security posture (IAM model, PSA levels, secrets, encryption)
Design observability stack
Design upgrade strategy
Document each significant decision as an ADR

Output depends on what the user asked for:

Comprehensive design (user asks for "full design", "system architecture", or doesn't specify a focus): Generate architecture/system-architecture.md covering ALL requirements (compute, networking, addons, security, observability, multi-tenancy, upgrades, cost, DR, constraints). Structure the document with: (1) Executive summary and requirements recap, (2) Cluster architecture overview with Mermaid diagrams (cluster topology, VPC/subnet layout, addon architecture, data flow), (3) Component specifications for cluster, node groups, addons, networking, security, and observability, (4) Integration points with external systems (CI/CD, registries, monitoring), (5) Customization requirements (air-gapped, proxy, private registry, compliance).
Focused design (user asks for "security design", "CI/CD design", "networking design", etc.): Generate architecture/<focus>-architecture.md as the PRIMARY document, going deep on that specific domain. Do NOT force comprehensive coverage when the user asked for a focused design.
Comprehensive + supplementary: When generating a comprehensive design, optionally also generate a <focus>-architecture.md deep-dive if a domain is complex enough (e.g., HIPAA security, multi-tenant CI/CD).

ADRs: architecture/architecture-decision-records/ADR-*.md. Every significant technology choice must have an ADR. Each ADR follows the format: Context → Decision → Alternatives Considered → Rationale → Consequences → Research Sources. Name files ADR-001-compute-strategy.md, ADR-002-networking-model.md, etc. For comprehensive designs, produce 7-9+ ADRs. For focused designs, produce ADRs relevant to the focus area.

Stage 3: Architecture Validation (MANDATORY — DO NOT SKIP)

You MUST run this stage after generating any design documents. Score the design against five validation dimensions. If the score is below 85/100, you MUST fix the identified gaps before proceeding. This is the quality gate between "draft" and "validated design."

Validation dimensions (each scored per references/architecture-validation.md):

Dimension	Points	What to Evaluate
Requirements Coverage	/25	Every requirement has an architectural solution
Component Integration	/20	All interfaces defined and compatible, data flows documented
Service Limits	/15	AWS service limits assessed with mitigation for high-risk items
Technical Feasibility	/20	Technology choices validated, EKS-specific checks pass
Documentation Completeness	/20	All required docs present, narrative quality (not just tables), ADR quality, diagrams rendered to PNG and embedded in docx/pptx

Output: appendices/iterations/score-sheet-iteration-1.md

Scoring thresholds:

>= 85/100: PASSED — proceed to Stage 4
70-84: CONDITIONAL — fix identified gaps, re-score as next iteration
< 70: FAILED — significant rework needed

How to score: For each dimension, evaluate every criteria in the scoring matrix (see reference), assign points with specific justification, document gaps, and calculate the total. Be honest — inflated scores lead to weak designs that fail during build.

Stage 4: Quality Review & Iteration (MANDATORY — DO NOT SKIP)

You MUST run this stage after Stage 3 passes. Apply weighted scoring across architecture quality dimensions. If the score is below 90/100, you MUST fix the gaps and re-score. Do NOT skip to handoff with a score below 90.

Scoring dimensions (weighted):

Dimension	Weight	What to Evaluate
Architecture & Design	30%	Patterns, component design, integration, technology choices
Security	25%	IAM, pod security, network security, encryption, secrets
Reliability & Operations	20%	HA, PDBs, health probes, upgrades, observability, security tool monitoring
Cost & Scalability	15%	Right-sizing, Spot/Graviton, consolidation, service limits
Implementation Readiness	10%	Handoff completeness, ADR quality, build skill compatibility

Output: appendices/iterations/score-sheet-iteration-X.md

Iteration rules:

Maximum 5 iterations to reach 90/100
Each iteration must show measurable progress (score must increase)
If the same gap persists across 2 iterations, escalate to the user
Final iteration content is promoted to root-level folders
Every score sheet must include: score per dimension, delta from previous iteration, specific gaps, and recommended fixes

The validation loop pattern:

Generate design -> Score (Stage 3) -> Below 85? -> Fix gaps -> Re-score
                                    -> Above 85? -> Score (Stage 4) -> Below 90? -> Fix gaps -> Re-score
                                                                     -> Above 90? -> Proceed to Stage 5

Stage 5: Finalize & Handoff

COMPLETION CHECKLIST — every item must be done before handoff. Walk through this list at the end. If any item is unchecked, go back and complete it.

Internet search for latest EKS version, tool versions, and AWS service updates (not from cached knowledge)
Architecture documents generated — architecture/system-architecture.md (or architecture/<focus>-architecture.md for focused designs) exists with narrative prose + diagrams
ADRs generated — architecture/architecture-decision-records/ADR-*.md files exist (minimum 6 for comprehensive, domain-relevant for focused)
Security architecture generated — architecture/security-architecture.md exists (if applicable)
Stage 3 validation scored — appendices/architecture-integration-validation.md exists with score >= 85/100
Stage 4 quality review scored — appendices/iterations/score-sheet-iteration-*.md exists with Stage 4 score >= 90/100
Every section has narrative prose — no table-only or bullet-only sections (0/5 narrative = auto-fail)
Mermaid diagrams rendered to PNG — diagrams/*.png files exist (high-res, 4x scale, white background)
AGENTS.md created — lists which design files the build agent must read
README.md created — provides human-readable navigation
docx/pptx offered to user — asked if they want Word/PowerPoint versions (only generate if confirmed)
If docx/pptx generated: rendered PNGs from diagrams/ embedded in documents (not Mermaid code blocks)

If any item is unchecked, STOP and complete it before proceeding. The files are the proof — if the score sheet doesn't exist, you skipped validation. If diagrams/*.png doesn't exist, you skipped rendering.

Generate handoff artifacts for eks-build:

AGENTS.md — machine-readable instructions listing which design files the build agent must read
README.md — human-readable navigation guide
Verify output structure matches specification
Render Mermaid diagrams to PNG — extract every Mermaid code block from the architecture markdown files, save each as a .mmd file, then convert to PNG in diagrams/. Install and convert: npm install -g @mermaid-js/mermaid-cli && mmdc -i diagram.mmd -o diagrams/<name>.png -b white -s 4. If mmdc doesn't work, search the internet for how to use mermaid-cli to convert .mmd to .png. Requirements: 4x scale, white background, auto-sized canvas (no fixed width/height). Use descriptive kebab-case names (e.g., defense-in-depth-layers.png, pod-identity-flow.png).
Ask the user if they want Word (.docx) and PowerPoint (.pptx) versions. Only generate if confirmed — the docx and aws-pptx skills handle generation. When generating, embed the rendered PNGs from diagrams/ into the documents.

Output: AGENTS.md, README.md, diagrams/*.png, optionally .docx and .pptx

Output Structure

All design output goes to projects/<project-name>/design/:

projects/<project-name>/design/
├── README.md                                # Navigation guide
├── AGENTS.md                                # Build agent instructions
├── architecture/
│   ├── system-architecture.md               # Cluster architecture with Mermaid diagrams
│   ├── architecture-decision-records/
│   │   ├── ADR-001-[decision-name].md
│   │   └── ADR-00X-[decision-name].md
│   └── security-architecture.md             # Security posture design
├── diagrams/                                # Rendered Mermaid diagrams (high-res PNG)
│   ├── cluster-topology.png
│   ├── network-architecture.png
│   └── addon-dependencies.png
├── generate-docx.js                         # DOCX generator script (optional — user must confirm)
├── generate-pptx.js                         # PPTX generator script (optional — user must confirm)
├── system-architecture.docx                 # Word document (optional — with embedded diagrams)
├── system-architecture.pptx                 # PowerPoint deck (optional — with embedded diagrams)
└── appendices/
    ├── input-assessment-analysis.md         # Stage 1 output
    ├── architecture-integration-validation.md # Stage 3 output
    └── iterations/                          # Quality iteration history
        ├── score-sheet-iteration-1.md
        └── score-sheet-iteration-X.md

Detailed file descriptions: See references/output-structure.md.

EKS Architecture Decision Framework

When to Use EKS

Requirement	EKS	ECS	Lambda
Kubernetes ecosystem	Native K8s	AWS-proprietary	No
Portable across clouds	Standard K8s API	AWS-only	AWS-only
Long-running services	Yes	Yes	15 min limit
Minimal ops overhead	Medium	Low	Lowest
GPU/ML workloads	Best support	Limited	No
Complex networking	Full control	Medium	Limited
Team has K8s expertise	Required	Not required	Not required

EKS Deployment Models

Model	Operational Overhead	Use When
EKS Standard	Medium-High	Need full customization
EKS Auto Mode	Low	Want minimal ops, standard workloads
EKS with Fargate	Low	Batch, low-density workloads
EKS on Outposts	High	Data residency, low-latency edge
EKS Anywhere	Highest	Air-gapped, custom hardware

Compute Selection Matrix

Refer to eks-best-practices skill for detailed compute comparison tables, Karpenter configuration patterns, and Auto Mode specifics.

Factor	Fargate	MNG	Karpenter	Auto Mode	Self-Managed
Best for	Batch, small scale	Stable, predictable	Dynamic, varied	Minimal ops	Custom AMI/kernel
Spot support	No	Yes	Yes (native)	Yes	Yes
GPU support	No	Yes	Yes	Yes	Yes
DaemonSets	No	Yes	Yes	Yes	Yes
Node SSH	No	Yes	Yes	No	Yes

Quick decision guide:

Default: Karpenter — best balance of flexibility, cost, and automation
Zero ops: EKS Auto Mode — AWS manages everything
Serverless/batch: Fargate — no nodes, per-pod billing
Predictable: MNG — familiar ASG model
Custom: Self-managed — full control, highest overhead

Networking Quick Reference

Refer to eks-best-practices skill for detailed networking patterns including VPC CNI deep-dives, subnet planning, service mesh options, and private cluster configurations.

VPC CNI Mode	Use When	Pod Density
Secondary IP (default)	Most workloads	Limited by ENI x IPs per ENI
Prefix Delegation	>30 pods/node, IP-constrained	4-16x more pods
Custom Networking	Pods need different CIDR	Same as underlying mode

Ingress Pattern	Best For
ALB (via LBC)	HTTP/HTTPS web apps, WAF, Cognito
NLB (via LBC)	TCP/UDP, gRPC, low latency, static IPs
Gateway API	Multi-team, new deployments (recommended)
VPC Lattice	Cross-VPC service-to-service, IAM auth

Security Essentials

Refer to eks-best-practices skill for detailed security architecture patterns including IAM deep-dives, pod security standards, network policies, and secrets management.

IAM Approach	Use When
Pod Identity	New workloads (EKS 1.24+) — simpler, session tags, role chaining
IRSA	Older clusters, Fargate

Key rules:

Use Pod Identity for new workloads
Use EKS access entries (API mode) over aws-auth ConfigMap
Move VPC CNI permissions from node role to Pod Identity/IRSA
Never use wildcard conditions in IRSA trust policies
Never attach application permissions to node IAM roles

Cost Optimization Quick Wins

Refer to eks-best-practices skill for detailed cost optimization strategies, Spot instance patterns, and right-sizing guidance.

Action	Savings	Effort
Graviton (arm64)	20-40%	Low
Spot for non-critical	60-90%	Low
Karpenter consolidation	20-30%	Low
VPA right-sizing	15-30%	Medium
gp3 over gp2	20% on EBS	Low
VPC endpoints	Eliminate NAT costs	Low

EKS Capabilities

EKS Capabilities are AWS-managed features installed and updated as part of the EKS platform. Evaluate managed vs self-managed for each:

Capability	What It Does	When to Use Managed	When to Self-Manage
ArgoCD	GitOps continuous delivery	Multi-account hub-and-spoke, IAM IDC integration, minimal ops	Custom plugins, air-gapped, existing ArgoCD investment
ACK	Manage AWS resources via K8s CRDs	Standard AWS resource management (S3, RDS, IAM)	Specific controller version pinning, custom config
KRO	Platform abstractions via ResourceGroupDefinitions	Golden path templates, multi-resource compositions	Early adoption, custom reconciliation logic

Combined pattern: ArgoCD deploys ACK resources + KRO compositions via GitOps, providing a single workflow for both infrastructure and applications.

Required ADR Categories

Every EKS design must produce ADRs for these decision areas (at minimum):

ADR Category	Decision	Common Alternatives
Deployment Model	Standard vs Auto Mode vs Fargate	Operational overhead vs control
Compute Strategy	Karpenter vs MNG vs Auto Mode	Flexibility vs predictability
Networking Model	CNI mode, ingress pattern	Pod density, traffic routing
Addon Pattern	Pattern 1 vs 2a vs 2b	Terraform-only vs GitOps
Security Model	Pod Identity vs IRSA, PSA levels	Simplicity vs compatibility
Observability	AWS-managed vs open source	Cost vs flexibility
Upgrade Strategy	In-place vs blue-green	Risk vs cost
Container Registry	Centralized ECR vs tenant-managed vs enterprise (Artifactory/Harbor)	Isolation vs simplicity
EKS Capabilities	Self-managed addons vs EKS managed capabilities (ArgoCD, ACK, KRO)	Control vs operational overhead

Additional ADRs as needed for: multi-tenancy, multi-account, service mesh, compliance framework, DR strategy.

AGENTS.md Specification

Generate AGENTS.md as a machine-readable handoff to eks-build:

<agent name="eks-build">
  <required-reading>
    <file path="architecture/system-architecture.md" purpose="Cluster architecture, component specs, networking, security posture" />
    <file path="architecture/security-architecture.md" purpose="Security controls, IAM model, encryption, pod security" />
  </required-reading>
  <optional-reading>
    <file path="architecture/architecture-decision-records/" purpose="ADRs for all technology choices" />
    <file path="appendices/architecture-integration-validation.md" purpose="Validation results and service limit analysis" />
  </optional-reading>
  <design-decisions>
    <decision key="pattern" value="[1|2a|2b]" />
    <decision key="compute" value="[karpenter|mng|auto-mode|fargate]" />
    <decision key="iam-model" value="[pod-identity|irsa|mixed]" />
    <decision key="air-gapped" value="[true|false]" />
    <decision key="proxy" value="[true|false]" />
    <decision key="private-registry" value="[true|false]" />
    <decision key="compliance" value="[standard|strict]" />
  </design-decisions>
</agent>

Detailed References

This skill uses progressive disclosure — essential guidance is above, detailed reference material is loaded on demand:

Output Structure — Read when you need detailed file descriptions, naming conventions, and organization principles for the design output folder
Architecture Validation — Read when running Stage 3 or Stage 4 validation; contains the full scoring matrix, criteria details, and report format

For detailed topic-specific reference material (autoscaling, networking, security, observability, cost, reliability, upgrades, container registry, Terraform examples), refer to the eks-best-practices skill which maintains canonical copies of all decision matrices and deep-dive guidance.