eks-design

star 20

Use when designing EKS architecture. Generates design documents with Mermaid diagrams, ADRs, security architecture, and validation reports. Translates requirements into tailored EKS designs guided by Well-Architected best practices. Covers cluster architecture, compute, networking, security, addons, observability, cost, and upgrade strategy. Also use when reviewing or validating existing EKS architectures, planning networking or security, evaluating deployment models, or generating architecture diagrams. Skip for short advisory recommendations without a formal document (eks-best-practices) and Internal Developer Platforms or progressive delivery (eks-platform-engineering).

aws-samples By aws-samples schedule Updated 6/4/2026

name: eks-design description: Use when designing EKS architecture. Generates design documents with Mermaid diagrams, ADRs, security architecture, and validation reports. Translates requirements into tailored EKS designs guided by Well-Architected best practices. Covers cluster architecture, compute, networking, security, addons, observability, cost, and upgrade strategy. Also use when reviewing or validating existing EKS architectures, planning networking or security, evaluating deployment models, or generating architecture diagrams. Skip for short advisory recommendations without a formal document (eks-best-practices), Internal Developer Platforms or progressive delivery (eks-platform-engineering), and GenAI/LLM workload design — GPU vs Neuron, vLLM/Ray serving, distributed training (eks-genai).

EKS Design

Generate architecture design documents for production-ready EKS deployments. All output is structured for direct handoff to eks-build for code generation.

When to Use

  • Designing a new EKS cluster architecture from requirements
  • Reviewing or validating existing EKS architecture decisions
  • Choosing between EKS compute options (Karpenter, MNG, Auto Mode, Fargate)
  • Planning EKS networking or security architecture
  • Evaluating EKS deployment models (Standard, Auto Mode, Outposts, Anywhere)
  • Optimizing EKS cost and scalability
  • Generating architecture documentation, ADRs, or Mermaid diagrams for EKS
  • Generating standalone Mermaid architecture diagrams (EKS topology, VPC layout, subnet tiers, node groups, pod networking flows, load balancer placement)
  • Creating design artifacts that feed into eks-build for implementation

Don't Use

  • Generating Terraform code or Helm charts (use eks-build)
  • EKS cluster reconnaissance or discovery (use eks-recon)
  • Terraform module design or testing (use terraform-skill)
  • Detailed reference material on autoscaling, networking, security, observability, cost, reliability, or upgrades (use eks-best-practices)
  • Internal Developer Platforms, Backstage portals, golden paths, progressive delivery, or developer self-service (use eks-platform-engineering)
  • GenAI / LLM workload design — GPU vs Trainium/Inferentia selection, vLLM / Ray Serve / distributed-training architecture, ML storage (FSx for Lustre), or GPU/Neuron scheduling (use eks-genai). Design the cluster here; design the GenAI workload on it there.

Design Output Format

Design documents describe WHAT and WHY — never HOW.

USE in design output DO NOT USE in design output
Decision tables (compare options) YAML manifests (K8s, Helm, Kustomize)
Mermaid diagrams (architecture, flows) Bash/CLI commands (aws, kubectl, helm)
ASCII flow diagrams (sequences, pipelines) JSON/HCL (IAM policies, Terraform)
Bullet summaries (components, integration) Code snippets (Python, Go, PromQL, SQL)
DO/DON'T lists (security, operations) Step-by-step deployment procedures

Rule: If you find yourself writing a code block, stop and convert it to a table, diagram, or description. Implementation code belongs in eks-build.

How to use references: Skill references contain decision frameworks, comparison tables, and architecture patterns. Use them to INFORM your design decisions — do not copy reference content into design documents. Synthesize knowledge into project-specific recommendations.

Internet search (MANDATORY before generating): Before writing any design content, you MUST search the internet to determine the latest EKS version, tool versions (Karpenter, ArgoCD, Kyverno, etc.), and AWS service updates. Do NOT use version numbers from reference files — they are illustrative only and may be outdated. Always verify the chosen EKS version is in standard support (not extended or EOL). Never rely solely on cached knowledge for version numbers.

Design Workflow

MANDATORY: The validation loop (Stages 3-4) is NOT optional. Every design MUST be scored after generation. If the score is below threshold, you MUST fix the gaps and re-score. Do NOT skip to Stage 5 (Handoff) without a passing score. Do NOT present the design to the user as "complete" until it passes. The scoring loop is what separates a draft from a validated design.

Stage 1: Input Assessment

Analyze available inputs (requirements documents, meeting notes, technical assessments) to extract:

  • Business context: Project scope, stakeholders, success criteria, timeline, budget
  • Technical context: Existing VPC/network, compliance requirements, tooling preferences
  • Constraints: Air-gapped, proxy, private registry, multi-account, regulatory

Output: appendices/input-assessment-analysis.md

Rules:

  • All information must come from verifiable sources — never invent or assume
  • Focus on WHAT (requirements), not HOW (architecture) — no technology selections yet
  • Document gaps honestly rather than filling with assumptions

Stage 2: Architecture Generation

Generate EKS architecture based on requirements. Use the decision frameworks below and search the internet for latest AWS best practices when requirements don't match existing patterns.

Refer to eks-best-practices skill for detailed reference material on autoscaling, networking, security, observability, cost optimization, reliability, and cluster upgrades.

Process:

  1. Select EKS deployment model (Standard, Auto Mode, Fargate, Outposts, Anywhere)
  2. Select compute strategy using the Compute Selection Matrix
  3. Select networking model (VPC CNI mode, ingress pattern)
  4. Select addon management pattern (Pattern 1, 2a, or 2b — see eks-build)
  5. Design security posture (IAM model, PSA levels, secrets, encryption)
  6. Design observability stack
  7. Design upgrade strategy
  8. Document each significant decision as an ADR

Output depends on what the user asked for:

  • Comprehensive design (user asks for "full design", "system architecture", or doesn't specify a focus): Generate architecture/system-architecture.md covering ALL requirements (compute, networking, addons, security, observability, multi-tenancy, upgrades, cost, DR, constraints). Structure the document with: (1) Executive summary and requirements recap, (2) Cluster architecture overview with Mermaid diagrams (cluster topology, VPC/subnet layout, addon architecture, data flow), (3) Component specifications for cluster, node groups, addons, networking, security, and observability, (4) Integration points with external systems (CI/CD, registries, monitoring), (5) Customization requirements (air-gapped, proxy, private registry, compliance).
  • Focused design (user asks for "security design", "CI/CD design", "networking design", etc.): Generate architecture/<focus>-architecture.md as the PRIMARY document, going deep on that specific domain. Do NOT force comprehensive coverage when the user asked for a focused design.
  • Comprehensive + supplementary: When generating a comprehensive design, optionally also generate a <focus>-architecture.md deep-dive if a domain is complex enough (e.g., HIPAA security, multi-tenant CI/CD).

ADRs: architecture/architecture-decision-records/ADR-*.md. Every significant technology choice must have an ADR. Each ADR follows the format: Context → Decision → Alternatives Considered → Rationale → Consequences → Research Sources. Name files ADR-001-compute-strategy.md, ADR-002-networking-model.md, etc. For comprehensive designs, produce 7-9+ ADRs. For focused designs, produce ADRs relevant to the focus area.

Stage 3: Architecture Validation (MANDATORY — DO NOT SKIP)

You MUST run this stage after generating any design documents. Score the design against five validation dimensions. If the score is below 85/100, you MUST fix the identified gaps before proceeding. This is the quality gate between "draft" and "validated design."

Validation dimensions (each scored per references/architecture-validation.md):

Dimension Points What to Evaluate
Requirements Coverage /25 Every requirement has an architectural solution
Component Integration /20 All interfaces defined and compatible, data flows documented
Service Limits /15 AWS service limits assessed with mitigation for high-risk items
Technical Feasibility /20 Technology choices validated, EKS-specific checks pass
Documentation Completeness /20 All required docs present, narrative quality (not just tables), ADR quality, diagrams rendered to PNG and embedded in docx/pptx

Output: appendices/iterations/score-sheet-iteration-1.md

Scoring thresholds:

  • >= 85/100: PASSED — proceed to Stage 4
  • 70-84: CONDITIONAL — fix identified gaps, re-score as next iteration
  • < 70: FAILED — significant rework needed

How to score: For each dimension, evaluate every criteria in the scoring matrix (see reference), assign points with specific justification, document gaps, and calculate the total. Be honest — inflated scores lead to weak designs that fail during build.

Stage 4: Quality Review & Iteration (MANDATORY — DO NOT SKIP)

You MUST run this stage after Stage 3 passes. Apply weighted scoring across architecture quality dimensions. If the score is below 90/100, you MUST fix the gaps and re-score. Do NOT skip to handoff with a score below 90.

Scoring dimensions (weighted):

Dimension Weight What to Evaluate
Architecture & Design 30% Patterns, component design, integration, technology choices
Security 25% IAM, pod security, network security, encryption, secrets
Reliability & Operations 20% HA, PDBs, health probes, upgrades, observability, security tool monitoring
Cost & Scalability 15% Right-sizing, Spot/Graviton, consolidation, service limits
Implementation Readiness 10% Handoff completeness, ADR quality, build skill compatibility

Output: appendices/iterations/score-sheet-iteration-X.md

Iteration rules:

  • Maximum 5 iterations to reach 90/100
  • Each iteration must show measurable progress (score must increase)
  • If the same gap persists across 2 iterations, escalate to the user
  • Final iteration content is promoted to root-level folders
  • Every score sheet must include: score per dimension, delta from previous iteration, specific gaps, and recommended fixes

The validation loop pattern:

Generate design -> Score (Stage 3) -> Below 85? -> Fix gaps -> Re-score
                                    -> Above 85? -> Score (Stage 4) -> Below 90? -> Fix gaps -> Re-score
                                                                     -> Above 90? -> Proceed to Stage 5

Stage 5: Finalize & Handoff

COMPLETION CHECKLIST — every item must be done before handoff. Walk through this list at the end. If any item is unchecked, go back and complete it.

  • Internet search for latest EKS version, tool versions, and AWS service updates (not from cached knowledge)
  • Architecture documents generated — architecture/system-architecture.md (or architecture/<focus>-architecture.md for focused designs) exists with narrative prose + diagrams
  • ADRs generated — architecture/architecture-decision-records/ADR-*.md files exist (minimum 6 for comprehensive, domain-relevant for focused)
  • Security architecture generated — architecture/security-architecture.md exists (if applicable)
  • Stage 3 validation scored — appendices/architecture-integration-validation.md exists with score >= 85/100
  • Stage 4 quality review scored — appendices/iterations/score-sheet-iteration-*.md exists with Stage 4 score >= 90/100
  • Every section has narrative prose — no table-only or bullet-only sections (0/5 narrative = auto-fail)
  • Mermaid diagrams rendered to PNG — diagrams/*.png files exist (high-res, 4x scale, white background)
  • AGENTS.md created — lists which design files the build agent must read
  • README.md created — provides human-readable navigation
  • docx/pptx offered to user — asked if they want Word/PowerPoint versions (only generate if confirmed)
  • If docx/pptx generated: rendered PNGs from diagrams/ embedded in documents (not Mermaid code blocks)

If any item is unchecked, STOP and complete it before proceeding. The files are the proof — if the score sheet doesn't exist, you skipped validation. If diagrams/*.png doesn't exist, you skipped rendering.

Generate handoff artifacts for eks-build:

  1. AGENTS.md — machine-readable instructions listing which design files the build agent must read
  2. README.md — human-readable navigation guide
  3. Verify output structure matches specification
  4. Render Mermaid diagrams to PNG — extract every Mermaid code block from the architecture markdown files, save each as a .mmd file, then convert to PNG in diagrams/. Install and convert: npm install -g @mermaid-js/mermaid-cli && mmdc -i diagram.mmd -o diagrams/<name>.png -b white -s 4. If mmdc doesn't work, search the internet for how to use mermaid-cli to convert .mmd to .png. Requirements: 4x scale, white background, auto-sized canvas (no fixed width/height). Use descriptive kebab-case names (e.g., defense-in-depth-layers.png, pod-identity-flow.png).
  5. Ask the user if they want Word (.docx) and PowerPoint (.pptx) versions. Only generate if confirmed — the docx and aws-pptx skills handle generation. When generating, embed the rendered PNGs from diagrams/ into the documents.

Output: AGENTS.md, README.md, diagrams/*.png, optionally .docx and .pptx

Output Structure

All design output goes to projects/<project-name>/design/:

projects/<project-name>/design/
├── README.md                                # Navigation guide
├── AGENTS.md                                # Build agent instructions
├── architecture/
│   ├── system-architecture.md               # Cluster architecture with Mermaid diagrams
│   ├── architecture-decision-records/
│   │   ├── ADR-001-[decision-name].md
│   │   └── ADR-00X-[decision-name].md
│   └── security-architecture.md             # Security posture design
├── diagrams/                                # Rendered Mermaid diagrams (high-res PNG)
│   ├── cluster-topology.png
│   ├── network-architecture.png
│   └── addon-dependencies.png
├── generate-docx.js                         # DOCX generator script (optional — user must confirm)
├── generate-pptx.js                         # PPTX generator script (optional — user must confirm)
├── system-architecture.docx                 # Word document (optional — with embedded diagrams)
├── system-architecture.pptx                 # PowerPoint deck (optional — with embedded diagrams)
└── appendices/
    ├── input-assessment-analysis.md         # Stage 1 output
    ├── architecture-integration-validation.md # Stage 3 output
    └── iterations/                          # Quality iteration history
        ├── score-sheet-iteration-1.md
        └── score-sheet-iteration-X.md

Detailed file descriptions: See references/output-structure.md.

EKS Architecture Decision Framework

When to Use EKS

Requirement EKS ECS Lambda
Kubernetes ecosystem Native K8s AWS-proprietary No
Portable across clouds Standard K8s API AWS-only AWS-only
Long-running services Yes Yes 15 min limit
Minimal ops overhead Medium Low Lowest
GPU/ML workloads Best support Limited No
Complex networking Full control Medium Limited
Team has K8s expertise Required Not required Not required

EKS Deployment Models

Model Operational Overhead Use When
EKS Standard Medium-High Need full customization
EKS Auto Mode Low Want minimal ops, standard workloads
EKS with Fargate Low Batch, low-density workloads
EKS on Outposts High Data residency, low-latency edge
EKS Anywhere Highest Air-gapped, custom hardware

Compute Selection Matrix

Refer to eks-best-practices skill for detailed compute comparison tables, Karpenter configuration patterns, and Auto Mode specifics.

Factor Fargate MNG Karpenter Auto Mode Self-Managed
Best for Batch, small scale Stable, predictable Dynamic, varied Minimal ops Custom AMI/kernel
Spot support No Yes Yes (native) Yes Yes
GPU support No Yes Yes Yes Yes
DaemonSets No Yes Yes Yes Yes
Node SSH No Yes Yes No Yes

Quick decision guide:

  • Default: Karpenter — best balance of flexibility, cost, and automation
  • Zero ops: EKS Auto Mode — AWS manages everything
  • Serverless/batch: Fargate — no nodes, per-pod billing
  • Predictable: MNG — familiar ASG model
  • Custom: Self-managed — full control, highest overhead

Networking Quick Reference

Refer to eks-best-practices skill for detailed networking patterns including VPC CNI deep-dives, subnet planning, service mesh options, and private cluster configurations.

VPC CNI Mode Use When Pod Density
Secondary IP (default) Most workloads Limited by ENI x IPs per ENI
Prefix Delegation >30 pods/node, IP-constrained 4-16x more pods
Custom Networking Pods need different CIDR Same as underlying mode
Ingress Pattern Best For
ALB (via LBC) HTTP/HTTPS web apps, WAF, Cognito
NLB (via LBC) TCP/UDP, gRPC, low latency, static IPs
Gateway API Multi-team, new deployments (recommended)
VPC Lattice Cross-VPC service-to-service, IAM auth

Security Essentials

Refer to eks-best-practices skill for detailed security architecture patterns including IAM deep-dives, pod security standards, network policies, and secrets management.

IAM Approach Use When
Pod Identity New workloads (EKS 1.24+) — simpler, session tags, role chaining
IRSA Older clusters, Fargate

Key rules:

  • Use Pod Identity for new workloads
  • Use EKS access entries (API mode) over aws-auth ConfigMap
  • Move VPC CNI permissions from node role to Pod Identity/IRSA
  • Never use wildcard conditions in IRSA trust policies
  • Never attach application permissions to node IAM roles

Cost Optimization Quick Wins

Refer to eks-best-practices skill for detailed cost optimization strategies, Spot instance patterns, and right-sizing guidance.

Action Savings Effort
Graviton (arm64) 20-40% Low
Spot for non-critical 60-90% Low
Karpenter consolidation 20-30% Low
VPA right-sizing 15-30% Medium
gp3 over gp2 20% on EBS Low
VPC endpoints Eliminate NAT costs Low

EKS Capabilities

EKS Capabilities are AWS-managed features installed and updated as part of the EKS platform. Evaluate managed vs self-managed for each:

Capability What It Does When to Use Managed When to Self-Manage
ArgoCD GitOps continuous delivery Multi-account hub-and-spoke, IAM IDC integration, minimal ops Custom plugins, air-gapped, existing ArgoCD investment
ACK Manage AWS resources via K8s CRDs Standard AWS resource management (S3, RDS, IAM) Specific controller version pinning, custom config
KRO Platform abstractions via ResourceGroupDefinitions Golden path templates, multi-resource compositions Early adoption, custom reconciliation logic

Combined pattern: ArgoCD deploys ACK resources + KRO compositions via GitOps, providing a single workflow for both infrastructure and applications.

Required ADR Categories

Every EKS design must produce ADRs for these decision areas (at minimum):

ADR Category Decision Common Alternatives
Deployment Model Standard vs Auto Mode vs Fargate Operational overhead vs control
Compute Strategy Karpenter vs MNG vs Auto Mode Flexibility vs predictability
Networking Model CNI mode, ingress pattern Pod density, traffic routing
Addon Pattern Pattern 1 vs 2a vs 2b Terraform-only vs GitOps
Security Model Pod Identity vs IRSA, PSA levels Simplicity vs compatibility
Observability AWS-managed vs open source Cost vs flexibility
Upgrade Strategy In-place vs blue-green Risk vs cost
Container Registry Centralized ECR vs tenant-managed vs enterprise (Artifactory/Harbor) Isolation vs simplicity
EKS Capabilities Self-managed addons vs EKS managed capabilities (ArgoCD, ACK, KRO) Control vs operational overhead

Additional ADRs as needed for: multi-tenancy, multi-account, service mesh, compliance framework, DR strategy.

AGENTS.md Specification

Generate AGENTS.md as a machine-readable handoff to eks-build:

<agent name="eks-build">
  <required-reading>
    <file path="architecture/system-architecture.md" purpose="Cluster architecture, component specs, networking, security posture" />
    <file path="architecture/security-architecture.md" purpose="Security controls, IAM model, encryption, pod security" />
  </required-reading>
  <optional-reading>
    <file path="architecture/architecture-decision-records/" purpose="ADRs for all technology choices" />
    <file path="appendices/architecture-integration-validation.md" purpose="Validation results and service limit analysis" />
  </optional-reading>
  <design-decisions>
    <decision key="pattern" value="[1|2a|2b]" />
    <decision key="compute" value="[karpenter|mng|auto-mode|fargate]" />
    <decision key="iam-model" value="[pod-identity|irsa|mixed]" />
    <decision key="air-gapped" value="[true|false]" />
    <decision key="proxy" value="[true|false]" />
    <decision key="private-registry" value="[true|false]" />
    <decision key="compliance" value="[standard|strict]" />
  </design-decisions>
</agent>

Detailed References

This skill uses progressive disclosure — essential guidance is above, detailed reference material is loaded on demand:

  • Output Structure — Read when you need detailed file descriptions, naming conventions, and organization principles for the design output folder
  • Architecture Validation — Read when running Stage 3 or Stage 4 validation; contains the full scoring matrix, criteria details, and report format

For detailed topic-specific reference material (autoscaling, networking, security, observability, cost, reliability, upgrades, container registry, Terraform examples), refer to the eks-best-practices skill which maintains canonical copies of all decision matrices and deep-dive guidance.

Install via CLI
npx skills add https://github.com/aws-samples/sample-apex-skills --skill eks-design
Repository Details
star Stars 20
call_split Forks 9
navigation Branch main
article Path SKILL.md
More from Creator