eks-best-practices

star 20

Advisory guidance for Amazon EKS architecture and configuration decisions — compute strategy, networking, security, reliability, cost, autoscaling, observability, multi-tenancy, and upgrade planning. Also answers Terraform configuration questions about terraform-aws-modules/terraform-aws-eks. Use for any EKS planning or architectural judgment call, even when phrased casually. Do NOT use for generating documents or code (eks-design, eks-build), scoring or auditing a live cluster (eks-operation-review, eks-upgrade-check), discovering what is running (eks-recon), MCP tooling setup (eks-mcp-server), or building developer platforms and IDPs (eks-platform-engineering).

aws-samples By aws-samples schedule Updated 6/4/2026

name: eks-best-practices description: Advisory guidance for Amazon EKS architecture and configuration decisions — compute strategy, networking, security, reliability, cost, autoscaling, observability, multi-tenancy, and upgrade planning. Also answers Terraform configuration questions about terraform-aws-modules/terraform-aws-eks. Use for any EKS planning or architectural judgment call, even when phrased casually. Do NOT use for generating documents or code (eks-design, eks-build), scoring or auditing a live cluster (eks-operation-review, eks-upgrade-check), discovering what is running (eks-recon), MCP tooling setup (eks-mcp-server), building developer platforms and IDPs (eks-platform-engineering), or GenAI/LLM workload decisions — GPU vs Trainium/Inferentia, vLLM/Ray serving, distributed training, ML storage (eks-genai).

EKS Best Practices

Comprehensive guidance for designing, deploying, and operating Amazon EKS clusters. Consolidates guidance from the AWS EKS Best Practices Guide, AWS EKS HA/Resiliency Guide, and terraform-aws-modules/terraform-aws-eks examples.

When to Use This Skill

Activate this skill when:

  • Designing a new EKS cluster architecture
  • Choosing between EKS compute options (Fargate, MNG, Karpenter, Auto Mode)
  • Configuring EKS networking (VPC CNI, ingress, service mesh)
  • Implementing EKS security (IAM, pod security, secrets)
  • Planning cluster upgrades or migrations
  • Reviewing EKS architecture decisions
  • Working with terraform-aws-modules/terraform-aws-eks examples
  • Optimizing EKS cost or scaling to large clusters

Don't use this skill for:

  • Generic Kubernetes concepts (Claude knows these)
  • Provider-specific API reference (link to AWS docs)
  • Non-EKS container orchestration (ECS, Lambda)
  • Step-by-step EKS upgrade execution — this skill covers upgrade strategy and architectural decisions, not the per-version procedures themselves.

EKS Architecture Decision Framework

When to Use EKS

Requirement EKS ECS Lambda
Kubernetes ecosystem ✅ Native K8s ❌ AWS-proprietary
Portable across clouds ✅ Standard K8s API ❌ AWS-only ❌ AWS-only
Long-running services ⚠️ 15 min limit
Minimal ops overhead Medium Low Lowest
GPU/ML workloads ✅ Best support Limited
Complex networking ✅ Full control Medium Limited
Team has K8s expertise Required Not required Not required

EKS Deployment Models

Model Description Operational Overhead Use When
EKS Standard Full control over nodes, add-ons, networking Medium-High Need full customization
EKS Auto Mode AWS manages nodes, add-ons, scaling Low Want minimal ops, standard workloads
EKS with Fargate Serverless pods, per-pod billing Low Batch, low-density workloads
EKS on Outposts Run EKS on-premises High Data residency, low-latency edge
EKS Anywhere EKS on your own infrastructure Highest Air-gapped, custom hardware

Shared Responsibility

Component AWS Manages You Manage
Control plane API server, etcd, HA, patching RBAC, admission control, audit logging
Data plane (MNG) AMI updates, node health Instance type, scaling, pod scheduling
Data plane (Fargate) Everything Pod spec, resource requests
Data plane (Auto Mode) Node lifecycle, OS patching Workload definitions
Networking ENI attachment, VPC CNI releases Subnet design, IP planning, ingress
Security Control plane auth IAM, pod security, secrets, network policies

Compute Selection Matrix

Decision Table

Factor Fargate MNG Karpenter Auto Mode Self-Managed
Best for Batch, small scale Stable, predictable Dynamic, varied Minimal ops Custom AMI/kernel
Scaling Per-pod ASG-based Fast, flexible AWS-managed Manual ASG
Spot support ✅ Native
GPU support
DaemonSets
Cost model Per vCPU/GB/hr Per EC2 instance Per EC2 instance Per EC2 instance Per EC2 instance
Max pods/node 1 ENI-based ENI-based AWS-managed ENI-based
Node SSH
Operational Lowest Low Low Lowest Highest

Quick Decision Guide

  • Default choice: Karpenter — best balance of flexibility, cost, and automation
  • Zero ops priority: EKS Auto Mode — AWS manages nodes, add-ons, and scaling via managed Karpenter. Best for teams that want Kubernetes benefits without operational overhead around upgrades, autoscaling, load balancing, and storage
  • Serverless/batch: Fargate — no nodes to manage, per-pod billing
  • Predictable, stable: MNG — familiar ASG model, managed updates
  • Custom requirements: Self-managed — full control, highest overhead

✅ DO:

  • Use Karpenter as the default node autoscaler for new clusters
  • Run system components (CoreDNS, Karpenter) on MNG or Fargate
  • Use multiple instance types for availability and cost optimization

❌ DON'T:

  • Use self-managed nodes without a specific technical requirement
  • Run Fargate for GPU or DaemonSet-dependent workloads
  • Mix Karpenter and Cluster Autoscaler on the same node groups

Networking Quick Reference

VPC CNI Mode Decision

Mode Use When Pod Density
Secondary IP (default) Most workloads, simple setup Limited by ENI × IPs per ENI
Prefix Delegation >30 pods/node, IP-constrained VPC 4-16× more pods per node
Custom Networking Pods need different CIDR than nodes Same as underlying mode

Ingress Pattern Selection

Pattern Best For Key Feature
ALB (via LBC) HTTP/HTTPS web apps Native WAF, Cognito auth
NLB (via LBC) TCP/UDP, gRPC, low latency Static IPs, source IP preservation
Gateway API Multi-team, new deployments ✅ Recommended standard
VPC Lattice Cross-VPC service-to-service No sidecar, IAM auth

IPv4 vs IPv6

Factor IPv4 IPv6
Default choice ✅ Yes When facing IP exhaustion
AWS service support Full Most (check specific services)
Complexity Standard Requires dual-stack VPC

For detailed networking guidance, see: Networking — VPC CNI & IP | Networking — Ingress & DNS

Security Essentials

IAM Strategy

Approach Use When Setup
Pod Identity ✅ New workloads (EKS 1.24+) EKS add-on + association
IRSA Older clusters, Fargate OIDC provider + trust policy

Key rules:

  • ✅ Use Pod Identity for new workloads — simpler setup, session tags, role chaining
  • ✅ Use EKS access entries (API mode) over aws-auth ConfigMap
  • ✅ Move VPC CNI permissions from node role to Pod Identity/IRSA
  • ❌ Don't use wildcard conditions in IRSA trust policies
  • ❌ Don't attach application permissions to node IAM roles

Pod Security Baseline

Apply Pod Security Admission (PSA) labels to all namespaces:

# Minimum: enforce baseline, warn on restricted
metadata:
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/warn: restricted

Secrets Management

Approach Complexity Best For
External Secrets Operator Medium ✅ GitOps workflows
Secrets Store CSI Medium Mount secrets as volumes
KMS envelope encryption Low Encrypt etcd secrets

Always enable KMS envelope encryption for Kubernetes secrets.

For detailed security guidance, see: Security Reference | Runtime & Network | Supply Chain & Compliance

Reliability Essentials

Pod Disruption Budgets

Create PDBs for every production workload with >1 replica:

Workload Recommended PDB
Stateless (3+ replicas) minAvailable: "50%"
Stateful quorum (3) maxUnavailable: 1
Batch/job maxUnavailable: "50%"
Singleton No PDB (would block all disruptions)

Health Probe Strategy

Probe Purpose Key Rule
Startup Wait for slow init Use for apps >10s startup
Readiness Traffic routing ✅ Check dependencies here
Liveness Detect deadlocks ❌ Never check dependencies

Critical rule: Liveness probes must NOT check external dependencies. If the database goes down and liveness checks the DB, ALL pods restart — causing cascading failure.

Graceful Shutdown Pattern

spec:
  terminationGracePeriodSeconds: 60
  containers:
  - lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 15"]

Why sleep 15: Gives kube-proxy and load balancer time to remove the pod from traffic routing before SIGTERM.

Multi-AZ Distribution

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule

For detailed reliability guidance, see: Reliability & Resiliency — Core (see also reliability-advanced.md for DR, deployment strategies, and large-cluster guidance)

Cluster Upgrade Strategy

Upgrade Sequence (Strict Order)

1. Control Plane → 2. EKS Add-ons → 3. Data Plane → 4. Custom Add-ons

Pre-Upgrade Checklist

  1. Check EKS Cluster Insights for upgrade readiness
  2. Scan for deprecated APIs (Pluto, kube-no-trouble)
  3. Verify add-on compatibility with target version
  4. Test in non-prod environment first
  5. Ensure PDBs are configured for graceful node drain
  6. Back up cluster state (Velero or GitOps repo)

Upgrade Strategy Decision

Factor In-Place Blue-Green
Risk Low-Medium Lowest
Cost No extra 2× during migration
Rollback ❌ No CP rollback ✅ Switch back
Use when ✅ Most upgrades Critical workloads

Data Plane with Karpenter

Karpenter automatically replaces nodes via drift detection after control plane upgrade. Control the speed with disruption.budgets:

disruption:
  budgets:
  - nodes: "10%"  # Max 10% of nodes replaced at a time

For detailed upgrade guidance, see: Cluster Upgrades Reference

Autoscaling Quick Reference

Node Autoscaler Selection

Karpenter Cluster Autoscaler Auto Mode
Default choice ✅ Yes Legacy/Outposts Minimal ops
Scale-up speed ~30s ~60-90s AWS-managed
Consolidation ✅ Built-in
Customization High Medium Low

Pod Autoscaler Selection

Scaler Trigger Use Case
HPA CPU, memory, custom Stateless services
VPA Historical usage Right-sizing (recommendation mode)
KEDA External events (SQS, Kafka) Event-driven workloads

For detailed autoscaling guidance, see: Autoscaling Reference | Karpenter Reference

Terraform Examples Quick Start

Based on terraform-aws-modules/terraform-aws-eks.

Example Selection

Starting Point Recommended Example
General production karpenter (MNG for system + Karpenter for workloads)
Minimal ops eks-auto-mode
Managed nodes eks-managed-node-group (AL2023 or Bottlerocket)
Full node control self-managed-node-group
Platform capabilities eks-capabilities (ArgoCD, ACK, KRO)
Hybrid/edge eks-hybrid-nodes

Common Deployment Topologies

Private cluster with Karpenter:

VPC (3 AZs, terraform-aws-modules/vpc/aws)
├── Private subnets → EKS nodes (MNG for system, Karpenter for workloads)
├── Public subnets  → ALB (internet-facing)
├── Intra subnets   → EKS control plane ENIs
└── NAT Gateway     → 1 per AZ for production

Multi-tenant platform:

EKS Cluster (terraform-aws-modules/eks/aws)
├── kube-system         (platform: CoreDNS, kube-proxy, VPC CNI)
├── karpenter           (Karpenter controller on MNG)
├── monitoring          (shared: Prometheus, Grafana)
├── ingress             (shared: AWS LBC)
├── team-a namespace    (RBAC, NetworkPolicy, ResourceQuota)
├── team-b namespace    (RBAC, NetworkPolicy, ResourceQuota)
└── team-c namespace    (RBAC, NetworkPolicy, ResourceQuota)

For detailed examples and terraform patterns, see: Terraform Examples Reference

Cost Optimization Quick Wins

Action Savings Effort
Graviton (arm64) 20-40% Low
Spot for non-critical 60-90% Low
Karpenter consolidation 20-30% Low
VPA right-sizing 15-30% Medium
gp3 over gp2 20% on EBS Low
VPC endpoints Eliminate NAT costs Low

For detailed cost guidance, see: Cost Optimization Reference | For scalability guidance, see: Scalability Reference

Observability Quick Reference

Pillar AWS-Managed Open Source
Metrics Container Insights AMP + Grafana
Logs CloudWatch Logs OpenSearch, Loki
Traces X-Ray ADOT + Jaeger/Tempo

Essential: Enable EKS audit logging and GuardDuty EKS Runtime Monitoring for security visibility.

For detailed observability guidance, see: Observability Reference

EKS Capabilities

EKS Capabilities are AWS-managed features installed and updated as part of the EKS platform. They run in AWS-owned infrastructure separate from your clusters, with AWS handling scaling, patching, and upgrading.

Capability What It Does When to Use Managed When to Self-Manage
ArgoCD GitOps continuous delivery Multi-account hub-and-spoke, IAM IDC integration, minimal ops Custom plugins, air-gapped, existing ArgoCD investment
ACK Manage AWS resources via K8s CRDs (S3, RDS, IAM, etc.) Standard AWS resource management Specific controller version pinning, custom config
KRO Platform abstractions via ResourceGroupDefinitions Golden path templates, multi-resource compositions Early adoption risk concerns, custom reconciliation logic

Combined pattern: ArgoCD deploys ACK resources + KRO compositions via GitOps, providing a single workflow for both infrastructure and applications.

For detailed ArgoCD patterns, see: ArgoCD Patterns Reference

Sources:

Detailed References

This skill uses progressive disclosure — essential guidance is in this main file, detailed reference material is loaded on demand:

  • Security — IAM, Cluster Access Manager, Pod Identity, IRSA, pod security standards, multi-tenancy, secrets management, data encryption
  • Security — Runtime & Network — Runtime threat detection (GuardDuty, seccomp, AppArmor, Falco), network policies, SG for pods, encryption in transit, detective controls
  • Security — Supply Chain & Compliance — Image security (SBOMs, attestations, ECR hardening), infrastructure hardening (Bottlerocket, CIS benchmarks), regulatory compliance, incident response
  • Networking — VPC CNI modes (secondary IP, prefix delegation, custom networking), subnet/CIDR planning, IPv4 vs IPv6, Security Groups for Pods, IP address management
  • Networking — Ingress & DNS — Ingress patterns (ALB, NLB, Gateway API), AWS Load Balancer Controller, service mesh, DNS/CoreDNS tuning, private cluster connectivity
  • Reliability & Resiliency — Core — HA patterns, PDBs, health probes, load balancer health checks, lifecycle hooks, topology spread, resource management
  • Reliability & Resiliency — Advanced — disaster recovery, zonal shift, deployment strategies, large cluster guidance, chaos engineering, admission-controller topology enforcement
  • Autoscaling — Autoscaler selection, Cluster Autoscaler (IAM, Spot, overprovisioning, parameter tuning), HPA, VPA, KEDA, CoreDNS autoscaling
  • Karpenter — Operational best practices, NodePools, EC2NodeClass, Spot/interruption handling, consolidation, multiple NodePool strategy, cost controls, resource management, private clusters, CoreDNS with Karpenter
  • Cluster Upgrades — In-place and blue-green upgrades, pre-upgrade validation, add-on management, API deprecation detection, version skew policy, Bottlerocket updates, rollback procedures
  • Cost Optimization — CFM framework, compute/networking/storage cost strategies, observability cost management, Spot, Graviton, tagging, Kubecost
  • Scalability — Scaling theory (churn rate, QPS), control plane (APF, monitoring), data plane (node sizing, diversity), cluster services (CoreDNS, Metrics Server), workload patterns, IPVS, large-cluster guidance
  • Observability — Observability strategy, CloudWatch Container Insights & Application Signals, Prometheus/Grafana, control plane monitoring, network performance monitoring, logging architecture, distributed tracing, GPU/AI-ML observability, detective controls, alerting patterns
  • Terraform Examples — terraform-aws-modules/terraform-aws-eks examples, submodules, add-on management, Provisioned Control Plane, EFA, VPC patterns, deployment topologies
  • ArgoCD Patterns — ArgoCD architecture, App of Apps, ApplicationSets, GitOps Bridge, multi-cluster patterns (hub-and-spoke, decentralized, hybrid), EKS ArgoCD Capability (managed vs self-managed, migration), ACK/KRO integration, multi-tenant RBAC
  • Container Registry — ECR architecture, operating models, image promotion, vulnerability scanning, base image curation, lifecycle policies, pull-through cache, repository creation templates, managed signing (AWS Signer), archival storage class, registry configuration
  • EKS Auto Mode — Auto Mode architecture, managed NodePools/NodeClasses, migration from standard EKS, comparison with self-managed Karpenter, limitations and FAQ

How to use: When you need detailed information on a topic, reference the appropriate guide. Claude will load it on demand.

Sources

Install via CLI
npx skills add https://github.com/aws-samples/sample-apex-skills --skill eks-best-practices
Repository Details
star Stars 20
call_split Forks 9
navigation Branch main
article Path SKILL.md
More from Creator