name: karpenter-provision description: Generate optimized Karpenter NodePool and EC2NodeClass configurations based on workload requirements. Handles Spot strategies, disruption budgets, instance diversity, and cost optimization.
/karpenter-provision
Purpose
Generate production-ready Karpenter configurations (NodePool + EC2NodeClass) optimized for cost, reliability, and the specific workload characteristics.
Requirement Gathering
Ask the user for these details (if not already provided):
- Workload type: stateless API, stateful (database/cache), batch/CI, GPU/ML, mixed
- Resources per pod: typical vCPU and memory requests
- Capacity type: on-demand only, spot only, or mixed (specify ratio)
- Monthly budget target: helps size limits appropriately
- Region and AZs: for subnet and zone configuration
- EKS version and Karpenter version: for API compatibility
- Special requirements: GPU, ARM64, high-memory, Windows, etc.
Generation Rules
Instance Type Selection
- Stateless APIs: c6i, c7g, m6i, m7g (compute/general purpose, good price-performance)
- Memory-intensive: r6i, r7g, r6a (memory optimized)
- Batch/CI: c6i, c6a, c7g, c7a, m6i, m6a, m7g (maximize diversity for Spot)
- GPU: p4d, p5, g5, g6 (match GPU type to workload: training vs inference)
- ARM64 preferred: Include Graviton variants (c7g, m7g, r7g) — 20-40% cheaper
Instance Size Selection
- Match to pod resource requests: if pods need 2 vCPU/4GB, use large to 2xlarge
- Avoid oversized instances (4xlarge+) unless pods are large or bin-packing is critical
- Include at least 3 sizes for flexibility
Spot Strategy
- Minimum 15 instance types for Spot-to-Spot consolidation
- Always include on-demand fallback for mixed strategies
- Use
weightfield: Spot NodePool weight=100, On-Demand weight=10 (fallback) - Configure interruption handling (SQS + EventBridge)
Disruption Policy
- Production:
consolidateAfter: 60s, budgets:nodes: "20%" - Batch:
consolidateAfter: 30s, budgets:nodes: "40%" - Add peak-hours protection budget for production
- Use
WhenEmptyOrUnderutilized(default, recommended for most cases)
Limits
- Set based on budget (rough formula: $0.04/vCPU-hour on-demand, $0.015 spot)
- Include both
cpuandmemorylimits - Leave 20% headroom above expected peak
Output
Generate complete YAML files:
NodePool(s) with:
- Instance family and size requirements
- Capacity type (on-demand/spot)
- AZ requirements
- Labels and taints (for workload isolation)
- Disruption policy with budgets
- Resource limits
- Weight (priority)
- expireAfter (TTL for node rotation)
EC2NodeClass with:
- amiSelectorTerms (alias for EKS-optimized AMI)
- subnetSelectorTerms (tag-based discovery)
- securityGroupSelectorTerms (tag-based discovery)
- instanceProfile
- blockDeviceMappings (root volume sizing)
Supporting resources:
- PodDisruptionBudget for the primary workload
- Recommended Prometheus ServiceMonitor config for Karpenter
Validation Checklist
Before presenting the configuration, verify:
- Instance type diversity: ≥15 types for Spot, ≥6 for On-Demand
- All 3 AZs included for HA (if multi-AZ requested)
- Resource limits align with budget target
- consolidateAfter is appropriate (not 0s in production)
- Peak-hours budget protection included for production workloads
- Taints configured for workload isolation (if multiple NodePools)
- expireAfter set for node rotation (720h prod, 168h batch recommended)
- amiSelectorTerms uses alias (not custom AMI) unless specifically required
- No launch template references (not supported in v1 API)
Examples
Common configurations the skill should handle:
- "Provision for a microservices platform with 50 stateless APIs"
- "Set up Karpenter for ML training workloads with p5.48xlarge GPUs"
- "Configure Spot-only NodePool for CI/CD runners"
- "Mixed strategy: critical payment service on On-Demand, everything else on Spot"
- "ARM64-first strategy with x86 fallback for cost optimization"