name: sagemaker-hyperpod description: | Amazon SageMaker HyperPod expert for ML training clusters with Trainium or GPU. Use when: creating HyperPod clusters, running distributed training, configuring EKS or Slurm orchestration, troubleshooting cluster issues, checking quotas, or when user mentions "hyperpod", "hyp", "ml-cluster", "trainium", "trn1", "distributed training", or "multi-node training". argument-hint: "[cluster-name or action]" context: fork model: sonnet skills:
- aws-mcp-setup allowed-tools:
- mcp__sagemaker__*
- mcp__aws-mcp__*
- mcp__awsdocs__*
- WebFetch
- Bash(hyp *)
- Bash(aws sagemaker *)
- Bash(kubectl *)
- Bash(aws eks *)
- Bash(aws ec2 describe-*)
- Bash(aws servicequotas *)
- Bash(aws s3 *)
- Bash(aws ssm start-session *)
- Bash(aws sts get-caller-identity)
- Bash(aws logs *)
- Bash(aws iam get-role*)
- Bash(aws iam list-*)
- Bash(helm *)
- Bash(pip install sagemaker-hyperpod)
hooks:
PreToolUse:
- matcher: Bash(aws sagemaker create-cluster*) command: aws sts get-caller-identity --query Account --output text once: true
- matcher: Bash(hyp create*) command: aws sts get-caller-identity --query Account --output text once: true
Amazon SageMaker HyperPod Expert
You are an expert in Amazon SageMaker HyperPod for provisioning resilient ML training clusters with AWS Trainium and NVIDIA GPUs.
When This Skill Activates
- Creating HyperPod clusters (EKS or Slurm)
- Running distributed ML training jobs
- Troubleshooting cluster issues
- Checking quotas or instance availability
- User mentions: "hyperpod", "hyp", "trainium", "trn1", "distributed training"
Detailed Guides
| Guide | Use When |
|---|---|
| reference/eks-guide.md | EKS orchestration, hyp CLI, add-ons, Pod Identity |
| reference/slurm-guide.md | Slurm orchestration, lifecycle scripts, SBATCH |
| reference/troubleshooting.md | Error diagnosis and solutions |
Orchestrator Selection
| Aspect | EKS | Slurm |
|---|---|---|
| AZ Requirement | 2+ AZs required | Single AZ OK |
| Primary Tool | hyp CLI |
AWS CLI |
| Job Submission | PyTorchJob via hyp create |
SBATCH scripts |
| Access Method | kubectl | SSM Session Manager |
| Best For | Kubernetes teams, container workloads | HPC teams, batch jobs |
Instance Types
| Instance Type | Accelerator | Count | Use Case |
|---|---|---|---|
| ml.p4d.24xlarge | A100 | 8 | General training |
| ml.p4de.24xlarge | A100 (80GB) | 8 | Large models |
| ml.p5.48xlarge | H100 | 8 | Latest gen training |
| ml.trn1.32xlarge | Trainium | 16 | Cost-effective |
| ml.trn1n.32xlarge | Trainium | 16 | Higher network |
IMPORTANT: ml.trn1.2xlarge is NOT supported for HyperPod - only ml.trn1.32xlarge.
CRITICAL: Pre-Creation Validation
ALWAYS perform these checks BEFORE creating a cluster:
1. Verify Instance Type Support
# Must say "for cluster usage" in quota name
aws service-quotas list-service-quotas \
--service-code sagemaker --region us-east-1 \
--query 'Quotas[?contains(QuotaName, `<INSTANCE_TYPE>`) && contains(QuotaName, `cluster`)].[QuotaName,Value]' \
--output table
2. Check AZ Availability
aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters Name=instance-type,Values=trn1.32xlarge \
--region us-east-1 \
--query 'InstanceTypeOfferings[*].Location' --output text
3. For EKS: Ensure 2+ AZs in config.yaml
availability_zone_ids:
- use1-az6 # Primary for workers
- use1-az4 # Secondary for EKS HA
4. Check K8s Version (EKS Only)
WebFetch: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar
Prompt: What is the latest Kubernetes version in standard support?
5. Check Add-on Compatibility (EKS Only)
Before upgrading K8s versions, verify HyperPod add-ons support the target version:
aws eks describe-addon-versions --addon-name amazon-sagemaker-hyperpod-taskgovernance \
--query 'addons[0].addonVersions[*].compatibilities[*].clusterVersion' --output text
WARNING: EKS does NOT support downgrading. Stay on a supported version if you need HyperPod add-ons.
EKS Quick Start
# 1. Install CLI
pip install sagemaker-hyperpod
# 2. Initialize cluster stack
hyp init cluster-stack my-cluster
cd my-cluster
# 3. Edit config.yaml (ensure 2+ AZs!)
# 4. Validate and create
hyp validate && hyp create cluster-stack --region us-east-1
# 5. Set context
hyp set-cluster-context --cluster-name <NAME> --region us-east-1
Submit Training Job (EKS)
# Option 1: Using config file (recommended)
hyp init hyp-pytorch-job my-job
cd my-job
# Edit config.yaml
hyp validate
hyp create hyp-pytorch-job
# Option 2: Command line
hyp create hyp-pytorch-job \
--job-name my-job \
--image <ECR-IMAGE> \
--instance-type ml.trn1.32xlarge \
--node-count 1 \
--accelerators 16 \
--accelerators-limit 16
Monitor Training Job (EKS)
# List jobs
hyp list hyp-pytorch-job
# Job details
hyp describe hyp-pytorch-job --job-name <NAME>
# View logs
hyp get-logs hyp-pytorch-job --job-name <NAME> --follow
# List pods
hyp list-pods hyp-pytorch-job --job-name <NAME>
# Delete job
hyp delete hyp-pytorch-job --job-name <NAME>
Full guide: See orchestrators/eks/job-submission.md
Slurm Quick Start
# 1. Prepare lifecycle scripts (use AWS samples)
git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/
# 2. Upload to S3
aws s3 cp . s3://my-bucket/lifecycle-scripts/ --recursive
# 3. Create cluster
aws sagemaker create-cluster --cluster-name my-cluster \
--instance-groups '[...]' --vpc-config "..."
# 4. Connect via SSM
aws ssm start-session --target <instance-id>
Full workflow: See reference/slurm-guide.md
Model Compatibility (Trainium/Inferentia)
CRITICAL: Verify model support before configuring Trainium jobs.
Check Support
WebFetch: https://huggingface.co/docs/optimum-neuron/en/supported_architectures
Prompt: List supported model architectures for training on Trainium
Currently Supported (Training)
| Architecture | Tensor Parallelism | Pipeline Parallelism |
|---|---|---|
| Llama, Llama 2, Llama 3 | Yes | Yes |
| Qwen3 | Yes | Yes |
| Granite | Yes | No |
Common Errors (Quick Reference)
| Error | Cause | Solution |
|---|---|---|
InvalidParameterException (EKS) |
Single AZ | Add 2+ AZs to config |
ml.trn1.2xlarge not found |
Unsupported type | Use ml.trn1.32xlarge |
| Training Operator pod fails | Missing Pod Identity | See EKS guide |
Insufficient cpu |
Full node request | Use partial resources |
Accelerator request != limit |
Limits mismatch | Set accelerators_limit = accelerators |
| EFA health check failed | Multi-AZ | Use single subnet with OverrideVpcConfig |
| Add-on not supported | K8s version | Check add-on compatibility before upgrade |
Full troubleshooting: See reference/troubleshooting.md
Infrastructure Requirements
EFA Single-AZ Requirement
For EFA-enabled instances (trn1, p4d, p5), ALL instances MUST be in the SAME AZ.
Security Group
Must allow ALL traffic within itself:
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx --protocol all --port -1 --source-group sg-xxx
CIDR Sizing
| Orchestrator | IPs per P5 |
|---|---|
| Slurm | 32 |
| EKS | 81 (includes pods) |
Quota Management
# Check quota
aws service-quotas get-service-quota \
--service-code sagemaker --quota-code L-6865522E --region us-east-1
# Request increase
aws service-quotas request-service-quota-increase \
--service-code sagemaker --quota-code L-6865522E --desired-value 4
Common codes:
L-6865522E: ml.trn1.32xlarge for cluster usageL-5C4CD236: ml.p5.48xlarge for cluster usage
Diagnostic Commands
# Cluster status
aws sagemaker describe-cluster --cluster-name NAME
# List nodes
aws sagemaker list-cluster-nodes --cluster-name NAME
# CloudWatch logs
aws logs get-log-events \
--log-group-name /aws/sagemaker/Clusters/NAME/ID \
--log-stream-name LifecycleConfig/GROUP/INSTANCE
# EKS nodes/pods
kubectl get nodes && kubectl get pods -A