sagemaker-hyperpod

star 5

Amazon SageMaker HyperPod expert for ML training clusters with Trainium or GPU. Use when: creating HyperPod clusters, running distributed training, configuring EKS or Slurm orchestration, troubleshooting cluster issues, checking quotas, or when user mentions "hyperpod", "hyp", "ml-cluster", "trainium", "trn1", "distributed training", or "multi-node training".

dgallitelli By dgallitelli schedule Updated 2/11/2026

name: sagemaker-hyperpod description: | Amazon SageMaker HyperPod expert for ML training clusters with Trainium or GPU. Use when: creating HyperPod clusters, running distributed training, configuring EKS or Slurm orchestration, troubleshooting cluster issues, checking quotas, or when user mentions "hyperpod", "hyp", "ml-cluster", "trainium", "trn1", "distributed training", or "multi-node training". argument-hint: "[cluster-name or action]" context: fork model: sonnet skills:

  • aws-mcp-setup allowed-tools:
  • mcp__sagemaker__*
  • mcp__aws-mcp__*
  • mcp__awsdocs__*
  • WebFetch
  • Bash(hyp *)
  • Bash(aws sagemaker *)
  • Bash(kubectl *)
  • Bash(aws eks *)
  • Bash(aws ec2 describe-*)
  • Bash(aws servicequotas *)
  • Bash(aws s3 *)
  • Bash(aws ssm start-session *)
  • Bash(aws sts get-caller-identity)
  • Bash(aws logs *)
  • Bash(aws iam get-role*)
  • Bash(aws iam list-*)
  • Bash(helm *)
  • Bash(pip install sagemaker-hyperpod) hooks: PreToolUse:
    • matcher: Bash(aws sagemaker create-cluster*) command: aws sts get-caller-identity --query Account --output text once: true
    • matcher: Bash(hyp create*) command: aws sts get-caller-identity --query Account --output text once: true

Amazon SageMaker HyperPod Expert

You are an expert in Amazon SageMaker HyperPod for provisioning resilient ML training clusters with AWS Trainium and NVIDIA GPUs.

When This Skill Activates

  • Creating HyperPod clusters (EKS or Slurm)
  • Running distributed ML training jobs
  • Troubleshooting cluster issues
  • Checking quotas or instance availability
  • User mentions: "hyperpod", "hyp", "trainium", "trn1", "distributed training"

Detailed Guides

Guide Use When
reference/eks-guide.md EKS orchestration, hyp CLI, add-ons, Pod Identity
reference/slurm-guide.md Slurm orchestration, lifecycle scripts, SBATCH
reference/troubleshooting.md Error diagnosis and solutions

Orchestrator Selection

Aspect EKS Slurm
AZ Requirement 2+ AZs required Single AZ OK
Primary Tool hyp CLI AWS CLI
Job Submission PyTorchJob via hyp create SBATCH scripts
Access Method kubectl SSM Session Manager
Best For Kubernetes teams, container workloads HPC teams, batch jobs

Instance Types

Instance Type Accelerator Count Use Case
ml.p4d.24xlarge A100 8 General training
ml.p4de.24xlarge A100 (80GB) 8 Large models
ml.p5.48xlarge H100 8 Latest gen training
ml.trn1.32xlarge Trainium 16 Cost-effective
ml.trn1n.32xlarge Trainium 16 Higher network

IMPORTANT: ml.trn1.2xlarge is NOT supported for HyperPod - only ml.trn1.32xlarge.


CRITICAL: Pre-Creation Validation

ALWAYS perform these checks BEFORE creating a cluster:

1. Verify Instance Type Support

# Must say "for cluster usage" in quota name
aws service-quotas list-service-quotas \
  --service-code sagemaker --region us-east-1 \
  --query 'Quotas[?contains(QuotaName, `<INSTANCE_TYPE>`) && contains(QuotaName, `cluster`)].[QuotaName,Value]' \
  --output table

2. Check AZ Availability

aws ec2 describe-instance-type-offerings \
  --location-type availability-zone \
  --filters Name=instance-type,Values=trn1.32xlarge \
  --region us-east-1 \
  --query 'InstanceTypeOfferings[*].Location' --output text

3. For EKS: Ensure 2+ AZs in config.yaml

availability_zone_ids:
  - use1-az6  # Primary for workers
  - use1-az4  # Secondary for EKS HA

4. Check K8s Version (EKS Only)

WebFetch: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar
Prompt: What is the latest Kubernetes version in standard support?

5. Check Add-on Compatibility (EKS Only)

Before upgrading K8s versions, verify HyperPod add-ons support the target version:

aws eks describe-addon-versions --addon-name amazon-sagemaker-hyperpod-taskgovernance \
  --query 'addons[0].addonVersions[*].compatibilities[*].clusterVersion' --output text

WARNING: EKS does NOT support downgrading. Stay on a supported version if you need HyperPod add-ons.


EKS Quick Start

# 1. Install CLI
pip install sagemaker-hyperpod

# 2. Initialize cluster stack
hyp init cluster-stack my-cluster
cd my-cluster

# 3. Edit config.yaml (ensure 2+ AZs!)

# 4. Validate and create
hyp validate && hyp create cluster-stack --region us-east-1

# 5. Set context
hyp set-cluster-context --cluster-name <NAME> --region us-east-1

Submit Training Job (EKS)

# Option 1: Using config file (recommended)
hyp init hyp-pytorch-job my-job
cd my-job
# Edit config.yaml
hyp validate
hyp create hyp-pytorch-job

# Option 2: Command line
hyp create hyp-pytorch-job \
  --job-name my-job \
  --image <ECR-IMAGE> \
  --instance-type ml.trn1.32xlarge \
  --node-count 1 \
  --accelerators 16 \
  --accelerators-limit 16

Monitor Training Job (EKS)

# List jobs
hyp list hyp-pytorch-job

# Job details
hyp describe hyp-pytorch-job --job-name <NAME>

# View logs
hyp get-logs hyp-pytorch-job --job-name <NAME> --follow

# List pods
hyp list-pods hyp-pytorch-job --job-name <NAME>

# Delete job
hyp delete hyp-pytorch-job --job-name <NAME>

Full guide: See orchestrators/eks/job-submission.md


Slurm Quick Start

# 1. Prepare lifecycle scripts (use AWS samples)
git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/

# 2. Upload to S3
aws s3 cp . s3://my-bucket/lifecycle-scripts/ --recursive

# 3. Create cluster
aws sagemaker create-cluster --cluster-name my-cluster \
  --instance-groups '[...]' --vpc-config "..."

# 4. Connect via SSM
aws ssm start-session --target <instance-id>

Full workflow: See reference/slurm-guide.md


Model Compatibility (Trainium/Inferentia)

CRITICAL: Verify model support before configuring Trainium jobs.

Check Support

WebFetch: https://huggingface.co/docs/optimum-neuron/en/supported_architectures
Prompt: List supported model architectures for training on Trainium

Currently Supported (Training)

Architecture Tensor Parallelism Pipeline Parallelism
Llama, Llama 2, Llama 3 Yes Yes
Qwen3 Yes Yes
Granite Yes No

Common Errors (Quick Reference)

Error Cause Solution
InvalidParameterException (EKS) Single AZ Add 2+ AZs to config
ml.trn1.2xlarge not found Unsupported type Use ml.trn1.32xlarge
Training Operator pod fails Missing Pod Identity See EKS guide
Insufficient cpu Full node request Use partial resources
Accelerator request != limit Limits mismatch Set accelerators_limit = accelerators
EFA health check failed Multi-AZ Use single subnet with OverrideVpcConfig
Add-on not supported K8s version Check add-on compatibility before upgrade

Full troubleshooting: See reference/troubleshooting.md


Infrastructure Requirements

EFA Single-AZ Requirement

For EFA-enabled instances (trn1, p4d, p5), ALL instances MUST be in the SAME AZ.

Security Group

Must allow ALL traffic within itself:

aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx --protocol all --port -1 --source-group sg-xxx

CIDR Sizing

Orchestrator IPs per P5
Slurm 32
EKS 81 (includes pods)

Quota Management

# Check quota
aws service-quotas get-service-quota \
  --service-code sagemaker --quota-code L-6865522E --region us-east-1

# Request increase
aws service-quotas request-service-quota-increase \
  --service-code sagemaker --quota-code L-6865522E --desired-value 4

Common codes:

  • L-6865522E: ml.trn1.32xlarge for cluster usage
  • L-5C4CD236: ml.p5.48xlarge for cluster usage

Diagnostic Commands

# Cluster status
aws sagemaker describe-cluster --cluster-name NAME

# List nodes
aws sagemaker list-cluster-nodes --cluster-name NAME

# CloudWatch logs
aws logs get-log-events \
  --log-group-name /aws/sagemaker/Clusters/NAME/ID \
  --log-stream-name LifecycleConfig/GROUP/INSTANCE

# EKS nodes/pods
kubectl get nodes && kubectl get pods -A
Install via CLI
npx skills add https://github.com/dgallitelli/aws-hyperpod-skill --skill sagemaker-hyperpod
Repository Details
star Stars 5
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator