name: sagemaker-warm-pool-researcher description: > Orchestrates tight-loop SageMaker training iteration using managed warm pools. Ensures pool reuse by enforcing matching constraints, verifying warm pool status between submissions, using persistent cache, and releasing instances on completion. Triggers: "iterate on training", "debug training job", "warm pool", "resubmit training", "bump dependency and retrain", "recipe iteration", "keep_alive", "dependency cascade", "sweep", "try different learning rates", "grid search", "train until metric", "iterate until accuracy", "train until loss", "try these hyperparameters", "bump and retry", "keep training until", "run training with different". Also auto-activates when a request implies 2+ sequential SageMaker training jobs on the same instance type — even without explicit mention of warm pools. argument-hint: "[instance-type] [keep-alive-seconds]" allowed-tools: Read, Write, Edit, Bash, Grep, Glob
Warm Pool Iteration — Orchestration Skill
You are managing an iterative SageMaker training loop that reuses a managed warm pool between submissions. Your job is to maintain pool affinity across iterations, fail fast on breakage, and release the instance when done.
0. Auto-Detection (Read This First)
This skill applies any time the user's request will produce 2+ sequential SageMaker training jobs on the same instance type. You do not need the user to say "warm pool."
Patterns that signal warm pool candidacy
| Pattern | Example |
|---|---|
| Hyperparameter sweep | "Try learning rates 1e-4, 5e-5, 2e-5" |
| Train-until-metric | "Train until eval loss < 0.3" or "iterate until accuracy > 90%" |
| Dependency debugging | "Bump transformers and retrain" or "fix the import error and rerun" |
| Recipe iteration | "Try LoRA rank 16, then 32, then 64" |
| Ablation study | "Run with and without gradient checkpointing" |
| Config grid | "Try batch sizes 4, 8, 16 on this setup" |
What to do when you detect a candidate
- Inform the user: "This looks like it will take multiple training iterations on the same instance — I'll use a managed warm pool to avoid repeated cold starts."
- Activate this skill's full workflow (quota check → locked fields → iteration loop → cleanup).
- If the user declines, skip warm pools and submit jobs normally.
When NOT to auto-activate
- Single training job with no iteration intent
- User explicitly asks for spot instances (incompatible)
- Heterogeneous cluster request
- Serverless Model Customization (warm pools don't apply)
1. Pre-Flight Checks (Run Once at Loop Start)
Before submitting the first training job:
Quota verification
Each instance type has its own warm pool quota code. Find yours by filtering:
aws service-quotas list-service-quotas \
--service-code sagemaker \
--region <region> \
--query "Quotas[?contains(QuotaName, 'warm pool') && contains(QuotaName, '<instance-type>')].[QuotaCode,Value,QuotaName]" \
--output table
If Value is 0: STOP. Inform the user that warm pool quota must be requested for the
target instance type. The keep-alive parameter silently no-ops without quota — jobs will
succeed but cold-start every time.
Lock the immutable fields
These fields MUST remain identical across ALL iterations or the pool breaks. Record them at job 1 and never change them:
| Field | Source |
|---|---|
RoleArn |
IAM execution role |
ResourceConfig.InstanceType |
e.g. ml.g6e.2xlarge |
ResourceConfig.InstanceCount |
e.g. 1 |
ResourceConfig.VolumeSizeInGB |
e.g. 250 |
ResourceConfig.VolumeKmsKeyId |
KMS key ARN (or absent) |
VpcConfig.SecurityGroupIds |
List of SG IDs (or absent) |
VpcConfig.Subnets |
List of subnet IDs (or absent) |
EnableInterContainerTrafficEncryption |
bool |
EnableNetworkIsolation |
bool |
Session tags (if EnableSessionTagChaining=True) |
tag key set |
Fields you CAN change freely between iterations
- Hyperparameters
- Training script / source directory
- Input data channels
- Output S3 path
- Job name (must be unique per job)
- Environment variables
KeepAlivePeriodInSecondsvalue (inherited fresh each job)
2. Job Submission Template
from sagemaker.train.model_trainer import ModelTrainer
from sagemaker.train.configs import Compute, InputData, OutputDataConfig, SourceCode
trainer = ModelTrainer(
training_image=IMAGE_URI,
source_code=SourceCode(
source_dir="src/",
entry_script="train.py",
),
compute=Compute(
instance_type="ml.g6e.2xlarge", # LOCKED
instance_count=1, # LOCKED
volume_size_in_gb=250, # LOCKED
keep_alive_period_in_seconds=1800, # 30 min default; adjust per cadence
),
output_data_config=OutputDataConfig(s3_output_path=OUTPUT_PATH),
hyperparameters={...}, # FREE — change every iteration
environment={
"PIP_CACHE_DIR": "/opt/ml/sagemaker/warmpoolcache/pip",
},
role=ROLE_ARN, # LOCKED
)
trainer.train(wait=True, logs=True)
Why wait=True, logs=True
- Fail-fast: error appears in-shell immediately, no CloudWatch polling
- Agent can parse the failure and patch within seconds
- Billing stops when the training step ends (pool enters Available)
Persistent cache
The directory /opt/ml/sagemaker/warmpoolcache (env var
SAGEMAKER_MANAGED_WARMPOOL_CACHE_DIRECTORY) survives across jobs on the same pool.
Use it for:
- pip/conda package cache (set
PIP_CACHE_DIRenvironment variable) - Model checkpoints for incremental training
- Downloaded datasets that don't change between iterations
- Any artifact expensive to recreate
3. Between Iterations (Verify Before Resubmit)
After each completed job, before submitting the next:
aws sagemaker describe-training-job \
--training-job-name <previous-job-name> \
--query "WarmPoolStatus"
| Status | Meaning | Action |
|---|---|---|
Available |
Pool is hot, waiting for next match | Submit next job — will reuse |
Reused |
Pool moved to a different job | Expected on the job BEFORE current |
InUse |
Currently running a job | Wait — don't submit a competing job |
Terminated |
Pool died | Next job will cold-start. Investigate why. |
Pool death causes
- InternalServerError during previous job (infrastructure failure)
- Explicit stop (
StopTrainingJoborKeepAlivePeriodInSeconds=0) - Keep-alive expiry (exceeded the configured seconds with no matching job)
- Patch update to the cluster
- Mismatched fields — next job didn't match on a LOCKED field
If Terminated: warn the user, note the cold-start cost, and continue. The next job
with keep_alive_period_in_seconds > 0 will create a new pool.
4. Iteration Loop (Agent Behavior)
┌─────────────────────────────────────────────────┐
│ 1. Submit training job (LOCKED fields constant) │
│ 2. wait=True, logs=True → block until complete │
│ 3. Job succeeded? │
│ YES → Go to CLEANUP │
│ NO → Read failure from logs │
│ 4. Is the failure recoverable by patching? │
│ NO → Go to CLEANUP │
│ YES → Patch code/config │
│ 5. Verify WarmPoolStatus == Available │
│ If Terminated → warn, continue (cold start) │
│ Go to step 1 │
│ │
│ CLEANUP (always runs on any exit path): │
│ Release warm pool → EXIT │
└─────────────────────────────────────────────────┘
The agent MUST release the warm pool on every exit path — success, unrecoverable failure, or decision to stop iterating. The only exception is a hard crash of the agent process itself (in which case the keep-alive expiry is the safety net).
Tight-loop economics
At ~90s agent turnaround between jobs:
- 7 iterations ≈ 28 min wall-clock, ~$1.31 (ml.g6e.2xlarge)
- Same 7 iterations cold: 58 min, ~$2.29
- Savings: 43% cost, 2.4x faster, single capacity acquisition
Keep-alive tuning
- Default
1800(30 min) is safe for human-driven iteration - For agent-driven tight loops,
600(10 min) is sufficient and reduces waste if the agent crashes or the user walks away - Max allowed:
3600(60 min) per job - Max chain duration: 28 days of continuous reuse
5. Constraints & Limitations
- No spot instances: warm pools are incompatible with managed spot training
- No heterogeneous clusters: single instance type only
- Volume size is locked: cannot resize EBS between iterations
- Quota is per-instance-type per-region: switching instance types requires a separate quota grant and breaks the current pool
- Pool is single-tenant: only one job can match at a time; don't submit competing jobs from parallel agents against the same pool
6. Cleanup (Run When Done)
Release the warm pool immediately after the final iteration:
import boto3
sm = boto3.client("sagemaker")
sm.update_training_job(
TrainingJobName=last_job_name,
ResourceConfig={"KeepAlivePeriodInSeconds": 0},
)
Or via CLI:
aws sagemaker update-training-job \
--training-job-name <last-job-name> \
--resource-config KeepAlivePeriodInSeconds=0
This is not optional. Warm pools bill at the full training rate ($2.80/hr for ml.g6e.2xlarge) while idle. If you don't release, you pay for up to 30 min of nothing after iteration ends.
7. Failure Taxonomy (What the Agent Should Recognize)
| Failure pattern | Likely cause | Fix approach |
|---|---|---|
ModuleNotFoundError / ImportError |
Missing or version-incompatible dependency | Bump pin in requirements.txt, resubmit |
CUDA out of memory |
Batch size or model too large for GPU | Reduce batch size, enable gradient checkpointing |
ResourceLimitExceeded |
Quota exhausted | Cannot continue — need quota increase or different instance |
InternalServerError (job-level) |
SageMaker infra issue | Pool is dead — next job cold-starts, retry same config |
AlgorithmError (exit code != 0) |
Bug in training script | Read traceback, patch script |
| Container timeout (no output) | Deadlock or infinite loop | Add timeout/watchdog, reduce scope |