fireworks-training - SKILL.md Agent Skill

name: fireworks-training description: Debug and migrate training work on Fireworks via the cookbook. Covers greenfield work (pick a recipe, fork it, resolve training + deployment shape from a profile), distillation / OPD / SDFT recipe work, user-level recovery (promote a checkpoint, list promotable checkpoints on a trainer, self-check a `WeightSyncScope.PER_TRAINER` vs `PER_DEPLOYMENT` bucket-scope mix-up), and migrating deprecated managed-infra scripts (port `InfraConfig` / `setup_infra` / `ResourceCleanup` to `TrainerConfig` + the recipe's SDK-managed provisioning path). The cookbook is the reference implementation of `fireworks.training.sdk`; fork a recipe or run an example instead of reimplementing. Trigger when the user wants to start, resume, promote, distill, migrate off deprecated infra APIs, or do a first-line diagnosis on a training run; for deeper recovery the skill routes users to Fireworks support.

Fireworks training

The cookbook is the reference implementation of the Fireworks Training SDK. Fork a recipe, run an example, use the standalone tools in references/tools.md. Use shapes for both trainer and deployment — never hand-set accelerator_type / node_count / custom_image_tag.

Task → reference

Task or signal	Reference
"How do I set up / install the cookbook?"	`references/setup.md`
"I want to run something out of the box"	`references/examples.md`
"I want to fork a recipe and edit the Config"	`references/recipes.md`
"Distillation" / "OPD" / "SDFT" / `distillation_loop.py` / `topk_forward_kl` / `teacher_messages`	`references/distillation.md`
"Migrate off `InfraConfig` / `setup_infra` / `ResourceCleanup`" / `TypeError: ... unexpected keyword argument 'infra'` / `ImportError: cannot import name 'setup_infra'`	`references/migrate.md`
"How do I set the training / deployment shape?"	`references/shapes.md`
`RuntimeError: Failed to resolve latest validated training shape`	`references/shapes.md` — don't pin a version; retry or reach out
"Can I run two deployments off one trainer (sampler + eval)?"	`references/rl/hotload.md`
"How does RL dispatch server-side vs client-side loss? What's the cost?"	`references/rl/loss-paths.md`
"How does gradient accumulation work at `optim_step`? What normalization does RL use?"	`references/rl/gradient-accumulation.md`
"Why are some RL samples being filtered?"	`references/rl/dynamic-filter.md`
"Custom loss for RL"	`references/rl/custom-loss.md`
"RL hotload / weight sync cadence, on-policy vs off-policy, `weight_sync_timeout`"	`references/rl/hotload.md`
"Concurrency control for RL rollouts — adaptive vs fixed?"	`references/rl/concurrency.md`
`DeploymentSamplerTimeoutError` / repeated sampler HTTP 408 or 504 / `ReadTimeout` during rollout sampling	`references/rl/sampling-timeouts.md`
"Async RL — overlap rollout with training, off-policy budget, PPO inner minibatches, `rollout_fn(sample_prompt)`"	`references/rl/async-rl.md`
"Black-box coding-agent RL" / "train claude-code or another agent harness" / "ProRL SWE-Gym parity"	`references/rl/async-rl.md`
"Why is `perf/wait_time_ratio` high?" / `perf/sampler_wait_for_trainer_time` / `perf/trainer_wait_for_sampler_time`	`references/rl/async-rl.md`
"How do I size `max_head_offpolicy_versions` and `max_concurrency_rollout_sample`?"	`references/rl/async-rl.md`
"How do I promote a checkpoint?"	`references/tools.md`
"Which checkpoints does the server know about / are promotable?"	`references/tools.md` — `FireworksClient.list_checkpoints(job_id)`
"How do I reconnect a training client to a running trainer?"	`references/tools.md`
"Hotload keeps failing — is this a `PER_TRAINER` / `PER_DEPLOYMENT` scope mix-up?"	`references/rl/hotload.md` — self-check and reach out to Fireworks support
"How do I verify train vs inference logprobs?"	`references/tools.md`
"I'm adding a new renderer — what's the contract?"	`../renderer/SKILL.md`
"I changed a renderer — how do I verify it matches HF / the live gateway?"	`../verifier/SKILL.md`
"Why is my model emitting trailing tokens / hard-appends?" / token stream looks wrong	`../verifier/SKILL.md`
"Where does checkpoint state live?" / CheckpointKind / `checkpoints.jsonl`	`references/checkpoints.md`
"Continue LoRA training from a prior adapter" / `warm_start_from_adapter`	`references/checkpoints.md`
Error: `checkpoint "<name>" not found in GCS`	`references/checkpoints.md` — validate `output_model_id` first; reach out to Fireworks support if still failing
Error: `Hotload failed for snapshot ...` / `Hotload did not complete within ...` / sampler deployment failed to load latest trainer weights	`references/rl/hotload.md` — compare expected vs current snapshot; reattach or recreate the deployment when the attachment is stale
Error: `hotload flow mismatch: trainer wants deployment-first ... but deployment ... is trainer-first`	`references/rl/hotload.md` — the server still emits the old "trainer-first / deployment-first" wording; it maps to `PER_TRAINER` / `PER_DEPLOYMENT` bucket scope. Scopes crossed at `CreateRlorTrainerJob`; pick one scope.
Error: `hotload flow mismatch: trainer T is deployment-first-keyed for deployment D`	`references/rl/hotload.md` — trainer is keyed to a different deployment's bucket (`PER_DEPLOYMENT`); use a `PER_TRAINER`-scope trainer
Error: `hot_load_bucket_url %q conflicts with hot_load_trainer_job %s; set exactly one`	`references/rl/hotload.md` — create-time: drop whichever field is wrong
Error: `hot_load_bucket_url %q conflicts with hot_load_trainer_job %s; update hot_load_trainer_job instead, or clear it first`	`references/rl/hotload.md` — update-time: clear bucket URL first, then PATCH trainer job
Error: `invalid FW_HOSTED hot_load_bucket_url` / `must use gs:// scheme` / `path must start with rl-checkpoints/`	`references/rl/hotload.md` — structural validation on FW_HOSTED URL at create/update
Error: `configured FW_HOSTED hot_load bucket is not reachable` / `control plane lacks permission`	`references/rl/hotload.md` — account's `ModelBucket` misprovisioned; reach out to Fireworks support
Error: `cannot cancel job in state: JOB_STATE_DELETED`	`references/rl/hotload.md` — trainer is tombstoned during the retention window; no action needed
`list_checkpoints` / `promote_checkpoint` returns NOT_FOUND > 30 days after delete	`references/rl/hotload.md` — past retention, expected
HTTP 400 on `output_model_id`	`references/tools.md` — validate before calling
"Is this a `PER_TRAINER` or `PER_DEPLOYMENT` bucket scope?"	`references/rl/hotload.md`
Manual `accelerator_type` / `node_count` set on `Config`	`references/shapes.md` — drop them, the profile owns infra

First debug step — always

Before assuming the platform is broken, confirm the user's installed fireworks-ai satisfies the cookbook's SDK requirement. A stale SDK produces errors that masquerade as server bugs: missing keyword arguments, "unknown field", silent no-ops on new config fields, or promote_checkpoint behaviour that doesn't match the code.

The requirement lives in the cookbook's training/pyproject.toml — look for the fireworks-ai[training] pin:

grep 'fireworks-ai\[training\]' cookbook/training/pyproject.toml
# e.g. "fireworks-ai[training]>=<minimum-sdk-version>,<2"

pip show fireworks-ai | grep -i version

If the installed version doesn't satisfy the pin, upgrade first and retry. Only after the SDK meets the requirement should you start triaging the actual symptom. Users do not need to sync the cookbook to upstream main — whatever cookbook commit they're on declares its own SDK requirement, and matching that is what matters.

Non-negotiables

Shape first. Prefer leaving cfg.trainer.training_shape_id unset so recipes auto-select the smallest validated shape that fits; set it only when you need an explicit override. The deployment shape comes from the profile. Manual infra fields are a mistake; the backend will reject or ignore them. See references/shapes.md.
WeightSyncScope.PER_TRAINER is the default. Set DeployConfig(weight_sync_scope=WeightSyncScope.PER_TRAINER) (the default). Do not combine it with hot_load_deployment_id — that field belongs to PER_DEPLOYMENT. Pick one bucket scope. See references/rl/hotload.md.
Fork, don't reinvent. Training loop plumbing lives in training/recipes/. Fork the file that matches the task; do not rewire FiretitanServiceClient / FiretitanTrainingClient / deployment hotload from scratch.
Validate output_model_id before promote. Server cap is 63 chars, charset [a-z0-9-]. A rejected promote orphans the sampler blob; the same checkpoint_id returns "not found in GCS" after GC. See references/checkpoints.md.

SDK surface

The training SDK lives at https://github.com/fw-ai-external/python-sdk under src/fireworks/training/sdk/. For any SDK call an agent needs, read the cookbook recipe that already makes it: recipe files are listed in references/recipes.md.