name: fireworks-training
description: Debug and migrate training work on Fireworks via the cookbook. Covers greenfield work (pick a recipe, fork it, resolve training + deployment shape from a profile), distillation / OPD / SDFT recipe work, user-level recovery (promote a checkpoint, list promotable checkpoints on a trainer, self-check a WeightSyncScope.PER_TRAINER vs PER_DEPLOYMENT bucket-scope mix-up), and migrating deprecated managed-infra scripts (port InfraConfig / setup_infra / ResourceCleanup to TrainerConfig + the recipe's SDK-managed provisioning path). The cookbook is the reference implementation of fireworks.training.sdk; fork a recipe or run an example instead of reimplementing. Trigger when the user wants to start, resume, promote, distill, migrate off deprecated infra APIs, or do a first-line diagnosis on a training run; for deeper recovery the skill routes users to Fireworks support.
Fireworks training
The cookbook is the reference implementation of the Fireworks Training SDK. Fork a recipe, run an example, use the standalone tools in references/tools.md. Use shapes for both trainer and deployment — never hand-set accelerator_type / node_count / custom_image_tag.
Task → reference
| Task or signal | Reference |
|---|---|
| "How do I set up / install the cookbook?" | references/setup.md |
| "I want to run something out of the box" | references/examples.md |
| "I want to fork a recipe and edit the Config" | references/recipes.md |
"Distillation" / "OPD" / "SDFT" / distillation_loop.py / topk_forward_kl / teacher_messages |
references/distillation.md |
"Migrate off InfraConfig / setup_infra / ResourceCleanup" / TypeError: ... unexpected keyword argument 'infra' / ImportError: cannot import name 'setup_infra' |
references/migrate.md |
| "How do I set the training / deployment shape?" | references/shapes.md |
RuntimeError: Failed to resolve latest validated training shape |
references/shapes.md — don't pin a version; retry or reach out |
| "Can I run two deployments off one trainer (sampler + eval)?" | references/rl/hotload.md |
| "How does RL dispatch server-side vs client-side loss? What's the cost?" | references/rl/loss-paths.md |
"How does gradient accumulation work at optim_step? What normalization does RL use?" |
references/rl/gradient-accumulation.md |
| "Why are some RL samples being filtered?" | references/rl/dynamic-filter.md |
| "Custom loss for RL" | references/rl/custom-loss.md |
"RL hotload / weight sync cadence, on-policy vs off-policy, weight_sync_timeout" |
references/rl/hotload.md |
| "Concurrency control for RL rollouts — adaptive vs fixed?" | references/rl/concurrency.md |
DeploymentSamplerTimeoutError / repeated sampler HTTP 408 or 504 / ReadTimeout during rollout sampling |
references/rl/sampling-timeouts.md |
"Async RL — overlap rollout with training, off-policy budget, PPO inner minibatches, rollout_fn(sample_prompt)" |
references/rl/async-rl.md |
| "Black-box coding-agent RL" / "train claude-code or another agent harness" / "ProRL SWE-Gym parity" | references/rl/async-rl.md |
"Why is perf/wait_time_ratio high?" / perf/sampler_wait_for_trainer_time / perf/trainer_wait_for_sampler_time |
references/rl/async-rl.md |
"How do I size max_head_offpolicy_versions and max_concurrency_rollout_sample?" |
references/rl/async-rl.md |
| "How do I promote a checkpoint?" | references/tools.md |
| "Which checkpoints does the server know about / are promotable?" | references/tools.md — FireworksClient.list_checkpoints(job_id) |
| "How do I reconnect a training client to a running trainer?" | references/tools.md |
"Hotload keeps failing — is this a PER_TRAINER / PER_DEPLOYMENT scope mix-up?" |
references/rl/hotload.md — self-check and reach out to Fireworks support |
| "How do I verify train vs inference logprobs?" | references/tools.md |
| "I'm adding a new renderer — what's the contract?" | ../renderer/SKILL.md |
| "I changed a renderer — how do I verify it matches HF / the live gateway?" | ../verifier/SKILL.md |
| "Why is my model emitting trailing tokens / hard-appends?" / token stream looks wrong | ../verifier/SKILL.md |
"Where does checkpoint state live?" / CheckpointKind / checkpoints.jsonl |
references/checkpoints.md |
"Continue LoRA training from a prior adapter" / warm_start_from_adapter |
references/checkpoints.md |
Error: checkpoint "<name>" not found in GCS |
references/checkpoints.md — validate output_model_id first; reach out to Fireworks support if still failing |
Error: Hotload failed for snapshot ... / Hotload did not complete within ... / sampler deployment failed to load latest trainer weights |
references/rl/hotload.md — compare expected vs current snapshot; reattach or recreate the deployment when the attachment is stale |
Error: hotload flow mismatch: trainer wants deployment-first ... but deployment ... is trainer-first |
references/rl/hotload.md — the server still emits the old "trainer-first / deployment-first" wording; it maps to PER_TRAINER / PER_DEPLOYMENT bucket scope. Scopes crossed at CreateRlorTrainerJob; pick one scope. |
Error: hotload flow mismatch: trainer T is deployment-first-keyed for deployment D |
references/rl/hotload.md — trainer is keyed to a different deployment's bucket (PER_DEPLOYMENT); use a PER_TRAINER-scope trainer |
Error: hot_load_bucket_url %q conflicts with hot_load_trainer_job %s; set exactly one |
references/rl/hotload.md — create-time: drop whichever field is wrong |
Error: hot_load_bucket_url %q conflicts with hot_load_trainer_job %s; update hot_load_trainer_job instead, or clear it first |
references/rl/hotload.md — update-time: clear bucket URL first, then PATCH trainer job |
Error: invalid FW_HOSTED hot_load_bucket_url / must use gs:// scheme / path must start with rl-checkpoints/ |
references/rl/hotload.md — structural validation on FW_HOSTED URL at create/update |
Error: configured FW_HOSTED hot_load bucket is not reachable / control plane lacks permission |
references/rl/hotload.md — account's ModelBucket misprovisioned; reach out to Fireworks support |
Error: cannot cancel job in state: JOB_STATE_DELETED |
references/rl/hotload.md — trainer is tombstoned during the retention window; no action needed |
list_checkpoints / promote_checkpoint returns NOT_FOUND > 30 days after delete |
references/rl/hotload.md — past retention, expected |
HTTP 400 on output_model_id |
references/tools.md — validate before calling |
"Is this a PER_TRAINER or PER_DEPLOYMENT bucket scope?" |
references/rl/hotload.md |
Manual accelerator_type / node_count set on Config |
references/shapes.md — drop them, the profile owns infra |
First debug step — always
Before assuming the platform is broken, confirm the user's installed fireworks-ai satisfies the cookbook's SDK requirement. A stale SDK produces errors that masquerade as server bugs: missing keyword arguments, "unknown field", silent no-ops on new config fields, or promote_checkpoint behaviour that doesn't match the code.
The requirement lives in the cookbook's training/pyproject.toml — look for the fireworks-ai[training] pin:
grep 'fireworks-ai\[training\]' cookbook/training/pyproject.toml
# e.g. "fireworks-ai[training]>=<minimum-sdk-version>,<2"
pip show fireworks-ai | grep -i version
If the installed version doesn't satisfy the pin, upgrade first and retry. Only after the SDK meets the requirement should you start triaging the actual symptom. Users do not need to sync the cookbook to upstream main — whatever cookbook commit they're on declares its own SDK requirement, and matching that is what matters.
Non-negotiables
- Shape first. Prefer leaving
cfg.trainer.training_shape_idunset so recipes auto-select the smallest validated shape that fits; set it only when you need an explicit override. The deployment shape comes from the profile. Manual infra fields are a mistake; the backend will reject or ignore them. Seereferences/shapes.md. WeightSyncScope.PER_TRAINERis the default. SetDeployConfig(weight_sync_scope=WeightSyncScope.PER_TRAINER)(the default). Do not combine it withhot_load_deployment_id— that field belongs toPER_DEPLOYMENT. Pick one bucket scope. Seereferences/rl/hotload.md.- Fork, don't reinvent. Training loop plumbing lives in
training/recipes/. Fork the file that matches the task; do not rewireFiretitanServiceClient/FiretitanTrainingClient/ deployment hotload from scratch. - Validate
output_model_idbefore promote. Server cap is 63 chars, charset[a-z0-9-]. A rejected promote orphans the sampler blob; the samecheckpoint_idreturns "not found in GCS" after GC. Seereferences/checkpoints.md.
SDK surface
The training SDK lives at https://github.com/fw-ai-external/python-sdk under src/fireworks/training/sdk/. For any SDK call an agent needs, read the cookbook recipe that already makes it: recipe files are listed in references/recipes.md.