fireworks-training

star 153

Debug and migrate training work on Fireworks via the cookbook. Covers greenfield work (pick a recipe, fork it, resolve training + deployment shape from a profile), user-level recovery (promote a checkpoint, list promotable checkpoints on a trainer, self-check a `WeightSyncScope.PER_TRAINER` vs `PER_DEPLOYMENT` bucket-scope mix-up), and migrating deprecated managed-infra scripts (port `InfraConfig` / `setup_infra` / `ResourceCleanup` to `TrainerConfig` + the recipe's SDK-managed provisioning path). The cookbook is the reference implementation of `fireworks.training.sdk`; fork a recipe or run an example instead of reimplementing. Trigger when the user wants to start, resume, promote, migrate off deprecated infra APIs, or do a first-line diagnosis on a training run; for deeper recovery the skill routes users to Fireworks support.

fw-ai By fw-ai schedule Updated 6/8/2026

name: fireworks-training description: Debug and migrate training work on Fireworks via the cookbook. Covers greenfield work (pick a recipe, fork it, resolve training + deployment shape from a profile), distillation / OPD / SDFT recipe work, user-level recovery (promote a checkpoint, list promotable checkpoints on a trainer, self-check a WeightSyncScope.PER_TRAINER vs PER_DEPLOYMENT bucket-scope mix-up), and migrating deprecated managed-infra scripts (port InfraConfig / setup_infra / ResourceCleanup to TrainerConfig + the recipe's SDK-managed provisioning path). The cookbook is the reference implementation of fireworks.training.sdk; fork a recipe or run an example instead of reimplementing. Trigger when the user wants to start, resume, promote, distill, migrate off deprecated infra APIs, or do a first-line diagnosis on a training run; for deeper recovery the skill routes users to Fireworks support.

Fireworks training

The cookbook is the reference implementation of the Fireworks Training SDK. Fork a recipe, run an example, use the standalone tools in references/tools.md. Use shapes for both trainer and deployment — never hand-set accelerator_type / node_count / custom_image_tag.


Task → reference

Task or signal Reference
"How do I set up / install the cookbook?" references/setup.md
"I want to run something out of the box" references/examples.md
"I want to fork a recipe and edit the Config" references/recipes.md
"Distillation" / "OPD" / "SDFT" / distillation_loop.py / topk_forward_kl / teacher_messages references/distillation.md
"Migrate off InfraConfig / setup_infra / ResourceCleanup" / TypeError: ... unexpected keyword argument 'infra' / ImportError: cannot import name 'setup_infra' references/migrate.md
"How do I set the training / deployment shape?" references/shapes.md
RuntimeError: Failed to resolve latest validated training shape references/shapes.md — don't pin a version; retry or reach out
"Can I run two deployments off one trainer (sampler + eval)?" references/rl/hotload.md
"How does RL dispatch server-side vs client-side loss? What's the cost?" references/rl/loss-paths.md
"How does gradient accumulation work at optim_step? What normalization does RL use?" references/rl/gradient-accumulation.md
"Why are some RL samples being filtered?" references/rl/dynamic-filter.md
"Custom loss for RL" references/rl/custom-loss.md
"RL hotload / weight sync cadence, on-policy vs off-policy, weight_sync_timeout" references/rl/hotload.md
"Concurrency control for RL rollouts — adaptive vs fixed?" references/rl/concurrency.md
DeploymentSamplerTimeoutError / repeated sampler HTTP 408 or 504 / ReadTimeout during rollout sampling references/rl/sampling-timeouts.md
"Async RL — overlap rollout with training, off-policy budget, PPO inner minibatches, rollout_fn(sample_prompt)" references/rl/async-rl.md
"Black-box coding-agent RL" / "train claude-code or another agent harness" / "ProRL SWE-Gym parity" references/rl/async-rl.md
"Why is perf/wait_time_ratio high?" / perf/sampler_wait_for_trainer_time / perf/trainer_wait_for_sampler_time references/rl/async-rl.md
"How do I size max_head_offpolicy_versions and max_concurrency_rollout_sample?" references/rl/async-rl.md
"How do I promote a checkpoint?" references/tools.md
"Which checkpoints does the server know about / are promotable?" references/tools.mdFireworksClient.list_checkpoints(job_id)
"How do I reconnect a training client to a running trainer?" references/tools.md
"Hotload keeps failing — is this a PER_TRAINER / PER_DEPLOYMENT scope mix-up?" references/rl/hotload.md — self-check and reach out to Fireworks support
"How do I verify train vs inference logprobs?" references/tools.md
"I'm adding a new renderer — what's the contract?" ../renderer/SKILL.md
"I changed a renderer — how do I verify it matches HF / the live gateway?" ../verifier/SKILL.md
"Why is my model emitting trailing tokens / hard-appends?" / token stream looks wrong ../verifier/SKILL.md
"Where does checkpoint state live?" / CheckpointKind / checkpoints.jsonl references/checkpoints.md
"Continue LoRA training from a prior adapter" / warm_start_from_adapter references/checkpoints.md
Error: checkpoint "<name>" not found in GCS references/checkpoints.md — validate output_model_id first; reach out to Fireworks support if still failing
Error: Hotload failed for snapshot ... / Hotload did not complete within ... / sampler deployment failed to load latest trainer weights references/rl/hotload.md — compare expected vs current snapshot; reattach or recreate the deployment when the attachment is stale
Error: hotload flow mismatch: trainer wants deployment-first ... but deployment ... is trainer-first references/rl/hotload.md — the server still emits the old "trainer-first / deployment-first" wording; it maps to PER_TRAINER / PER_DEPLOYMENT bucket scope. Scopes crossed at CreateRlorTrainerJob; pick one scope.
Error: hotload flow mismatch: trainer T is deployment-first-keyed for deployment D references/rl/hotload.md — trainer is keyed to a different deployment's bucket (PER_DEPLOYMENT); use a PER_TRAINER-scope trainer
Error: hot_load_bucket_url %q conflicts with hot_load_trainer_job %s; set exactly one references/rl/hotload.md — create-time: drop whichever field is wrong
Error: hot_load_bucket_url %q conflicts with hot_load_trainer_job %s; update hot_load_trainer_job instead, or clear it first references/rl/hotload.md — update-time: clear bucket URL first, then PATCH trainer job
Error: invalid FW_HOSTED hot_load_bucket_url / must use gs:// scheme / path must start with rl-checkpoints/ references/rl/hotload.md — structural validation on FW_HOSTED URL at create/update
Error: configured FW_HOSTED hot_load bucket is not reachable / control plane lacks permission references/rl/hotload.md — account's ModelBucket misprovisioned; reach out to Fireworks support
Error: cannot cancel job in state: JOB_STATE_DELETED references/rl/hotload.md — trainer is tombstoned during the retention window; no action needed
list_checkpoints / promote_checkpoint returns NOT_FOUND > 30 days after delete references/rl/hotload.md — past retention, expected
HTTP 400 on output_model_id references/tools.md — validate before calling
"Is this a PER_TRAINER or PER_DEPLOYMENT bucket scope?" references/rl/hotload.md
Manual accelerator_type / node_count set on Config references/shapes.md — drop them, the profile owns infra

First debug step — always

Before assuming the platform is broken, confirm the user's installed fireworks-ai satisfies the cookbook's SDK requirement. A stale SDK produces errors that masquerade as server bugs: missing keyword arguments, "unknown field", silent no-ops on new config fields, or promote_checkpoint behaviour that doesn't match the code.

The requirement lives in the cookbook's training/pyproject.toml — look for the fireworks-ai[training] pin:

grep 'fireworks-ai\[training\]' cookbook/training/pyproject.toml
# e.g. "fireworks-ai[training]>=<minimum-sdk-version>,<2"

pip show fireworks-ai | grep -i version

If the installed version doesn't satisfy the pin, upgrade first and retry. Only after the SDK meets the requirement should you start triaging the actual symptom. Users do not need to sync the cookbook to upstream main — whatever cookbook commit they're on declares its own SDK requirement, and matching that is what matters.

Non-negotiables

  1. Shape first. Prefer leaving cfg.trainer.training_shape_id unset so recipes auto-select the smallest validated shape that fits; set it only when you need an explicit override. The deployment shape comes from the profile. Manual infra fields are a mistake; the backend will reject or ignore them. See references/shapes.md.
  2. WeightSyncScope.PER_TRAINER is the default. Set DeployConfig(weight_sync_scope=WeightSyncScope.PER_TRAINER) (the default). Do not combine it with hot_load_deployment_id — that field belongs to PER_DEPLOYMENT. Pick one bucket scope. See references/rl/hotload.md.
  3. Fork, don't reinvent. Training loop plumbing lives in training/recipes/. Fork the file that matches the task; do not rewire FiretitanServiceClient / FiretitanTrainingClient / deployment hotload from scratch.
  4. Validate output_model_id before promote. Server cap is 63 chars, charset [a-z0-9-]. A rejected promote orphans the sampler blob; the same checkpoint_id returns "not found in GCS" after GC. See references/checkpoints.md.

SDK surface

The training SDK lives at https://github.com/fw-ai-external/python-sdk under src/fireworks/training/sdk/. For any SDK call an agent needs, read the cookbook recipe that already makes it: recipe files are listed in references/recipes.md.

Install via CLI
npx skills add https://github.com/fw-ai/cookbook --skill fireworks-training
Repository Details
star Stars 153
call_split Forks 48
navigation Branch main
article Path SKILL.md
More from Creator