name: wandb-weave-ft-retrospective description: Analyze W&B, Weave, and local fine-tuning evaluation artifacts, then produce a concrete next-run improvement plan with data, prompt, and training actions. Use after each SFT or eval cycle.
Wandb Weave Ft Retrospective
Use this skill to turn experiment traces into an actionable fine-tuning plan.
When To Use
- User asks for a retrospective, run diagnosis, or "next fine-tuning plan".
- You have W&B runs but no clear priority of what to fix first.
- You need to convert hard-case outputs into concrete dataset or prompt changes.
Required Inputs
- Submission records:
artifacts/hf_jobs/submissions.jsonlartifacts/hf_jobs/eval_submissions.jsonl- Per-run outputs:
outputs/*/final_metrics.jsonoutputs/*/iter_eval_metrics.jsonloutputs/*/hard_cases.valid.jsonl- Optional trace export:
outputs/*/weave_eval_traces.json
Workflow
- Pin the comparison set.
- Pick one baseline run and one candidate run.
- Record base model id, adapter/full model id, dataset version, and run ids.
- Check gate metrics first.
- Use
json_valid_rateandparse_error_rateas hard gates. - Then inspect quality metrics (
mae_raw,mse_raw, per-dimensionmae_raw_*).
- Cluster dominant failures.
- Group hard cases by
parse_error,source_type, and request pattern. - Separate format failures from value-quality failures.
- Convert clusters into fixes.
- Format failures first: output contract prompt text, decode config, EOS handling.
- Value-quality failures next: add targeted training rows for high-error dimensions.
- Hyperparameter tuning last: only after format validity is stable.
- Produce a run-ready plan.
- Include exact env var overrides for the next run.
- Include promotion criteria and rollback criteria.
Repository Helpers
python scripts/hf/collect_campaign_validation.pypython scripts/hf/debug_json_prompt_variants.pypython scripts/hf/debug_local_eval_outputs.py
Output Contract
Always return all of the following:
- Short diagnosis summary.
- Top 3 failure clusters with evidence references.
- Concrete next-run changes (dataset, prompt, train config).
- Clear success gates for promotion.
References
references/retrospective-template.md