name: snapshots description: Use when investigating KAI Scheduler behavior with captured cluster state, especially to replay scheduler decisions on specific refs or compare behavior across versions. license: MIT compatibility: Requires bash, git, kubectl for capture, curl for capture, make/docker or a prebuilt snapshot-tool for replay. metadata: author: KAI Scheduler maintainers version: "1.0"
Snapshots
Use this skill when investigating KAI Scheduler behavior with captured cluster state, especially for reproducing scheduling bugs, comparing behavior across KAI versions, or gathering evidence for issues like kai-scheduler/KAI-Scheduler#1517.
Facts
docs/plugins/snapshot.mdis the source of truth for capture.- The snapshot endpoint is
/get-snapshoton plugin port8081, not the scheduler--listen-addressport. In the observed clusters here, remote8080returned404while remote8081worked. - Snapshot files are ZIP archives containing
snapshot.json, even when named.gzip. cmd/snapshot-tool/main.gorebuilds fake clients fromsnapshot.jsonand replays the configured scheduler actions.- Replay is a simulation of scheduler behavior, not a full cluster reproduction.
Commands
Run scripts from the repository root:
.agents/skills/snapshots/scripts/capture-snapshot.sh --output snapshots/issue-123.gzip
.agents/skills/snapshots/scripts/inspect-snapshot.sh snapshots/issue-123.gzip
.agents/skills/snapshots/scripts/run-snapshot.sh --snapshot snapshots/issue-123.gzip --verbosity 8
.agents/skills/snapshots/scripts/run-snapshot.sh --ref v0.14.2 --snapshot snapshots/issue-123.gzip
.agents/skills/snapshots/scripts/compare-snapshot-refs.sh --snapshot snapshots/issue-123.gzip --refs main,v0.14.2
capture-snapshot.sh: port-forward the scheduler and download/get-snapshot. Default target isdeployment/kai-scheduler-defaultin namespacekai-scheduleron local/remote port8081. The script inheritsKUBECONFIG, for example:
KUBECONFIG=$HOME/.kube/engine-scale-test \
.agents/skills/snapshots/scripts/capture-snapshot.sh --output snapshots/example.gzip
inspect-snapshot.sh: validate that the archive containssnapshot.jsonand print top-level structure. Run this before replaying user-provided artifacts.run-snapshot.sh: buildsnapshot-toolwithmake build-go SERVICE_NAME=snapshot-tooland replay on the current checkout, or use--refto switch to one Git ref, replay, and restore the original branch or commit. For large snapshots, start with--verbosity 2. For reruns, prefer--no-build --tool bin/snapshot-tool-amd64. If a ref-based run is interrupted hard enough that the shell trap does not execute, the repo can stay detached; check withgit status --short --branchand restore withgit switch <branch>.compare-snapshot-refs.sh: run the same snapshot against several git refs and save one log per ref plussummary.tsv.
Workflow
- Capture or receive the snapshot. Avoid committing snapshot artifacts unless the user explicitly asks.
- Inspect the archive and confirm it contains
snapshot.json. - If capture fails, verify the scheduler pod is running, verify the scheduler ConfigMap includes
- name: snapshot, and verify the scheduler logs containSnapshot plugin registering get-snapshot. - Replay on the reported KAI version first, aligned to the exact tag or commit.
- Use
--verbosity 2first. Compare timing from action timestamps inside the logs, not whole-command wall clock, because builds and verbosity can dominate. - Replay on candidate fixed or regressed refs only after the reported version is understood.
- If a version appears stuck, stop waiting indefinitely and keep the partial log as evidence. In the runs here,
v0.14.0completedreclaimmaterially faster thanv0.13.0, whilev0.14.4appeared to stall inreclaimpast an interactive timeout. - Report refs, commands, log paths, action timings, errors, and whether the issue reproduced.