megatron-checkpoint-layout - SKILL.md Agent Skill

name: megatron-checkpoint-layout description: Bilingual guidance for Megatron checkpoint 1D 2D 3D mp_rank layouts across tensor pipeline and expert parallel dimensions compatibility: opencode metadata: domain: distributed-training framework: megatron repo: llava-onevision2

Use this skill when diagnosing, designing, or converting Megatron/Megatron-Core checkpoints that may use TP, PP, and EP.

在排查、设计或转换使用 TP、PP、EP 的 Megatron / Megatron-Core checkpoint 时，使用这个 skill。

The key discriminator is whether expert parallelism participates in checkpoint sharding.

真正的分界点是：expert parallelism 是否参与了 checkpoint 切分。

Megatron does not treat pp > 1 as meaning 3D by itself.

Megatron 不会因为 pp > 1 就自动把 checkpoint 视为 3D。

So even if tp=1 and pp=1, once EP is enabled the checkpoint naming is still conceptually 3D because ranks are addressed by (tp, pp, ep).

所以即使 tp=1 且 pp=1，只要启用了 EP，checkpoint 在语义上仍然是 3D，因为 rank 仍然由 (tp, pp, ep) 共同定位。

When reading or converting checkpoints:

在读取或转换 checkpoint 时：

Bad assumption:

错误假设：

Why it fails:

为什么会失败：

For this repository, follow this rule:

这个仓库建议遵循以下规则：

This matches Megatron's path-building logic better than using pp > 1 as the branch condition.

这个规则比“用 pp > 1 作为分支条件”更贴近 Megatron 自己的路径生成逻辑。

What are the actual shard directory names under the checkpoint root?
Was expert_parallel_size provided by the caller?
Is the model dense or MoE?
Is the loader branching on EP or incorrectly branching on PP?
If conversion failed, which exact mp_rank_* pattern was expected and which one exists on disk?
checkpoint 根目录下，实际 shard 目录名是什么？
调用方是否传入了 expert_parallel_size？
当前模型是 dense 还是 MoE？
loader 是按 EP 分支，还是错误地按 PP 分支？
如果转换失败，程序期望的 mp_rank_* 模式是什么，磁盘上实际又是什么？

When asked to analyze a checkpoint issue, return:

当你被要求分析 checkpoint 问题时，应该返回：